Initial commit
This commit is contained in:
181
label/cores/file/README.md
Normal file
181
label/cores/file/README.md
Normal file
@@ -0,0 +1,181 @@
|
||||
# PDF Layer Extractor for Industrial Diagrams
|
||||
|
||||
Extract colored layers from PDF industrial diagrams with white backgrounds. Automatically handles variable layer counts and antialiasing around text.
|
||||
|
||||
## Features
|
||||
|
||||
- **PDF Support**: Direct PDF processing at configurable DPI
|
||||
- **Automatic Layer Detection**: K-means clustering identifies distinct colored layers
|
||||
- **Handles Antialiasing**: Tolerates color mixing around text and fine details
|
||||
- **Variable Layer Counts**: Auto-detects all colored layers
|
||||
- **Strict White Filtering**: Pure white (255,255,255) treated as background only
|
||||
- **High Quality Output**: Each layer saved as transparent PNG
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
# Basic usage
|
||||
python layer_extractor.py diagram.pdf
|
||||
|
||||
# Higher resolution
|
||||
python layer_extractor.py diagram.pdf --dpi 600
|
||||
|
||||
# Extract to specific folder
|
||||
python layer_extractor.py diagram.pdf -o my_layers/
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
### Basic Command
|
||||
|
||||
```bash
|
||||
python layer_extractor.py diagram.pdf
|
||||
```
|
||||
|
||||
Output: `output/diagram_layer1_255_000_000.png`, `output/diagram_layer2_000_000_255.png`, etc.
|
||||
|
||||
### Common Options
|
||||
|
||||
```bash
|
||||
# High resolution rendering (better for detailed diagrams)
|
||||
python layer_extractor.py diagram.pdf --dpi 600
|
||||
|
||||
# Adjust color tolerance (for antialiasing issues)
|
||||
python layer_extractor.py diagram.pdf -t 40
|
||||
|
||||
# Extract only top 3 layers
|
||||
python layer_extractor.py diagram.pdf -n 3
|
||||
|
||||
# Custom output directory
|
||||
python layer_extractor.py diagram.pdf -o layers/
|
||||
```
|
||||
|
||||
## Parameters
|
||||
|
||||
- `--dpi` (default: 300) - PDF rendering resolution
|
||||
- 300: Standard quality, faster
|
||||
- 600: High quality, larger files
|
||||
- 150: Draft quality, quick preview
|
||||
|
||||
- `-t, --tolerance` (default: 30) - Color matching tolerance (0-100 scale)
|
||||
- **10-15**: Very strict, only nearly identical colors
|
||||
- **20-25**: Strict, minimal antialiasing
|
||||
- **30**: Default, handles moderate antialiasing (RECOMMENDED)
|
||||
- **40-50**: Lenient, good for heavy antialiasing around text
|
||||
- **60+**: Very lenient, may blur layer boundaries
|
||||
|
||||
- `-n, --n-layers` - Extract specific number of layers (default: auto-detect)
|
||||
|
||||
- `-m, --min-pixels` (default: 100) - Minimum pixels to consider a valid layer
|
||||
|
||||
## How It Works
|
||||
|
||||
1. **PDF Rendering**: Converts PDF to high-resolution image at specified DPI
|
||||
2. **Color Analysis**: Uses K-means clustering on pixel colors
|
||||
3. **White Filtering**: Removes pure white background (RGB ≥ 250,250,250)
|
||||
4. **Layer Extraction**: For each color, creates a mask of similar pixels
|
||||
5. **Alpha Blending**: Handles antialiasing with gradient transparency
|
||||
6. **Output**: Saves each layer as transparent PNG
|
||||
|
||||
## Output Format
|
||||
|
||||
Files are named: `{pdf_name}_layer{N}_{R}_{G}_{B}.png`
|
||||
|
||||
Example:
|
||||
```
|
||||
output/
|
||||
├── piping_diagram_layer1_220_050_050.png (Red layer)
|
||||
├── piping_diagram_layer2_050_100_220.png (Blue layer)
|
||||
└── piping_diagram_layer3_050_180_050.png (Green layer)
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Colors bleeding between layers (antialiasing issue)
|
||||
|
||||
**Problem**: Gray pixels from antialiasing appearing in wrong layer, especially around black text on gray layers
|
||||
|
||||
**Explanation**: When black text (0,0,0) sits on a gray layer (150,150,150), antialiasing creates intermediate grays (75,75,75, 100,100,100, etc.) that are far from both black and gray in color space.
|
||||
|
||||
**Solution**: Increase tolerance to capture these intermediate colors
|
||||
```bash
|
||||
# For moderate antialiasing (default, usually works)
|
||||
python layer_extractor.py diagram.pdf -t 30
|
||||
|
||||
# For heavy antialiasing (small text, compressed PDFs)
|
||||
python layer_extractor.py diagram.pdf -t 45
|
||||
|
||||
# For extreme cases (very compressed or low quality)
|
||||
python layer_extractor.py diagram.pdf -t 60
|
||||
```
|
||||
|
||||
### Missing fine details
|
||||
|
||||
**Problem**: Thin lines or small text not captured
|
||||
|
||||
**Solution**: Increase tolerance or DPI
|
||||
```bash
|
||||
python layer_extractor.py diagram.pdf -t 40 --dpi 600
|
||||
```
|
||||
|
||||
### Too many layers detected
|
||||
|
||||
**Problem**: Small color artifacts creating extra layers
|
||||
|
||||
**Solution**: Increase minimum pixel threshold
|
||||
```bash
|
||||
python layer_extractor.py diagram.pdf -m 500
|
||||
```
|
||||
|
||||
### Blurry output
|
||||
|
||||
**Problem**: Output quality not good enough
|
||||
|
||||
**Solution**: Increase DPI
|
||||
```bash
|
||||
python layer_extractor.py diagram.pdf --dpi 600
|
||||
```
|
||||
|
||||
## Examples
|
||||
|
||||
### Standard industrial diagram
|
||||
```bash
|
||||
python layer_extractor.py electrical_schematic.pdf
|
||||
```
|
||||
|
||||
### High-detail mechanical drawing
|
||||
```bash
|
||||
python layer_extractor.py mechanical_drawing.pdf --dpi 600 -t 25
|
||||
```
|
||||
|
||||
### Diagram with known 4 layers
|
||||
```bash
|
||||
python layer_extractor.py hvac_diagram.pdf -n 4
|
||||
```
|
||||
|
||||
### Compressed/low-quality PDF
|
||||
```bash
|
||||
python layer_extractor.py scanned_diagram.pdf -t 50 --dpi 300
|
||||
```
|
||||
|
||||
## Tips
|
||||
|
||||
1. **Start with defaults** - They work for most diagrams
|
||||
2. **Check first** - Run once and review output before batch processing
|
||||
3. **DPI vs File Size** - Higher DPI = better quality but larger files
|
||||
4. **Tolerance tuning** - Adjust by ±5-10 at a time
|
||||
5. **Layer count** - Use `-n` if you know exact number for faster processing
|
||||
|
||||
## Requirements
|
||||
|
||||
- Python 3.7+
|
||||
- PyMuPDF (PDF rendering)
|
||||
- Pillow (image processing)
|
||||
- NumPy (array operations)
|
||||
- scikit-learn (color clustering)
|
||||
Reference in New Issue
Block a user