# PDF Layer Extractor for Industrial Diagrams Extract colored layers from PDF industrial diagrams with white backgrounds. Automatically handles variable layer counts and antialiasing around text. ## Features - **PDF Support**: Direct PDF processing at configurable DPI - **Automatic Layer Detection**: K-means clustering identifies distinct colored layers - **Handles Antialiasing**: Tolerates color mixing around text and fine details - **Variable Layer Counts**: Auto-detects all colored layers - **Strict White Filtering**: Pure white (255,255,255) treated as background only - **High Quality Output**: Each layer saved as transparent PNG ## Installation ```bash pip install -r requirements.txt ``` ## Quick Start ```bash # Basic usage python layer_extractor.py diagram.pdf # Higher resolution python layer_extractor.py diagram.pdf --dpi 600 # Extract to specific folder python layer_extractor.py diagram.pdf -o my_layers/ ``` ## Usage ### Basic Command ```bash python layer_extractor.py diagram.pdf ``` Output: `output/diagram_layer1_255_000_000.png`, `output/diagram_layer2_000_000_255.png`, etc. ### Common Options ```bash # High resolution rendering (better for detailed diagrams) python layer_extractor.py diagram.pdf --dpi 600 # Adjust color tolerance (for antialiasing issues) python layer_extractor.py diagram.pdf -t 40 # Extract only top 3 layers python layer_extractor.py diagram.pdf -n 3 # Custom output directory python layer_extractor.py diagram.pdf -o layers/ ``` ## Parameters - `--dpi` (default: 300) - PDF rendering resolution - 300: Standard quality, faster - 600: High quality, larger files - 150: Draft quality, quick preview - `-t, --tolerance` (default: 30) - Color matching tolerance (0-100 scale) - **10-15**: Very strict, only nearly identical colors - **20-25**: Strict, minimal antialiasing - **30**: Default, handles moderate antialiasing (RECOMMENDED) - **40-50**: Lenient, good for heavy antialiasing around text - **60+**: Very lenient, may blur layer boundaries - `-n, --n-layers` - Extract specific number of layers (default: auto-detect) - `-m, --min-pixels` (default: 100) - Minimum pixels to consider a valid layer ## How It Works 1. **PDF Rendering**: Converts PDF to high-resolution image at specified DPI 2. **Color Analysis**: Uses K-means clustering on pixel colors 3. **White Filtering**: Removes pure white background (RGB ≥ 250,250,250) 4. **Layer Extraction**: For each color, creates a mask of similar pixels 5. **Alpha Blending**: Handles antialiasing with gradient transparency 6. **Output**: Saves each layer as transparent PNG ## Output Format Files are named: `{pdf_name}_layer{N}_{R}_{G}_{B}.png` Example: ``` output/ ├── piping_diagram_layer1_220_050_050.png (Red layer) ├── piping_diagram_layer2_050_100_220.png (Blue layer) └── piping_diagram_layer3_050_180_050.png (Green layer) ``` ## Troubleshooting ### Colors bleeding between layers (antialiasing issue) **Problem**: Gray pixels from antialiasing appearing in wrong layer, especially around black text on gray layers **Explanation**: When black text (0,0,0) sits on a gray layer (150,150,150), antialiasing creates intermediate grays (75,75,75, 100,100,100, etc.) that are far from both black and gray in color space. **Solution**: Increase tolerance to capture these intermediate colors ```bash # For moderate antialiasing (default, usually works) python layer_extractor.py diagram.pdf -t 30 # For heavy antialiasing (small text, compressed PDFs) python layer_extractor.py diagram.pdf -t 45 # For extreme cases (very compressed or low quality) python layer_extractor.py diagram.pdf -t 60 ``` ### Missing fine details **Problem**: Thin lines or small text not captured **Solution**: Increase tolerance or DPI ```bash python layer_extractor.py diagram.pdf -t 40 --dpi 600 ``` ### Too many layers detected **Problem**: Small color artifacts creating extra layers **Solution**: Increase minimum pixel threshold ```bash python layer_extractor.py diagram.pdf -m 500 ``` ### Blurry output **Problem**: Output quality not good enough **Solution**: Increase DPI ```bash python layer_extractor.py diagram.pdf --dpi 600 ``` ## Examples ### Standard industrial diagram ```bash python layer_extractor.py electrical_schematic.pdf ``` ### High-detail mechanical drawing ```bash python layer_extractor.py mechanical_drawing.pdf --dpi 600 -t 25 ``` ### Diagram with known 4 layers ```bash python layer_extractor.py hvac_diagram.pdf -n 4 ``` ### Compressed/low-quality PDF ```bash python layer_extractor.py scanned_diagram.pdf -t 50 --dpi 300 ``` ## Tips 1. **Start with defaults** - They work for most diagrams 2. **Check first** - Run once and review output before batch processing 3. **DPI vs File Size** - Higher DPI = better quality but larger files 4. **Tolerance tuning** - Adjust by ±5-10 at a time 5. **Layer count** - Use `-n` if you know exact number for faster processing ## Requirements - Python 3.7+ - PyMuPDF (PDF rendering) - Pillow (image processing) - NumPy (array operations) - scikit-learn (color clustering)