# PDF Layer Extractor - Summary ## What It Does Extracts colored layers from PDF industrial diagrams into separate transparent PNG files. ✓ Single PDF file processing ✓ White background filtered (pure white only) ✓ Variable number of layers (auto-detected) ✓ Handles antialiasing around text ✓ High-quality output at configurable DPI ## Quick Start 1. Install dependencies: ```bash pip install -r requirements.txt ``` 2. Run on your PDF: ```bash python layer_extractor.py your_diagram.pdf ``` 3. Find layers in `output/` folder ## Key Features ### Automatic Color Detection Uses K-means clustering to identify distinct colored layers. White (RGB ≥ 250) is treated as background only. ### Antialiasing Handling The tolerance parameter (default 30) handles color mixing: - Text antialiasing creates gray pixels around black text - Tolerance value captures these gradual color transitions - Each pixel gets alpha based on distance from target color ### Output Format Files named: `diagram_layerN_RRR_GGG_BBB.png` - Transparent PNG with only that color layer - RGB values in filename for reference ## Common Usage ```bash # Default (works for most diagrams) python layer_extractor.py diagram.pdf # High quality python layer_extractor.py diagram.pdf --dpi 600 # Strict color separation (less antialiasing bleed) python layer_extractor.py diagram.pdf -t 20 # Lenient (more antialiasing tolerance) python layer_extractor.py diagram.pdf -t 40 # Extract top 3 layers only python layer_extractor.py diagram.pdf -n 3 # Custom output folder python layer_extractor.py diagram.pdf -o my_layers/ ``` ## Parameters | Parameter | Default | Description | |-----------|---------|-------------| | `--dpi` | 300 | PDF rendering resolution (150/300/600) | | `-t, --tolerance` | 30 | Color matching tolerance (15-50 typical) | | `-n, --n-layers` | auto | Number of layers to extract | | `-m, --min-pixels` | 100 | Minimum pixels for valid layer | | `-o, --output` | output | Output directory | ## Tolerance Guide The tolerance parameter is key to handling antialiasing: - **15-20**: Very strict, clean diagrams with no antialiasing - **30** (default): Balanced, handles moderate antialiasing - **40-50**: Lenient, for heavy antialiasing or compression artifacts ### Example: Gray Layer with Black Text When you have a light gray layer with black text: - Black text creates gray antialiasing pixels - These gray pixels are close to the gray layer color - Higher tolerance includes them in the gray layer - Lower tolerance might miss them Start with default (30) and adjust ±10 based on results. ## Files Included 1. **layer_extractor.py** - Main script 2. **requirements.txt** - Dependencies (PyMuPDF, Pillow, numpy, scikit-learn) 3. **README.md** - Full documentation 4. **QUICKSTART.md** - Quick reference guide ## Technical Notes - Uses PyMuPDF to render PDF at specified DPI - K-means clustering identifies dominant colors - Euclidean distance in RGB space for color matching - Alpha channel gradient for smooth edges - White detection: all RGB values ≥ 250 ## Example Output Input: `piping_diagram.pdf` Output: ``` output/ ├── piping_diagram_layer1_220_050_050.png (red piping) ├── piping_diagram_layer2_050_100_220.png (blue electrical) ├── piping_diagram_layer3_150_150_150.png (gray annotations) └── piping_diagram_layer4_050_180_050.png (green mechanical) ``` Each PNG has transparent background with only that color layer visible.