5.0 KiB
PDF Layer Extractor for Industrial Diagrams
Extract colored layers from PDF industrial diagrams with white backgrounds. Automatically handles variable layer counts and antialiasing around text.
Features
- PDF Support: Direct PDF processing at configurable DPI
- Automatic Layer Detection: K-means clustering identifies distinct colored layers
- Handles Antialiasing: Tolerates color mixing around text and fine details
- Variable Layer Counts: Auto-detects all colored layers
- Strict White Filtering: Pure white (255,255,255) treated as background only
- High Quality Output: Each layer saved as transparent PNG
Installation
pip install -r requirements.txt
Quick Start
# Basic usage
python layer_extractor.py diagram.pdf
# Higher resolution
python layer_extractor.py diagram.pdf --dpi 600
# Extract to specific folder
python layer_extractor.py diagram.pdf -o my_layers/
Usage
Basic Command
python layer_extractor.py diagram.pdf
Output: output/diagram_layer1_255_000_000.png, output/diagram_layer2_000_000_255.png, etc.
Common Options
# High resolution rendering (better for detailed diagrams)
python layer_extractor.py diagram.pdf --dpi 600
# Adjust color tolerance (for antialiasing issues)
python layer_extractor.py diagram.pdf -t 40
# Extract only top 3 layers
python layer_extractor.py diagram.pdf -n 3
# Custom output directory
python layer_extractor.py diagram.pdf -o layers/
Parameters
-
--dpi(default: 300) - PDF rendering resolution- 300: Standard quality, faster
- 600: High quality, larger files
- 150: Draft quality, quick preview
-
-t, --tolerance(default: 30) - Color matching tolerance (0-100 scale)- 10-15: Very strict, only nearly identical colors
- 20-25: Strict, minimal antialiasing
- 30: Default, handles moderate antialiasing (RECOMMENDED)
- 40-50: Lenient, good for heavy antialiasing around text
- 60+: Very lenient, may blur layer boundaries
-
-n, --n-layers- Extract specific number of layers (default: auto-detect) -
-m, --min-pixels(default: 100) - Minimum pixels to consider a valid layer
How It Works
- PDF Rendering: Converts PDF to high-resolution image at specified DPI
- Color Analysis: Uses K-means clustering on pixel colors
- White Filtering: Removes pure white background (RGB ≥ 250,250,250)
- Layer Extraction: For each color, creates a mask of similar pixels
- Alpha Blending: Handles antialiasing with gradient transparency
- Output: Saves each layer as transparent PNG
Output Format
Files are named: {pdf_name}_layer{N}_{R}_{G}_{B}.png
Example:
output/
├── piping_diagram_layer1_220_050_050.png (Red layer)
├── piping_diagram_layer2_050_100_220.png (Blue layer)
└── piping_diagram_layer3_050_180_050.png (Green layer)
Troubleshooting
Colors bleeding between layers (antialiasing issue)
Problem: Gray pixels from antialiasing appearing in wrong layer, especially around black text on gray layers
Explanation: When black text (0,0,0) sits on a gray layer (150,150,150), antialiasing creates intermediate grays (75,75,75, 100,100,100, etc.) that are far from both black and gray in color space.
Solution: Increase tolerance to capture these intermediate colors
# For moderate antialiasing (default, usually works)
python layer_extractor.py diagram.pdf -t 30
# For heavy antialiasing (small text, compressed PDFs)
python layer_extractor.py diagram.pdf -t 45
# For extreme cases (very compressed or low quality)
python layer_extractor.py diagram.pdf -t 60
Missing fine details
Problem: Thin lines or small text not captured
Solution: Increase tolerance or DPI
python layer_extractor.py diagram.pdf -t 40 --dpi 600
Too many layers detected
Problem: Small color artifacts creating extra layers
Solution: Increase minimum pixel threshold
python layer_extractor.py diagram.pdf -m 500
Blurry output
Problem: Output quality not good enough
Solution: Increase DPI
python layer_extractor.py diagram.pdf --dpi 600
Examples
Standard industrial diagram
python layer_extractor.py electrical_schematic.pdf
High-detail mechanical drawing
python layer_extractor.py mechanical_drawing.pdf --dpi 600 -t 25
Diagram with known 4 layers
python layer_extractor.py hvac_diagram.pdf -n 4
Compressed/low-quality PDF
python layer_extractor.py scanned_diagram.pdf -t 50 --dpi 300
Tips
- Start with defaults - They work for most diagrams
- Check first - Run once and review output before batch processing
- DPI vs File Size - Higher DPI = better quality but larger files
- Tolerance tuning - Adjust by ±5-10 at a time
- Layer count - Use
-nif you know exact number for faster processing
Requirements
- Python 3.7+
- PyMuPDF (PDF rendering)
- Pillow (image processing)
- NumPy (array operations)
- scikit-learn (color clustering)