3.4 KiB
PDF Layer Extractor - Summary
What It Does
Extracts colored layers from PDF industrial diagrams into separate transparent PNG files.
✓ Single PDF file processing ✓ White background filtered (pure white only) ✓ Variable number of layers (auto-detected) ✓ Handles antialiasing around text ✓ High-quality output at configurable DPI
Quick Start
- Install dependencies:
pip install -r requirements.txt
- Run on your PDF:
python layer_extractor.py your_diagram.pdf
- Find layers in
output/folder
Key Features
Automatic Color Detection
Uses K-means clustering to identify distinct colored layers. White (RGB ≥ 250) is treated as background only.
Antialiasing Handling
The tolerance parameter (default 30) handles color mixing:
- Text antialiasing creates gray pixels around black text
- Tolerance value captures these gradual color transitions
- Each pixel gets alpha based on distance from target color
Output Format
Files named: diagram_layerN_RRR_GGG_BBB.png
- Transparent PNG with only that color layer
- RGB values in filename for reference
Common Usage
# Default (works for most diagrams)
python layer_extractor.py diagram.pdf
# High quality
python layer_extractor.py diagram.pdf --dpi 600
# Strict color separation (less antialiasing bleed)
python layer_extractor.py diagram.pdf -t 20
# Lenient (more antialiasing tolerance)
python layer_extractor.py diagram.pdf -t 40
# Extract top 3 layers only
python layer_extractor.py diagram.pdf -n 3
# Custom output folder
python layer_extractor.py diagram.pdf -o my_layers/
Parameters
| Parameter | Default | Description |
|---|---|---|
--dpi |
300 | PDF rendering resolution (150/300/600) |
-t, --tolerance |
30 | Color matching tolerance (15-50 typical) |
-n, --n-layers |
auto | Number of layers to extract |
-m, --min-pixels |
100 | Minimum pixels for valid layer |
-o, --output |
output | Output directory |
Tolerance Guide
The tolerance parameter is key to handling antialiasing:
- 15-20: Very strict, clean diagrams with no antialiasing
- 30 (default): Balanced, handles moderate antialiasing
- 40-50: Lenient, for heavy antialiasing or compression artifacts
Example: Gray Layer with Black Text
When you have a light gray layer with black text:
- Black text creates gray antialiasing pixels
- These gray pixels are close to the gray layer color
- Higher tolerance includes them in the gray layer
- Lower tolerance might miss them
Start with default (30) and adjust ±10 based on results.
Files Included
- layer_extractor.py - Main script
- requirements.txt - Dependencies (PyMuPDF, Pillow, numpy, scikit-learn)
- README.md - Full documentation
- QUICKSTART.md - Quick reference guide
Technical Notes
- Uses PyMuPDF to render PDF at specified DPI
- K-means clustering identifies dominant colors
- Euclidean distance in RGB space for color matching
- Alpha channel gradient for smooth edges
- White detection: all RGB values ≥ 250
Example Output
Input: piping_diagram.pdf
Output:
output/
├── piping_diagram_layer1_220_050_050.png (red piping)
├── piping_diagram_layer2_050_100_220.png (blue electrical)
├── piping_diagram_layer3_150_150_150.png (gray annotations)
└── piping_diagram_layer4_050_180_050.png (green mechanical)
Each PNG has transparent background with only that color layer visible.