121 lines
3.4 KiB
Markdown
121 lines
3.4 KiB
Markdown
# PDF Layer Extractor - Summary
|
|
|
|
## What It Does
|
|
|
|
Extracts colored layers from PDF industrial diagrams into separate transparent PNG files.
|
|
|
|
✓ Single PDF file processing
|
|
✓ White background filtered (pure white only)
|
|
✓ Variable number of layers (auto-detected)
|
|
✓ Handles antialiasing around text
|
|
✓ High-quality output at configurable DPI
|
|
|
|
## Quick Start
|
|
|
|
1. Install dependencies:
|
|
```bash
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
2. Run on your PDF:
|
|
```bash
|
|
python layer_extractor.py your_diagram.pdf
|
|
```
|
|
|
|
3. Find layers in `output/` folder
|
|
|
|
## Key Features
|
|
|
|
### Automatic Color Detection
|
|
Uses K-means clustering to identify distinct colored layers. White (RGB ≥ 250) is treated as background only.
|
|
|
|
### Antialiasing Handling
|
|
The tolerance parameter (default 30) handles color mixing:
|
|
- Text antialiasing creates gray pixels around black text
|
|
- Tolerance value captures these gradual color transitions
|
|
- Each pixel gets alpha based on distance from target color
|
|
|
|
### Output Format
|
|
Files named: `diagram_layerN_RRR_GGG_BBB.png`
|
|
- Transparent PNG with only that color layer
|
|
- RGB values in filename for reference
|
|
|
|
## Common Usage
|
|
|
|
```bash
|
|
# Default (works for most diagrams)
|
|
python layer_extractor.py diagram.pdf
|
|
|
|
# High quality
|
|
python layer_extractor.py diagram.pdf --dpi 600
|
|
|
|
# Strict color separation (less antialiasing bleed)
|
|
python layer_extractor.py diagram.pdf -t 20
|
|
|
|
# Lenient (more antialiasing tolerance)
|
|
python layer_extractor.py diagram.pdf -t 40
|
|
|
|
# Extract top 3 layers only
|
|
python layer_extractor.py diagram.pdf -n 3
|
|
|
|
# Custom output folder
|
|
python layer_extractor.py diagram.pdf -o my_layers/
|
|
```
|
|
|
|
## Parameters
|
|
|
|
| Parameter | Default | Description |
|
|
|-----------|---------|-------------|
|
|
| `--dpi` | 300 | PDF rendering resolution (150/300/600) |
|
|
| `-t, --tolerance` | 30 | Color matching tolerance (15-50 typical) |
|
|
| `-n, --n-layers` | auto | Number of layers to extract |
|
|
| `-m, --min-pixels` | 100 | Minimum pixels for valid layer |
|
|
| `-o, --output` | output | Output directory |
|
|
|
|
## Tolerance Guide
|
|
|
|
The tolerance parameter is key to handling antialiasing:
|
|
|
|
- **15-20**: Very strict, clean diagrams with no antialiasing
|
|
- **30** (default): Balanced, handles moderate antialiasing
|
|
- **40-50**: Lenient, for heavy antialiasing or compression artifacts
|
|
|
|
### Example: Gray Layer with Black Text
|
|
|
|
When you have a light gray layer with black text:
|
|
- Black text creates gray antialiasing pixels
|
|
- These gray pixels are close to the gray layer color
|
|
- Higher tolerance includes them in the gray layer
|
|
- Lower tolerance might miss them
|
|
|
|
Start with default (30) and adjust ±10 based on results.
|
|
|
|
## Files Included
|
|
|
|
1. **layer_extractor.py** - Main script
|
|
2. **requirements.txt** - Dependencies (PyMuPDF, Pillow, numpy, scikit-learn)
|
|
3. **README.md** - Full documentation
|
|
4. **QUICKSTART.md** - Quick reference guide
|
|
|
|
## Technical Notes
|
|
|
|
- Uses PyMuPDF to render PDF at specified DPI
|
|
- K-means clustering identifies dominant colors
|
|
- Euclidean distance in RGB space for color matching
|
|
- Alpha channel gradient for smooth edges
|
|
- White detection: all RGB values ≥ 250
|
|
|
|
## Example Output
|
|
|
|
Input: `piping_diagram.pdf`
|
|
Output:
|
|
```
|
|
output/
|
|
├── piping_diagram_layer1_220_050_050.png (red piping)
|
|
├── piping_diagram_layer2_050_100_220.png (blue electrical)
|
|
├── piping_diagram_layer3_150_150_150.png (gray annotations)
|
|
└── piping_diagram_layer4_050_180_050.png (green mechanical)
|
|
```
|
|
|
|
Each PNG has transparent background with only that color layer visible.
|