Initial commit

2026-05-14 14:07:04 -03:00
commit e0bc5d784b
34 changed files with 7496 additions and 0 deletions
--- a/label/cores/file/README.md
+++ b/label/cores/file/README.md
@@ -0,0 +1,181 @@
+# PDF Layer Extractor for Industrial Diagrams
+
+Extract colored layers from PDF industrial diagrams with white backgrounds. Automatically handles variable layer counts and antialiasing around text.
+
+## Features
+
+- **PDF Support**: Direct PDF processing at configurable DPI
+- **Automatic Layer Detection**: K-means clustering identifies distinct colored layers
+- **Handles Antialiasing**: Tolerates color mixing around text and fine details
+- **Variable Layer Counts**: Auto-detects all colored layers
+- **Strict White Filtering**: Pure white (255,255,255) treated as background only
+- **High Quality Output**: Each layer saved as transparent PNG
+
+## Installation
+
+```bash
+pip install -r requirements.txt
+```
+
+## Quick Start
+
+```bash
+# Basic usage
+python layer_extractor.py diagram.pdf
+
+# Higher resolution
+python layer_extractor.py diagram.pdf --dpi 600
+
+# Extract to specific folder
+python layer_extractor.py diagram.pdf -o my_layers/
+```
+
+## Usage
+
+### Basic Command
+
+```bash
+python layer_extractor.py diagram.pdf
+```
+
+Output: `output/diagram_layer1_255_000_000.png`, `output/diagram_layer2_000_000_255.png`, etc.
+
+### Common Options
+
+```bash
+# High resolution rendering (better for detailed diagrams)
+python layer_extractor.py diagram.pdf --dpi 600
+
+# Adjust color tolerance (for antialiasing issues)
+python layer_extractor.py diagram.pdf -t 40
+
+# Extract only top 3 layers
+python layer_extractor.py diagram.pdf -n 3
+
+# Custom output directory
+python layer_extractor.py diagram.pdf -o layers/
+```
+
+## Parameters
+
+- `--dpi` (default: 300) - PDF rendering resolution
+  - 300: Standard quality, faster
+  - 600: High quality, larger files
+  - 150: Draft quality, quick preview
+
+- `-t, --tolerance` (default: 30) - Color matching tolerance (0-100 scale)
+  - **10-15**: Very strict, only nearly identical colors
+  - **20-25**: Strict, minimal antialiasing
+  - **30**: Default, handles moderate antialiasing (RECOMMENDED)
+  - **40-50**: Lenient, good for heavy antialiasing around text
+  - **60+**: Very lenient, may blur layer boundaries
+
+- `-n, --n-layers` - Extract specific number of layers (default: auto-detect)
+
+- `-m, --min-pixels` (default: 100) - Minimum pixels to consider a valid layer
+
+## How It Works
+
+1. **PDF Rendering**: Converts PDF to high-resolution image at specified DPI
+2. **Color Analysis**: Uses K-means clustering on pixel colors
+3. **White Filtering**: Removes pure white background (RGB ≥ 250,250,250)
+4. **Layer Extraction**: For each color, creates a mask of similar pixels
+5. **Alpha Blending**: Handles antialiasing with gradient transparency
+6. **Output**: Saves each layer as transparent PNG
+
+## Output Format
+
+Files are named: `{pdf_name}_layer{N}_{R}_{G}_{B}.png`
+
+Example:
+```
+output/
+├── piping_diagram_layer1_220_050_050.png  (Red layer)
+├── piping_diagram_layer2_050_100_220.png  (Blue layer)
+└── piping_diagram_layer3_050_180_050.png  (Green layer)
+```
+
+## Troubleshooting
+
+### Colors bleeding between layers (antialiasing issue)
+
+**Problem**: Gray pixels from antialiasing appearing in wrong layer, especially around black text on gray layers
+
+**Explanation**: When black text (0,0,0) sits on a gray layer (150,150,150), antialiasing creates intermediate grays (75,75,75, 100,100,100, etc.) that are far from both black and gray in color space.
+
+**Solution**: Increase tolerance to capture these intermediate colors
+```bash
+# For moderate antialiasing (default, usually works)
+python layer_extractor.py diagram.pdf -t 30
+
+# For heavy antialiasing (small text, compressed PDFs)
+python layer_extractor.py diagram.pdf -t 45
+
+# For extreme cases (very compressed or low quality)
+python layer_extractor.py diagram.pdf -t 60
+```
+
+### Missing fine details
+
+**Problem**: Thin lines or small text not captured
+
+**Solution**: Increase tolerance or DPI
+```bash
+python layer_extractor.py diagram.pdf -t 40 --dpi 600
+```
+
+### Too many layers detected
+
+**Problem**: Small color artifacts creating extra layers
+
+**Solution**: Increase minimum pixel threshold
+```bash
+python layer_extractor.py diagram.pdf -m 500
+```
+
+### Blurry output
+
+**Problem**: Output quality not good enough
+
+**Solution**: Increase DPI
+```bash
+python layer_extractor.py diagram.pdf --dpi 600
+```
+
+## Examples
+
+### Standard industrial diagram
+```bash
+python layer_extractor.py electrical_schematic.pdf
+```
+
+### High-detail mechanical drawing
+```bash
+python layer_extractor.py mechanical_drawing.pdf --dpi 600 -t 25
+```
+
+### Diagram with known 4 layers
+```bash
+python layer_extractor.py hvac_diagram.pdf -n 4
+```
+
+### Compressed/low-quality PDF
+```bash
+python layer_extractor.py scanned_diagram.pdf -t 50 --dpi 300
+```
+
+## Tips
+
+1. **Start with defaults** - They work for most diagrams
+2. **Check first** - Run once and review output before batch processing
+3. **DPI vs File Size** - Higher DPI = better quality but larger files
+4. **Tolerance tuning** - Adjust by ±5-10 at a time
+5. **Layer count** - Use `-n` if you know exact number for faster processing
+
+## Requirements
+
+- Python 3.7+
+- PyMuPDF (PDF rendering)
+- Pillow (image processing)
+- NumPy (array operations)
+- scikit-learn (color clustering)