Files
2026-05-14 14:07:04 -03:00

3.4 KiB

PDF Layer Extractor - Summary

What It Does

Extracts colored layers from PDF industrial diagrams into separate transparent PNG files.

✓ Single PDF file processing ✓ White background filtered (pure white only) ✓ Variable number of layers (auto-detected) ✓ Handles antialiasing around text ✓ High-quality output at configurable DPI

Quick Start

  1. Install dependencies:
pip install -r requirements.txt
  1. Run on your PDF:
python layer_extractor.py your_diagram.pdf
  1. Find layers in output/ folder

Key Features

Automatic Color Detection

Uses K-means clustering to identify distinct colored layers. White (RGB ≥ 250) is treated as background only.

Antialiasing Handling

The tolerance parameter (default 30) handles color mixing:

  • Text antialiasing creates gray pixels around black text
  • Tolerance value captures these gradual color transitions
  • Each pixel gets alpha based on distance from target color

Output Format

Files named: diagram_layerN_RRR_GGG_BBB.png

  • Transparent PNG with only that color layer
  • RGB values in filename for reference

Common Usage

# Default (works for most diagrams)
python layer_extractor.py diagram.pdf

# High quality
python layer_extractor.py diagram.pdf --dpi 600

# Strict color separation (less antialiasing bleed)
python layer_extractor.py diagram.pdf -t 20

# Lenient (more antialiasing tolerance)
python layer_extractor.py diagram.pdf -t 40

# Extract top 3 layers only
python layer_extractor.py diagram.pdf -n 3

# Custom output folder
python layer_extractor.py diagram.pdf -o my_layers/

Parameters

Parameter Default Description
--dpi 300 PDF rendering resolution (150/300/600)
-t, --tolerance 30 Color matching tolerance (15-50 typical)
-n, --n-layers auto Number of layers to extract
-m, --min-pixels 100 Minimum pixels for valid layer
-o, --output output Output directory

Tolerance Guide

The tolerance parameter is key to handling antialiasing:

  • 15-20: Very strict, clean diagrams with no antialiasing
  • 30 (default): Balanced, handles moderate antialiasing
  • 40-50: Lenient, for heavy antialiasing or compression artifacts

Example: Gray Layer with Black Text

When you have a light gray layer with black text:

  • Black text creates gray antialiasing pixels
  • These gray pixels are close to the gray layer color
  • Higher tolerance includes them in the gray layer
  • Lower tolerance might miss them

Start with default (30) and adjust ±10 based on results.

Files Included

  1. layer_extractor.py - Main script
  2. requirements.txt - Dependencies (PyMuPDF, Pillow, numpy, scikit-learn)
  3. README.md - Full documentation
  4. QUICKSTART.md - Quick reference guide

Technical Notes

  • Uses PyMuPDF to render PDF at specified DPI
  • K-means clustering identifies dominant colors
  • Euclidean distance in RGB space for color matching
  • Alpha channel gradient for smooth edges
  • White detection: all RGB values ≥ 250

Example Output

Input: piping_diagram.pdf Output:

output/
├── piping_diagram_layer1_220_050_050.png  (red piping)
├── piping_diagram_layer2_050_100_220.png  (blue electrical)
├── piping_diagram_layer3_150_150_150.png  (gray annotations)
└── piping_diagram_layer4_050_180_050.png  (green mechanical)

Each PNG has transparent background with only that color layer visible.