pdf_signature_extraction/README_page_extraction.md

# PDF Page Extraction Script

This script extracts specific PDF pages listed in `master_signatures.csv`.

## What It Does

**Simple page extraction - NO image detection:**
1. Reads the CSV file with filename and page number
2. Finds the PDF file in batch directories
3. Extracts the specified page
4. Saves it as a single-page PDF

**No filtering** - extracts all pages listed in the CSV regardless of content.

## Configuration

Edit these values in `extract_pages_from_csv.py`:

```python
CSV_PATH = "/Volumes/NV2/PDF-Processing/master_signatures.csv"
PDF_BASE_PATH = "/Volumes/NV2/PDF-Processing/total-pdf"
OUTPUT_PATH = "/Volumes/NV2/PDF-Processing/signature-image-output"
TEST_LIMIT = 100  # Number of rows to process from CSV
```

## Usage

### Test with 100 files (current setting)
```bash
cd /Volumes/NV2/pdf_recognize
source venv/bin/activate
python extract_pages_from_csv.py
```

### Process all files in CSV
Edit line 16 in `extract_pages_from_csv.py`:
```python
TEST_LIMIT = None  # Process all rows
```

Or set a specific number:
```python
TEST_LIMIT = 1000  # Process first 1000 rows
```

## Input Format

CSV file must have these columns:
- `source_folder` - Original folder name
- `source_subfolder` - Subfolder name
- `filename` - PDF filename
- `page` - Page number to extract (1-indexed)

Example:
```csv
source_folder,source_subfolder,filename,page
Ai1,01,201301_1324_AI1.pdf,3
Ai1,01,201301_2061_AI1.pdf,5
```

## Output

### Extracted PDFs
Location: `/Volumes/NV2/PDF-Processing/signature-image-output/`

**Naming:** `{original_filename}_page{page_number}.pdf`

Examples:
- `201301_1324_AI1_page3.pdf` - Page 3 from original
- `201302_4915_AI1_page4.pdf` - Page 4 from original

### Log File
Location: `/Volumes/NV2/PDF-Processing/signature-image-output/page_extraction_log_YYYYMMDD_HHMMSS.csv`

Columns:
- `source_folder` - From CSV
- `source_subfolder` - From CSV
- `filename` - PDF filename
- `page` - Page number
- `pdf_found` - True/False if PDF was found
- `exported` - True/False if page was extracted
- `error_message` - Error details if any

## How It Works

```python
# 1. Find PDF in batch directories
pdf_path = find_pdf_file(filename)

# 2. Open PDF and extract specific page
doc = fitz.open(pdf_path)
output_doc = fitz.open()
output_doc.insert_pdf(doc, from_page=page-1, to_page=page-1)

# 3. Save extracted page
output_doc.save(output_path)
```

**Key points:**
- ✅ Simple and fast - no image analysis
- ✅ Extracts exactly what's in the CSV
- ✅ Handles missing PDFs gracefully
- ✅ Validates page numbers
- ✅ Detailed logging for troubleshooting

## Directory Structure

```
/Volumes/NV2/PDF-Processing/
├── master_signatures.csv          # Input CSV
├── total-pdf/                     # Source PDFs
│   ├── batch_01/
│   ├── batch_02/
│   └── ...
└── signature-image-output/        # Output directory
    ├── page_extraction_log_*.csv  # Processing log
    └── *_page*.pdf                # Extracted pages
```

## Performance

- Processing speed: ~1-2 files per second
- 100 files: ~1-2 minutes
- Full dataset (86,073 files): ~12-24 hours estimated

## Error Handling

The script handles:
- ✅ PDF file not found in batch directories
- ✅ Invalid page numbers (beyond PDF page count)
- ✅ Corrupt or unreadable PDFs
- ✅ File system errors

All errors are logged in the CSV log file.

## Next Steps

After extracting pages, use `extract_handwriting.py` to detect and extract handwritten regions from the extracted pages.

## Dependencies

- Python 3.9+
- PyMuPDF (fitz) - Installed in venv