# PDF Page Extraction Script This script extracts specific PDF pages listed in `master_signatures.csv`. ## What It Does **Simple page extraction - NO image detection:** 1. Reads the CSV file with filename and page number 2. Finds the PDF file in batch directories 3. Extracts the specified page 4. Saves it as a single-page PDF **No filtering** - extracts all pages listed in the CSV regardless of content. ## Configuration Edit these values in `extract_pages_from_csv.py`: ```python CSV_PATH = "/Volumes/NV2/PDF-Processing/master_signatures.csv" PDF_BASE_PATH = "/Volumes/NV2/PDF-Processing/total-pdf" OUTPUT_PATH = "/Volumes/NV2/PDF-Processing/signature-image-output" TEST_LIMIT = 100 # Number of rows to process from CSV ``` ## Usage ### Test with 100 files (current setting) ```bash cd /Volumes/NV2/pdf_recognize source venv/bin/activate python extract_pages_from_csv.py ``` ### Process all files in CSV Edit line 16 in `extract_pages_from_csv.py`: ```python TEST_LIMIT = None # Process all rows ``` Or set a specific number: ```python TEST_LIMIT = 1000 # Process first 1000 rows ``` ## Input Format CSV file must have these columns: - `source_folder` - Original folder name - `source_subfolder` - Subfolder name - `filename` - PDF filename - `page` - Page number to extract (1-indexed) Example: ```csv source_folder,source_subfolder,filename,page Ai1,01,201301_1324_AI1.pdf,3 Ai1,01,201301_2061_AI1.pdf,5 ``` ## Output ### Extracted PDFs Location: `/Volumes/NV2/PDF-Processing/signature-image-output/` **Naming:** `{original_filename}_page{page_number}.pdf` Examples: - `201301_1324_AI1_page3.pdf` - Page 3 from original - `201302_4915_AI1_page4.pdf` - Page 4 from original ### Log File Location: `/Volumes/NV2/PDF-Processing/signature-image-output/page_extraction_log_YYYYMMDD_HHMMSS.csv` Columns: - `source_folder` - From CSV - `source_subfolder` - From CSV - `filename` - PDF filename - `page` - Page number - `pdf_found` - True/False if PDF was found - `exported` - True/False if page was extracted - `error_message` - Error details if any ## How It Works ```python # 1. Find PDF in batch directories pdf_path = find_pdf_file(filename) # 2. Open PDF and extract specific page doc = fitz.open(pdf_path) output_doc = fitz.open() output_doc.insert_pdf(doc, from_page=page-1, to_page=page-1) # 3. Save extracted page output_doc.save(output_path) ``` **Key points:** - ✅ Simple and fast - no image analysis - ✅ Extracts exactly what's in the CSV - ✅ Handles missing PDFs gracefully - ✅ Validates page numbers - ✅ Detailed logging for troubleshooting ## Directory Structure ``` /Volumes/NV2/PDF-Processing/ ├── master_signatures.csv # Input CSV ├── total-pdf/ # Source PDFs │ ├── batch_01/ │ ├── batch_02/ │ └── ... └── signature-image-output/ # Output directory ├── page_extraction_log_*.csv # Processing log └── *_page*.pdf # Extracted pages ``` ## Performance - Processing speed: ~1-2 files per second - 100 files: ~1-2 minutes - Full dataset (86,073 files): ~12-24 hours estimated ## Error Handling The script handles: - ✅ PDF file not found in batch directories - ✅ Invalid page numbers (beyond PDF page count) - ✅ Corrupt or unreadable PDFs - ✅ File system errors All errors are logged in the CSV log file. ## Next Steps After extracting pages, use `extract_handwriting.py` to detect and extract handwritten regions from the extracted pages. ## Dependencies - Python 3.9+ - PyMuPDF (fitz) - Installed in venv