🔍 Code Extractor

function process_multi_page_pdf

Maturity: 58

A convenience wrapper function that processes multi-page PDF files and extracts analysis data from each page along with document metadata.

File:
/tf/active/vicechatdev/e-ink-llm/multi_page_processor.py
Lines:
374 - 386
Complexity:
simple

Purpose

This function provides a simplified interface for processing PDF documents with multiple pages. It instantiates a MultiPagePDFProcessor, extracts content and analysis from all pages (up to a specified maximum), and returns structured data about each page along with overall document metadata. It's designed for use cases requiring automated PDF content extraction, document analysis pipelines, or batch processing of PDF files.

Source Code

def process_multi_page_pdf(pdf_path: str, max_pages: int = 50) -> Tuple[List[PageAnalysis], Dict[str, Any]]:
    """
    Convenience function to process multi-page PDF
    
    Args:
        pdf_path: Path to PDF file
        max_pages: Maximum pages to process
        
    Returns:
        Tuple of (page analyses, document metadata)
    """
    processor = MultiPagePDFProcessor(max_pages=max_pages)
    return processor.extract_all_pages(Path(pdf_path))

Parameters

Name Type Default Kind
pdf_path str - positional_or_keyword
max_pages int 50 positional_or_keyword

Parameter Details

pdf_path: String representing the file system path to the PDF file to be processed. Can be absolute or relative path. The file must exist and be a valid PDF format.

max_pages: Integer specifying the maximum number of pages to process from the PDF. Defaults to 50. This parameter helps control processing time and resource usage for large documents. If the PDF has fewer pages than max_pages, all pages will be processed.

Return Value

Type: Tuple[List[PageAnalysis], Dict[str, Any]]

Returns a tuple containing two elements: (1) A list of PageAnalysis objects, where each object contains analysis data for a single page including extracted text, images, layout information, and other page-specific metadata. (2) A dictionary containing document-level metadata such as total page count, file information, processing statistics, and other document-wide properties. The exact structure of PageAnalysis and metadata dictionary depends on the MultiPagePDFProcessor implementation.

Dependencies

  • fitz
  • PyMuPDF
  • Pillow
  • pathlib
  • typing
  • dataclasses
  • logging
  • base64
  • io
  • sys

Required Imports

from pathlib import Path
from typing import List, Dict, Any, Tuple

Usage Example

from pathlib import Path
from typing import List, Dict, Any, Tuple

# Process a PDF with default settings (max 50 pages)
page_analyses, metadata = process_multi_page_pdf('document.pdf')

# Access results
print(f"Processed {len(page_analyses)} pages")
print(f"Total pages in document: {metadata.get('total_pages')}")

# Iterate through page analyses
for i, page_analysis in enumerate(page_analyses):
    print(f"Page {i+1}: {page_analysis}")

# Process with custom max_pages limit
page_analyses, metadata = process_multi_page_pdf(
    pdf_path='/path/to/large_document.pdf',
    max_pages=10
)

# Using Path object
from pathlib import Path
pdf_file = Path('reports/annual_report.pdf')
page_analyses, metadata = process_multi_page_pdf(str(pdf_file), max_pages=100)

Best Practices

  • Ensure the PDF file exists and is readable before calling this function to avoid file not found errors
  • Set an appropriate max_pages value based on your memory constraints and processing requirements
  • Handle potential exceptions from PDF processing (corrupted files, permission issues, etc.)
  • For very large PDFs, consider processing in batches by calling this function multiple times with different page ranges
  • The function returns all data in memory, so be cautious with very large documents
  • Verify that MultiPagePDFProcessor is properly initialized and configured before using this wrapper
  • Consider logging or error handling around this function call in production environments
  • The pdf_path parameter accepts strings, so convert Path objects to strings if needed

Similar Components

AI-powered semantic similarity - components with related functionality:

  • class MultiPagePDFProcessor 68.7% similar

    A class for processing multi-page PDF documents with context-aware analysis, OCR, and summarization capabilities.

    From: /tf/active/vicechatdev/e-ink-llm/multi_page_processor.py
  • function process_single_file 58.3% similar

    Asynchronously processes a single file (likely PDF) through an LLM pipeline, generating a response PDF with optional conversation continuity, multi-page support, and editing workflow capabilities.

    From: /tf/active/vicechatdev/e-ink-llm/processor.py
  • class MultiPageAnalysisResult 57.2% similar

    A dataclass that encapsulates the complete results of analyzing a multi-page document, including individual page analyses, document summary, combined response, and processing statistics.

    From: /tf/active/vicechatdev/e-ink-llm/multi_page_llm_handler.py
  • class MultiPageLLMHandler 55.4% similar

    Handles LLM processing for multi-page documents with context awareness, automatically selecting optimal analysis strategies based on document size.

    From: /tf/active/vicechatdev/e-ink-llm/multi_page_llm_handler.py
  • function test_enhanced_pdf_processing 54.7% similar

    A comprehensive test function that validates PDF processing capabilities, including text extraction, cleaning, chunking, and table detection across multiple PDF processing libraries.

    From: /tf/active/vicechatdev/vice_ai/test_enhanced_pdf.py
← Back to Browse