process_multi_page_pdf - Code Extractor

function process_multi_page_pdf

Maturity: 58

A convenience wrapper function that processes multi-page PDF files and extracts analysis data from each page along with document metadata.

File:
/tf/active/vicechatdev/e-ink-llm/multi_page_processor.py

Lines:
374 - 386

Complexity:
simple

Purpose

This function provides a simplified interface for processing PDF documents with multiple pages. It instantiates a MultiPagePDFProcessor, extracts content and analysis from all pages (up to a specified maximum), and returns structured data about each page along with overall document metadata. It's designed for use cases requiring automated PDF content extraction, document analysis pipelines, or batch processing of PDF files.

Source Code

def process_multi_page_pdf(pdf_path: str, max_pages: int = 50) -> Tuple[List[PageAnalysis], Dict[str, Any]]:
    """
    Convenience function to process multi-page PDF
    
    Args:
        pdf_path: Path to PDF file
        max_pages: Maximum pages to process
        
    Returns:
        Tuple of (page analyses, document metadata)
    """
    processor = MultiPagePDFProcessor(max_pages=max_pages)
    return processor.extract_all_pages(Path(pdf_path))

Parameters

Name	Type	Default	Kind
`pdf_path`	str	-	positional_or_keyword
`max_pages`	int	50	positional_or_keyword

Parameter Details

pdf_path: String representing the file system path to the PDF file to be processed. Can be absolute or relative path. The file must exist and be a valid PDF format.

max_pages: Integer specifying the maximum number of pages to process from the PDF. Defaults to 50. This parameter helps control processing time and resource usage for large documents. If the PDF has fewer pages than max_pages, all pages will be processed.

Return Value

Type: Tuple[List[PageAnalysis], Dict[str, Any]]

Returns a tuple containing two elements: (1) A list of PageAnalysis objects, where each object contains analysis data for a single page including extracted text, images, layout information, and other page-specific metadata. (2) A dictionary containing document-level metadata such as total page count, file information, processing statistics, and other document-wide properties. The exact structure of PageAnalysis and metadata dictionary depends on the MultiPagePDFProcessor implementation.

Dependencies

fitz
PyMuPDF
Pillow
pathlib
typing
dataclasses
logging
base64
io
sys

Required Imports

from pathlib import Path
from typing import List, Dict, Any, Tuple

Usage Example

from pathlib import Path
from typing import List, Dict, Any, Tuple

# Process a PDF with default settings (max 50 pages)
page_analyses, metadata = process_multi_page_pdf('document.pdf')

# Access results
print(f"Processed {len(page_analyses)} pages")
print(f"Total pages in document: {metadata.get('total_pages')}")

# Iterate through page analyses
for i, page_analysis in enumerate(page_analyses):
    print(f"Page {i+1}: {page_analysis}")

# Process with custom max_pages limit
page_analyses, metadata = process_multi_page_pdf(
    pdf_path='/path/to/large_document.pdf',
    max_pages=10
)

# Using Path object
from pathlib import Path
pdf_file = Path('reports/annual_report.pdf')
page_analyses, metadata = process_multi_page_pdf(str(pdf_file), max_pages=100)

Best Practices

Ensure the PDF file exists and is readable before calling this function to avoid file not found errors
Set an appropriate max_pages value based on your memory constraints and processing requirements
Handle potential exceptions from PDF processing (corrupted files, permission issues, etc.)
For very large PDFs, consider processing in batches by calling this function multiple times with different page ranges
The function returns all data in memory, so be cautious with very large documents
Verify that MultiPagePDFProcessor is properly initialized and configured before using this wrapper
Consider logging or error handling around this function call in production environments
The pdf_path parameter accepts strings, so convert Path objects to strings if needed

Similar Components

AI-powered semantic similarity - components with related functionality:

class MultiPagePDFProcessor 68.7% similar

A class for processing multi-page PDF documents with context-aware analysis, OCR, and summarization capabilities.
From: /tf/active/vicechatdev/e-ink-llm/multi_page_processor.py
function process_single_file 58.3% similar

Asynchronously processes a single file (likely PDF) through an LLM pipeline, generating a response PDF with optional conversation continuity, multi-page support, and editing workflow capabilities.
From: /tf/active/vicechatdev/e-ink-llm/processor.py
class MultiPageAnalysisResult 57.2% similar

A dataclass that encapsulates the complete results of analyzing a multi-page document, including individual page analyses, document summary, combined response, and processing statistics.
From: /tf/active/vicechatdev/e-ink-llm/multi_page_llm_handler.py
class MultiPageLLMHandler 55.4% similar

Handles LLM processing for multi-page documents with context awareness, automatically selecting optimal analysis strategies based on document size.
From: /tf/active/vicechatdev/e-ink-llm/multi_page_llm_handler.py
function test_enhanced_pdf_processing 54.7% similar

A comprehensive test function that validates PDF processing capabilities, including text extraction, cleaning, chunking, and table detection across multiple PDF processing libraries.
From: /tf/active/vicechatdev/vice_ai/test_enhanced_pdf.py

🔍 Code Extractor

function process_multi_page_pdf

Purpose

Source Code

Parameters

Parameter Details

Return Value

Dependencies

Required Imports

Usage Example

Best Practices

Tags

Similar Components

class MultiPagePDFProcessor 68.7% similar

function process_single_file 58.3% similar

class MultiPageAnalysisResult 57.2% similar

class MultiPageLLMHandler 55.4% similar

function test_enhanced_pdf_processing 54.7% similar

function process_multi_page_pdf

Purpose

Source Code

Parameters

Parameter Details

Return Value

Dependencies

Required Imports

Usage Example

Best Practices

Tags

Similar Components

class MultiPagePDFProcessor 68.7% similar

function process_single_file 58.3% similar

class MultiPageAnalysisResult 57.2% similar

class MultiPageLLMHandler 55.4% similar

function test_enhanced_pdf_processing 54.7% similar

✨ Improve Code: process_multi_page_pdf

Code Comparison