🔍 Code Extractor

function merge_pdfs

Maturity: 49

Merges multiple PDF files into a single output PDF file with robust error handling and fallback mechanisms.

File:
/tf/active/vicechatdev/msg_to_eml.py
Lines:
412 - 474
Complexity:
moderate

Purpose

This function combines multiple PDF files into one consolidated PDF document. It validates input files, filters out non-existent or empty files, and attempts to use PyMuPDF (fitz) as the primary merging library with PyPDF2 as a fallback. It handles edge cases like single file inputs (which are simply copied) and continues processing even if individual PDFs fail to merge.

Source Code

def merge_pdfs(input_paths, output_path):
    """Merge multiple PDF files with better error handling"""
    try:
        # Filter out non-existent files
        valid_paths = [path for path in input_paths if os.path.exists(path) and os.path.getsize(path) > 0]
        
        if not valid_paths:
            logger.error("No valid PDF files to merge")
            return None
            
        if len(valid_paths) == 1:
            # Just copy the single file
            shutil.copy2(valid_paths[0], output_path)
            return output_path
        
        # Try PyMuPDF first, as it's commonly used and more robust
        try:
            import fitz
            
            # Create output PDF
            output_pdf = fitz.open()
            
            # Add each input PDF
            for input_path in valid_paths:
                try:
                    pdf = fitz.open(input_path)
                    output_pdf.insert_pdf(pdf)
                except Exception as e:
                    logger.warning(f"Problem with PDF {input_path}: {str(e)}")
                    continue
            
            # Save merged PDF
            output_pdf.save(output_path)
            output_pdf.close()
            
            return output_path
            
        except ImportError:
            # Fall back to using PyPDF2
            try:
                from PyPDF2 import PdfMerger
                
                merger = PdfMerger()
                
                for input_path in valid_paths:
                    try:
                        merger.append(input_path)
                    except Exception as e:
                        logger.warning(f"Problem with PDF {input_path}: {str(e)}")
                        continue
                
                merger.write(output_path)
                merger.close()
                
                return output_path
            except ImportError:
                logger.error("No PDF merging library available. Install PyMuPDF or PyPDF2.")
                return None
            
    except Exception as e:
        logger.error(f"Error merging PDFs: {str(e)}")
        logger.error(traceback.format_exc())
        return None

Parameters

Name Type Default Kind
input_paths - - positional_or_keyword
output_path - - positional_or_keyword

Parameter Details

input_paths: A list or iterable of file paths (strings) pointing to PDF files to be merged. The function will filter out non-existent files and empty files (size 0 bytes) automatically. Order in the list determines the order in the merged output.

output_path: A string representing the file path where the merged PDF should be saved. Should include the filename and .pdf extension. The directory must exist or be writable.

Return Value

Returns the output_path (string) if the merge operation succeeds, or None if the operation fails (no valid input files, missing libraries, or other errors). The returned path confirms the location of the successfully created merged PDF.

Dependencies

  • os
  • shutil
  • traceback
  • fitz (PyMuPDF)
  • PyPDF2
  • logging

Required Imports

import os
import shutil
import traceback
import logging

Conditional/Optional Imports

These imports are only needed under specific conditions:

import fitz

Condition: Primary PDF merging library (PyMuPDF). Used first if available. Install with: pip install PyMuPDF

Optional
from PyPDF2 import PdfMerger

Condition: Fallback PDF merging library. Used only if PyMuPDF (fitz) is not available. Install with: pip install PyPDF2

Optional

Usage Example

import os
import shutil
import traceback
import logging

# Setup logger
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)
handler = logging.StreamHandler()
handler.setFormatter(logging.Formatter('%(levelname)s: %(message)s'))
logger.addHandler(handler)

# Install at least one library:
# pip install PyMuPDF  # or pip install PyPDF2

def merge_pdfs(input_paths, output_path):
    # ... (function code here)
    pass

# Example usage
input_files = ['document1.pdf', 'document2.pdf', 'document3.pdf']
output_file = 'merged_output.pdf'

result = merge_pdfs(input_files, output_file)

if result:
    print(f'Successfully merged PDFs to: {result}')
else:
    print('Failed to merge PDFs')

Best Practices

  • Ensure at least one PDF merging library (PyMuPDF or PyPDF2) is installed before calling this function
  • Always check the return value - None indicates failure, a path string indicates success
  • The function logs warnings for individual PDF files that fail to merge but continues processing remaining files
  • Input files are validated automatically - non-existent or empty files are filtered out
  • For single file inputs, the function optimizes by copying rather than merging
  • PyMuPDF (fitz) is preferred over PyPDF2 for better robustness and performance
  • Ensure the logger object is properly configured in the calling scope
  • The output directory must exist before calling this function
  • Consider wrapping calls in try-except blocks for additional error handling at the application level

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function eml_to_pdf 50.0% similar

    Converts an .eml email file to PDF format, including the email body and all attachments merged into a single PDF document.

    From: /tf/active/vicechatdev/msg_to_eml.py
  • function msg_to_pdf_improved 44.1% similar

    Converts a Microsoft Outlook .msg file to PDF format using EML as an intermediate format for improved reliability, with fallback to direct conversion if needed.

    From: /tf/active/vicechatdev/msg_to_eml.py
  • function test_mixed_previous_reports 43.7% similar

    A test function that validates the DocumentExtractor's ability to extract text content from multiple file formats (text and markdown) and combine them into a unified previous reports summary.

    From: /tf/active/vicechatdev/leexi/test_enhanced_reports.py
  • function test_multiple_files 43.5% similar

    A test function that validates the extraction of text content from multiple document files using a DocumentExtractor instance, displaying extraction results and simulating combined content processing.

    From: /tf/active/vicechatdev/leexi/test_multiple_files.py
  • function merge_word_documents 43.1% similar

    Merges track changes and comments from a revision Word document into a base Word document, creating a combined output document.

    From: /tf/active/vicechatdev/word_merge.py
← Back to Browse