function compare_pdf_content
Compares the textual content similarity between two PDF files by extracting text samples and computing a similarity ratio using sequence matching.
/tf/active/vicechatdev/mailsearch/enhanced_document_comparison.py
150 - 158
moderate
Purpose
This function is designed to determine how similar two PDF documents are based on their textual content. It extracts text from sample pages (likely the first few pages) of each PDF and uses the SequenceMatcher algorithm to calculate a similarity ratio between 0.0 (completely different) and 1.0 (identical). This is useful for detecting duplicate documents, finding similar PDFs, or identifying versions of the same document.
Source Code
def compare_pdf_content(file1: str, file2: str) -> float:
"""Compare PDF content similarity (first few pages)"""
text1 = extract_text_from_pdf_sample(file1)
text2 = extract_text_from_pdf_sample(file2)
if not text1 or not text2:
return 0.0
return SequenceMatcher(None, text1, text2).ratio()
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
file1 |
str | - | positional_or_keyword |
file2 |
str | - | positional_or_keyword |
Parameter Details
file1: Path to the first PDF file as a string. Should be a valid file path pointing to an existing PDF document. Can be absolute or relative path.
file2: Path to the second PDF file as a string. Should be a valid file path pointing to an existing PDF document. Can be absolute or relative path.
Return Value
Type: float
Returns a float value between 0.0 and 1.0 representing the similarity ratio between the two PDF documents. A value of 1.0 indicates identical content, 0.5 indicates 50% similarity, and 0.0 indicates completely different content or if text extraction failed for either file.
Dependencies
PyPDF2difflib
Required Imports
from difflib import SequenceMatcher
Usage Example
from difflib import SequenceMatcher
# Assuming extract_text_from_pdf_sample is defined
def extract_text_from_pdf_sample(file_path: str) -> str:
# Implementation to extract text from PDF
pass
# Compare two PDF files
file1_path = 'document1.pdf'
file2_path = 'document2.pdf'
similarity_score = compare_pdf_content(file1_path, file2_path)
if similarity_score > 0.9:
print(f'Documents are highly similar: {similarity_score:.2%}')
elif similarity_score > 0.5:
print(f'Documents are moderately similar: {similarity_score:.2%}')
else:
print(f'Documents are different: {similarity_score:.2%}')
Best Practices
- Ensure both PDF file paths are valid and accessible before calling this function to avoid file not found errors.
- The function returns 0.0 if text extraction fails for either file, so check for this case if you need to distinguish between 'no similarity' and 'extraction failed'.
- This function only compares a sample of pages (not the entire document), so it may not be suitable for detecting differences in later pages of long documents.
- The SequenceMatcher algorithm is case-sensitive and whitespace-sensitive, so formatting differences may affect the similarity score.
- For large-scale PDF comparison operations, consider caching extracted text to avoid repeated PDF parsing.
- The function depends on 'extract_text_from_pdf_sample' being available in the same scope - ensure this dependency is properly defined or imported.
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function compare_documents_v1 70.4% similar
-
function extract_text_from_pdf_sample 67.0% similar
-
function fuzzy_match_score 59.4% similar
-
function test_extraction_methods 58.4% similar
-
function find_best_match 56.5% similar