function compare_documents
Compares documents from an output folder with documents in a wuxi2 repository by matching document codes, file hashes, sizes, and filenames to identify identical, similar, or missing documents.
/tf/active/vicechatdev/mailsearch/compare_documents.py
208 - 321
moderate
Purpose
This function performs comprehensive document comparison between two sources (output folder and wuxi2 repository) to identify document matches, duplicates, and discrepancies. It uses multiple matching strategies including exact hash matching for identical files, size matching for potential duplicates, and fuzzy filename matching for similar documents. The function generates detailed comparison results including match types, file metadata, and similarity scores, useful for document reconciliation, migration validation, or duplicate detection workflows.
Source Code
def compare_documents(
output_docs: Dict[str, Dict],
wuxi2_docs: Dict[str, List[Dict]]
) -> List[Dict]:
"""
Compare documents from output folder with wuxi2 repository
Args:
output_docs: Documents from output folder
wuxi2_docs: Documents from wuxi2 repository
Returns:
List of comparison results
"""
print(f"\n{'='*80}")
print("Comparing documents...")
print(f"{'='*80}\n")
results = []
for code, output_info in output_docs.items():
result = {
'document_code': code,
'output_filename': output_info['filename'],
'output_size': output_info['size'],
'output_hash': output_info['hash'],
'status': 'ABSENT',
'match_type': 'N/A',
'wuxi2_filename': '',
'wuxi2_path': '',
'wuxi2_size': 0,
'wuxi2_hash': '',
'size_match': False,
'hash_match': False,
'filename_similarity': 0.0,
'notes': ''
}
# Check if code exists in wuxi2
if code in wuxi2_docs:
result['status'] = 'PRESENT'
wuxi2_matches = wuxi2_docs[code]
# Find best match
best_match = None
best_score = 0.0
exact_hash_match = None
exact_size_match = None
for wuxi2_file in wuxi2_matches:
# Check for exact hash match (identical file)
if wuxi2_file['hash'] == output_info['hash']:
exact_hash_match = wuxi2_file
break
# Check for exact size match
if wuxi2_file['size'] == output_info['size'] and not exact_size_match:
exact_size_match = wuxi2_file
# Calculate filename similarity
similarity = fuzzy_match_filename(
output_info['filename'],
wuxi2_file['filename'],
code
)
if similarity > best_score:
best_score = similarity
best_match = wuxi2_file
# Determine match type and populate result
if exact_hash_match:
match = exact_hash_match
result['match_type'] = 'IDENTICAL (hash match)'
result['hash_match'] = True
result['size_match'] = True
elif exact_size_match:
match = exact_size_match
result['match_type'] = 'SIZE MATCH (possible identical)'
result['size_match'] = True
result['hash_match'] = False
elif best_match:
match = best_match
if best_score > 0.7:
result['match_type'] = 'HIGH SIMILARITY'
elif best_score > 0.4:
result['match_type'] = 'MODERATE SIMILARITY'
else:
result['match_type'] = 'LOW SIMILARITY'
result['hash_match'] = False
result['size_match'] = False
else:
result['match_type'] = 'CODE MATCH ONLY'
result['notes'] = f"{len(wuxi2_matches)} file(s) with same code but no good match"
if best_match or exact_hash_match or exact_size_match:
match = exact_hash_match or exact_size_match or best_match
result['wuxi2_filename'] = match['filename']
result['wuxi2_path'] = match['relative_path']
result['wuxi2_size'] = match['size']
result['wuxi2_hash'] = match['hash']
result['filename_similarity'] = best_score
# Add notes for multiple matches
if len(wuxi2_matches) > 1:
result['notes'] = f"{len(wuxi2_matches)} files with code {code} in wuxi2"
results.append(result)
# Print progress
status_symbol = "✓" if result['status'] == 'PRESENT' else "✗"
print(f"{status_symbol} {code:20s} {result['match_type']:30s} {output_info['filename'][:50]}")
return results
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
output_docs |
Dict[str, Dict] | - | positional_or_keyword |
wuxi2_docs |
Dict[str, List[Dict]] | - | positional_or_keyword |
Parameter Details
output_docs: Dictionary mapping document codes (strings) to document metadata dictionaries. Each metadata dict must contain 'filename' (str), 'size' (int), and 'hash' (str) keys representing the document's filename, file size in bytes, and hash value respectively.
wuxi2_docs: Dictionary mapping document codes (strings) to lists of document metadata dictionaries. Each metadata dict must contain 'filename' (str), 'size' (int), 'hash' (str), and 'relative_path' (str) keys. Multiple documents can share the same code, hence the list structure.
Return Value
Type: List[Dict]
Returns a list of dictionaries, one per document in output_docs. Each result dictionary contains: 'document_code' (str), 'output_filename' (str), 'output_size' (int), 'output_hash' (str), 'status' ('PRESENT' or 'ABSENT'), 'match_type' (str describing match quality: 'IDENTICAL (hash match)', 'SIZE MATCH (possible identical)', 'HIGH SIMILARITY', 'MODERATE SIMILARITY', 'LOW SIMILARITY', 'CODE MATCH ONLY', or 'N/A'), 'wuxi2_filename' (str), 'wuxi2_path' (str), 'wuxi2_size' (int), 'wuxi2_hash' (str), 'size_match' (bool), 'hash_match' (bool), 'filename_similarity' (float 0.0-1.0), and 'notes' (str with additional information).
Required Imports
from typing import Dict, List
Usage Example
# Prepare input data
output_docs = {
'DOC001': {
'filename': 'report_2023.pdf',
'size': 1024000,
'hash': 'abc123def456'
},
'DOC002': {
'filename': 'summary.docx',
'size': 512000,
'hash': 'xyz789ghi012'
}
}
wuxi2_docs = {
'DOC001': [
{
'filename': 'report_2023_final.pdf',
'size': 1024000,
'hash': 'abc123def456',
'relative_path': 'documents/2023/report_2023_final.pdf'
}
],
'DOC003': [
{
'filename': 'other_doc.pdf',
'size': 2048000,
'hash': 'mno345pqr678',
'relative_path': 'documents/other/other_doc.pdf'
}
]
}
# Compare documents
results = compare_documents(output_docs, wuxi2_docs)
# Process results
for result in results:
print(f"Code: {result['document_code']}, Status: {result['status']}, Match: {result['match_type']}")
if result['hash_match']:
print(f" Identical file found at: {result['wuxi2_path']}")
Best Practices
- Ensure both input dictionaries have consistent structure with required keys ('filename', 'size', 'hash' for output_docs; additional 'relative_path' for wuxi2_docs)
- The fuzzy_match_filename function must be available in scope before calling this function
- Hash values should be computed using a consistent algorithm (e.g., MD5, SHA256) across both document sources
- File sizes should be in bytes for accurate comparison
- Consider the performance impact when comparing large document sets, as the function performs nested iterations
- The function prints progress to stdout; redirect or suppress output if running in a non-interactive environment
- Review the 'notes' field in results for documents with multiple potential matches in wuxi2_docs
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function compare_documents_v1 77.5% similar
-
function main_v57 67.3% similar
-
function print_summary 64.3% similar
-
function main_v102 64.2% similar
-
function scan_wuxi2_folder 62.8% similar