function print_summary
Prints a formatted summary report of document comparison results, including presence status, match quality statistics, and examples of absent and modified documents.
/tf/active/vicechatdev/mailsearch/compare_documents.py
355 - 409
simple
Purpose
This function generates a comprehensive console output summarizing the results of a document comparison operation between two datasets (output and wuxi2). It calculates and displays statistics about document presence, hash matches, size matches, and similarity levels. It also provides concrete examples of absent and modified documents with their metadata. This is typically used as a final reporting step in a document comparison or validation workflow.
Source Code
def print_summary(results: List[Dict]):
"""
Print summary statistics
Args:
results: List of comparison results
"""
print(f"\n{'='*80}")
print("COMPARISON SUMMARY")
print(f"{'='*80}\n")
total = len(results)
present = sum(1 for r in results if r['status'] == 'PRESENT')
absent = total - present
identical = sum(1 for r in results if r['hash_match'])
size_match = sum(1 for r in results if r['size_match'] and not r['hash_match'])
high_sim = sum(1 for r in results if r['match_type'] == 'HIGH SIMILARITY')
mod_sim = sum(1 for r in results if r['match_type'] == 'MODERATE SIMILARITY')
low_sim = sum(1 for r in results if r['match_type'] == 'LOW SIMILARITY')
print(f"Total coded documents in output: {total}")
print(f"\nPresence Status:")
print(f" Present in wuxi2: {present:4d} ({present/total*100:.1f}%)")
print(f" Absent from wuxi2: {absent:4d} ({absent/total*100:.1f}%)")
print(f"\nMatch Quality (for present documents):")
print(f" Identical (hash match): {identical:4d} ({identical/present*100:.1f}% of present)")
print(f" Size match (likely same): {size_match:4d} ({size_match/present*100:.1f}% of present)")
print(f" High similarity (>70%): {high_sim:4d} ({high_sim/present*100:.1f}% of present)")
print(f" Moderate similarity: {mod_sim:4d} ({mod_sim/present*100:.1f}% of present)")
print(f" Low similarity: {low_sim:4d} ({low_sim/present*100:.1f}% of present)")
print(f"\n{'='*80}\n")
# Show some examples
print("Examples of ABSENT documents:")
print("-" * 80)
absent_docs = [r for r in results if r['status'] == 'ABSENT']
for doc in absent_docs[:5]:
print(f" {doc['document_code']:15s} {doc['output_filename']}")
if len(absent_docs) > 5:
print(f" ... and {len(absent_docs) - 5} more")
print(f"\nExamples of MODIFIED documents (size/hash mismatch):")
print("-" * 80)
modified_docs = [r for r in results if r['status'] == 'PRESENT' and not r['hash_match']]
for doc in modified_docs[:5]:
print(f" {doc['document_code']:15s} {doc['output_filename'][:60]}")
print(f" → Output: {doc['output_size']:,} bytes")
print(f" → Wuxi2: {doc['wuxi2_size']:,} bytes")
if len(modified_docs) > 5:
print(f" ... and {len(modified_docs) - 5} more")
print()
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
results |
List[Dict] | - | positional_or_keyword |
Parameter Details
results: A list of dictionaries where each dictionary represents a comparison result for a single document. Each dictionary must contain the following keys: 'status' (str: 'PRESENT' or 'ABSENT'), 'hash_match' (bool: whether file hashes match), 'size_match' (bool: whether file sizes match), 'match_type' (str: similarity level like 'HIGH SIMILARITY', 'MODERATE SIMILARITY', 'LOW SIMILARITY'), 'document_code' (str: document identifier), 'output_filename' (str: filename in output dataset), 'output_size' (int: file size in bytes in output), and 'wuxi2_size' (int: file size in bytes in wuxi2). The list should not be empty to avoid division by zero errors.
Return Value
This function returns None. It produces side effects by printing formatted text to standard output (console). The output includes: a header section with total counts and percentages, presence status breakdown, match quality metrics for present documents, examples of absent documents (up to 5), and examples of modified documents with size comparisons (up to 5).
Required Imports
from typing import List
from typing import Dict
Usage Example
# Example usage with sample comparison results
from typing import List, Dict
def print_summary(results: List[Dict]):
# ... function code ...
pass
# Sample data structure
results = [
{
'status': 'PRESENT',
'hash_match': True,
'size_match': True,
'match_type': 'IDENTICAL',
'document_code': 'DOC001',
'output_filename': 'document1.pdf',
'output_size': 1024000,
'wuxi2_size': 1024000
},
{
'status': 'PRESENT',
'hash_match': False,
'size_match': False,
'match_type': 'HIGH SIMILARITY',
'document_code': 'DOC002',
'output_filename': 'document2.pdf',
'output_size': 2048000,
'wuxi2_size': 2050000
},
{
'status': 'ABSENT',
'hash_match': False,
'size_match': False,
'match_type': 'N/A',
'document_code': 'DOC003',
'output_filename': 'document3.pdf',
'output_size': 512000,
'wuxi2_size': 0
}
]
# Print the summary
print_summary(results)
Best Practices
- Ensure the results list is not empty before calling this function to avoid division by zero errors
- All dictionaries in the results list must contain the required keys: 'status', 'hash_match', 'size_match', 'match_type', 'document_code', 'output_filename', 'output_size', and 'wuxi2_size'
- The 'status' field should only contain 'PRESENT' or 'ABSENT' values for accurate counting
- File sizes should be provided in bytes for consistent formatting
- This function is designed for console output; redirect stdout if you need to capture the output to a file
- The function shows only the first 5 examples of absent and modified documents; consider the full results list size when interpreting the summary
- Match type values should follow the expected categories: 'HIGH SIMILARITY', 'MODERATE SIMILARITY', 'LOW SIMILARITY' for proper categorization
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function print_summary_v1 87.3% similar
-
function compare_documents 64.3% similar
-
function save_results 62.3% similar
-
function main_v57 57.9% similar
-
function compare_documents_v1 57.3% similar