🔍 Code Extractor

function print_summary

Maturity: 47

Prints a formatted summary report of document comparison results, including presence status, match quality statistics, and examples of absent and modified documents.

File:
/tf/active/vicechatdev/mailsearch/compare_documents.py
Lines:
355 - 409
Complexity:
simple

Purpose

This function generates a comprehensive console output summarizing the results of a document comparison operation between two datasets (output and wuxi2). It calculates and displays statistics about document presence, hash matches, size matches, and similarity levels. It also provides concrete examples of absent and modified documents with their metadata. This is typically used as a final reporting step in a document comparison or validation workflow.

Source Code

def print_summary(results: List[Dict]):
    """
    Print summary statistics
    
    Args:
        results: List of comparison results
    """
    print(f"\n{'='*80}")
    print("COMPARISON SUMMARY")
    print(f"{'='*80}\n")
    
    total = len(results)
    present = sum(1 for r in results if r['status'] == 'PRESENT')
    absent = total - present
    
    identical = sum(1 for r in results if r['hash_match'])
    size_match = sum(1 for r in results if r['size_match'] and not r['hash_match'])
    high_sim = sum(1 for r in results if r['match_type'] == 'HIGH SIMILARITY')
    mod_sim = sum(1 for r in results if r['match_type'] == 'MODERATE SIMILARITY')
    low_sim = sum(1 for r in results if r['match_type'] == 'LOW SIMILARITY')
    
    print(f"Total coded documents in output:  {total}")
    print(f"\nPresence Status:")
    print(f"  Present in wuxi2:  {present:4d} ({present/total*100:.1f}%)")
    print(f"  Absent from wuxi2: {absent:4d} ({absent/total*100:.1f}%)")
    
    print(f"\nMatch Quality (for present documents):")
    print(f"  Identical (hash match):    {identical:4d} ({identical/present*100:.1f}% of present)")
    print(f"  Size match (likely same):  {size_match:4d} ({size_match/present*100:.1f}% of present)")
    print(f"  High similarity (>70%):    {high_sim:4d} ({high_sim/present*100:.1f}% of present)")
    print(f"  Moderate similarity:       {mod_sim:4d} ({mod_sim/present*100:.1f}% of present)")
    print(f"  Low similarity:            {low_sim:4d} ({low_sim/present*100:.1f}% of present)")
    
    print(f"\n{'='*80}\n")
    
    # Show some examples
    print("Examples of ABSENT documents:")
    print("-" * 80)
    absent_docs = [r for r in results if r['status'] == 'ABSENT']
    for doc in absent_docs[:5]:
        print(f"  {doc['document_code']:15s} {doc['output_filename']}")
    if len(absent_docs) > 5:
        print(f"  ... and {len(absent_docs) - 5} more")
    
    print(f"\nExamples of MODIFIED documents (size/hash mismatch):")
    print("-" * 80)
    modified_docs = [r for r in results if r['status'] == 'PRESENT' and not r['hash_match']]
    for doc in modified_docs[:5]:
        print(f"  {doc['document_code']:15s} {doc['output_filename'][:60]}")
        print(f"    → Output: {doc['output_size']:,} bytes")
        print(f"    → Wuxi2:  {doc['wuxi2_size']:,} bytes")
    if len(modified_docs) > 5:
        print(f"  ... and {len(modified_docs) - 5} more")
    
    print()

Parameters

Name Type Default Kind
results List[Dict] - positional_or_keyword

Parameter Details

results: A list of dictionaries where each dictionary represents a comparison result for a single document. Each dictionary must contain the following keys: 'status' (str: 'PRESENT' or 'ABSENT'), 'hash_match' (bool: whether file hashes match), 'size_match' (bool: whether file sizes match), 'match_type' (str: similarity level like 'HIGH SIMILARITY', 'MODERATE SIMILARITY', 'LOW SIMILARITY'), 'document_code' (str: document identifier), 'output_filename' (str: filename in output dataset), 'output_size' (int: file size in bytes in output), and 'wuxi2_size' (int: file size in bytes in wuxi2). The list should not be empty to avoid division by zero errors.

Return Value

This function returns None. It produces side effects by printing formatted text to standard output (console). The output includes: a header section with total counts and percentages, presence status breakdown, match quality metrics for present documents, examples of absent documents (up to 5), and examples of modified documents with size comparisons (up to 5).

Required Imports

from typing import List
from typing import Dict

Usage Example

# Example usage with sample comparison results
from typing import List, Dict

def print_summary(results: List[Dict]):
    # ... function code ...
    pass

# Sample data structure
results = [
    {
        'status': 'PRESENT',
        'hash_match': True,
        'size_match': True,
        'match_type': 'IDENTICAL',
        'document_code': 'DOC001',
        'output_filename': 'document1.pdf',
        'output_size': 1024000,
        'wuxi2_size': 1024000
    },
    {
        'status': 'PRESENT',
        'hash_match': False,
        'size_match': False,
        'match_type': 'HIGH SIMILARITY',
        'document_code': 'DOC002',
        'output_filename': 'document2.pdf',
        'output_size': 2048000,
        'wuxi2_size': 2050000
    },
    {
        'status': 'ABSENT',
        'hash_match': False,
        'size_match': False,
        'match_type': 'N/A',
        'document_code': 'DOC003',
        'output_filename': 'document3.pdf',
        'output_size': 512000,
        'wuxi2_size': 0
    }
]

# Print the summary
print_summary(results)

Best Practices

  • Ensure the results list is not empty before calling this function to avoid division by zero errors
  • All dictionaries in the results list must contain the required keys: 'status', 'hash_match', 'size_match', 'match_type', 'document_code', 'output_filename', 'output_size', and 'wuxi2_size'
  • The 'status' field should only contain 'PRESENT' or 'ABSENT' values for accurate counting
  • File sizes should be provided in bytes for consistent formatting
  • This function is designed for console output; redirect stdout if you need to capture the output to a file
  • The function shows only the first 5 examples of absent and modified documents; consider the full results list size when interpreting the summary
  • Match type values should follow the expected categories: 'HIGH SIMILARITY', 'MODERATE SIMILARITY', 'LOW SIMILARITY' for proper categorization

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function print_summary_v1 87.3% similar

    Prints a comprehensive summary report of document comparison results, including status breakdowns, signature analysis, match quality metrics, and examples from each category.

    From: /tf/active/vicechatdev/mailsearch/enhanced_document_comparison.py
  • function compare_documents 64.3% similar

    Compares documents from an output folder with documents in a wuxi2 repository by matching document codes, file hashes, sizes, and filenames to identify identical, similar, or missing documents.

    From: /tf/active/vicechatdev/mailsearch/compare_documents.py
  • function save_results 62.3% similar

    Saves comparison results data to both CSV and JSON file formats with predefined field structure and UTF-8 encoding.

    From: /tf/active/vicechatdev/mailsearch/compare_documents.py
  • function main_v57 57.9% similar

    Main execution function that orchestrates a document comparison workflow between two directories (mailsearch/output and wuxi2 repository), scanning for coded documents, comparing them, and generating results.

    From: /tf/active/vicechatdev/mailsearch/compare_documents.py
  • function compare_documents_v1 57.3% similar

    Compares two sets of PDF documents by matching document codes, detecting signatures, calculating content similarity, and generating detailed comparison results with signature information.

    From: /tf/active/vicechatdev/mailsearch/enhanced_document_comparison.py
← Back to Browse