print_summary - Code Extractor

function print_summary

Maturity: 47

Prints a formatted summary report of document comparison results, including presence status, match quality statistics, and examples of absent and modified documents.

File:
/tf/active/vicechatdev/mailsearch/compare_documents.py

Lines:
355 - 409

Complexity:
simple

Purpose

This function generates a comprehensive console output summarizing the results of a document comparison operation between two datasets (output and wuxi2). It calculates and displays statistics about document presence, hash matches, size matches, and similarity levels. It also provides concrete examples of absent and modified documents with their metadata. This is typically used as a final reporting step in a document comparison or validation workflow.

Source Code

def print_summary(results: List[Dict]):
    """
    Print summary statistics
    
    Args:
        results: List of comparison results
    """
    print(f"\n{'='*80}")
    print("COMPARISON SUMMARY")
    print(f"{'='*80}\n")
    
    total = len(results)
    present = sum(1 for r in results if r['status'] == 'PRESENT')
    absent = total - present
    
    identical = sum(1 for r in results if r['hash_match'])
    size_match = sum(1 for r in results if r['size_match'] and not r['hash_match'])
    high_sim = sum(1 for r in results if r['match_type'] == 'HIGH SIMILARITY')
    mod_sim = sum(1 for r in results if r['match_type'] == 'MODERATE SIMILARITY')
    low_sim = sum(1 for r in results if r['match_type'] == 'LOW SIMILARITY')
    
    print(f"Total coded documents in output:  {total}")
    print(f"\nPresence Status:")
    print(f"  Present in wuxi2:  {present:4d} ({present/total*100:.1f}%)")
    print(f"  Absent from wuxi2: {absent:4d} ({absent/total*100:.1f}%)")
    
    print(f"\nMatch Quality (for present documents):")
    print(f"  Identical (hash match):    {identical:4d} ({identical/present*100:.1f}% of present)")
    print(f"  Size match (likely same):  {size_match:4d} ({size_match/present*100:.1f}% of present)")
    print(f"  High similarity (>70%):    {high_sim:4d} ({high_sim/present*100:.1f}% of present)")
    print(f"  Moderate similarity:       {mod_sim:4d} ({mod_sim/present*100:.1f}% of present)")
    print(f"  Low similarity:            {low_sim:4d} ({low_sim/present*100:.1f}% of present)")
    
    print(f"\n{'='*80}\n")
    
    # Show some examples
    print("Examples of ABSENT documents:")
    print("-" * 80)
    absent_docs = [r for r in results if r['status'] == 'ABSENT']
    for doc in absent_docs[:5]:
        print(f"  {doc['document_code']:15s} {doc['output_filename']}")
    if len(absent_docs) > 5:
        print(f"  ... and {len(absent_docs) - 5} more")
    
    print(f"\nExamples of MODIFIED documents (size/hash mismatch):")
    print("-" * 80)
    modified_docs = [r for r in results if r['status'] == 'PRESENT' and not r['hash_match']]
    for doc in modified_docs[:5]:
        print(f"  {doc['document_code']:15s} {doc['output_filename'][:60]}")
        print(f"    → Output: {doc['output_size']:,} bytes")
        print(f"    → Wuxi2:  {doc['wuxi2_size']:,} bytes")
    if len(modified_docs) > 5:
        print(f"  ... and {len(modified_docs) - 5} more")
    
    print()

Parameters

Name	Type	Default	Kind
`results`	List[Dict]	-	positional_or_keyword

Parameter Details

results: A list of dictionaries where each dictionary represents a comparison result for a single document. Each dictionary must contain the following keys: 'status' (str: 'PRESENT' or 'ABSENT'), 'hash_match' (bool: whether file hashes match), 'size_match' (bool: whether file sizes match), 'match_type' (str: similarity level like 'HIGH SIMILARITY', 'MODERATE SIMILARITY', 'LOW SIMILARITY'), 'document_code' (str: document identifier), 'output_filename' (str: filename in output dataset), 'output_size' (int: file size in bytes in output), and 'wuxi2_size' (int: file size in bytes in wuxi2). The list should not be empty to avoid division by zero errors.

Return Value

This function returns None. It produces side effects by printing formatted text to standard output (console). The output includes: a header section with total counts and percentages, presence status breakdown, match quality metrics for present documents, examples of absent documents (up to 5), and examples of modified documents with size comparisons (up to 5).

Required Imports

from typing import List
from typing import Dict

Usage Example

# Example usage with sample comparison results
from typing import List, Dict

def print_summary(results: List[Dict]):
    # ... function code ...
    pass

# Sample data structure
results = [
    {
        'status': 'PRESENT',
        'hash_match': True,
        'size_match': True,
        'match_type': 'IDENTICAL',
        'document_code': 'DOC001',
        'output_filename': 'document1.pdf',
        'output_size': 1024000,
        'wuxi2_size': 1024000
    },
    {
        'status': 'PRESENT',
        'hash_match': False,
        'size_match': False,
        'match_type': 'HIGH SIMILARITY',
        'document_code': 'DOC002',
        'output_filename': 'document2.pdf',
        'output_size': 2048000,
        'wuxi2_size': 2050000
    },
    {
        'status': 'ABSENT',
        'hash_match': False,
        'size_match': False,
        'match_type': 'N/A',
        'document_code': 'DOC003',
        'output_filename': 'document3.pdf',
        'output_size': 512000,
        'wuxi2_size': 0
    }
]

# Print the summary
print_summary(results)

Best Practices

Ensure the results list is not empty before calling this function to avoid division by zero errors
All dictionaries in the results list must contain the required keys: 'status', 'hash_match', 'size_match', 'match_type', 'document_code', 'output_filename', 'output_size', and 'wuxi2_size'
The 'status' field should only contain 'PRESENT' or 'ABSENT' values for accurate counting
File sizes should be provided in bytes for consistent formatting
This function is designed for console output; redirect stdout if you need to capture the output to a file
The function shows only the first 5 examples of absent and modified documents; consider the full results list size when interpreting the summary
Match type values should follow the expected categories: 'HIGH SIMILARITY', 'MODERATE SIMILARITY', 'LOW SIMILARITY' for proper categorization

Similar Components

AI-powered semantic similarity - components with related functionality:

function print_summary_v1 87.3% similar

Prints a comprehensive summary report of document comparison results, including status breakdowns, signature analysis, match quality metrics, and examples from each category.
From: /tf/active/vicechatdev/mailsearch/enhanced_document_comparison.py
function compare_documents 64.3% similar

Compares documents from an output folder with documents in a wuxi2 repository by matching document codes, file hashes, sizes, and filenames to identify identical, similar, or missing documents.
From: /tf/active/vicechatdev/mailsearch/compare_documents.py
function save_results 62.3% similar

Saves comparison results data to both CSV and JSON file formats with predefined field structure and UTF-8 encoding.
From: /tf/active/vicechatdev/mailsearch/compare_documents.py
function main_v57 57.9% similar

Main execution function that orchestrates a document comparison workflow between two directories (mailsearch/output and wuxi2 repository), scanning for coded documents, comparing them, and generating results.
From: /tf/active/vicechatdev/mailsearch/compare_documents.py
function compare_documents_v1 57.3% similar

Compares two sets of PDF documents by matching document codes, detecting signatures, calculating content similarity, and generating detailed comparison results with signature information.
From: /tf/active/vicechatdev/mailsearch/enhanced_document_comparison.py

← Back to Browse

Assistant

Hi! I can help improve this code. Tell me what you'd like to enhance (e.g., "add error handling", "optimize performance", "improve readability", "add type hints").

Code Comparison

Original Code

                            def print_summary(results: List[Dict]):
    """
    Print summary statistics
    
    Args:
        results: List of comparison results
    """
    print(f"\n{'='*80}")
    print("COMPARISON SUMMARY")
    print(f"{'='*80}\n")
    
    total = len(results)
    present = sum(1 for r in results if r['status'] == 'PRESENT')
    absent = total - present
    
    identical = sum(1 for r in results if r['hash_match'])
    size_match = sum(1 for r in results if r['size_match'] and not r['hash_match'])
    high_sim = sum(1 for r in results if r['match_type'] == 'HIGH SIMILARITY')
    mod_sim = sum(1 for r in results if r['match_type'] == 'MODERATE SIMILARITY')
    low_sim = sum(1 for r in results if r['match_type'] == 'LOW SIMILARITY')
    
    print(f"Total coded documents in output:  {total}")
    print(f"\nPresence Status:")
    print(f"  Present in wuxi2:  {present:4d} ({present/total*100:.1f}%)")
    print(f"  Absent from wuxi2: {absent:4d} ({absent/total*100:.1f}%)")
    
    print(f"\nMatch Quality (for present documents):")
    print(f"  Identical (hash match):    {identical:4d} ({identical/present*100:.1f}% of present)")
    print(f"  Size match (likely same):  {size_match:4d} ({size_match/present*100:.1f}% of present)")
    print(f"  High similarity (>70%):    {high_sim:4d} ({high_sim/present*100:.1f}% of present)")
    print(f"  Moderate similarity:       {mod_sim:4d} ({mod_sim/present*100:.1f}% of present)")
    print(f"  Low similarity:            {low_sim:4d} ({low_sim/present*100:.1f}% of present)")
    
    print(f"\n{'='*80}\n")
    
    # Show some examples
    print("Examples of ABSENT documents:")
    print("-" * 80)
    absent_docs = [r for r in results if r['status'] == 'ABSENT']
    for doc in absent_docs[:5]:
        print(f"  {doc['document_code']:15s} {doc['output_filename']}")
    if len(absent_docs) > 5:
        print(f"  ... and {len(absent_docs) - 5} more")
    
    print(f"\nExamples of MODIFIED documents (size/hash mismatch):")
    print("-" * 80)
    modified_docs = [r for r in results if r['status'] == 'PRESENT' and not r['hash_match']]
    for doc in modified_docs[:5]:
        print(f"  {doc['document_code']:15s} {doc['output_filename'][:60]}")
        print(f"    → Output: {doc['output_size']:,} bytes")
        print(f"    → Wuxi2:  {doc['wuxi2_size']:,} bytes")
    if len(modified_docs) > 5:
        print(f"  ... and {len(modified_docs) - 5} more")
    
    print()
                        

Improved Code

🔍 Code Extractor

function print_summary

Purpose

Source Code

Parameters

Parameter Details

Return Value

Required Imports

Usage Example

Best Practices

Tags

Similar Components

function print_summary_v1 87.3% similar

function compare_documents 64.3% similar

function save_results 62.3% similar

function main_v57 57.9% similar

function compare_documents_v1 57.3% similar

function print_summary

Purpose

Source Code

Parameters

Parameter Details

Return Value

Required Imports

Usage Example

Best Practices

Tags

Similar Components

function print_summary_v1 87.3% similar

function compare_documents 64.3% similar

function save_results 62.3% similar

function main_v57 57.9% similar

function compare_documents_v1 57.3% similar

✨ Improve Code: print_summary

Code Comparison