🔍 Code Extractor

function scan_output_folder_v1

Maturity: 55

Scans a specified folder for PDF documents with embedded codes in their filenames, extracting metadata and signature information for each coded document found.

File:
/tf/active/vicechatdev/mailsearch/enhanced_document_comparison.py
Lines:
161 - 185
Complexity:
moderate

Purpose

This function is designed to inventory and catalog PDF documents in an output folder by identifying documents with specific code patterns in their filenames. It extracts comprehensive metadata including file paths, sizes, content hashes, and signature detection information. This is useful for document management systems, compliance tracking, or automated document processing pipelines where documents are identified by codes and need to be tracked with their associated metadata.

Source Code

def scan_output_folder(folder_path: str) -> Dict[str, Dict]:
    """Scan output folder for coded documents"""
    print(f"\nScanning output folder: {folder_path}")
    documents = {}
    
    for filename in os.listdir(folder_path):
        if filename.endswith('.pdf') and not filename.startswith('.'):
            filepath = os.path.join(folder_path, filename)
            if not os.path.isfile(filepath):
                continue
                
            code = extract_document_code(filename)
            if code:
                print(f"  Found: {code} - {filename[:80]}")
                documents[code] = {
                    'code': code,
                    'filename': filename,
                    'filepath': filepath,
                    'size': os.path.getsize(filepath),
                    'hash': calculate_file_hash(filepath),
                    'signature_info': detect_signatures_in_pdf(filepath)
                }
    
    print(f"\nTotal coded documents in output: {len(documents)}")
    return documents

Parameters

Name Type Default Kind
folder_path str - positional_or_keyword

Parameter Details

folder_path: String path to the directory containing PDF documents to scan. Should be an absolute or relative path to an existing folder. The function will iterate through all files in this directory (non-recursive) looking for PDF files with extractable document codes in their filenames.

Return Value

Type: Dict[str, Dict]

Returns a dictionary where keys are extracted document codes (strings) and values are nested dictionaries containing metadata for each document. Each nested dictionary includes: 'code' (the extracted document code), 'filename' (original filename), 'filepath' (full path to the file), 'size' (file size in bytes), 'hash' (file content hash for integrity verification), and 'signature_info' (information about detected signatures in the PDF). Returns an empty dictionary if no coded documents are found.

Dependencies

  • os
  • PyPDF2

Required Imports

import os
from typing import Dict
import PyPDF2

Usage Example

# Assuming helper functions are defined
# extract_document_code, calculate_file_hash, detect_signatures_in_pdf

import os
from typing import Dict
import PyPDF2

# Example usage
output_folder = '/path/to/output/documents'
result = scan_output_folder(output_folder)

# Access scanned documents
for code, metadata in result.items():
    print(f"Document Code: {code}")
    print(f"  Filename: {metadata['filename']}")
    print(f"  Size: {metadata['size']} bytes")
    print(f"  Hash: {metadata['hash']}")
    print(f"  Signatures: {metadata['signature_info']}")

# Check if specific document exists
if 'DOC-001' in result:
    doc_path = result['DOC-001']['filepath']
    print(f"Found document at: {doc_path}")

Best Practices

  • Ensure the folder_path exists and is accessible before calling this function to avoid FileNotFoundError
  • The function only scans the immediate directory (non-recursive), so nested folders will not be processed
  • Hidden files (starting with '.') are automatically excluded from scanning
  • Only files with .pdf extension are processed; other file types are ignored
  • The function prints progress information to stdout, which may need to be suppressed in production environments
  • Ensure helper functions (extract_document_code, calculate_file_hash, detect_signatures_in_pdf) are properly implemented and handle errors gracefully
  • For large folders with many PDFs, this function may take significant time as it processes each file individually
  • The returned dictionary uses document codes as keys, so duplicate codes will overwrite previous entries
  • Consider implementing error handling around file operations if dealing with potentially corrupted or inaccessible files

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function scan_output_folder 81.9% similar

    Scans a specified output folder for PDF files containing document codes, extracts those codes, and returns a dictionary mapping each code to its associated file information.

    From: /tf/active/vicechatdev/mailsearch/compare_documents.py
  • function scan_wuxi2_folder_v1 69.5% similar

    Recursively scans a directory for PDF files, extracts document codes from filenames, and returns a dictionary mapping each unique document code to a list of file metadata dictionaries.

    From: /tf/active/vicechatdev/mailsearch/enhanced_document_comparison.py
  • function scan_wuxi2_folder 62.2% similar

    Recursively scans a wuxi2 folder for PDF documents, extracts document codes from filenames, and organizes them into a dictionary mapping codes to file information.

    From: /tf/active/vicechatdev/mailsearch/compare_documents.py
  • function main_v100 58.0% similar

    Main entry point function that orchestrates a document comparison workflow between two folders (mailsearch/output and wuxi2 repository), detecting signatures and generating comparison results.

    From: /tf/active/vicechatdev/mailsearch/enhanced_document_comparison.py
  • function compare_documents_v1 57.4% similar

    Compares two sets of PDF documents by matching document codes, detecting signatures, calculating content similarity, and generating detailed comparison results with signature information.

    From: /tf/active/vicechatdev/mailsearch/enhanced_document_comparison.py
← Back to Browse