🔍 Code Extractor

function scan_wuxi2_folder_v1

Maturity: 53

Recursively scans a directory for PDF files, extracts document codes from filenames, and returns a dictionary mapping each unique document code to a list of file metadata dictionaries.

File:
/tf/active/vicechatdev/mailsearch/enhanced_document_comparison.py
Lines:
188 - 222
Complexity:
moderate

Purpose

This function is designed to inventory and organize PDF documents in the wuxi2 repository by extracting standardized document codes from filenames. It handles multiple variants of the same document (e.g., renumbered versions) by grouping them under the same code. The function collects comprehensive metadata for each PDF including file path, size, hash, and relative location, enabling document tracking, deduplication, and management workflows.

Source Code

def scan_wuxi2_folder(folder_path: str) -> Dict[str, List[Dict]]:
    """
    Scan wuxi2 repository for all PDFs
    Returns dict mapping document codes to list of matching files
    (allowing for renumbered variants in same folder)
    """
    print(f"\nScanning wuxi2 repository: {folder_path}")
    documents = defaultdict(list)
    count = 0
    
    for root, dirs, files in os.walk(folder_path):
        for filename in files:
            if filename.endswith('.pdf') and not filename.startswith('.'):
                filepath = os.path.join(root, filename)
                code = extract_document_code(filename)
                
                if code:
                    count += 1
                    if count % 100 == 0:
                        print(f"  Processed {count} coded documents...")
                    
                    relative_path = os.path.relpath(root, folder_path)
                    documents[code].append({
                        'code': code,
                        'filename': filename,
                        'filepath': filepath,
                        'relative_path': relative_path,
                        'size': os.path.getsize(filepath),
                        'hash': calculate_file_hash(filepath),
                        'signature_info': None  # Lazy load when needed
                    })
    
    print(f"\nTotal coded documents found: {count}")
    print(f"Unique document codes: {len(documents)}")
    return documents

Parameters

Name Type Default Kind
folder_path str - positional_or_keyword

Parameter Details

folder_path: String path to the root directory of the wuxi2 repository to scan. Should be an absolute or relative path to a valid directory containing PDF files. The function will recursively traverse all subdirectories from this starting point.

Return Value

Type: Dict[str, List[Dict]]

Returns a dictionary where keys are document codes (strings) extracted from PDF filenames, and values are lists of dictionaries. Each inner dictionary contains metadata for a single PDF file: 'code' (document code), 'filename' (original filename), 'filepath' (absolute path), 'relative_path' (path relative to folder_path), 'size' (file size in bytes), 'hash' (file hash from calculate_file_hash function), and 'signature_info' (initially None, intended for lazy loading). Multiple files can share the same code if they are variants.

Dependencies

  • os
  • hashlib
  • collections

Required Imports

import os
from collections import defaultdict
from typing import Dict, List

Usage Example

# Assuming helper functions are defined
def extract_document_code(filename):
    # Example implementation
    match = re.search(r'([A-Z]{2,}-\d{4,})', filename)
    return match.group(1) if match else None

def calculate_file_hash(filepath):
    # Example implementation
    with open(filepath, 'rb') as f:
        return hashlib.md5(f.read()).hexdigest()

# Scan the repository
repository_path = '/path/to/wuxi2/documents'
document_map = scan_wuxi2_folder(repository_path)

# Access results
print(f"Found {len(document_map)} unique document codes")
for code, files in document_map.items():
    print(f"Code {code}: {len(files)} file(s)")
    for file_info in files:
        print(f"  - {file_info['filename']} ({file_info['size']} bytes)")

Best Practices

  • Ensure the helper functions 'extract_document_code' and 'calculate_file_hash' are properly defined before calling this function
  • For large repositories, be aware that this function loads all metadata into memory; consider pagination or streaming for very large datasets
  • The function prints progress messages to stdout; redirect or suppress output if running in a non-interactive environment
  • File hashing is performed for every PDF which can be I/O intensive; consider caching results for repeated scans
  • The 'signature_info' field is set to None for lazy loading; implement a separate function to populate this field when needed to avoid unnecessary processing
  • Hidden files (starting with '.') are automatically excluded from scanning
  • The function uses os.walk which follows symbolic links by default; be cautious of circular references in the directory structure

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function scan_wuxi2_folder 91.7% similar

    Recursively scans a wuxi2 folder for PDF documents, extracts document codes from filenames, and organizes them into a dictionary mapping codes to file information.

    From: /tf/active/vicechatdev/mailsearch/compare_documents.py
  • function scan_output_folder_v1 69.5% similar

    Scans a specified folder for PDF documents with embedded codes in their filenames, extracting metadata and signature information for each coded document found.

    From: /tf/active/vicechatdev/mailsearch/enhanced_document_comparison.py
  • function find_best_folder 67.1% similar

    Finds the best matching folder in a directory tree by comparing hierarchical document codes with folder names containing numeric codes.

    From: /tf/active/vicechatdev/mailsearch/copy_signed_documents.py
  • function scan_output_folder 66.3% similar

    Scans a specified output folder for PDF files containing document codes, extracts those codes, and returns a dictionary mapping each code to its associated file information.

    From: /tf/active/vicechatdev/mailsearch/compare_documents.py
  • function compare_documents_v1 61.8% similar

    Compares two sets of PDF documents by matching document codes, detecting signatures, calculating content similarity, and generating detailed comparison results with signature information.

    From: /tf/active/vicechatdev/mailsearch/enhanced_document_comparison.py
← Back to Browse