function scan_wuxi2_folder_v1
Recursively scans a directory for PDF files, extracts document codes from filenames, and returns a dictionary mapping each unique document code to a list of file metadata dictionaries.
/tf/active/vicechatdev/mailsearch/enhanced_document_comparison.py
188 - 222
moderate
Purpose
This function is designed to inventory and organize PDF documents in the wuxi2 repository by extracting standardized document codes from filenames. It handles multiple variants of the same document (e.g., renumbered versions) by grouping them under the same code. The function collects comprehensive metadata for each PDF including file path, size, hash, and relative location, enabling document tracking, deduplication, and management workflows.
Source Code
def scan_wuxi2_folder(folder_path: str) -> Dict[str, List[Dict]]:
"""
Scan wuxi2 repository for all PDFs
Returns dict mapping document codes to list of matching files
(allowing for renumbered variants in same folder)
"""
print(f"\nScanning wuxi2 repository: {folder_path}")
documents = defaultdict(list)
count = 0
for root, dirs, files in os.walk(folder_path):
for filename in files:
if filename.endswith('.pdf') and not filename.startswith('.'):
filepath = os.path.join(root, filename)
code = extract_document_code(filename)
if code:
count += 1
if count % 100 == 0:
print(f" Processed {count} coded documents...")
relative_path = os.path.relpath(root, folder_path)
documents[code].append({
'code': code,
'filename': filename,
'filepath': filepath,
'relative_path': relative_path,
'size': os.path.getsize(filepath),
'hash': calculate_file_hash(filepath),
'signature_info': None # Lazy load when needed
})
print(f"\nTotal coded documents found: {count}")
print(f"Unique document codes: {len(documents)}")
return documents
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
folder_path |
str | - | positional_or_keyword |
Parameter Details
folder_path: String path to the root directory of the wuxi2 repository to scan. Should be an absolute or relative path to a valid directory containing PDF files. The function will recursively traverse all subdirectories from this starting point.
Return Value
Type: Dict[str, List[Dict]]
Returns a dictionary where keys are document codes (strings) extracted from PDF filenames, and values are lists of dictionaries. Each inner dictionary contains metadata for a single PDF file: 'code' (document code), 'filename' (original filename), 'filepath' (absolute path), 'relative_path' (path relative to folder_path), 'size' (file size in bytes), 'hash' (file hash from calculate_file_hash function), and 'signature_info' (initially None, intended for lazy loading). Multiple files can share the same code if they are variants.
Dependencies
oshashlibcollections
Required Imports
import os
from collections import defaultdict
from typing import Dict, List
Usage Example
# Assuming helper functions are defined
def extract_document_code(filename):
# Example implementation
match = re.search(r'([A-Z]{2,}-\d{4,})', filename)
return match.group(1) if match else None
def calculate_file_hash(filepath):
# Example implementation
with open(filepath, 'rb') as f:
return hashlib.md5(f.read()).hexdigest()
# Scan the repository
repository_path = '/path/to/wuxi2/documents'
document_map = scan_wuxi2_folder(repository_path)
# Access results
print(f"Found {len(document_map)} unique document codes")
for code, files in document_map.items():
print(f"Code {code}: {len(files)} file(s)")
for file_info in files:
print(f" - {file_info['filename']} ({file_info['size']} bytes)")
Best Practices
- Ensure the helper functions 'extract_document_code' and 'calculate_file_hash' are properly defined before calling this function
- For large repositories, be aware that this function loads all metadata into memory; consider pagination or streaming for very large datasets
- The function prints progress messages to stdout; redirect or suppress output if running in a non-interactive environment
- File hashing is performed for every PDF which can be I/O intensive; consider caching results for repeated scans
- The 'signature_info' field is set to None for lazy loading; implement a separate function to populate this field when needed to avoid unnecessary processing
- Hidden files (starting with '.') are automatically excluded from scanning
- The function uses os.walk which follows symbolic links by default; be cautious of circular references in the directory structure
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function scan_wuxi2_folder 91.7% similar
-
function scan_output_folder_v1 69.5% similar
-
function find_best_folder 67.1% similar
-
function scan_output_folder 66.3% similar
-
function compare_documents_v1 61.8% similar