function scan_output_folder_v1
Scans a specified folder for PDF documents with embedded codes in their filenames, extracting metadata and signature information for each coded document found.
/tf/active/vicechatdev/mailsearch/enhanced_document_comparison.py
161 - 185
moderate
Purpose
This function is designed to inventory and catalog PDF documents in an output folder by identifying documents with specific code patterns in their filenames. It extracts comprehensive metadata including file paths, sizes, content hashes, and signature detection information. This is useful for document management systems, compliance tracking, or automated document processing pipelines where documents are identified by codes and need to be tracked with their associated metadata.
Source Code
def scan_output_folder(folder_path: str) -> Dict[str, Dict]:
"""Scan output folder for coded documents"""
print(f"\nScanning output folder: {folder_path}")
documents = {}
for filename in os.listdir(folder_path):
if filename.endswith('.pdf') and not filename.startswith('.'):
filepath = os.path.join(folder_path, filename)
if not os.path.isfile(filepath):
continue
code = extract_document_code(filename)
if code:
print(f" Found: {code} - {filename[:80]}")
documents[code] = {
'code': code,
'filename': filename,
'filepath': filepath,
'size': os.path.getsize(filepath),
'hash': calculate_file_hash(filepath),
'signature_info': detect_signatures_in_pdf(filepath)
}
print(f"\nTotal coded documents in output: {len(documents)}")
return documents
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
folder_path |
str | - | positional_or_keyword |
Parameter Details
folder_path: String path to the directory containing PDF documents to scan. Should be an absolute or relative path to an existing folder. The function will iterate through all files in this directory (non-recursive) looking for PDF files with extractable document codes in their filenames.
Return Value
Type: Dict[str, Dict]
Returns a dictionary where keys are extracted document codes (strings) and values are nested dictionaries containing metadata for each document. Each nested dictionary includes: 'code' (the extracted document code), 'filename' (original filename), 'filepath' (full path to the file), 'size' (file size in bytes), 'hash' (file content hash for integrity verification), and 'signature_info' (information about detected signatures in the PDF). Returns an empty dictionary if no coded documents are found.
Dependencies
osPyPDF2
Required Imports
import os
from typing import Dict
import PyPDF2
Usage Example
# Assuming helper functions are defined
# extract_document_code, calculate_file_hash, detect_signatures_in_pdf
import os
from typing import Dict
import PyPDF2
# Example usage
output_folder = '/path/to/output/documents'
result = scan_output_folder(output_folder)
# Access scanned documents
for code, metadata in result.items():
print(f"Document Code: {code}")
print(f" Filename: {metadata['filename']}")
print(f" Size: {metadata['size']} bytes")
print(f" Hash: {metadata['hash']}")
print(f" Signatures: {metadata['signature_info']}")
# Check if specific document exists
if 'DOC-001' in result:
doc_path = result['DOC-001']['filepath']
print(f"Found document at: {doc_path}")
Best Practices
- Ensure the folder_path exists and is accessible before calling this function to avoid FileNotFoundError
- The function only scans the immediate directory (non-recursive), so nested folders will not be processed
- Hidden files (starting with '.') are automatically excluded from scanning
- Only files with .pdf extension are processed; other file types are ignored
- The function prints progress information to stdout, which may need to be suppressed in production environments
- Ensure helper functions (extract_document_code, calculate_file_hash, detect_signatures_in_pdf) are properly implemented and handle errors gracefully
- For large folders with many PDFs, this function may take significant time as it processes each file individually
- The returned dictionary uses document codes as keys, so duplicate codes will overwrite previous entries
- Consider implementing error handling around file operations if dealing with potentially corrupted or inaccessible files
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function scan_output_folder 81.9% similar
-
function scan_wuxi2_folder_v1 69.5% similar
-
function scan_wuxi2_folder 62.2% similar
-
function main_v100 58.0% similar
-
function compare_documents_v1 57.4% similar