function extract_document_code
Extracts a structured document code (e.g., '4.5.38.2') from a filename using regex pattern matching.
/tf/active/vicechatdev/mailsearch/compare_documents.py
27 - 40
simple
Purpose
This function parses filenames to extract hierarchical document codes that typically appear at the beginning of document names. It's commonly used in document management systems where files are prefixed with numerical codes for organization and categorization. The function returns None if no matching code pattern is found, making it safe to use in filtering and validation workflows.
Source Code
def extract_document_code(filename: str) -> Optional[str]:
"""
Extract document code from filename (e.g., '4.5.38.2' from '4.5.38.2 Document Name.pdf')
Args:
filename: The filename to extract code from
Returns:
Document code or None if no code found
"""
match = CODE_PATTERN.match(filename)
if match:
return match.group(1)
return None
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
filename |
str | - | positional_or_keyword |
Parameter Details
filename: A string representing the filename (with or without path) from which to extract the document code. Expected to contain a numerical code pattern at the beginning (e.g., '4.5.38.2 Document Name.pdf'). Can be a full path or just the filename. No constraints on length or format, but must be a valid string.
Return Value
Type: Optional[str]
Returns an Optional[str] - either a string containing the extracted document code (e.g., '4.5.38.2') if a matching pattern is found, or None if no code pattern matches. The returned code is the first capturing group from the CODE_PATTERN regex match.
Dependencies
re
Required Imports
import re
from typing import Optional
Usage Example
import re
from typing import Optional
# Define the CODE_PATTERN (must be defined before using the function)
CODE_PATTERN = re.compile(r'^(\d+(?:\.\d+)+)')
def extract_document_code(filename: str) -> Optional[str]:
match = CODE_PATTERN.match(filename)
if match:
return match.group(1)
return None
# Example usage
filename1 = '4.5.38.2 Document Name.pdf'
code1 = extract_document_code(filename1)
print(code1) # Output: '4.5.38.2'
filename2 = 'Document Without Code.pdf'
code2 = extract_document_code(filename2)
print(code2) # Output: None
filename3 = '/path/to/1.2.3 Report.docx'
code3 = extract_document_code(filename3)
print(code3) # Output: None (pattern matches start of string, not basename)
Best Practices
- Ensure CODE_PATTERN is defined as a module-level constant before calling this function
- The function expects CODE_PATTERN to match from the start of the filename string - if processing full paths, extract the basename first using os.path.basename() or Path().name
- The regex pattern should have at least one capturing group to extract the code
- Handle the None return value appropriately in calling code to avoid NoneType errors
- Consider validating the extracted code format if specific hierarchical structures are required
- For batch processing, compile the regex pattern once at module level rather than inside the function for better performance
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function extract_document_code_v1 84.1% similar
-
function extract_code_parts 58.2% similar
-
function get_document_type_code 53.8% similar
-
function fuzzy_match_filename 53.1% similar
-
function is_valid_document_file 51.3% similar