🔍 Code Extractor

function extract_document_code

Maturity: 53

Extracts a structured document code (e.g., '4.5.38.2') from a filename using regex pattern matching.

File:
/tf/active/vicechatdev/mailsearch/compare_documents.py
Lines:
27 - 40
Complexity:
simple

Purpose

This function parses filenames to extract hierarchical document codes that typically appear at the beginning of document names. It's commonly used in document management systems where files are prefixed with numerical codes for organization and categorization. The function returns None if no matching code pattern is found, making it safe to use in filtering and validation workflows.

Source Code

def extract_document_code(filename: str) -> Optional[str]:
    """
    Extract document code from filename (e.g., '4.5.38.2' from '4.5.38.2 Document Name.pdf')
    
    Args:
        filename: The filename to extract code from
        
    Returns:
        Document code or None if no code found
    """
    match = CODE_PATTERN.match(filename)
    if match:
        return match.group(1)
    return None

Parameters

Name Type Default Kind
filename str - positional_or_keyword

Parameter Details

filename: A string representing the filename (with or without path) from which to extract the document code. Expected to contain a numerical code pattern at the beginning (e.g., '4.5.38.2 Document Name.pdf'). Can be a full path or just the filename. No constraints on length or format, but must be a valid string.

Return Value

Type: Optional[str]

Returns an Optional[str] - either a string containing the extracted document code (e.g., '4.5.38.2') if a matching pattern is found, or None if no code pattern matches. The returned code is the first capturing group from the CODE_PATTERN regex match.

Dependencies

  • re

Required Imports

import re
from typing import Optional

Usage Example

import re
from typing import Optional

# Define the CODE_PATTERN (must be defined before using the function)
CODE_PATTERN = re.compile(r'^(\d+(?:\.\d+)+)')

def extract_document_code(filename: str) -> Optional[str]:
    match = CODE_PATTERN.match(filename)
    if match:
        return match.group(1)
    return None

# Example usage
filename1 = '4.5.38.2 Document Name.pdf'
code1 = extract_document_code(filename1)
print(code1)  # Output: '4.5.38.2'

filename2 = 'Document Without Code.pdf'
code2 = extract_document_code(filename2)
print(code2)  # Output: None

filename3 = '/path/to/1.2.3 Report.docx'
code3 = extract_document_code(filename3)
print(code3)  # Output: None (pattern matches start of string, not basename)

Best Practices

  • Ensure CODE_PATTERN is defined as a module-level constant before calling this function
  • The function expects CODE_PATTERN to match from the start of the filename string - if processing full paths, extract the basename first using os.path.basename() or Path().name
  • The regex pattern should have at least one capturing group to extract the code
  • Handle the None return value appropriately in calling code to avoid NoneType errors
  • Consider validating the extracted code format if specific hierarchical structures are required
  • For batch processing, compile the regex pattern once at module level rather than inside the function for better performance

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function extract_document_code_v1 84.1% similar

    Extracts a structured document code (e.g., 2.13.4.3.3.2) from a filename using regex pattern matching.

    From: /tf/active/vicechatdev/mailsearch/enhanced_document_comparison.py
  • function extract_code_parts 58.2% similar

    Splits a document code string into its component parts using a period (.) as the delimiter.

    From: /tf/active/vicechatdev/mailsearch/copy_signed_documents.py
  • function get_document_type_code 53.8% similar

    Retrieves a document type code from a dictionary lookup using the provided document type name, returning the name itself if no mapping exists.

    From: /tf/active/vicechatdev/CDocs/settings_prod.py
  • function fuzzy_match_filename 53.1% similar

    Calculates a fuzzy match similarity score between two filenames by comparing them after normalization, using exact matching, substring containment, and word overlap techniques.

    From: /tf/active/vicechatdev/mailsearch/compare_documents.py
  • function is_valid_document_file 51.3% similar

    Validates whether a given filename has an extension corresponding to a supported document type by checking against a predefined list of valid document extensions.

    From: /tf/active/vicechatdev/CDocs/utils/__init__.py
← Back to Browse