🔍 Code Extractor

Component not found

function clean_text_for_xml

Maturity: 45

Sanitizes text by removing or replacing XML-incompatible characters to ensure compatibility with Word document XML structure.

File:
/tf/active/vicechatdev/improved_convert_disclosures_to_table.py
Lines:
44 - 73
Complexity:
moderate

Purpose

This function prepares text for safe insertion into Word documents by ensuring all characters comply with XML 1.0 specifications. It removes control characters, null bytes, and other problematic characters that could corrupt Word document XML structure. The function first applies general text cleaning, then filters characters based on XML 1.0 valid character ranges (tab, newline, carriage return, and characters in ranges 0x20-0xD7FF, 0xE000-0xFFFD, and 0x10000-0x10FFFF).

Source Code

def clean_text_for_xml(text):
    """Clean text to be XML compatible for Word documents."""
    if not text:
        return ""
    
    # First apply general cleaning
    text = clean_text(text)
    
    # Remove or replace XML-incompatible characters
    # Remove null bytes and control characters except tab, newline, carriage return
    text = ''.join(char for char in text if ord(char) >= 32 or char in '\t\n\r')
    
    # Replace any remaining problematic characters
    text = text.replace('\x00', '')  # Remove null bytes
    text = text.replace('\x0b', ' ')  # Replace vertical tab with space
    text = text.replace('\x0c', ' ')  # Replace form feed with space
    
    # Ensure only valid XML characters (XML 1.0 specification)
    cleaned = ''
    for char in text:
        code = ord(char)
        if (code == 0x09 or code == 0x0A or code == 0x0D or 
            (0x20 <= code <= 0xD7FF) or 
            (0xE000 <= code <= 0xFFFD) or 
            (0x10000 <= code <= 0x10FFFF)):
            cleaned += char
        else:
            cleaned += ' '  # Replace invalid characters with space
    
    return cleaned

Parameters

Name Type Default Kind
text - - positional_or_keyword

Parameter Details

text: Input string to be cleaned for XML compatibility. Can be None or empty string, which will return an empty string. Expected to be any text content that needs to be inserted into a Word document's XML structure.

Return Value

Returns a string containing only XML 1.0 compatible characters. Invalid characters are replaced with spaces. If input is None or empty, returns an empty string (''). The returned string is safe to insert into Word document XML without causing parsing errors.

Usage Example

# Assuming clean_text function is available
def clean_text(text):
    return text.strip() if text else ''

def clean_text_for_xml(text):
    if not text:
        return ''
    text = clean_text(text)
    text = ''.join(char for char in text if ord(char) >= 32 or char in '\t\n\r')
    text = text.replace('\x00', '')
    text = text.replace('\x0b', ' ')
    text = text.replace('\x0c', ' ')
    cleaned = ''
    for char in text:
        code = ord(char)
        if (code == 0x09 or code == 0x0A or code == 0x0D or 
            (0x20 <= code <= 0xD7FF) or 
            (0xE000 <= code <= 0xFFFD) or 
            (0x10000 <= code <= 0x10FFFF)):
            cleaned += char
        else:
            cleaned += ' '
    return cleaned

# Example usage
raw_text = 'Hello\x00World\x0b\x0cTest\u0001Data'
cleaned = clean_text_for_xml(raw_text)
print(cleaned)  # Output: 'HelloWorld  Test Data'

# Handle None or empty input
result = clean_text_for_xml(None)
print(result)  # Output: ''

result = clean_text_for_xml('')
print(result)  # Output: ''

Best Practices

  • Always use this function before inserting user-generated or external text into Word document XML structures to prevent document corruption
  • The function depends on a 'clean_text' function that must be defined or imported in the same scope
  • Invalid XML characters are replaced with spaces rather than removed to maintain text readability and spacing
  • The function handles None and empty string inputs gracefully by returning empty strings
  • This function is specifically designed for XML 1.0 specification compliance, which is used by Word documents
  • Consider the performance impact when processing very large text strings due to character-by-character iteration
  • The function preserves tabs, newlines, and carriage returns as they are valid in XML and useful for formatting

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function clean_text 71.0% similar

    Cleans and normalizes text content by removing HTML tags, normalizing whitespace, and stripping markdown formatting elements.

    From: /tf/active/vicechatdev/improved_convert_disclosures_to_table.py
  • function sanitize_folders 47.5% similar

    Recursively traverses a directory tree and sanitizes folder names by removing non-ASCII characters, renaming folders to ASCII-only versions.

    From: /tf/active/vicechatdev/creation_updater.py
  • function create_word_report 43.2% similar

    Generates a formatted Microsoft Word document report containing warranty disclosures with a table of contents, metadata, and structured sections for each warranty.

    From: /tf/active/vicechatdev/convert_disclosures_to_table.py
  • function create_word_report_improved 42.9% similar

    Generates a formatted Microsoft Word document report containing warranty disclosures with table of contents, structured sections, and references.

    From: /tf/active/vicechatdev/improved_convert_disclosures_to_table.py
  • function main 41.5% similar

    Orchestrates the conversion of an improved markdown file containing warranty disclosures into multiple tabular formats (CSV, Excel, Word) with timestamp-based file naming.

    From: /tf/active/vicechatdev/improved_convert_disclosures_to_table.py
← Back to Browse