🔍 Code Extractor

function clean_text_for_xml_v1

Maturity: 43

Sanitizes text strings to ensure XML 1.0 compatibility by removing or replacing invalid control characters and ensuring all characters meet XML specification requirements for Word document generation.

File:
/tf/active/vicechatdev/enhanced_word_converter_fixed.py
Lines:
22 - 49
Complexity:
simple

Purpose

This function prepares text content for safe insertion into XML-based Word documents (.docx format) by filtering out characters that would cause XML parsing errors. It removes null bytes, control characters (except tab, newline, and carriage return), and ensures all characters fall within valid XML 1.0 character ranges. This is essential when processing user-generated content or data from external sources that may contain invalid XML characters that would corrupt Word document generation.

Source Code

def clean_text_for_xml(text):
    """Clean text to be XML compatible for Word documents."""
    if not text:
        return ""
    
    # Remove or replace XML-incompatible characters
    # Remove null bytes and control characters except tab, newline, carriage return
    text = ''.join(char for char in text if ord(char) >= 32 or char in '\t\n\r')
    
    # Replace any remaining problematic characters
    text = text.replace('\x00', '')  # Remove null bytes
    text = text.replace('\x0b', ' ')  # Replace vertical tab with space
    text = text.replace('\x0c', ' ')  # Replace form feed with space
    
    # Ensure only valid XML characters (XML 1.0 specification)
    # Valid characters: #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
    cleaned = ''
    for char in text:
        code = ord(char)
        if (code == 0x09 or code == 0x0A or code == 0x0D or 
            (0x20 <= code <= 0xD7FF) or 
            (0xE000 <= code <= 0xFFFD) or 
            (0x10000 <= code <= 0x10FFFF)):
            cleaned += char
        else:
            cleaned += ' '  # Replace invalid characters with space
    
    return cleaned

Parameters

Name Type Default Kind
text - - positional_or_keyword

Parameter Details

text: Input string to be cleaned. Can be any text content including user input, file contents, or data from external sources. Accepts None or empty strings, which will return an empty string. May contain control characters, null bytes, or other XML-incompatible characters that need sanitization.

Return Value

Returns a cleaned string containing only XML 1.0 compatible characters. Invalid characters are either removed or replaced with spaces. If input is None or empty, returns an empty string (''). The returned string is safe to insert into XML structures used by python-docx for Word document generation.

Usage Example

# Basic usage
text_with_invalid_chars = "Hello\x00World\x0b\x0cTest\t\nValid"
cleaned = clean_text_for_xml(text_with_invalid_chars)
print(cleaned)  # Output: 'HelloWorld  Test\t\nValid'

# Handle None input
result = clean_text_for_xml(None)
print(result)  # Output: ''

# Handle empty string
result = clean_text_for_xml('')
print(result)  # Output: ''

# Use with python-docx
from docx import Document

user_input = "User\x00provided\x0btext"
safe_text = clean_text_for_xml(user_input)

doc = Document()
doc.add_paragraph(safe_text)
doc.save('output.docx')

Best Practices

  • Always use this function when inserting user-generated content or external data into Word documents to prevent XML parsing errors
  • Call this function before passing text to python-docx methods like add_paragraph(), add_heading(), or add_run()
  • Be aware that invalid characters are replaced with spaces, which may affect text formatting or length
  • For performance-critical applications with large text volumes, consider caching results if the same text is processed multiple times
  • This function preserves tabs (\t), newlines (\n), and carriage returns (\r) as they are valid in XML and useful for document formatting
  • The function is defensive and returns empty string for None input, so no need for null checks before calling

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function clean_text_for_xml 97.0% similar

    Sanitizes text by removing or replacing XML-incompatible characters to ensure compatibility with Word document XML structure.

    From: /tf/active/vicechatdev/improved_convert_disclosures_to_table.py
  • function clean_text 72.6% similar

    Cleans and normalizes text content by removing HTML tags, normalizing whitespace, and stripping markdown formatting elements.

    From: /tf/active/vicechatdev/improved_convert_disclosures_to_table.py
  • function xml_escape 63.3% similar

    Escapes special XML characters in a string by replacing them with their corresponding XML entity references.

    From: /tf/active/vicechatdev/SPFCsync/venv/lib64/python3.11/site-packages/office365/runtime/auth/providers/saml_token_provider.py
  • function create_enhanced_word_document 50.7% similar

    Converts markdown-formatted warranty disclosure content into a formatted Microsoft Word document with hierarchical headings, styled text, lists, and special formatting for block references.

    From: /tf/active/vicechatdev/improved_word_converter.py
  • function create_enhanced_word_document_v1 50.6% similar

    Converts markdown content into a formatted Microsoft Word document with proper styling, table of contents, warranty sections, and reference handling for Project Victoria warranty disclosures.

    From: /tf/active/vicechatdev/enhanced_word_converter_fixed.py
← Back to Browse