function clean_text_for_xml
Sanitizes text by removing or replacing XML-incompatible characters to ensure compatibility with Word document XML structure.
/tf/active/vicechatdev/improved_convert_disclosures_to_table.py
44 - 73
moderate
Purpose
This function prepares text for safe insertion into Word documents by ensuring all characters comply with XML 1.0 specifications. It removes control characters, null bytes, and other problematic characters that could corrupt Word document XML structure. The function first applies general text cleaning, then filters characters based on XML 1.0 valid character ranges (tab, newline, carriage return, and characters in ranges 0x20-0xD7FF, 0xE000-0xFFFD, and 0x10000-0x10FFFF).
Source Code
def clean_text_for_xml(text):
"""Clean text to be XML compatible for Word documents."""
if not text:
return ""
# First apply general cleaning
text = clean_text(text)
# Remove or replace XML-incompatible characters
# Remove null bytes and control characters except tab, newline, carriage return
text = ''.join(char for char in text if ord(char) >= 32 or char in '\t\n\r')
# Replace any remaining problematic characters
text = text.replace('\x00', '') # Remove null bytes
text = text.replace('\x0b', ' ') # Replace vertical tab with space
text = text.replace('\x0c', ' ') # Replace form feed with space
# Ensure only valid XML characters (XML 1.0 specification)
cleaned = ''
for char in text:
code = ord(char)
if (code == 0x09 or code == 0x0A or code == 0x0D or
(0x20 <= code <= 0xD7FF) or
(0xE000 <= code <= 0xFFFD) or
(0x10000 <= code <= 0x10FFFF)):
cleaned += char
else:
cleaned += ' ' # Replace invalid characters with space
return cleaned
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
text |
- | - | positional_or_keyword |
Parameter Details
text: Input string to be cleaned for XML compatibility. Can be None or empty string, which will return an empty string. Expected to be any text content that needs to be inserted into a Word document's XML structure.
Return Value
Returns a string containing only XML 1.0 compatible characters. Invalid characters are replaced with spaces. If input is None or empty, returns an empty string (''). The returned string is safe to insert into Word document XML without causing parsing errors.
Usage Example
# Assuming clean_text function is available
def clean_text(text):
return text.strip() if text else ''
def clean_text_for_xml(text):
if not text:
return ''
text = clean_text(text)
text = ''.join(char for char in text if ord(char) >= 32 or char in '\t\n\r')
text = text.replace('\x00', '')
text = text.replace('\x0b', ' ')
text = text.replace('\x0c', ' ')
cleaned = ''
for char in text:
code = ord(char)
if (code == 0x09 or code == 0x0A or code == 0x0D or
(0x20 <= code <= 0xD7FF) or
(0xE000 <= code <= 0xFFFD) or
(0x10000 <= code <= 0x10FFFF)):
cleaned += char
else:
cleaned += ' '
return cleaned
# Example usage
raw_text = 'Hello\x00World\x0b\x0cTest\u0001Data'
cleaned = clean_text_for_xml(raw_text)
print(cleaned) # Output: 'HelloWorld Test Data'
# Handle None or empty input
result = clean_text_for_xml(None)
print(result) # Output: ''
result = clean_text_for_xml('')
print(result) # Output: ''
Best Practices
- Always use this function before inserting user-generated or external text into Word document XML structures to prevent document corruption
- The function depends on a 'clean_text' function that must be defined or imported in the same scope
- Invalid XML characters are replaced with spaces rather than removed to maintain text readability and spacing
- The function handles None and empty string inputs gracefully by returning empty strings
- This function is specifically designed for XML 1.0 specification compliance, which is used by Word documents
- Consider the performance impact when processing very large text strings due to character-by-character iteration
- The function preserves tabs, newlines, and carriage returns as they are valid in XML and useful for formatting
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function clean_text 71.0% similar
-
function sanitize_folders 47.5% similar
-
function create_word_report 43.2% similar
-
function create_word_report_improved 42.9% similar
-
function main 41.5% similar