function clean_text_for_xml_v1
Sanitizes text strings to ensure XML 1.0 compatibility by removing or replacing invalid control characters and ensuring all characters meet XML specification requirements for Word document generation.
/tf/active/vicechatdev/enhanced_word_converter_fixed.py
22 - 49
simple
Purpose
This function prepares text content for safe insertion into XML-based Word documents (.docx format) by filtering out characters that would cause XML parsing errors. It removes null bytes, control characters (except tab, newline, and carriage return), and ensures all characters fall within valid XML 1.0 character ranges. This is essential when processing user-generated content or data from external sources that may contain invalid XML characters that would corrupt Word document generation.
Source Code
def clean_text_for_xml(text):
"""Clean text to be XML compatible for Word documents."""
if not text:
return ""
# Remove or replace XML-incompatible characters
# Remove null bytes and control characters except tab, newline, carriage return
text = ''.join(char for char in text if ord(char) >= 32 or char in '\t\n\r')
# Replace any remaining problematic characters
text = text.replace('\x00', '') # Remove null bytes
text = text.replace('\x0b', ' ') # Replace vertical tab with space
text = text.replace('\x0c', ' ') # Replace form feed with space
# Ensure only valid XML characters (XML 1.0 specification)
# Valid characters: #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
cleaned = ''
for char in text:
code = ord(char)
if (code == 0x09 or code == 0x0A or code == 0x0D or
(0x20 <= code <= 0xD7FF) or
(0xE000 <= code <= 0xFFFD) or
(0x10000 <= code <= 0x10FFFF)):
cleaned += char
else:
cleaned += ' ' # Replace invalid characters with space
return cleaned
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
text |
- | - | positional_or_keyword |
Parameter Details
text: Input string to be cleaned. Can be any text content including user input, file contents, or data from external sources. Accepts None or empty strings, which will return an empty string. May contain control characters, null bytes, or other XML-incompatible characters that need sanitization.
Return Value
Returns a cleaned string containing only XML 1.0 compatible characters. Invalid characters are either removed or replaced with spaces. If input is None or empty, returns an empty string (''). The returned string is safe to insert into XML structures used by python-docx for Word document generation.
Usage Example
# Basic usage
text_with_invalid_chars = "Hello\x00World\x0b\x0cTest\t\nValid"
cleaned = clean_text_for_xml(text_with_invalid_chars)
print(cleaned) # Output: 'HelloWorld Test\t\nValid'
# Handle None input
result = clean_text_for_xml(None)
print(result) # Output: ''
# Handle empty string
result = clean_text_for_xml('')
print(result) # Output: ''
# Use with python-docx
from docx import Document
user_input = "User\x00provided\x0btext"
safe_text = clean_text_for_xml(user_input)
doc = Document()
doc.add_paragraph(safe_text)
doc.save('output.docx')
Best Practices
- Always use this function when inserting user-generated content or external data into Word documents to prevent XML parsing errors
- Call this function before passing text to python-docx methods like add_paragraph(), add_heading(), or add_run()
- Be aware that invalid characters are replaced with spaces, which may affect text formatting or length
- For performance-critical applications with large text volumes, consider caching results if the same text is processed multiple times
- This function preserves tabs (\t), newlines (\n), and carriage returns (\r) as they are valid in XML and useful for document formatting
- The function is defensive and returns empty string for None input, so no need for null checks before calling
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function clean_text_for_xml 97.0% similar
-
function clean_text 72.6% similar
-
function xml_escape 63.3% similar
-
function create_enhanced_word_document 50.7% similar
-
function create_enhanced_word_document_v1 50.6% similar