function parse
Parses XML input (string, file-like object, or generator) and converts it into a Python dictionary representation with configurable options for attributes, namespaces, comments, and streaming.
/tf/active/vicechatdev/SPFCsync/venv/lib64/python3.11/site-packages/xmltodict.py
202 - 379
complex
Purpose
This function provides a flexible XML-to-dictionary parser that supports multiple input formats, streaming mode for large files, namespace handling, attribute processing, comment preservation, and custom postprocessing. It's designed for converting XML data into Python dictionaries for easier manipulation and access, with security features like entity disabling and support for alternative expat parsers.
Source Code
def parse(xml_input, encoding=None, expat=expat, process_namespaces=False,
namespace_separator=':', disable_entities=True, process_comments=False, **kwargs):
"""Parse the given XML input and convert it into a dictionary.
`xml_input` can either be a `string`, a file-like object, or a generator of strings.
If `xml_attribs` is `True`, element attributes are put in the dictionary
among regular child elements, using `@` as a prefix to avoid collisions. If
set to `False`, they are just ignored.
Simple example::
>>> import xmltodict
>>> doc = xmltodict.parse(\"\"\"
... <a prop="x">
... <b>1</b>
... <b>2</b>
... </a>
... \"\"\")
>>> doc['a']['@prop']
u'x'
>>> doc['a']['b']
[u'1', u'2']
If `item_depth` is `0`, the function returns a dictionary for the root
element (default behavior). Otherwise, it calls `item_callback` every time
an item at the specified depth is found and returns `None` in the end
(streaming mode).
The callback function receives two parameters: the `path` from the document
root to the item (name-attribs pairs), and the `item` (dict). If the
callback's return value is false-ish, parsing will be stopped with the
:class:`ParsingInterrupted` exception.
Streaming example::
>>> def handle(path, item):
... print('path:%s item:%s' % (path, item))
... return True
...
>>> xmltodict.parse(\"\"\"
... <a prop="x">
... <b>1</b>
... <b>2</b>
... </a>\"\"\", item_depth=2, item_callback=handle)
path:[(u'a', {u'prop': u'x'}), (u'b', None)] item:1
path:[(u'a', {u'prop': u'x'}), (u'b', None)] item:2
The optional argument `postprocessor` is a function that takes `path`,
`key` and `value` as positional arguments and returns a new `(key, value)`
pair where both `key` and `value` may have changed. Usage example::
>>> def postprocessor(path, key, value):
... try:
... return key + ':int', int(value)
... except (ValueError, TypeError):
... return key, value
>>> xmltodict.parse('<a><b>1</b><b>2</b><b>x</b></a>',
... postprocessor=postprocessor)
{'a': {'b:int': [1, 2], 'b': 'x'}}
You can pass an alternate version of `expat` (such as `defusedexpat`) by
using the `expat` parameter. E.g:
>>> import defusedexpat
>>> xmltodict.parse('<a>hello</a>', expat=defusedexpat.pyexpat)
{'a': 'hello'}
You can use the force_list argument to force lists to be created even
when there is only a single child of a given level of hierarchy. The
force_list argument is a tuple of keys. If the key for a given level
of hierarchy is in the force_list argument, that level of hierarchy
will have a list as a child (even if there is only one sub-element).
The index_keys operation takes precedence over this. This is applied
after any user-supplied postprocessor has already run.
For example, given this input:
<servers>
<server>
<name>host1</name>
<os>Linux</os>
<interfaces>
<interface>
<name>em0</name>
<ip_address>10.0.0.1</ip_address>
</interface>
</interfaces>
</server>
</servers>
If called with force_list=('interface',), it will produce
this dictionary:
{'servers':
{'server':
{'name': 'host1',
'os': 'Linux'},
'interfaces':
{'interface':
[ {'name': 'em0', 'ip_address': '10.0.0.1' } ] } } }
`force_list` can also be a callable that receives `path`, `key` and
`value`. This is helpful in cases where the logic that decides whether
a list should be forced is more complex.
If `process_comment` is `True` then comment will be added with comment_key
(default=`'#comment'`) to then tag which contains comment
For example, given this input:
<a>
<b>
<!-- b comment -->
<c>
<!-- c comment -->
1
</c>
<d>2</d>
</b>
</a>
If called with process_comment=True, it will produce
this dictionary:
'a': {
'b': {
'#comment': 'b comment',
'c': {
'#comment': 'c comment',
'#text': '1',
},
'd': '2',
},
}
"""
handler = _DictSAXHandler(namespace_separator=namespace_separator,
**kwargs)
if isinstance(xml_input, _unicode):
if not encoding:
encoding = 'utf-8'
xml_input = xml_input.encode(encoding)
if not process_namespaces:
namespace_separator = None
parser = expat.ParserCreate(
encoding,
namespace_separator
)
try:
parser.ordered_attributes = True
except AttributeError:
# Jython's expat does not support ordered_attributes
pass
parser.StartNamespaceDeclHandler = handler.startNamespaceDecl
parser.StartElementHandler = handler.startElement
parser.EndElementHandler = handler.endElement
parser.CharacterDataHandler = handler.characters
if process_comments:
parser.CommentHandler = handler.comments
parser.buffer_text = True
if disable_entities:
try:
# Attempt to disable DTD in Jython's expat parser (Xerces-J).
feature = "http://apache.org/xml/features/disallow-doctype-decl"
parser._reader.setFeature(feature, True)
except AttributeError:
# For CPython / expat parser.
# Anything not handled ends up here and entities aren't expanded.
parser.DefaultHandler = lambda x: None
# Expects an integer return; zero means failure -> expat.ExpatError.
parser.ExternalEntityRefHandler = lambda *x: 1
if hasattr(xml_input, 'read'):
parser.ParseFile(xml_input)
elif isgenerator(xml_input):
for chunk in xml_input:
parser.Parse(chunk,False)
parser.Parse(b'',True)
else:
parser.Parse(xml_input, True)
return handler.item
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
xml_input |
- | - | positional_or_keyword |
encoding |
- | None | positional_or_keyword |
expat |
- | expat | positional_or_keyword |
process_namespaces |
- | False | positional_or_keyword |
namespace_separator |
- | ':' | positional_or_keyword |
disable_entities |
- | True | positional_or_keyword |
process_comments |
- | False | positional_or_keyword |
**kwargs |
- | - | var_keyword |
Parameter Details
xml_input: The XML data to parse. Can be a string (unicode or bytes), a file-like object with a read() method, or a generator yielding string chunks for streaming large files.
encoding: Character encoding for the XML input. Defaults to None, which uses 'utf-8' for unicode strings. Specify encoding like 'utf-8', 'latin-1', etc. if needed.
expat: The expat parser module to use. Defaults to the standard xml.parsers.expat. Can be replaced with defusedexpat.pyexpat for enhanced security against XML attacks.
process_namespaces: Boolean flag (default False). When True, XML namespaces are processed and namespace prefixes are preserved in element names using the namespace_separator.
namespace_separator: String separator used between namespace and element name when process_namespaces is True. Defaults to ':' (e.g., 'ns:element').
disable_entities: Boolean flag (default True). When True, disables XML entity expansion to prevent XXE (XML External Entity) attacks and other security vulnerabilities.
process_comments: Boolean flag (default False). When True, XML comments are included in the output dictionary with the key '#comment' (or custom comment_key from kwargs).
kwargs: Additional keyword arguments passed to _DictSAXHandler. Common options include: xml_attribs (bool, include attributes), attr_prefix (str, default '@'), cdata_key (str, default '#text'), force_list (tuple/callable), item_depth (int, for streaming), item_callback (callable, for streaming), postprocessor (callable, for value transformation), dict_constructor (callable, default dict), strip_whitespace (bool), comment_key (str, default '#comment'), force_cdata (bool), cdata_separator (str).
Return Value
Returns a dictionary (or OrderedDict if specified) representing the parsed XML structure. Element names become keys, text content becomes values, attributes are prefixed with '@' by default, and repeated elements become lists. In streaming mode (when item_depth > 0), returns None and calls item_callback for each item at the specified depth instead.
Dependencies
xml.parsers.expatdefusedexpatinspect
Required Imports
from xml.parsers import expat
from inspect import isgenerator
Conditional/Optional Imports
These imports are only needed under specific conditions:
from defusedexpat import pyexpat as expat
Condition: only if using defusedexpat for enhanced security (alternative to standard expat)
Optionalfrom collections import OrderedDict
Condition: only if dict_constructor=OrderedDict is passed in kwargs to preserve element order
OptionalUsage Example
import xmltodict
from xml.parsers import expat
# Basic usage - parse XML string
xml_string = '''<root><person name="John"><age>30</age><city>NYC</city></person></root>'''
result = xmltodict.parse(xml_string)
print(result)
# Output: {'root': {'person': {'@name': 'John', 'age': '30', 'city': 'NYC'}}}
# Parse with file object
with open('data.xml', 'rb') as f:
result = xmltodict.parse(f)
# Streaming mode for large files
def handle_item(path, item):
print(f'Processing: {item}')
return True
xmltodict.parse(xml_string, item_depth=2, item_callback=handle_item)
# With postprocessor to convert types
def postprocessor(path, key, value):
if key == 'age':
return key, int(value)
return key, value
result = xmltodict.parse(xml_string, postprocessor=postprocessor)
# Force lists for specific elements
result = xmltodict.parse(xml_string, force_list=('person',))
# Process comments
xml_with_comments = '''<root><!-- comment --><data>value</data></root>'''
result = xmltodict.parse(xml_with_comments, process_comments=True)
# Use defusedexpat for security
import defusedexpat
result = xmltodict.parse(xml_string, expat=defusedexpat.pyexpat)
Best Practices
- Always use disable_entities=True (default) when parsing untrusted XML to prevent XXE attacks
- Consider using defusedexpat.pyexpat instead of standard expat for enhanced security when parsing external XML
- Use streaming mode (item_depth + item_callback) for large XML files to avoid loading entire document into memory
- Provide explicit encoding parameter when working with non-UTF-8 XML documents
- Use force_list parameter when you need consistent list structures even for single elements
- Implement postprocessor functions for type conversion and data validation during parsing
- When processing namespaces, set process_namespaces=True and choose appropriate namespace_separator
- For ordered element preservation, pass dict_constructor=OrderedDict in kwargs
- Handle ParsingInterrupted exception when using item_callback that may return False
- Test with sample data before processing large production XML files
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function unparse 66.1% similar
-
class _DictSAXHandler 63.2% similar
-
function _emit 59.1% similar
-
function _process_namespace 42.7% similar
-
function parse_log_line 36.9% similar