🔍 Code Extractor

function parse

Maturity: 70

Parses XML input (string, file-like object, or generator) and converts it into a Python dictionary representation with configurable options for attributes, namespaces, comments, and streaming.

File:
/tf/active/vicechatdev/SPFCsync/venv/lib64/python3.11/site-packages/xmltodict.py
Lines:
202 - 379
Complexity:
complex

Purpose

This function provides a flexible XML-to-dictionary parser that supports multiple input formats, streaming mode for large files, namespace handling, attribute processing, comment preservation, and custom postprocessing. It's designed for converting XML data into Python dictionaries for easier manipulation and access, with security features like entity disabling and support for alternative expat parsers.

Source Code

def parse(xml_input, encoding=None, expat=expat, process_namespaces=False,
          namespace_separator=':', disable_entities=True, process_comments=False, **kwargs):
    """Parse the given XML input and convert it into a dictionary.

    `xml_input` can either be a `string`, a file-like object, or a generator of strings.

    If `xml_attribs` is `True`, element attributes are put in the dictionary
    among regular child elements, using `@` as a prefix to avoid collisions. If
    set to `False`, they are just ignored.

    Simple example::

        >>> import xmltodict
        >>> doc = xmltodict.parse(\"\"\"
        ... <a prop="x">
        ...   <b>1</b>
        ...   <b>2</b>
        ... </a>
        ... \"\"\")
        >>> doc['a']['@prop']
        u'x'
        >>> doc['a']['b']
        [u'1', u'2']

    If `item_depth` is `0`, the function returns a dictionary for the root
    element (default behavior). Otherwise, it calls `item_callback` every time
    an item at the specified depth is found and returns `None` in the end
    (streaming mode).

    The callback function receives two parameters: the `path` from the document
    root to the item (name-attribs pairs), and the `item` (dict). If the
    callback's return value is false-ish, parsing will be stopped with the
    :class:`ParsingInterrupted` exception.

    Streaming example::

        >>> def handle(path, item):
        ...     print('path:%s item:%s' % (path, item))
        ...     return True
        ...
        >>> xmltodict.parse(\"\"\"
        ... <a prop="x">
        ...   <b>1</b>
        ...   <b>2</b>
        ... </a>\"\"\", item_depth=2, item_callback=handle)
        path:[(u'a', {u'prop': u'x'}), (u'b', None)] item:1
        path:[(u'a', {u'prop': u'x'}), (u'b', None)] item:2

    The optional argument `postprocessor` is a function that takes `path`,
    `key` and `value` as positional arguments and returns a new `(key, value)`
    pair where both `key` and `value` may have changed. Usage example::

        >>> def postprocessor(path, key, value):
        ...     try:
        ...         return key + ':int', int(value)
        ...     except (ValueError, TypeError):
        ...         return key, value
        >>> xmltodict.parse('<a><b>1</b><b>2</b><b>x</b></a>',
        ...                 postprocessor=postprocessor)
        {'a': {'b:int': [1, 2], 'b': 'x'}}

    You can pass an alternate version of `expat` (such as `defusedexpat`) by
    using the `expat` parameter. E.g:

        >>> import defusedexpat
        >>> xmltodict.parse('<a>hello</a>', expat=defusedexpat.pyexpat)
        {'a': 'hello'}

    You can use the force_list argument to force lists to be created even
    when there is only a single child of a given level of hierarchy. The
    force_list argument is a tuple of keys. If the key for a given level
    of hierarchy is in the force_list argument, that level of hierarchy
    will have a list as a child (even if there is only one sub-element).
    The index_keys operation takes precedence over this. This is applied
    after any user-supplied postprocessor has already run.

        For example, given this input:
        <servers>
          <server>
            <name>host1</name>
            <os>Linux</os>
            <interfaces>
              <interface>
                <name>em0</name>
                <ip_address>10.0.0.1</ip_address>
              </interface>
            </interfaces>
          </server>
        </servers>

        If called with force_list=('interface',), it will produce
        this dictionary:
        {'servers':
          {'server':
            {'name': 'host1',
             'os': 'Linux'},
             'interfaces':
              {'interface':
                [ {'name': 'em0', 'ip_address': '10.0.0.1' } ] } } }

        `force_list` can also be a callable that receives `path`, `key` and
        `value`. This is helpful in cases where the logic that decides whether
        a list should be forced is more complex.


        If `process_comment` is `True` then comment will be added with comment_key
        (default=`'#comment'`) to then tag which contains comment

            For example, given this input:
            <a>
              <b>
                <!-- b comment -->
                <c>
                    <!-- c comment -->
                    1
                </c>
                <d>2</d>
              </b>
            </a>

            If called with process_comment=True, it will produce
            this dictionary:
            'a': {
                'b': {
                    '#comment': 'b comment',
                    'c': {

                        '#comment': 'c comment',
                        '#text': '1',
                    },
                    'd': '2',
                },
            }
    """
    handler = _DictSAXHandler(namespace_separator=namespace_separator,
                              **kwargs)
    if isinstance(xml_input, _unicode):
        if not encoding:
            encoding = 'utf-8'
        xml_input = xml_input.encode(encoding)
    if not process_namespaces:
        namespace_separator = None
    parser = expat.ParserCreate(
        encoding,
        namespace_separator
    )
    try:
        parser.ordered_attributes = True
    except AttributeError:
        # Jython's expat does not support ordered_attributes
        pass
    parser.StartNamespaceDeclHandler = handler.startNamespaceDecl
    parser.StartElementHandler = handler.startElement
    parser.EndElementHandler = handler.endElement
    parser.CharacterDataHandler = handler.characters
    if process_comments:
        parser.CommentHandler = handler.comments
    parser.buffer_text = True
    if disable_entities:
        try:
            # Attempt to disable DTD in Jython's expat parser (Xerces-J).
            feature = "http://apache.org/xml/features/disallow-doctype-decl"
            parser._reader.setFeature(feature, True)
        except AttributeError:
            # For CPython / expat parser.
            # Anything not handled ends up here and entities aren't expanded.
            parser.DefaultHandler = lambda x: None
            # Expects an integer return; zero means failure -> expat.ExpatError.
            parser.ExternalEntityRefHandler = lambda *x: 1
    if hasattr(xml_input, 'read'):
        parser.ParseFile(xml_input)
    elif isgenerator(xml_input):
        for chunk in xml_input:
            parser.Parse(chunk,False)
        parser.Parse(b'',True)
    else:
        parser.Parse(xml_input, True)
    return handler.item

Parameters

Name Type Default Kind
xml_input - - positional_or_keyword
encoding - None positional_or_keyword
expat - expat positional_or_keyword
process_namespaces - False positional_or_keyword
namespace_separator - ':' positional_or_keyword
disable_entities - True positional_or_keyword
process_comments - False positional_or_keyword
**kwargs - - var_keyword

Parameter Details

xml_input: The XML data to parse. Can be a string (unicode or bytes), a file-like object with a read() method, or a generator yielding string chunks for streaming large files.

encoding: Character encoding for the XML input. Defaults to None, which uses 'utf-8' for unicode strings. Specify encoding like 'utf-8', 'latin-1', etc. if needed.

expat: The expat parser module to use. Defaults to the standard xml.parsers.expat. Can be replaced with defusedexpat.pyexpat for enhanced security against XML attacks.

process_namespaces: Boolean flag (default False). When True, XML namespaces are processed and namespace prefixes are preserved in element names using the namespace_separator.

namespace_separator: String separator used between namespace and element name when process_namespaces is True. Defaults to ':' (e.g., 'ns:element').

disable_entities: Boolean flag (default True). When True, disables XML entity expansion to prevent XXE (XML External Entity) attacks and other security vulnerabilities.

process_comments: Boolean flag (default False). When True, XML comments are included in the output dictionary with the key '#comment' (or custom comment_key from kwargs).

kwargs: Additional keyword arguments passed to _DictSAXHandler. Common options include: xml_attribs (bool, include attributes), attr_prefix (str, default '@'), cdata_key (str, default '#text'), force_list (tuple/callable), item_depth (int, for streaming), item_callback (callable, for streaming), postprocessor (callable, for value transformation), dict_constructor (callable, default dict), strip_whitespace (bool), comment_key (str, default '#comment'), force_cdata (bool), cdata_separator (str).

Return Value

Returns a dictionary (or OrderedDict if specified) representing the parsed XML structure. Element names become keys, text content becomes values, attributes are prefixed with '@' by default, and repeated elements become lists. In streaming mode (when item_depth > 0), returns None and calls item_callback for each item at the specified depth instead.

Dependencies

  • xml.parsers.expat
  • defusedexpat
  • inspect

Required Imports

from xml.parsers import expat
from inspect import isgenerator

Conditional/Optional Imports

These imports are only needed under specific conditions:

from defusedexpat import pyexpat as expat

Condition: only if using defusedexpat for enhanced security (alternative to standard expat)

Optional
from collections import OrderedDict

Condition: only if dict_constructor=OrderedDict is passed in kwargs to preserve element order

Optional

Usage Example

import xmltodict
from xml.parsers import expat

# Basic usage - parse XML string
xml_string = '''<root><person name="John"><age>30</age><city>NYC</city></person></root>'''
result = xmltodict.parse(xml_string)
print(result)
# Output: {'root': {'person': {'@name': 'John', 'age': '30', 'city': 'NYC'}}}

# Parse with file object
with open('data.xml', 'rb') as f:
    result = xmltodict.parse(f)

# Streaming mode for large files
def handle_item(path, item):
    print(f'Processing: {item}')
    return True

xmltodict.parse(xml_string, item_depth=2, item_callback=handle_item)

# With postprocessor to convert types
def postprocessor(path, key, value):
    if key == 'age':
        return key, int(value)
    return key, value

result = xmltodict.parse(xml_string, postprocessor=postprocessor)

# Force lists for specific elements
result = xmltodict.parse(xml_string, force_list=('person',))

# Process comments
xml_with_comments = '''<root><!-- comment --><data>value</data></root>'''
result = xmltodict.parse(xml_with_comments, process_comments=True)

# Use defusedexpat for security
import defusedexpat
result = xmltodict.parse(xml_string, expat=defusedexpat.pyexpat)

Best Practices

  • Always use disable_entities=True (default) when parsing untrusted XML to prevent XXE attacks
  • Consider using defusedexpat.pyexpat instead of standard expat for enhanced security when parsing external XML
  • Use streaming mode (item_depth + item_callback) for large XML files to avoid loading entire document into memory
  • Provide explicit encoding parameter when working with non-UTF-8 XML documents
  • Use force_list parameter when you need consistent list structures even for single elements
  • Implement postprocessor functions for type conversion and data validation during parsing
  • When processing namespaces, set process_namespaces=True and choose appropriate namespace_separator
  • For ordered element preservation, pass dict_constructor=OrderedDict in kwargs
  • Handle ParsingInterrupted exception when using item_callback that may return False
  • Test with sample data before processing large production XML files

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function unparse 66.1% similar

    Converts a Python dictionary into an XML document string, serving as the reverse operation of XML parsing. Supports customizable formatting, encoding, and XML generation options.

    From: /tf/active/vicechatdev/SPFCsync/venv/lib64/python3.11/site-packages/xmltodict.py
  • class _DictSAXHandler 63.2% similar

    A SAX (Simple API for XML) event handler that converts XML documents into Python dictionaries, with extensive configuration options for handling attributes, namespaces, CDATA, and structure.

    From: /tf/active/vicechatdev/SPFCsync/venv/lib64/python3.11/site-packages/xmltodict.py
  • function _emit 59.1% similar

    Recursively converts a dictionary structure into XML SAX events, emitting them through a content handler for XML generation.

    From: /tf/active/vicechatdev/SPFCsync/venv/lib64/python3.11/site-packages/xmltodict.py
  • function _process_namespace 42.7% similar

    Processes XML namespace prefixes in element/attribute names by resolving them against a namespace dictionary and reconstructing the full qualified name.

    From: /tf/active/vicechatdev/SPFCsync/venv/lib64/python3.11/site-packages/xmltodict.py
  • function parse_log_line 36.9% similar

    Parses a structured log line string and extracts timestamp, logger name, log level, and message components into a dictionary.

    From: /tf/active/vicechatdev/SPFCsync/monitor.py
← Back to Browse