parse - Code Extractor

function parse

Maturity: 70

Parses XML input (string, file-like object, or generator) and converts it into a Python dictionary representation with configurable options for attributes, namespaces, comments, and streaming.

File:
/tf/active/vicechatdev/SPFCsync/venv/lib64/python3.11/site-packages/xmltodict.py

Lines:
202 - 379

Complexity:
complex

Purpose

This function provides a flexible XML-to-dictionary parser that supports multiple input formats, streaming mode for large files, namespace handling, attribute processing, comment preservation, and custom postprocessing. It's designed for converting XML data into Python dictionaries for easier manipulation and access, with security features like entity disabling and support for alternative expat parsers.

Source Code

def parse(xml_input, encoding=None, expat=expat, process_namespaces=False,
          namespace_separator=':', disable_entities=True, process_comments=False, **kwargs):
    """Parse the given XML input and convert it into a dictionary.

    `xml_input` can either be a `string`, a file-like object, or a generator of strings.

    If `xml_attribs` is `True`, element attributes are put in the dictionary
    among regular child elements, using `@` as a prefix to avoid collisions. If
    set to `False`, they are just ignored.

    Simple example::

        >>> import xmltodict
        >>> doc = xmltodict.parse(\"\"\"
        ... <a prop="x">
        ...   <b>1</b>
        ...   <b>2</b>
        ... </a>
        ... \"\"\")
        >>> doc['a']['@prop']
        u'x'
        >>> doc['a']['b']
        [u'1', u'2']

    If `item_depth` is `0`, the function returns a dictionary for the root
    element (default behavior). Otherwise, it calls `item_callback` every time
    an item at the specified depth is found and returns `None` in the end
    (streaming mode).

    The callback function receives two parameters: the `path` from the document
    root to the item (name-attribs pairs), and the `item` (dict). If the
    callback's return value is false-ish, parsing will be stopped with the
    :class:`ParsingInterrupted` exception.

    Streaming example::

        >>> def handle(path, item):
        ...     print('path:%s item:%s' % (path, item))
        ...     return True
        ...
        >>> xmltodict.parse(\"\"\"
        ... <a prop="x">
        ...   <b>1</b>
        ...   <b>2</b>
        ... </a>\"\"\", item_depth=2, item_callback=handle)
        path:[(u'a', {u'prop': u'x'}), (u'b', None)] item:1
        path:[(u'a', {u'prop': u'x'}), (u'b', None)] item:2

    The optional argument `postprocessor` is a function that takes `path`,
    `key` and `value` as positional arguments and returns a new `(key, value)`
    pair where both `key` and `value` may have changed. Usage example::

        >>> def postprocessor(path, key, value):
        ...     try:
        ...         return key + ':int', int(value)
        ...     except (ValueError, TypeError):
        ...         return key, value
        >>> xmltodict.parse('<a><b>1</b><b>2</b><b>x</b></a>',
        ...                 postprocessor=postprocessor)
        {'a': {'b:int': [1, 2], 'b': 'x'}}

    You can pass an alternate version of `expat` (such as `defusedexpat`) by
    using the `expat` parameter. E.g:

        >>> import defusedexpat
        >>> xmltodict.parse('<a>hello</a>', expat=defusedexpat.pyexpat)
        {'a': 'hello'}

    You can use the force_list argument to force lists to be created even
    when there is only a single child of a given level of hierarchy. The
    force_list argument is a tuple of keys. If the key for a given level
    of hierarchy is in the force_list argument, that level of hierarchy
    will have a list as a child (even if there is only one sub-element).
    The index_keys operation takes precedence over this. This is applied
    after any user-supplied postprocessor has already run.

        For example, given this input:
        <servers>
          <server>
            <name>host1</name>
            <os>Linux</os>
            <interfaces>
              <interface>
                <name>em0</name>
                <ip_address>10.0.0.1</ip_address>
              </interface>
            </interfaces>
          </server>
        </servers>

        If called with force_list=('interface',), it will produce
        this dictionary:
        {'servers':
          {'server':
            {'name': 'host1',
             'os': 'Linux'},
             'interfaces':
              {'interface':
                [ {'name': 'em0', 'ip_address': '10.0.0.1' } ] } } }

        `force_list` can also be a callable that receives `path`, `key` and
        `value`. This is helpful in cases where the logic that decides whether
        a list should be forced is more complex.


        If `process_comment` is `True` then comment will be added with comment_key
        (default=`'#comment'`) to then tag which contains comment

            For example, given this input:
            <a>
              <b>
                <!-- b comment -->
                <c>
                    <!-- c comment -->
                    1
                </c>
                <d>2</d>
              </b>
            </a>

            If called with process_comment=True, it will produce
            this dictionary:
            'a': {
                'b': {
                    '#comment': 'b comment',
                    'c': {

                        '#comment': 'c comment',
                        '#text': '1',
                    },
                    'd': '2',
                },
            }
    """
    handler = _DictSAXHandler(namespace_separator=namespace_separator,
                              **kwargs)
    if isinstance(xml_input, _unicode):
        if not encoding:
            encoding = 'utf-8'
        xml_input = xml_input.encode(encoding)
    if not process_namespaces:
        namespace_separator = None
    parser = expat.ParserCreate(
        encoding,
        namespace_separator
    )
    try:
        parser.ordered_attributes = True
    except AttributeError:
        # Jython's expat does not support ordered_attributes
        pass
    parser.StartNamespaceDeclHandler = handler.startNamespaceDecl
    parser.StartElementHandler = handler.startElement
    parser.EndElementHandler = handler.endElement
    parser.CharacterDataHandler = handler.characters
    if process_comments:
        parser.CommentHandler = handler.comments
    parser.buffer_text = True
    if disable_entities:
        try:
            # Attempt to disable DTD in Jython's expat parser (Xerces-J).
            feature = "http://apache.org/xml/features/disallow-doctype-decl"
            parser._reader.setFeature(feature, True)
        except AttributeError:
            # For CPython / expat parser.
            # Anything not handled ends up here and entities aren't expanded.
            parser.DefaultHandler = lambda x: None
            # Expects an integer return; zero means failure -> expat.ExpatError.
            parser.ExternalEntityRefHandler = lambda *x: 1
    if hasattr(xml_input, 'read'):
        parser.ParseFile(xml_input)
    elif isgenerator(xml_input):
        for chunk in xml_input:
            parser.Parse(chunk,False)
        parser.Parse(b'',True)
    else:
        parser.Parse(xml_input, True)
    return handler.item

Parameters

Name	Type	Default	Kind
`xml_input`	-	-	positional_or_keyword
`encoding`	-	None	positional_or_keyword
`expat`	-	expat	positional_or_keyword
`process_namespaces`	-	False	positional_or_keyword
`namespace_separator`	-	':'	positional_or_keyword
`disable_entities`	-	True	positional_or_keyword
`process_comments`	-	False	positional_or_keyword
`**kwargs`	-	-	var_keyword

Parameter Details

xml_input: The XML data to parse. Can be a string (unicode or bytes), a file-like object with a read() method, or a generator yielding string chunks for streaming large files.

encoding: Character encoding for the XML input. Defaults to None, which uses 'utf-8' for unicode strings. Specify encoding like 'utf-8', 'latin-1', etc. if needed.

expat: The expat parser module to use. Defaults to the standard xml.parsers.expat. Can be replaced with defusedexpat.pyexpat for enhanced security against XML attacks.

process_namespaces: Boolean flag (default False). When True, XML namespaces are processed and namespace prefixes are preserved in element names using the namespace_separator.

namespace_separator: String separator used between namespace and element name when process_namespaces is True. Defaults to ':' (e.g., 'ns:element').

disable_entities: Boolean flag (default True). When True, disables XML entity expansion to prevent XXE (XML External Entity) attacks and other security vulnerabilities.

process_comments: Boolean flag (default False). When True, XML comments are included in the output dictionary with the key '#comment' (or custom comment_key from kwargs).

kwargs: Additional keyword arguments passed to _DictSAXHandler. Common options include: xml_attribs (bool, include attributes), attr_prefix (str, default '@'), cdata_key (str, default '#text'), force_list (tuple/callable), item_depth (int, for streaming), item_callback (callable, for streaming), postprocessor (callable, for value transformation), dict_constructor (callable, default dict), strip_whitespace (bool), comment_key (str, default '#comment'), force_cdata (bool), cdata_separator (str).

Return Value

Returns a dictionary (or OrderedDict if specified) representing the parsed XML structure. Element names become keys, text content becomes values, attributes are prefixed with '@' by default, and repeated elements become lists. In streaming mode (when item_depth > 0), returns None and calls item_callback for each item at the specified depth instead.

Dependencies

xml.parsers.expat
defusedexpat
inspect

Required Imports

from xml.parsers import expat
from inspect import isgenerator

Conditional/Optional Imports

These imports are only needed under specific conditions:

from defusedexpat import pyexpat as expat

Condition: only if using defusedexpat for enhanced security (alternative to standard expat)

Optional

from collections import OrderedDict

Condition: only if dict_constructor=OrderedDict is passed in kwargs to preserve element order

Optional

Usage Example

import xmltodict
from xml.parsers import expat

# Basic usage - parse XML string
xml_string = '''<root><person name="John"><age>30</age><city>NYC</city></person></root>'''
result = xmltodict.parse(xml_string)
print(result)
# Output: {'root': {'person': {'@name': 'John', 'age': '30', 'city': 'NYC'}}}

# Parse with file object
with open('data.xml', 'rb') as f:
    result = xmltodict.parse(f)

# Streaming mode for large files
def handle_item(path, item):
    print(f'Processing: {item}')
    return True

xmltodict.parse(xml_string, item_depth=2, item_callback=handle_item)

# With postprocessor to convert types
def postprocessor(path, key, value):
    if key == 'age':
        return key, int(value)
    return key, value

result = xmltodict.parse(xml_string, postprocessor=postprocessor)

# Force lists for specific elements
result = xmltodict.parse(xml_string, force_list=('person',))

# Process comments
xml_with_comments = '''<root><!-- comment --><data>value</data></root>'''
result = xmltodict.parse(xml_with_comments, process_comments=True)

# Use defusedexpat for security
import defusedexpat
result = xmltodict.parse(xml_string, expat=defusedexpat.pyexpat)

Best Practices

Always use disable_entities=True (default) when parsing untrusted XML to prevent XXE attacks
Consider using defusedexpat.pyexpat instead of standard expat for enhanced security when parsing external XML
Use streaming mode (item_depth + item_callback) for large XML files to avoid loading entire document into memory
Provide explicit encoding parameter when working with non-UTF-8 XML documents
Use force_list parameter when you need consistent list structures even for single elements
Implement postprocessor functions for type conversion and data validation during parsing
When processing namespaces, set process_namespaces=True and choose appropriate namespace_separator
For ordered element preservation, pass dict_constructor=OrderedDict in kwargs
Handle ParsingInterrupted exception when using item_callback that may return False
Test with sample data before processing large production XML files

Similar Components

AI-powered semantic similarity - components with related functionality:

function unparse 66.1% similar

Converts a Python dictionary into an XML document string, serving as the reverse operation of XML parsing. Supports customizable formatting, encoding, and XML generation options.
From: /tf/active/vicechatdev/SPFCsync/venv/lib64/python3.11/site-packages/xmltodict.py
class _DictSAXHandler 63.2% similar

A SAX (Simple API for XML) event handler that converts XML documents into Python dictionaries, with extensive configuration options for handling attributes, namespaces, CDATA, and structure.
From: /tf/active/vicechatdev/SPFCsync/venv/lib64/python3.11/site-packages/xmltodict.py
function _emit 59.1% similar

Recursively converts a dictionary structure into XML SAX events, emitting them through a content handler for XML generation.
From: /tf/active/vicechatdev/SPFCsync/venv/lib64/python3.11/site-packages/xmltodict.py
function _process_namespace 42.7% similar

Processes XML namespace prefixes in element/attribute names by resolving them against a namespace dictionary and reconstructing the full qualified name.
From: /tf/active/vicechatdev/SPFCsync/venv/lib64/python3.11/site-packages/xmltodict.py
function parse_log_line 36.9% similar

Parses a structured log line string and extracts timestamp, logger name, log level, and message components into a dictionary.
From: /tf/active/vicechatdev/SPFCsync/monitor.py

← Back to Browse

Assistant

Hi! I can help improve this code. Tell me what you'd like to enhance (e.g., "add error handling", "optimize performance", "improve readability", "add type hints").

Code Comparison

Original Code

                            def parse(xml_input, encoding=None, expat=expat, process_namespaces=False,
          namespace_separator=':', disable_entities=True, process_comments=False, **kwargs):
    """Parse the given XML input and convert it into a dictionary.

    `xml_input` can either be a `string`, a file-like object, or a generator of strings.

    If `xml_attribs` is `True`, element attributes are put in the dictionary
    among regular child elements, using `@` as a prefix to avoid collisions. If
    set to `False`, they are just ignored.

    Simple example::

        >>> import xmltodict
        >>> doc = xmltodict.parse(\"\"\"
        ... <a prop="x">
        ...   <b>1</b>
        ...   <b>2</b>
        ... </a>
        ... \"\"\")
        >>> doc['a']['@prop']
        u'x'
        >>> doc['a']['b']
        [u'1', u'2']

    If `item_depth` is `0`, the function returns a dictionary for the root
    element (default behavior). Otherwise, it calls `item_callback` every time
    an item at the specified depth is found and returns `None` in the end
    (streaming mode).

    The callback function receives two parameters: the `path` from the document
    root to the item (name-attribs pairs), and the `item` (dict). If the
    callback's return value is false-ish, parsing will be stopped with the
    :class:`ParsingInterrupted` exception.

    Streaming example::

        >>> def handle(path, item):
        ...     print('path:%s item:%s' % (path, item))
        ...     return True
        ...
        >>> xmltodict.parse(\"\"\"
        ... <a prop="x">
        ...   <b>1</b>
        ...   <b>2</b>
        ... </a>\"\"\", item_depth=2, item_callback=handle)
        path:[(u'a', {u'prop': u'x'}), (u'b', None)] item:1
        path:[(u'a', {u'prop': u'x'}), (u'b', None)] item:2

    The optional argument `postprocessor` is a function that takes `path`,
    `key` and `value` as positional arguments and returns a new `(key, value)`
    pair where both `key` and `value` may have changed. Usage example::

        >>> def postprocessor(path, key, value):
        ...     try:
        ...         return key + ':int', int(value)
        ...     except (ValueError, TypeError):
        ...         return key, value
        >>> xmltodict.parse('<a><b>1</b><b>2</b><b>x</b></a>',
        ...                 postprocessor=postprocessor)
        {'a': {'b:int': [1, 2], 'b': 'x'}}

    You can pass an alternate version of `expat` (such as `defusedexpat`) by
    using the `expat` parameter. E.g:

        >>> import defusedexpat
        >>> xmltodict.parse('<a>hello</a>', expat=defusedexpat.pyexpat)
        {'a': 'hello'}

    You can use the force_list argument to force lists to be created even
    when there is only a single child of a given level of hierarchy. The
    force_list argument is a tuple of keys. If the key for a given level
    of hierarchy is in the force_list argument, that level of hierarchy
    will have a list as a child (even if there is only one sub-element).
    The index_keys operation takes precedence over this. This is applied
    after any user-supplied postprocessor has already run.

        For example, given this input:
        <servers>
          <server>
            <name>host1</name>
            <os>Linux</os>
            <interfaces>
              <interface>
                <name>em0</name>
                <ip_address>10.0.0.1</ip_address>
              </interface>
            </interfaces>
          </server>
        </servers>

        If called with force_list=('interface',), it will produce
        this dictionary:
        {'servers':
          {'server':
            {'name': 'host1',
             'os': 'Linux'},
             'interfaces':
              {'interface':
                [ {'name': 'em0', 'ip_address': '10.0.0.1' } ] } } }

        `force_list` can also be a callable that receives `path`, `key` and
        `value`. This is helpful in cases where the logic that decides whether
        a list should be forced is more complex.


        If `process_comment` is `True` then comment will be added with comment_key
        (default=`'#comment'`) to then tag which contains comment

            For example, given this input:
            <a>
              <b>
                <!-- b comment -->
                <c>
                    <!-- c comment -->
                    1
                </c>
                <d>2</d>
              </b>
            </a>

            If called with process_comment=True, it will produce
            this dictionary:
            'a': {
                'b': {
                    '#comment': 'b comment',
                    'c': {

                        '#comment': 'c comment',
                        '#text': '1',
                    },
                    'd': '2',
                },
            }
    """
    handler = _DictSAXHandler(namespace_separator=namespace_separator,
                              **kwargs)
    if isinstance(xml_input, _unicode):
        if not encoding:
            encoding = 'utf-8'
        xml_input = xml_input.encode(encoding)
    if not process_namespaces:
        namespace_separator = None
    parser = expat.ParserCreate(
        encoding,
        namespace_separator
    )
    try:
        parser.ordered_attributes = True
    except AttributeError:
        # Jython's expat does not support ordered_attributes
        pass
    parser.StartNamespaceDeclHandler = handler.startNamespaceDecl
    parser.StartElementHandler = handler.startElement
    parser.EndElementHandler = handler.endElement
    parser.CharacterDataHandler = handler.characters
    if process_comments:
        parser.CommentHandler = handler.comments
    parser.buffer_text = True
    if disable_entities:
        try:
            # Attempt to disable DTD in Jython's expat parser (Xerces-J).
            feature = "http://apache.org/xml/features/disallow-doctype-decl"
            parser._reader.setFeature(feature, True)
        except AttributeError:
            # For CPython / expat parser.
            # Anything not handled ends up here and entities aren't expanded.
            parser.DefaultHandler = lambda x: None
            # Expects an integer return; zero means failure -> expat.ExpatError.
            parser.ExternalEntityRefHandler = lambda *x: 1
    if hasattr(xml_input, 'read'):
        parser.ParseFile(xml_input)
    elif isgenerator(xml_input):
        for chunk in xml_input:
            parser.Parse(chunk,False)
        parser.Parse(b'',True)
    else:
        parser.Parse(xml_input, True)
    return handler.item
                        

Improved Code

🔍 Code Extractor

function parse

Purpose

Source Code

Parameters

Parameter Details

Return Value

Dependencies

Required Imports

Conditional/Optional Imports

Usage Example

Best Practices

Tags

Similar Components

function unparse 66.1% similar

class _DictSAXHandler 63.2% similar

function _emit 59.1% similar

function _process_namespace 42.7% similar

function parse_log_line 36.9% similar

function parse

Purpose

Source Code

Parameters

Parameter Details

Return Value

Dependencies

Required Imports

Conditional/Optional Imports

Usage Example

Best Practices

Tags

Similar Components

function unparse 66.1% similar

class _DictSAXHandler 63.2% similar

function _emit 59.1% similar

function _process_namespace 42.7% similar

function parse_log_line 36.9% similar

✨ Improve Code: parse

Code Comparison