_DictSAXHandler - Code Extractor

class _DictSAXHandler

Maturity: 46

A SAX (Simple API for XML) event handler that converts XML documents into Python dictionaries, with extensive configuration options for handling attributes, namespaces, CDATA, and structure.

File:
/tf/active/vicechatdev/SPFCsync/venv/lib64/python3.11/site-packages/xmltodict.py

Lines:
43 - 199

Complexity:
complex

Purpose

This class serves as a SAX parser handler for converting XML to dictionary representations. It processes XML parsing events (startElement, endElement, characters, etc.) and builds a nested dictionary structure. It supports advanced features like namespace handling, attribute prefixing, CDATA management, custom postprocessing, forcing list structures, and streaming parsing with callbacks at specific depths. The handler is designed to work with Python's SAX parser infrastructure and can handle complex XML documents with namespaces, comments, and mixed content.

Source Code

class _DictSAXHandler(object):
    def __init__(self,
                 item_depth=0,
                 item_callback=lambda *args: True,
                 xml_attribs=True,
                 attr_prefix='@',
                 cdata_key='#text',
                 force_cdata=False,
                 cdata_separator='',
                 postprocessor=None,
                 dict_constructor=_dict,
                 strip_whitespace=True,
                 namespace_separator=':',
                 namespaces=None,
                 force_list=None,
                 comment_key='#comment'):
        self.path = []
        self.stack = []
        self.data = []
        self.item = None
        self.item_depth = item_depth
        self.xml_attribs = xml_attribs
        self.item_callback = item_callback
        self.attr_prefix = attr_prefix
        self.cdata_key = cdata_key
        self.force_cdata = force_cdata
        self.cdata_separator = cdata_separator
        self.postprocessor = postprocessor
        self.dict_constructor = dict_constructor
        self.strip_whitespace = strip_whitespace
        self.namespace_separator = namespace_separator
        self.namespaces = namespaces
        self.namespace_declarations = dict_constructor()
        self.force_list = force_list
        self.comment_key = comment_key

    def _build_name(self, full_name):
        if self.namespaces is None:
            return full_name
        i = full_name.rfind(self.namespace_separator)
        if i == -1:
            return full_name
        namespace, name = full_name[:i], full_name[i+1:]
        try:
            short_namespace = self.namespaces[namespace]
        except KeyError:
            short_namespace = namespace
        if not short_namespace:
            return name
        else:
            return self.namespace_separator.join((short_namespace, name))

    def _attrs_to_dict(self, attrs):
        if isinstance(attrs, dict):
            return attrs
        return self.dict_constructor(zip(attrs[0::2], attrs[1::2]))

    def startNamespaceDecl(self, prefix, uri):
        self.namespace_declarations[prefix or ''] = uri

    def startElement(self, full_name, attrs):
        name = self._build_name(full_name)
        attrs = self._attrs_to_dict(attrs)
        if attrs and self.namespace_declarations:
            attrs['xmlns'] = self.namespace_declarations
            self.namespace_declarations = self.dict_constructor()
        self.path.append((name, attrs or None))
        if len(self.path) > self.item_depth:
            self.stack.append((self.item, self.data))
            if self.xml_attribs:
                attr_entries = []
                for key, value in attrs.items():
                    key = self.attr_prefix+self._build_name(key)
                    if self.postprocessor:
                        entry = self.postprocessor(self.path, key, value)
                    else:
                        entry = (key, value)
                    if entry:
                        attr_entries.append(entry)
                attrs = self.dict_constructor(attr_entries)
            else:
                attrs = None
            self.item = attrs or None
            self.data = []

    def endElement(self, full_name):
        name = self._build_name(full_name)
        if len(self.path) == self.item_depth:
            item = self.item
            if item is None:
                item = (None if not self.data
                        else self.cdata_separator.join(self.data))

            should_continue = self.item_callback(self.path, item)
            if not should_continue:
                raise ParsingInterrupted()
        if self.stack:
            data = (None if not self.data
                    else self.cdata_separator.join(self.data))
            item = self.item
            self.item, self.data = self.stack.pop()
            if self.strip_whitespace and data:
                data = data.strip() or None
            if data and self.force_cdata and item is None:
                item = self.dict_constructor()
            if item is not None:
                if data:
                    self.push_data(item, self.cdata_key, data)
                self.item = self.push_data(self.item, name, item)
            else:
                self.item = self.push_data(self.item, name, data)
        else:
            self.item = None
            self.data = []
        self.path.pop()

    def characters(self, data):
        if not self.data:
            self.data = [data]
        else:
            self.data.append(data)

    def comments(self, data):
        if self.strip_whitespace:
            data = data.strip()
        self.item = self.push_data(self.item, self.comment_key, data)

    def push_data(self, item, key, data):
        if self.postprocessor is not None:
            result = self.postprocessor(self.path, key, data)
            if result is None:
                return item
            key, data = result
        if item is None:
            item = self.dict_constructor()
        try:
            value = item[key]
            if isinstance(value, list):
                value.append(data)
            else:
                item[key] = [value, data]
        except KeyError:
            if self._should_force_list(key, data):
                item[key] = [data]
            else:
                item[key] = data
        return item

    def _should_force_list(self, key, value):
        if not self.force_list:
            return False
        if isinstance(self.force_list, bool):
            return self.force_list
        try:
            return key in self.force_list
        except TypeError:
            return self.force_list(self.path[:-1], key, value)

Parameters

Name	Type	Default	Kind
`bases`	object	-

Parameter Details

item_depth: Integer specifying the depth at which to trigger item_callback. 0 means root level. Used for streaming large XML documents by processing items at a specific nesting level.

item_callback: Callable function invoked when an element at item_depth is fully parsed. Receives (path, item) and should return True to continue parsing or False to stop. Default is lambda that always returns True.

xml_attribs: Boolean indicating whether to include XML attributes in the output dictionary. If True, attributes are added with attr_prefix. Default is True.

attr_prefix: String prefix for attribute keys in the output dictionary. Default is '@'. For example, an attribute 'id' becomes '@id'.

cdata_key: String key name for character data (text content) in elements that also have children or attributes. Default is '#text'.

force_cdata: Boolean that forces creation of a dictionary with cdata_key even when element has only text content. Default is False.

cdata_separator: String used to join multiple character data segments within an element. Default is empty string ''.

postprocessor: Optional callable for transforming parsed data. Receives (path, key, data) and should return None to skip or (new_key, new_data) tuple. Default is None.

dict_constructor: Callable that creates dictionary objects. Allows using OrderedDict or custom dict types. Default is _dict (typically OrderedDict).

strip_whitespace: Boolean indicating whether to strip leading/trailing whitespace from text content. Default is True.

namespace_separator: String separator for namespace prefix and local name. Default is ':'. Used when building qualified names.

namespaces: Optional dictionary mapping namespace URIs to prefixes for namespace abbreviation. None means no namespace processing.

force_list: Controls when to force values into lists. Can be: None (no forcing), True (always force), False (never force), iterable of keys to force, or callable(path, key, value) returning boolean.

comment_key: String key name for XML comments in the output dictionary. Default is '#comment'.

Return Value

Instantiation returns a _DictSAXHandler object that can be used as a SAX ContentHandler. The handler maintains state during parsing and builds the final dictionary structure in the 'item' attribute. Methods like startElement, endElement, and characters return None (they modify internal state). The push_data method returns the updated item dictionary.

Class Interface

Methods

`init(self, item_depth=0, item_callback=lambda *args: True, xml_attribs=True, attr_prefix='@', cdata_key='#text', force_cdata=False, cdata_separator='', postprocessor=None, dict_constructor=_dict, strip_whitespace=True, namespace_separator=':', namespaces=None, force_list=None, comment_key='#comment')`

Purpose: Initializes the SAX handler with configuration options for XML-to-dict conversion

Parameters:

item_depth: Depth level for triggering item callbacks
item_callback: Function called when item at specified depth is parsed
xml_attribs: Whether to include XML attributes
attr_prefix: Prefix for attribute keys
cdata_key: Key name for character data
force_cdata: Force CDATA key creation
cdata_separator: Separator for joining character data
postprocessor: Optional data transformation function
dict_constructor: Function to create dictionary objects
strip_whitespace: Whether to strip whitespace from text
namespace_separator: Separator for namespace and local name
namespaces: Namespace URI to prefix mapping
force_list: Configuration for forcing list structures
comment_key: Key name for XML comments

Returns: None (constructor)

`_build_name(self, full_name) -> str`

Purpose: Builds element name by processing namespaces according to configuration

Parameters:

full_name: Full qualified name potentially containing namespace

Returns: Processed name with namespace prefix applied or removed based on namespaces mapping

`_attrs_to_dict(self, attrs) -> dict`

Purpose: Converts SAX attributes object to dictionary using configured dict_constructor

Parameters:

attrs: SAX attributes object or dictionary

Returns: Dictionary representation of attributes

`startNamespaceDecl(self, prefix, uri) -> None`

Purpose: SAX event handler for namespace declarations, stores namespace mappings

Parameters:

prefix: Namespace prefix (empty string for default namespace)
uri: Namespace URI

Returns: None

`startElement(self, full_name, attrs) -> None`

Purpose: SAX event handler for element start, processes element name and attributes, updates parsing state

Parameters:

full_name: Full element name including namespace
attrs: Element attributes

Returns: None

`endElement(self, full_name) -> None`

Purpose: SAX event handler for element end, finalizes element processing, triggers callbacks, and updates item structure

Parameters:

full_name: Full element name including namespace

Returns: None (may raise ParsingInterrupted if callback returns False)

`characters(self, data) -> None`

Purpose: SAX event handler for character data, accumulates text content in data list

Parameters:

data: Character data string

Returns: None

`comments(self, data) -> None`

Purpose: SAX event handler for XML comments, adds comment to current item with comment_key

Parameters:

data: Comment text

Returns: None

`push_data(self, item, key, data) -> dict`

Purpose: Adds data to item dictionary, handling duplicate keys by creating lists and applying postprocessor

Parameters:

item: Current item dictionary or None
key: Key to add
data: Value to add

Returns: Updated item dictionary with new data added

`_should_force_list(self, key, value) -> bool`

Purpose: Determines if a key should be forced into a list structure based on force_list configuration

Parameters:

key: Dictionary key being evaluated
value: Value associated with the key

Returns: Boolean indicating whether to force list structure

Attributes

Name	Type	Description	Scope
`path`	list	Stack of (name, attrs) tuples representing current path in XML tree during parsing	instance
`stack`	list	Stack of (item, data) tuples for maintaining parent context during nested element parsing	instance
`data`	list	Accumulator for character data segments within current element	instance
`item`	dict or None	Current item being built or final parsed result after parsing completes	instance
`item_depth`	int	Depth level at which to trigger item_callback	instance
`xml_attribs`	bool	Whether to include XML attributes in output	instance
`item_callback`	callable	Function called when element at item_depth is fully parsed	instance
`attr_prefix`	str	Prefix string for attribute keys in output dictionary	instance
`cdata_key`	str	Key name for character data in output dictionary	instance
`force_cdata`	bool	Whether to force creation of dictionary with cdata_key for text-only elements	instance
`cdata_separator`	str	String used to join multiple character data segments	instance
`postprocessor`	callable or None	Optional function for transforming parsed data	instance
`dict_constructor`	callable	Function used to create dictionary objects (e.g., dict, OrderedDict)	instance
`strip_whitespace`	bool	Whether to strip leading/trailing whitespace from text content	instance
`namespace_separator`	str	Separator between namespace prefix and local name	instance
`namespaces`	dict or None	Mapping of namespace URIs to prefixes for namespace processing	instance
`namespace_declarations`	dict	Temporary storage for namespace declarations in current element	instance
`force_list`	bool, iterable, callable, or None	Configuration for forcing values into list structures	instance
`comment_key`	str	Key name for XML comments in output dictionary	instance

Dependencies

xml.sax
xml.parsers.expat
defusedexpat

Required Imports

from xml.sax.saxutils import XMLGenerator
from xml.sax.xmlreader import AttributesImpl
from xml.parsers import expat

Conditional/Optional Imports

These imports are only needed under specific conditions:

from defusedexpat import pyexpat as expat

Condition: when defusedexpat is available for secure XML parsing

Optional

from collections import OrderedDict as _dict

Condition: for maintaining element order in parsed dictionaries

Required (conditional)

from cStringIO import StringIO

Condition: Python 2.x only

Optional

from io import StringIO

Condition: Python 3.x only

Optional

Usage Example

from collections import OrderedDict
from xml.sax import make_parser

# Create handler with custom configuration
handler = _DictSAXHandler(
    item_depth=0,
    xml_attribs=True,
    attr_prefix='@',
    cdata_key='#text',
    strip_whitespace=True,
    dict_constructor=OrderedDict
)

# Use with SAX parser
parser = make_parser()
parser.setContentHandler(handler)
parser.parse('example.xml')

# Access parsed result
result = handler.item

# Example with streaming callback
def process_item(path, item):
    print(f'Parsed item at {path}: {item}')
    return True  # Continue parsing

streaming_handler = _DictSAXHandler(
    item_depth=2,
    item_callback=process_item
)

# Example with namespace handling
ns_handler = _DictSAXHandler(
    namespaces={'http://example.com/ns': 'ex'},
    namespace_separator=':'
)

# Example with force_list
list_handler = _DictSAXHandler(
    force_list=['item', 'entry']  # Always make these keys lists
)

Best Practices

This handler is designed to be used with Python's SAX parser infrastructure via make_parser() and setContentHandler()
The handler maintains internal state (path, stack, data, item) that is modified during parsing - do not reuse the same handler instance for multiple documents
Use item_depth and item_callback for memory-efficient streaming parsing of large XML documents
The item attribute contains the final parsed result after parsing completes
When using postprocessor, return None to skip an element or (key, value) tuple to transform it
The force_list parameter is crucial for ensuring consistent structure when elements may appear once or multiple times
Set xml_attribs=False if you don't need attributes in the output to simplify the resulting dictionary
Use namespace_separator and namespaces parameters together for proper namespace handling
The handler can raise ParsingInterrupted exception if item_callback returns False - ensure this exception is defined
Character data is accumulated in the data list and joined with cdata_separator when the element closes
The stack attribute maintains parent context during nested element parsing
Path tracking (self.path) provides context for postprocessor and item_callback functions

Similar Components

AI-powered semantic similarity - components with related functionality:

function _emit 70.6% similar

Recursively converts a dictionary structure into XML SAX events, emitting them through a content handler for XML generation.
From: /tf/active/vicechatdev/SPFCsync/venv/lib64/python3.11/site-packages/xmltodict.py
function parse 63.2% similar

Parses XML input (string, file-like object, or generator) and converts it into a Python dictionary representation with configurable options for attributes, namespaces, comments, and streaming.
From: /tf/active/vicechatdev/SPFCsync/venv/lib64/python3.11/site-packages/xmltodict.py
function unparse 59.5% similar

Converts a Python dictionary into an XML document string, serving as the reverse operation of XML parsing. Supports customizable formatting, encoding, and XML generation options.
From: /tf/active/vicechatdev/SPFCsync/venv/lib64/python3.11/site-packages/xmltodict.py
function _process_namespace 47.5% similar

Processes XML namespace prefixes in element/attribute names by resolving them against a namespace dictionary and reconstructing the full qualified name.
From: /tf/active/vicechatdev/SPFCsync/venv/lib64/python3.11/site-packages/xmltodict.py
class SimpleDataHandle 46.0% similar

A data handler class that manages multiple data sources with different types (dataframes, vector stores, databases) and their associated processing configurations.
From: /tf/active/vicechatdev/OneCo_hybrid_RAG copy.py

← Back to Browse

Assistant

Hi! I can help improve this code. Tell me what you'd like to enhance (e.g., "add error handling", "optimize performance", "improve readability", "add type hints").

Code Comparison

Original Code

                            class _DictSAXHandler(object):
    def __init__(self,
                 item_depth=0,
                 item_callback=lambda *args: True,
                 xml_attribs=True,
                 attr_prefix='@',
                 cdata_key='#text',
                 force_cdata=False,
                 cdata_separator='',
                 postprocessor=None,
                 dict_constructor=_dict,
                 strip_whitespace=True,
                 namespace_separator=':',
                 namespaces=None,
                 force_list=None,
                 comment_key='#comment'):
        self.path = []
        self.stack = []
        self.data = []
        self.item = None
        self.item_depth = item_depth
        self.xml_attribs = xml_attribs
        self.item_callback = item_callback
        self.attr_prefix = attr_prefix
        self.cdata_key = cdata_key
        self.force_cdata = force_cdata
        self.cdata_separator = cdata_separator
        self.postprocessor = postprocessor
        self.dict_constructor = dict_constructor
        self.strip_whitespace = strip_whitespace
        self.namespace_separator = namespace_separator
        self.namespaces = namespaces
        self.namespace_declarations = dict_constructor()
        self.force_list = force_list
        self.comment_key = comment_key

    def _build_name(self, full_name):
        if self.namespaces is None:
            return full_name
        i = full_name.rfind(self.namespace_separator)
        if i == -1:
            return full_name
        namespace, name = full_name[:i], full_name[i+1:]
        try:
            short_namespace = self.namespaces[namespace]
        except KeyError:
            short_namespace = namespace
        if not short_namespace:
            return name
        else:
            return self.namespace_separator.join((short_namespace, name))

    def _attrs_to_dict(self, attrs):
        if isinstance(attrs, dict):
            return attrs
        return self.dict_constructor(zip(attrs[0::2], attrs[1::2]))

    def startNamespaceDecl(self, prefix, uri):
        self.namespace_declarations[prefix or ''] = uri

    def startElement(self, full_name, attrs):
        name = self._build_name(full_name)
        attrs = self._attrs_to_dict(attrs)
        if attrs and self.namespace_declarations:
            attrs['xmlns'] = self.namespace_declarations
            self.namespace_declarations = self.dict_constructor()
        self.path.append((name, attrs or None))
        if len(self.path) > self.item_depth:
            self.stack.append((self.item, self.data))
            if self.xml_attribs:
                attr_entries = []
                for key, value in attrs.items():
                    key = self.attr_prefix+self._build_name(key)
                    if self.postprocessor:
                        entry = self.postprocessor(self.path, key, value)
                    else:
                        entry = (key, value)
                    if entry:
                        attr_entries.append(entry)
                attrs = self.dict_constructor(attr_entries)
            else:
                attrs = None
            self.item = attrs or None
            self.data = []

    def endElement(self, full_name):
        name = self._build_name(full_name)
        if len(self.path) == self.item_depth:
            item = self.item
            if item is None:
                item = (None if not self.data
                        else self.cdata_separator.join(self.data))

            should_continue = self.item_callback(self.path, item)
            if not should_continue:
                raise ParsingInterrupted()
        if self.stack:
            data = (None if not self.data
                    else self.cdata_separator.join(self.data))
            item = self.item
            self.item, self.data = self.stack.pop()
            if self.strip_whitespace and data:
                data = data.strip() or None
            if data and self.force_cdata and item is None:
                item = self.dict_constructor()
            if item is not None:
                if data:
                    self.push_data(item, self.cdata_key, data)
                self.item = self.push_data(self.item, name, item)
            else:
                self.item = self.push_data(self.item, name, data)
        else:
            self.item = None
            self.data = []
        self.path.pop()

    def characters(self, data):
        if not self.data:
            self.data = [data]
        else:
            self.data.append(data)

    def comments(self, data):
        if self.strip_whitespace:
            data = data.strip()
        self.item = self.push_data(self.item, self.comment_key, data)

    def push_data(self, item, key, data):
        if self.postprocessor is not None:
            result = self.postprocessor(self.path, key, data)
            if result is None:
                return item
            key, data = result
        if item is None:
            item = self.dict_constructor()
        try:
            value = item[key]
            if isinstance(value, list):
                value.append(data)
            else:
                item[key] = [value, data]
        except KeyError:
            if self._should_force_list(key, data):
                item[key] = [data]
            else:
                item[key] = data
        return item

    def _should_force_list(self, key, value):
        if not self.force_list:
            return False
        if isinstance(self.force_list, bool):
            return self.force_list
        try:
            return key in self.force_list
        except TypeError:
            return self.force_list(self.path[:-1], key, value)
                        

Improved Code

🔍 Code Extractor

class _DictSAXHandler

Purpose

Source Code

Parameters

Parameter Details

Return Value

Class Interface

Methods

`_build_name(self, full_name) -> str`

`_attrs_to_dict(self, attrs) -> dict`

`startNamespaceDecl(self, prefix, uri) -> None`

`startElement(self, full_name, attrs) -> None`

`endElement(self, full_name) -> None`

`characters(self, data) -> None`

`comments(self, data) -> None`

`push_data(self, item, key, data) -> dict`

`_should_force_list(self, key, value) -> bool`

Attributes

Dependencies

Required Imports

Conditional/Optional Imports

Usage Example

Best Practices

Tags

Similar Components

function _emit 70.6% similar

function parse 63.2% similar

function unparse 59.5% similar

function _process_namespace 47.5% similar

class SimpleDataHandle 46.0% similar

class _DictSAXHandler

Purpose

Source Code

Parameters

Parameter Details

Return Value

Class Interface

Methods

_build_name(self, full_name) -> str

_attrs_to_dict(self, attrs) -> dict

startNamespaceDecl(self, prefix, uri) -> None

startElement(self, full_name, attrs) -> None

endElement(self, full_name) -> None

characters(self, data) -> None

comments(self, data) -> None

push_data(self, item, key, data) -> dict

_should_force_list(self, key, value) -> bool

Attributes

Dependencies

Required Imports

Conditional/Optional Imports

Usage Example

Best Practices

Tags

Similar Components

function _emit 70.6% similar

function parse 63.2% similar

function unparse 59.5% similar

function _process_namespace 47.5% similar

class SimpleDataHandle 46.0% similar

✨ Improve Code: _DictSAXHandler

Code Comparison

`_build_name(self, full_name) -> str`

`_attrs_to_dict(self, attrs) -> dict`

`startNamespaceDecl(self, prefix, uri) -> None`

`startElement(self, full_name, attrs) -> None`

`endElement(self, full_name) -> None`

`characters(self, data) -> None`

`comments(self, data) -> None`

`push_data(self, item, key, data) -> dict`

`_should_force_list(self, key, value) -> bool`