SimpleDataHandle - Code Extractor

class SimpleDataHandle

Maturity: 40

A data handler class that manages multiple data sources with different types (dataframes, vector stores, databases) and their associated processing configurations.

File:
/tf/active/vicechatdev/OneCo_hybrid_RAG copy.py

Lines:
718 - 787

Complexity:
moderate

Purpose

SimpleDataHandle provides a centralized registry for managing heterogeneous data sources in a data processing or RAG (Retrieval-Augmented Generation) pipeline. It stores data along with metadata including type, filters, processing steps, inclusion limits, and instructions for how to use each data source. The class automatically configures default settings based on data type and can convert documents to vector stores using FAISS and OpenAI embeddings.

Source Code

class SimpleDataHandle:
     
    def __init__(self):
        self.handlers = {}
        return
     
    def add_data(self, name:str, type:str, data:Any, filters:str="", processing_steps:List[str]=[], inclusions:int=10,instructions:str=""):
        ## Default values for type, filters, processing_steps, instructions
        if type == "":
            type = "text"
        if type=="dataframe":
            filters=""
            if processing_steps==[]:
                processing_steps=["markdown"]
            if instructions=="":
                instructions="""Start with a summary of the internal data, using summary tables when possible. If the internal data is presented as chemical formulas in SMILES format, try to find the corresponding chemical names and properties and report those in your answer.
                            Use them to compare it to other chemical data in the external sources."""
        if type=="vectorstore" or "to_vectorstore":
            if processing_steps==[]:
                processing_steps=["similarity"]
            if instructions=="":
                instructions="""Provide a summary of the given context data extracted from lab data and reports and from scientific literature, using summary tables when possible.
                            """
        if type =="to_vectorstore":
            embeddings = OpenAIEmbeddings()
            index = faiss.IndexFlatL2(len(embeddings.embed_query("hello world")))
            vector_store = FAISS(
                embedding_function=embeddings,
                docstore=InMemoryDocstore(),
                index_to_docstore_id={},
                index=index
            )
            uuids = [str(uuid4()) for _ in range(len(data))]
            vector_store.add_documents(
            documents=data,  
            ids=uuids,
            )
            data=vector_store
            type="vectorstore"
        if type == "db_search":
            if processing_steps==[]:
                processing_steps=["similarity"]
            if instructions=="":
                instructions="""Provide a summary of the given context data extracted from lab data and reports and from scientific literature, using summary tables when possible.
                            """
        if type=="chromaDB":
            if processing_steps==[]:
                processing_steps=["similarity"]
            if instructions=="":    
                instructions="""Provide a summary of the given context data extracted from lab data and reports and from scientific literature, using summary tables when possible.
                            """
        
        self.handlers[name] = {
            "type" : type,
            "data" : data,
            "filters" : filters,
            "processing_steps" : processing_steps,
            "inclusions" : inclusions,
            "instructions" : instructions
        }
        return
     
    def remove_data(self, name:str):
        if name in self.handlers:
            del self.handlers[name]
        return
    
    def clear_data(self):
        self.handlers = {}
        return

Parameters

Name	Type	Default	Kind
`bases`	-	-

Parameter Details

__init__: No parameters required. Initializes an empty handlers dictionary to store data sources.

Return Value

The class constructor returns None. The add_data, remove_data, and clear_data methods all return None (they modify internal state). The primary value is accessed through the handlers attribute which contains a dictionary mapping data source names to their configuration dictionaries.

Class Interface

Methods

`init(self) -> None`

Purpose: Initialize a new SimpleDataHandle instance with an empty handlers dictionary

Returns: None

`add_data(self, name: str, type: str, data: Any, filters: str = '', processing_steps: List[str] = [], inclusions: int = 10, instructions: str = '') -> None`

Purpose: Add a new data source to the handler with associated configuration. Automatically sets defaults based on type and can convert documents to FAISS vector stores.

Parameters:

name: Unique identifier for this data source
type: Type of data: 'text', 'dataframe', 'vectorstore', 'to_vectorstore', 'db_search', or 'chromaDB'
data: The actual data object (text, DataFrame, vector store, list of Documents, etc.)
filters: Filter criteria for the data (defaults to empty string, auto-cleared for dataframes)
processing_steps: List of processing steps to apply (defaults vary by type: ['markdown'] for dataframes, ['similarity'] for vector stores)
inclusions: Number of items to include from this source (default 10)
instructions: Instructions for how to use this data source (defaults vary by type with specific guidance for chemical data, lab reports, etc.)

Returns: None - modifies internal handlers dictionary

`remove_data(self, name: str) -> None`

Purpose: Remove a data source from the handler by name

Parameters:

name: The name of the data source to remove

Returns: None - modifies internal handlers dictionary by deleting the entry if it exists

`clear_data(self) -> None`

Purpose: Remove all data sources from the handler

Returns: None - resets handlers dictionary to empty

Attributes

Name	Type	Description	Scope
`handlers`	Dict[str, Dict[str, Any]]	Dictionary mapping data source names to their configuration dictionaries. Each configuration contains keys: 'type', 'data', 'filters', 'processing_steps', 'inclusions', and 'instructions'	instance

Dependencies

typing
panel
langchain_community
langchain_openai
uuid
pandas
sentence_transformers
faiss
numpy
neo4j
openai
chromadb
tiktoken
pybtex

Required Imports

from typing import List, Any, Dict
from langchain_community.embeddings import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_community.docstore.in_memory import InMemoryDocstore
from uuid import uuid4
import faiss

Conditional/Optional Imports

These imports are only needed under specific conditions:

from langchain_community.embeddings import OpenAIEmbeddings

Condition: only when adding data with type='to_vectorstore'

Required (conditional)

from langchain_community.vectorstores import FAISS

Condition: only when adding data with type='to_vectorstore'

Required (conditional)

import faiss

Condition: only when adding data with type='to_vectorstore'

Required (conditional)

Usage Example

# Instantiate the handler
handler = SimpleDataHandle()

# Add a text data source
handler.add_data(
    name='my_text_data',
    type='text',
    data='Some text content',
    inclusions=5
)

# Add a dataframe
import pandas as pd
df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
handler.add_data(
    name='my_dataframe',
    type='dataframe',
    data=df,
    processing_steps=['markdown']
)

# Add documents and convert to vector store
from langchain_core.documents import Document
docs = [Document(page_content='doc1'), Document(page_content='doc2')]
handler.add_data(
    name='my_vectors',
    type='to_vectorstore',
    data=docs
)

# Access stored data
print(handler.handlers['my_text_data'])

# Remove a data source
handler.remove_data('my_text_data')

# Clear all data
handler.clear_data()

Best Practices

Always instantiate the class before adding data sources
Use descriptive names when adding data to avoid conflicts in the handlers dictionary
When using type='to_vectorstore', ensure data is a list of Document objects compatible with LangChain
The class modifies the type from 'to_vectorstore' to 'vectorstore' after conversion, so check the final type in handlers
Default processing_steps and instructions vary by data type - review the source to understand defaults
The handlers attribute is a public dictionary that can be directly accessed or modified
No validation is performed on data types or parameters - ensure correct types are passed
Methods return None, so check the handlers dictionary to verify operations succeeded
For vectorstore types, OpenAI API credentials must be configured before calling add_data
The inclusions parameter defaults to 10 and likely controls how many items to retrieve/include from the data source

Similar Components

AI-powered semantic similarity - components with related functionality:

class _DictSAXHandler 46.0% similar

A SAX (Simple API for XML) event handler that converts XML documents into Python dictionaries, with extensive configuration options for handling attributes, namespaces, CDATA, and structure.
From: /tf/active/vicechatdev/SPFCsync/venv/lib64/python3.11/site-packages/xmltodict.py
class DataSource 44.6% similar

DataSource is an abstract base class that serves as a foundation for identifying and representing sources of content in eDiscovery operations within the Office 365 ecosystem.
From: /tf/active/vicechatdev/SPFCsync/venv/lib64/python3.11/site-packages/office365/directory/security/data_source.py
class DocumentProcessor 42.9% similar

Process different document types for RAG context extraction
From: /tf/active/vicechatdev/offline_docstore_multi_vice.py
class DocumentProcessor_v1 42.3% similar

Process different document types for RAG context extraction
From: /tf/active/vicechatdev/offline_docstore_multi.py
class ODataModel 42.2% similar

A container class for managing OData type schemas, providing a registry to store and retrieve type definitions by their names.
From: /tf/active/vicechatdev/SPFCsync/venv/lib64/python3.11/site-packages/office365/runtime/odata/model.py

← Back to Browse

Assistant

Hi! I can help improve this code. Tell me what you'd like to enhance (e.g., "add error handling", "optimize performance", "improve readability", "add type hints").

Code Comparison

Original Code

                            class SimpleDataHandle:
     
    def __init__(self):
        self.handlers = {}
        return
     
    def add_data(self, name:str, type:str, data:Any, filters:str="", processing_steps:List[str]=[], inclusions:int=10,instructions:str=""):
        ## Default values for type, filters, processing_steps, instructions
        if type == "":
            type = "text"
        if type=="dataframe":
            filters=""
            if processing_steps==[]:
                processing_steps=["markdown"]
            if instructions=="":
                instructions="""Start with a summary of the internal data, using summary tables when possible. If the internal data is presented as chemical formulas in SMILES format, try to find the corresponding chemical names and properties and report those in your answer.
                            Use them to compare it to other chemical data in the external sources."""
        if type=="vectorstore" or "to_vectorstore":
            if processing_steps==[]:
                processing_steps=["similarity"]
            if instructions=="":
                instructions="""Provide a summary of the given context data extracted from lab data and reports and from scientific literature, using summary tables when possible.
                            """
        if type =="to_vectorstore":
            embeddings = OpenAIEmbeddings()
            index = faiss.IndexFlatL2(len(embeddings.embed_query("hello world")))
            vector_store = FAISS(
                embedding_function=embeddings,
                docstore=InMemoryDocstore(),
                index_to_docstore_id={},
                index=index
            )
            uuids = [str(uuid4()) for _ in range(len(data))]
            vector_store.add_documents(
            documents=data,  
            ids=uuids,
            )
            data=vector_store
            type="vectorstore"
        if type == "db_search":
            if processing_steps==[]:
                processing_steps=["similarity"]
            if instructions=="":
                instructions="""Provide a summary of the given context data extracted from lab data and reports and from scientific literature, using summary tables when possible.
                            """
        if type=="chromaDB":
            if processing_steps==[]:
                processing_steps=["similarity"]
            if instructions=="":    
                instructions="""Provide a summary of the given context data extracted from lab data and reports and from scientific literature, using summary tables when possible.
                            """
        
        self.handlers[name] = {
            "type" : type,
            "data" : data,
            "filters" : filters,
            "processing_steps" : processing_steps,
            "inclusions" : inclusions,
            "instructions" : instructions
        }
        return
     
    def remove_data(self, name:str):
        if name in self.handlers:
            del self.handlers[name]
        return
    
    def clear_data(self):
        self.handlers = {}
        return      
                        

Improved Code

🔍 Code Extractor

class SimpleDataHandle

Purpose

Source Code

Parameters

Parameter Details

Return Value

Class Interface

Methods

`init(self) -> None`

`add_data(self, name: str, type: str, data: Any, filters: str = '', processing_steps: List[str] = [], inclusions: int = 10, instructions: str = '') -> None`

`remove_data(self, name: str) -> None`

`clear_data(self) -> None`

Attributes

Dependencies

Required Imports

Conditional/Optional Imports

Usage Example

Best Practices

Tags

Similar Components

class _DictSAXHandler 46.0% similar

class DataSource 44.6% similar

class DocumentProcessor 42.9% similar

class DocumentProcessor_v1 42.3% similar

class ODataModel 42.2% similar

class SimpleDataHandle

Purpose

Source Code

Parameters

Parameter Details

Return Value

Class Interface

Methods

__init__(self) -> None

add_data(self, name: str, type: str, data: Any, filters: str = '', processing_steps: List[str] = [], inclusions: int = 10, instructions: str = '') -> None

remove_data(self, name: str) -> None

clear_data(self) -> None

Attributes

Dependencies

Required Imports

Conditional/Optional Imports

Usage Example

Best Practices

Tags

Similar Components

class _DictSAXHandler 46.0% similar

class DataSource 44.6% similar

class DocumentProcessor 42.9% similar

class DocumentProcessor_v1 42.3% similar

class ODataModel 42.2% similar

✨ Improve Code: SimpleDataHandle

Code Comparison

`init(self) -> None`

`add_data(self, name: str, type: str, data: Any, filters: str = '', processing_steps: List[str] = [], inclusions: int = 10, instructions: str = '') -> None`

`remove_data(self, name: str) -> None`

`clear_data(self) -> None`