🔍 Code Extractor

class SimpleDataHandle

Maturity: 40

A data handler class that manages multiple data sources with different types (dataframes, vector stores, databases) and their associated processing configurations.

File:
/tf/active/vicechatdev/OneCo_hybrid_RAG copy.py
Lines:
718 - 787
Complexity:
moderate

Purpose

SimpleDataHandle provides a centralized registry for managing heterogeneous data sources in a data processing or RAG (Retrieval-Augmented Generation) pipeline. It stores data along with metadata including type, filters, processing steps, inclusion limits, and instructions for how to use each data source. The class automatically configures default settings based on data type and can convert documents to vector stores using FAISS and OpenAI embeddings.

Source Code

class SimpleDataHandle:
     
    def __init__(self):
        self.handlers = {}
        return
     
    def add_data(self, name:str, type:str, data:Any, filters:str="", processing_steps:List[str]=[], inclusions:int=10,instructions:str=""):
        ## Default values for type, filters, processing_steps, instructions
        if type == "":
            type = "text"
        if type=="dataframe":
            filters=""
            if processing_steps==[]:
                processing_steps=["markdown"]
            if instructions=="":
                instructions="""Start with a summary of the internal data, using summary tables when possible. If the internal data is presented as chemical formulas in SMILES format, try to find the corresponding chemical names and properties and report those in your answer.
                            Use them to compare it to other chemical data in the external sources."""
        if type=="vectorstore" or "to_vectorstore":
            if processing_steps==[]:
                processing_steps=["similarity"]
            if instructions=="":
                instructions="""Provide a summary of the given context data extracted from lab data and reports and from scientific literature, using summary tables when possible.
                            """
        if type =="to_vectorstore":
            embeddings = OpenAIEmbeddings()
            index = faiss.IndexFlatL2(len(embeddings.embed_query("hello world")))
            vector_store = FAISS(
                embedding_function=embeddings,
                docstore=InMemoryDocstore(),
                index_to_docstore_id={},
                index=index
            )
            uuids = [str(uuid4()) for _ in range(len(data))]
            vector_store.add_documents(
            documents=data,  
            ids=uuids,
            )
            data=vector_store
            type="vectorstore"
        if type == "db_search":
            if processing_steps==[]:
                processing_steps=["similarity"]
            if instructions=="":
                instructions="""Provide a summary of the given context data extracted from lab data and reports and from scientific literature, using summary tables when possible.
                            """
        if type=="chromaDB":
            if processing_steps==[]:
                processing_steps=["similarity"]
            if instructions=="":    
                instructions="""Provide a summary of the given context data extracted from lab data and reports and from scientific literature, using summary tables when possible.
                            """
        
        self.handlers[name] = {
            "type" : type,
            "data" : data,
            "filters" : filters,
            "processing_steps" : processing_steps,
            "inclusions" : inclusions,
            "instructions" : instructions
        }
        return
     
    def remove_data(self, name:str):
        if name in self.handlers:
            del self.handlers[name]
        return
    
    def clear_data(self):
        self.handlers = {}
        return      

Parameters

Name Type Default Kind
bases - -

Parameter Details

__init__: No parameters required. Initializes an empty handlers dictionary to store data sources.

Return Value

The class constructor returns None. The add_data, remove_data, and clear_data methods all return None (they modify internal state). The primary value is accessed through the handlers attribute which contains a dictionary mapping data source names to their configuration dictionaries.

Class Interface

Methods

__init__(self) -> None

Purpose: Initialize a new SimpleDataHandle instance with an empty handlers dictionary

Returns: None

add_data(self, name: str, type: str, data: Any, filters: str = '', processing_steps: List[str] = [], inclusions: int = 10, instructions: str = '') -> None

Purpose: Add a new data source to the handler with associated configuration. Automatically sets defaults based on type and can convert documents to FAISS vector stores.

Parameters:

  • name: Unique identifier for this data source
  • type: Type of data: 'text', 'dataframe', 'vectorstore', 'to_vectorstore', 'db_search', or 'chromaDB'
  • data: The actual data object (text, DataFrame, vector store, list of Documents, etc.)
  • filters: Filter criteria for the data (defaults to empty string, auto-cleared for dataframes)
  • processing_steps: List of processing steps to apply (defaults vary by type: ['markdown'] for dataframes, ['similarity'] for vector stores)
  • inclusions: Number of items to include from this source (default 10)
  • instructions: Instructions for how to use this data source (defaults vary by type with specific guidance for chemical data, lab reports, etc.)

Returns: None - modifies internal handlers dictionary

remove_data(self, name: str) -> None

Purpose: Remove a data source from the handler by name

Parameters:

  • name: The name of the data source to remove

Returns: None - modifies internal handlers dictionary by deleting the entry if it exists

clear_data(self) -> None

Purpose: Remove all data sources from the handler

Returns: None - resets handlers dictionary to empty

Attributes

Name Type Description Scope
handlers Dict[str, Dict[str, Any]] Dictionary mapping data source names to their configuration dictionaries. Each configuration contains keys: 'type', 'data', 'filters', 'processing_steps', 'inclusions', and 'instructions' instance

Dependencies

  • typing
  • panel
  • langchain_community
  • langchain_openai
  • uuid
  • pandas
  • sentence_transformers
  • faiss
  • numpy
  • neo4j
  • openai
  • chromadb
  • tiktoken
  • pybtex

Required Imports

from typing import List, Any, Dict
from langchain_community.embeddings import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_community.docstore.in_memory import InMemoryDocstore
from uuid import uuid4
import faiss

Conditional/Optional Imports

These imports are only needed under specific conditions:

from langchain_community.embeddings import OpenAIEmbeddings

Condition: only when adding data with type='to_vectorstore'

Required (conditional)
from langchain_community.vectorstores import FAISS

Condition: only when adding data with type='to_vectorstore'

Required (conditional)
import faiss

Condition: only when adding data with type='to_vectorstore'

Required (conditional)

Usage Example

# Instantiate the handler
handler = SimpleDataHandle()

# Add a text data source
handler.add_data(
    name='my_text_data',
    type='text',
    data='Some text content',
    inclusions=5
)

# Add a dataframe
import pandas as pd
df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
handler.add_data(
    name='my_dataframe',
    type='dataframe',
    data=df,
    processing_steps=['markdown']
)

# Add documents and convert to vector store
from langchain_core.documents import Document
docs = [Document(page_content='doc1'), Document(page_content='doc2')]
handler.add_data(
    name='my_vectors',
    type='to_vectorstore',
    data=docs
)

# Access stored data
print(handler.handlers['my_text_data'])

# Remove a data source
handler.remove_data('my_text_data')

# Clear all data
handler.clear_data()

Best Practices

  • Always instantiate the class before adding data sources
  • Use descriptive names when adding data to avoid conflicts in the handlers dictionary
  • When using type='to_vectorstore', ensure data is a list of Document objects compatible with LangChain
  • The class modifies the type from 'to_vectorstore' to 'vectorstore' after conversion, so check the final type in handlers
  • Default processing_steps and instructions vary by data type - review the source to understand defaults
  • The handlers attribute is a public dictionary that can be directly accessed or modified
  • No validation is performed on data types or parameters - ensure correct types are passed
  • Methods return None, so check the handlers dictionary to verify operations succeeded
  • For vectorstore types, OpenAI API credentials must be configured before calling add_data
  • The inclusions parameter defaults to 10 and likely controls how many items to retrieve/include from the data source

Similar Components

AI-powered semantic similarity - components with related functionality:

  • class _DictSAXHandler 46.0% similar

    A SAX (Simple API for XML) event handler that converts XML documents into Python dictionaries, with extensive configuration options for handling attributes, namespaces, CDATA, and structure.

    From: /tf/active/vicechatdev/SPFCsync/venv/lib64/python3.11/site-packages/xmltodict.py
  • class DataSource 44.6% similar

    DataSource is an abstract base class that serves as a foundation for identifying and representing sources of content in eDiscovery operations within the Office 365 ecosystem.

    From: /tf/active/vicechatdev/SPFCsync/venv/lib64/python3.11/site-packages/office365/directory/security/data_source.py
  • class DocumentProcessor 42.9% similar

    Process different document types for RAG context extraction

    From: /tf/active/vicechatdev/offline_docstore_multi_vice.py
  • class DocumentProcessor_v1 42.3% similar

    Process different document types for RAG context extraction

    From: /tf/active/vicechatdev/offline_docstore_multi.py
  • class ODataModel 42.2% similar

    A container class for managing OData type schemas, providing a registry to store and retrieve type definitions by their names.

    From: /tf/active/vicechatdev/SPFCsync/venv/lib64/python3.11/site-packages/office365/runtime/odata/model.py
← Back to Browse