class SimpleDataHandle
A data handler class that manages multiple data sources with different types (dataframes, vector stores, databases) and their associated processing configurations.
/tf/active/vicechatdev/OneCo_hybrid_RAG copy.py
718 - 787
moderate
Purpose
SimpleDataHandle provides a centralized registry for managing heterogeneous data sources in a data processing or RAG (Retrieval-Augmented Generation) pipeline. It stores data along with metadata including type, filters, processing steps, inclusion limits, and instructions for how to use each data source. The class automatically configures default settings based on data type and can convert documents to vector stores using FAISS and OpenAI embeddings.
Source Code
class SimpleDataHandle:
def __init__(self):
self.handlers = {}
return
def add_data(self, name:str, type:str, data:Any, filters:str="", processing_steps:List[str]=[], inclusions:int=10,instructions:str=""):
## Default values for type, filters, processing_steps, instructions
if type == "":
type = "text"
if type=="dataframe":
filters=""
if processing_steps==[]:
processing_steps=["markdown"]
if instructions=="":
instructions="""Start with a summary of the internal data, using summary tables when possible. If the internal data is presented as chemical formulas in SMILES format, try to find the corresponding chemical names and properties and report those in your answer.
Use them to compare it to other chemical data in the external sources."""
if type=="vectorstore" or "to_vectorstore":
if processing_steps==[]:
processing_steps=["similarity"]
if instructions=="":
instructions="""Provide a summary of the given context data extracted from lab data and reports and from scientific literature, using summary tables when possible.
"""
if type =="to_vectorstore":
embeddings = OpenAIEmbeddings()
index = faiss.IndexFlatL2(len(embeddings.embed_query("hello world")))
vector_store = FAISS(
embedding_function=embeddings,
docstore=InMemoryDocstore(),
index_to_docstore_id={},
index=index
)
uuids = [str(uuid4()) for _ in range(len(data))]
vector_store.add_documents(
documents=data,
ids=uuids,
)
data=vector_store
type="vectorstore"
if type == "db_search":
if processing_steps==[]:
processing_steps=["similarity"]
if instructions=="":
instructions="""Provide a summary of the given context data extracted from lab data and reports and from scientific literature, using summary tables when possible.
"""
if type=="chromaDB":
if processing_steps==[]:
processing_steps=["similarity"]
if instructions=="":
instructions="""Provide a summary of the given context data extracted from lab data and reports and from scientific literature, using summary tables when possible.
"""
self.handlers[name] = {
"type" : type,
"data" : data,
"filters" : filters,
"processing_steps" : processing_steps,
"inclusions" : inclusions,
"instructions" : instructions
}
return
def remove_data(self, name:str):
if name in self.handlers:
del self.handlers[name]
return
def clear_data(self):
self.handlers = {}
return
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
bases |
- | - |
Parameter Details
__init__: No parameters required. Initializes an empty handlers dictionary to store data sources.
Return Value
The class constructor returns None. The add_data, remove_data, and clear_data methods all return None (they modify internal state). The primary value is accessed through the handlers attribute which contains a dictionary mapping data source names to their configuration dictionaries.
Class Interface
Methods
__init__(self) -> None
Purpose: Initialize a new SimpleDataHandle instance with an empty handlers dictionary
Returns: None
add_data(self, name: str, type: str, data: Any, filters: str = '', processing_steps: List[str] = [], inclusions: int = 10, instructions: str = '') -> None
Purpose: Add a new data source to the handler with associated configuration. Automatically sets defaults based on type and can convert documents to FAISS vector stores.
Parameters:
name: Unique identifier for this data sourcetype: Type of data: 'text', 'dataframe', 'vectorstore', 'to_vectorstore', 'db_search', or 'chromaDB'data: The actual data object (text, DataFrame, vector store, list of Documents, etc.)filters: Filter criteria for the data (defaults to empty string, auto-cleared for dataframes)processing_steps: List of processing steps to apply (defaults vary by type: ['markdown'] for dataframes, ['similarity'] for vector stores)inclusions: Number of items to include from this source (default 10)instructions: Instructions for how to use this data source (defaults vary by type with specific guidance for chemical data, lab reports, etc.)
Returns: None - modifies internal handlers dictionary
remove_data(self, name: str) -> None
Purpose: Remove a data source from the handler by name
Parameters:
name: The name of the data source to remove
Returns: None - modifies internal handlers dictionary by deleting the entry if it exists
clear_data(self) -> None
Purpose: Remove all data sources from the handler
Returns: None - resets handlers dictionary to empty
Attributes
| Name | Type | Description | Scope |
|---|---|---|---|
handlers |
Dict[str, Dict[str, Any]] | Dictionary mapping data source names to their configuration dictionaries. Each configuration contains keys: 'type', 'data', 'filters', 'processing_steps', 'inclusions', and 'instructions' | instance |
Dependencies
typingpanellangchain_communitylangchain_openaiuuidpandassentence_transformersfaissnumpyneo4jopenaichromadbtiktokenpybtex
Required Imports
from typing import List, Any, Dict
from langchain_community.embeddings import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_community.docstore.in_memory import InMemoryDocstore
from uuid import uuid4
import faiss
Conditional/Optional Imports
These imports are only needed under specific conditions:
from langchain_community.embeddings import OpenAIEmbeddings
Condition: only when adding data with type='to_vectorstore'
Required (conditional)from langchain_community.vectorstores import FAISS
Condition: only when adding data with type='to_vectorstore'
Required (conditional)import faiss
Condition: only when adding data with type='to_vectorstore'
Required (conditional)Usage Example
# Instantiate the handler
handler = SimpleDataHandle()
# Add a text data source
handler.add_data(
name='my_text_data',
type='text',
data='Some text content',
inclusions=5
)
# Add a dataframe
import pandas as pd
df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
handler.add_data(
name='my_dataframe',
type='dataframe',
data=df,
processing_steps=['markdown']
)
# Add documents and convert to vector store
from langchain_core.documents import Document
docs = [Document(page_content='doc1'), Document(page_content='doc2')]
handler.add_data(
name='my_vectors',
type='to_vectorstore',
data=docs
)
# Access stored data
print(handler.handlers['my_text_data'])
# Remove a data source
handler.remove_data('my_text_data')
# Clear all data
handler.clear_data()
Best Practices
- Always instantiate the class before adding data sources
- Use descriptive names when adding data to avoid conflicts in the handlers dictionary
- When using type='to_vectorstore', ensure data is a list of Document objects compatible with LangChain
- The class modifies the type from 'to_vectorstore' to 'vectorstore' after conversion, so check the final type in handlers
- Default processing_steps and instructions vary by data type - review the source to understand defaults
- The handlers attribute is a public dictionary that can be directly accessed or modified
- No validation is performed on data types or parameters - ensure correct types are passed
- Methods return None, so check the handlers dictionary to verify operations succeeded
- For vectorstore types, OpenAI API credentials must be configured before calling add_data
- The inclusions parameter defaults to 10 and likely controls how many items to retrieve/include from the data source
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
class _DictSAXHandler 46.0% similar
-
class DataSource 44.6% similar
-
class DocumentProcessor 42.9% similar
-
class DocumentProcessor_v1 42.3% similar
-
class ODataModel 42.2% similar