function add_document_to_graph
Creates nodes and relationships in a Neo4j graph database for a processed document, including its text and table chunks, connecting it to a folder hierarchy.
/tf/active/vicechatdev/offline_docstore_multi.py
1181 - 1231
moderate
Purpose
This function integrates a processed document into a Neo4j knowledge graph by creating a Document node with metadata, linking it to either a specified subfolder or root folder, and creating child nodes for text and table chunks extracted from the document. It maintains a hierarchical structure of folders and documents with their associated content chunks.
Source Code
def add_document_to_graph(session, processed_doc, deepest_folder_uid):
"""Add processed document to Neo4j graph"""
file_path = processed_doc["file_path"]
file_path_escaped = file_path.replace("'", "``")
filename = processed_doc["file_name"]
filename_escaped = filename.replace("'", "``")
text_chunks = processed_doc.get("text_chunks", [])
table_chunks = processed_doc.get("table_chunks", [])
# Generate UID for the document
doc_uid = str(uuid4())
key = evaluate_query(session,"match (x:Docstores) where not ('Template' in labels(x)) return x.Keys")
# Connect document to folder
if deepest_folder_uid:
query = f"MATCH (f:Subfolder {{UID: '{deepest_folder_uid}'}}) " \
f"MERGE (f)-[:PATH]->(n:Document {{UID:'{doc_uid}', " \
f"Name:'{filename_escaped}', " \
f"File:'{file_path_escaped}', " \
f"Type:'{processed_doc['file_type']}', " \
f"Keys:'{key}'}})"
run_query(session,query)
else:
# Connect to root folder
query = f"MATCH (x:Rootfolder {{Name:'T001'}}) " \
f"MERGE (x)-[:PATH]->(n:Document {{UID:'{doc_uid}', " \
f"Name:'{filename_escaped}', " \
f"File:'{file_path_escaped}', " \
f"Type:'{processed_doc['file_type']}', " \
f"Keys:'{key}'}})"
run_query(session,query)
# Connect chunks to the document (unchanged)
for i,text in enumerate(text_chunks):
out=run_query(session,f"MATCH (x {{UID:'{doc_uid}'}}) "
f"MERGE (x)-[:CHUNK]->(n:Text_chunk {{UID:'{text[2]}',"
f"Name:'{filename}:Text:{str(i)}',"
f"Text:'{text[1]}',"
f"Parent:'{text[0]}',"
f"Keys:'{key}'}})")
for i,text in enumerate(table_chunks):
out=run_query(session,f"MATCH (x {{UID:'{doc_uid}'}}) "
f"MERGE (x)-[:CHUNK]->(n:Table_chunk {{UID:'{text[3]}',"
f"Name:'{filename}:Table:{str(i)}',"
f"Text:'{text[2]}',"
f"Html:'{text[1]}',"
f"Parent:'{text[0]}',"
f"Keys:'{key}'}})")
return doc_uid
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
session |
- | - | positional_or_keyword |
processed_doc |
- | - | positional_or_keyword |
deepest_folder_uid |
- | - | positional_or_keyword |
Parameter Details
session: Neo4j database session object used to execute Cypher queries against the graph database. Should be an active session from neo4j.GraphDatabase.driver().session().
processed_doc: Dictionary containing document metadata and content. Expected keys: 'file_path' (str: full path to file), 'file_name' (str: name of file), 'file_type' (str: document type/extension), 'text_chunks' (list of tuples: [(parent, text, uid), ...]), 'table_chunks' (list of tuples: [(parent, html, text, uid), ...]). Text chunks contain extracted text content, table chunks contain both HTML and text representations.
deepest_folder_uid: String UID of the deepest subfolder in the hierarchy where this document should be attached. If None or empty, the document will be connected to the root folder 'T001'. Should be a valid UUID string matching an existing Subfolder node.
Return Value
Returns a string containing the generated UUID (doc_uid) for the newly created Document node in the Neo4j graph. This UID can be used to reference or query the document later.
Dependencies
neo4juuid
Required Imports
from uuid import uuid4
from neo4j import GraphDatabase
Usage Example
from neo4j import GraphDatabase
from uuid import uuid4
# Assuming evaluate_query and run_query helper functions exist
driver = GraphDatabase.driver('bolt://localhost:7687', auth=('neo4j', 'password'))
session = driver.session()
processed_doc = {
'file_path': '/documents/report.pdf',
'file_name': 'report.pdf',
'file_type': 'pdf',
'text_chunks': [
('section1', 'This is the first paragraph', 'uuid-text-1'),
('section2', 'This is the second paragraph', 'uuid-text-2')
],
'table_chunks': [
('table1', '<table><tr><td>Data</td></tr></table>', 'Data', 'uuid-table-1')
]
}
folder_uid = 'existing-folder-uuid-123'
doc_uid = add_document_to_graph(session, processed_doc, folder_uid)
print(f'Document created with UID: {doc_uid}')
session.close()
driver.close()
Best Practices
- Ensure the Neo4j session is properly opened before calling this function and closed after use
- The function uses string interpolation for Cypher queries which is vulnerable to injection attacks; consider using parameterized queries instead
- Single quotes in file paths and filenames are escaped with double backticks (``), but this may not handle all special characters safely
- The function assumes helper functions 'evaluate_query' and 'run_query' exist in the scope; ensure these are imported or defined
- Text content in chunks should be properly escaped before passing to this function to avoid Cypher syntax errors
- The function does not validate input data structure; ensure processed_doc contains all required keys
- Consider wrapping the function call in a try-except block to handle Neo4j connection errors
- The hardcoded root folder name 'T001' should ideally be configurable
- Large documents with many chunks may result in many individual queries; consider batching for performance
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function add_document_to_graph_v1 98.6% similar
-
function create_folder_hierarchy_v1 65.3% similar
-
function create_folder_hierarchy 63.8% similar
-
function create_folder_hierarchy_v2 63.1% similar
-
function create_document 61.9% similar