🔍 Code Extractor

class QueryBasedExtractor

Maturity: 35

A class that performs targeted information extraction from text using LLM-based query-guided extraction, with support for handling long documents through chunking and token management.

File:
/tf/active/vicechatdev/OneCo_hybrid_RAG.py
Lines:
76 - 287
Complexity:
complex

Purpose

QueryBasedExtractor is designed to extract relevant information from text documents based on user-provided queries. It uses a small LLM (default: gpt-4o-mini) to perform intelligent extraction that preserves original wording while focusing only on query-relevant content. The class handles token counting, automatic chunking for long texts, and multi-pass extraction to ensure outputs stay within token limits. It's particularly useful for reducing large documents to their most relevant portions before further processing or analysis.

Source Code

class QueryBasedExtractor:
    def __init__(self, max_output_tokens=1024, api_key=None, model_name="gpt-4o-mini"):
        """
        Initialize the extractor with configuration for a small LLM.
        
        Args:
            max_output_tokens: Maximum tokens for the extracted output
            api_key: API key for the LLM service
            model_name: Small LLM model to use
        """
        self.max_output_tokens = max_output_tokens
        self.api_key = api_key
        self.model_name = model_name
        
        # Set up tiktoken encoder for token counting
        import tiktoken
        self.tokenizer = tiktoken.get_encoding("cl100k_base")
        
        # Set up OpenAI client if API key is provided
        if api_key:
            import openai
            import os
            os.environ["OPENAI_API_KEY"] = api_key
            self.client = openai.OpenAI(api_key=api_key)
        
    def count_tokens(self, text):
        """Count tokens in a string."""
        return len(self.tokenizer.encode(text))
    
    def call_llm(self, prompt):
        """
        Call the LLM with the prompt.
        
        Args:
            prompt: The formatted prompt for extraction
            
        Returns:
            Extracted text from the LLM
        """
        from langchain_openai import ChatOpenAI
        
        # Use LangChain's ChatOpenAI for consistency with OneCo_hybrid_RAG
        llm = ChatOpenAI(
            model=self.model_name,
            temperature=0,
            max_tokens=self.max_output_tokens
        )
        
        response = llm.invoke(prompt)
        return response.content
    
    def create_extraction_prompt(self, queries, text):
        """
        Create a prompt for targeted information extraction based on queries.
        
        Args:
            queries: List of queries to guide the extraction
            text: Text to extract from
            
        Returns:
            Formatted prompt string
        """
        formatted_queries = "\n".join([f"- {q}" for q in queries])
        
        # Design an extraction-focused prompt based on OneCo_hybrid_RAG style
        prompt = f"""
You are performing targeted information extraction. Given the queries below, extract ONLY the most relevant 
passages from the provided text that directly address these queries.

IMPORTANT INSTRUCTIONS:
- DO NOT summarize or paraphrase - extract the exact relevant passages
- Maintain original wording and details crucial for answering the queries
- Include complete sentences and necessary context around key points
- Extract passages in order of relevance to the queries
- If important details are in different parts of the text, include all relevant sections
- Extract ONLY information relevant to the queries
- The extraction MUST be self-contained and make sense on its own
- Maximum output length: {self.max_output_tokens} tokens

QUERIES:
{formatted_queries}

TEXT TO EXTRACT FROM:
{text}

RELEVANT EXTRACTED INFORMATION:
"""
        return prompt
    
    def extract(self, text, queries):
        """
        Extract relevant information from text based on queries.
        
        Args:
            text: Text to extract from
            queries: List of queries to guide extraction
            
        Returns:
            Extracted relevant information
        """
        # Check text length to determine if extraction is needed
        text_tokens = self.count_tokens(text)
        
        # If text is already under token limit, just return it
        if text_tokens <= self.max_output_tokens:
            print("Text is within token limit, no extraction needed.")
            return text
            
        # Create extraction prompt
        prompt = self.create_extraction_prompt(queries, text)
        
        # Check prompt size to ensure it fits in model context
        prompt_tokens = self.count_tokens(prompt)
        
        # For very large texts that won't fit in model context, we need chunking
        if prompt_tokens > 100000:  # Assuming context limit of a small model
            return self.process_long_text(text, queries)
        
        # Otherwise, do direct extraction
        print("Extracting information from text...")
        return self.call_llm(prompt)
    
    def process_long_text(self, text, queries):
        """
        Process very long text by splitting into chunks and extracting from each.
        
        Args:
            text: Long text to process
            queries: List of queries for extraction
            
        Returns:
            Combined extraction from all chunks
        """
        # Calculate how much space we need for queries and prompt template
        query_text = "\n".join([f"- {q}" for q in queries])
        prompt_template = self.create_extraction_prompt([], "")
        fixed_tokens = self.count_tokens(prompt_template) + self.count_tokens(query_text)
        
        # Calculate available space for text in each chunk
        # 7000 is a conservative estimate of context window for small model like gpt-4o-mini
        # Adjust based on the actual model being used
        available_tokens = 100000 - fixed_tokens - 100  # 100 token buffer
        
        # Split text into chunks that fit in context window
        text_tokens = self.tokenizer.encode(text)
        chunks = []
        
        for i in range(0, len(text_tokens), available_tokens):
            chunk_tokens = text_tokens[i:i+available_tokens]
            chunk_text = self.tokenizer.decode(chunk_tokens)
            chunks.append(chunk_text)
        
        # Process each chunk and collect extractions
        all_extractions = []
        
        for i, chunk in enumerate(chunks):
            print(f"Processing chunk {i+1}/{len(chunks)}")
            
            # Create an extraction prompt for this chunk
            chunk_prompt = f"""
You are performing targeted information extraction. Extract ONLY the most relevant 
passages from this text chunk that directly address the queries below.

IMPORTANT CONTEXT:
- This is chunk {i+1} of {len(chunks)} from a larger document
- Extract only information relevant to the queries
- DO NOT summarize - extract exact relevant passages
- Maintain original wording and crucial details
- Maximum extraction length: {self.max_output_tokens // len(chunks)} tokens

QUERIES:
{query_text}

TEXT CHUNK {i+1}/{len(chunks)}:
{chunk}

RELEVANT EXTRACTED INFORMATION:
"""
            extracted = self.call_llm(chunk_prompt)
            if extracted.strip():
                all_extractions.append(extracted.strip())
        
        # Combine all extractions
        combined = "\n\n".join(all_extractions)
        
        # If combined extractions are still too long, do a second pass
        if self.count_tokens(combined) > self.max_output_tokens:
            consolidation_prompt = f"""
You are performing final extraction consolidation. You have extracts from different parts 
of a document that address the queries below.

Your task is to create a single coherent extract that includes ONLY the most important and 
relevant passages to answer the queries, while avoiding redundancy.

IMPORTANT INSTRUCTIONS:
- Focus only on the most relevant information for the queries
- Maintain original wording from the extracts
- Remove redundant information that appears in multiple extracts
- Create a coherent, self-contained extract
- Maximum output length: {self.max_output_tokens} tokens

QUERIES:
{query_text}

EXTRACTS TO CONSOLIDATE:
{combined}

FINAL CONSOLIDATED EXTRACT:
"""
            return self.call_llm(consolidation_prompt)
        
        return combined

Parameters

Name Type Default Kind
bases - -

Parameter Details

max_output_tokens: Maximum number of tokens allowed in the extracted output. Default is 1024. This controls the size of the final extraction and is used to determine if extraction is needed at all (texts shorter than this are returned as-is).

api_key: OpenAI API key for authentication. If provided, sets up the OpenAI client and stores the key in environment variables. Can be None if the key is already set in the environment.

model_name: Name of the OpenAI model to use for extraction. Default is 'gpt-4o-mini'. Should be a model that supports chat completions and has sufficient context window for the extraction tasks.

Return Value

The class constructor returns a QueryBasedExtractor instance. The main extract() method returns a string containing the extracted relevant information from the input text, reduced to focus on query-relevant content and constrained by max_output_tokens. If the input text is already within the token limit, it returns the original text unchanged.

Class Interface

Methods

__init__(self, max_output_tokens=1024, api_key=None, model_name='gpt-4o-mini')

Purpose: Initialize the QueryBasedExtractor with configuration for token limits, API credentials, and model selection. Sets up tiktoken encoder and OpenAI client.

Parameters:

  • max_output_tokens: Maximum tokens for extracted output (default: 1024)
  • api_key: OpenAI API key, optional if already in environment (default: None)
  • model_name: OpenAI model name to use (default: 'gpt-4o-mini')

Returns: None - initializes instance attributes

count_tokens(self, text: str) -> int

Purpose: Count the number of tokens in a given text string using tiktoken's cl100k_base encoding.

Parameters:

  • text: String to count tokens for

Returns: Integer count of tokens in the text

call_llm(self, prompt: str) -> str

Purpose: Call the configured LLM with a prompt and return the extracted text. Uses LangChain's ChatOpenAI for consistency.

Parameters:

  • prompt: The formatted prompt string for extraction

Returns: String containing the LLM's response content

create_extraction_prompt(self, queries: list, text: str) -> str

Purpose: Create a formatted prompt for targeted information extraction based on provided queries and text. Includes detailed instructions for the LLM.

Parameters:

  • queries: List of query strings to guide the extraction
  • text: Text content to extract information from

Returns: Formatted prompt string ready for LLM consumption

extract(self, text: str, queries: list) -> str

Purpose: Main extraction method that extracts relevant information from text based on queries. Automatically handles short texts, normal extraction, and long text chunking.

Parameters:

  • text: Text document to extract information from
  • queries: List of query strings to guide what information to extract

Returns: String containing extracted relevant information, constrained by max_output_tokens

process_long_text(self, text: str, queries: list) -> str

Purpose: Process very long texts that exceed model context limits by splitting into chunks, extracting from each chunk, and consolidating results.

Parameters:

  • text: Long text document that exceeds normal context limits
  • queries: List of query strings for extraction guidance

Returns: String containing consolidated extraction from all chunks, within max_output_tokens limit

Attributes

Name Type Description Scope
max_output_tokens int Maximum number of tokens allowed in the extracted output instance
api_key str or None OpenAI API key for authentication instance
model_name str Name of the OpenAI model to use for extraction instance
tokenizer tiktoken.Encoding Tiktoken encoder instance using cl100k_base encoding for token counting instance
client openai.OpenAI OpenAI client instance, only created if api_key is provided instance

Dependencies

  • tiktoken
  • openai
  • langchain_openai
  • os

Required Imports

import tiktoken
import openai
import os
from langchain_openai import ChatOpenAI

Conditional/Optional Imports

These imports are only needed under specific conditions:

import tiktoken

Condition: imported in __init__ when the class is instantiated

Required (conditional)
import openai

Condition: imported in __init__ only if api_key is provided

Optional
import os

Condition: imported in __init__ only if api_key is provided to set environment variable

Optional
from langchain_openai import ChatOpenAI

Condition: imported in call_llm method when LLM is invoked

Required (conditional)

Usage Example

# Basic usage
from query_based_extractor import QueryBasedExtractor

# Initialize the extractor
extractor = QueryBasedExtractor(
    max_output_tokens=1024,
    api_key='your-openai-api-key',
    model_name='gpt-4o-mini'
)

# Define queries to guide extraction
queries = [
    'What are the main findings of the study?',
    'What methodology was used?',
    'What are the limitations mentioned?'
]

# Extract relevant information from a long document
long_text = """[Your long document text here]..."""
extracted_info = extractor.extract(long_text, queries)

print(f"Original tokens: {extractor.count_tokens(long_text)}")
print(f"Extracted tokens: {extractor.count_tokens(extracted_info)}")
print(f"\nExtracted content:\n{extracted_info}")

# For very long documents, the class automatically handles chunking
very_long_text = """[Very long document that exceeds context window]..."""
extracted = extractor.extract(very_long_text, queries)

Best Practices

  • Always provide an API key either through the constructor or as an environment variable before calling extract()
  • The class automatically determines if extraction is needed based on token count - texts under max_output_tokens are returned unchanged
  • For very long documents (>100k tokens), the class automatically chunks the text and performs multi-pass extraction
  • Queries should be specific and focused to get the best extraction results
  • The extractor preserves original wording rather than summarizing, making it suitable for maintaining factual accuracy
  • Token counting uses tiktoken's cl100k_base encoding, which may not exactly match the model's tokenizer but provides a good approximation
  • The class is stateless after initialization - you can reuse the same instance for multiple extraction calls
  • For production use, consider implementing retry logic around call_llm() to handle API failures
  • The 100000 token threshold for chunking is conservative and may need adjustment based on the specific model's context window
  • When processing chunks, each chunk gets a proportional token allocation (max_output_tokens // len(chunks)) to ensure the combined result fits within limits

Similar Components

AI-powered semantic similarity - components with related functionality:

  • class RegulatoryExtractor 61.3% similar

    A class for extracting structured metadata from regulatory guideline PDF documents using LLM-based analysis and storing the results in an Excel tracking spreadsheet.

    From: /tf/active/vicechatdev/reg_extractor.py
  • class DocumentExtractor 61.1% similar

    A document text extraction class that supports multiple file formats including Word, PowerPoint, PDF, and plain text files, with automatic format detection and conversion capabilities.

    From: /tf/active/vicechatdev/leexi/document_extractor.py
  • function extract_previous_reports_summary 54.5% similar

    Extracts and summarizes key information from previous meeting report files using document extraction and OpenAI's GPT-4o-mini model to provide context for upcoming meetings.

    From: /tf/active/vicechatdev/leexi/app.py
  • function test_multiple_files 52.4% similar

    A test function that validates the extraction of text content from multiple document files using a DocumentExtractor instance, displaying extraction results and simulating combined content processing.

    From: /tf/active/vicechatdev/leexi/test_multiple_files.py
  • function test_document_extractor 51.5% similar

    A test function that validates the DocumentExtractor class by testing file type support detection, text extraction from various document formats, and error handling.

    From: /tf/active/vicechatdev/leexi/test_document_extractor.py
← Back to Browse