🔍 Code Extractor

Browse Components

Showing 20 of 1763 components

  • function summarize_text

    A deprecated standalone function that was originally designed to summarize groups of similar documents but now only returns the input documents unchanged with a deprecation warning.

    File: /tf/active/vicechatdev/chromadb-cleanup/src/summarization/summarizer.py | Lines: 22-35

    deprecated text-summarization document-processing nlp text-clustering
  • function init_openai_client

    Initializes the OpenAI client by setting the API key from either a provided parameter or environment variable.

    File: /tf/active/vicechatdev/chromadb-cleanup/src/summarization/summarizer.py | Lines: 7-19

    initialization authentication openai api-key configuration
  • function get_unique_documents

    Identifies and separates unique documents from duplicates in a list by comparing hash values of document text content.

    File: /tf/active/vicechatdev/chromadb-cleanup/src/utils/hash_utils.py | Lines: 44-67

    deduplication document-processing data-cleaning hashing text-processing
  • function identify_duplicates

    Identifies duplicate documents by computing hash values of their text content and grouping documents with identical hashes.

    File: /tf/active/vicechatdev/chromadb-cleanup/src/utils/hash_utils.py | Lines: 22-41

    deduplication document-processing hashing data-cleaning duplicate-detection
  • function hash_text

    Creates a SHA-256 hash of normalized text content to generate a unique identifier for documents, enabling duplicate detection and content comparison.

    File: /tf/active/vicechatdev/chromadb-cleanup/src/utils/hash_utils.py | Lines: 5-19

    hashing text-processing deduplication content-fingerprinting sha256
  • function find_similar_documents

    Identifies pairs of similar documents by comparing their embeddings and returns those exceeding a specified similarity threshold, sorted by similarity score.

    File: /tf/active/vicechatdev/chromadb-cleanup/src/utils/similarity_utils.py | Lines: 38-71

    document-similarity embedding-comparison duplicate-detection cosine-similarity nlp
  • function build_similarity_matrix

    Computes a pairwise cosine similarity matrix for a collection of embedding vectors, where each cell (i,j) represents the similarity between embedding i and embedding j.

    File: /tf/active/vicechatdev/chromadb-cleanup/src/utils/similarity_utils.py | Lines: 24-35

    embeddings similarity cosine-similarity matrix nlp
  • function calculate_similarity

    Computes the cosine similarity between two embedding vectors, returning a normalized score between 0 and 1 that measures their directional alignment.

    File: /tf/active/vicechatdev/chromadb-cleanup/src/utils/similarity_utils.py | Lines: 6-21

    cosine-similarity vector-comparison embeddings similarity-metric machine-learning
  • class TextClusterer

    A class that clusters similar documents based on their embeddings using various clustering algorithms (K-means, Agglomerative, DBSCAN) and optionally generates summaries for each cluster.

    File: /tf/active/vicechatdev/chromadb-cleanup/src/clustering/text_clusterer.py | Lines: 8-171

    clustering document-clustering embeddings machine-learning kmeans
  • function test_identical_chunks_with_different_cases

    A unit test function that verifies the HashCleaner's ability to remove duplicate text chunks while being case-sensitive, ensuring that strings differing only in case are treated as distinct entries.

    File: /tf/active/vicechatdev/chromadb-cleanup/tests/test_hash_cleaner.py | Lines: 38-49

    unit-test pytest deduplication case-sensitive text-processing
  • function test_no_identical_chunks

    A unit test function that verifies the HashCleaner's behavior when processing a list of unique text chunks, ensuring no chunks are removed when all are distinct.

    File: /tf/active/vicechatdev/chromadb-cleanup/tests/test_hash_cleaner.py | Lines: 28-36

    unit-test pytest hash-cleaner deduplication text-processing
  • function test_empty_input_v1

    A pytest test function that verifies the HashCleaner's behavior when processing an empty list of text chunks.

    File: /tf/active/vicechatdev/chromadb-cleanup/tests/test_hash_cleaner.py | Lines: 22-26

    testing unit-test pytest edge-case boundary-condition
  • function test_remove_identical_chunks

    A pytest test function that verifies the HashCleaner's ability to remove duplicate text chunks from a list while preserving order and unique entries.

    File: /tf/active/vicechatdev/chromadb-cleanup/tests/test_hash_cleaner.py | Lines: 8-20

    testing pytest unit-test deduplication text-processing
  • function hash_cleaner

    A pytest fixture that instantiates and returns a HashCleaner object for use in test cases.

    File: /tf/active/vicechatdev/chromadb-cleanup/tests/test_hash_cleaner.py | Lines: 5-6

    pytest fixture testing hash cleaner
  • class TestCombinedCleaner

    A unittest test class that validates the functionality of the CombinedCleaner class, testing its ability to remove duplicate and similar texts from collections.

    File: /tf/active/vicechatdev/chromadb-cleanup/tests/test_combined_cleaner.py | Lines: 6-47

    unittest testing text-cleaning deduplication similarity-detection
  • function test_similarity_threshold_effect

    A pytest test function that validates the behavior of SimilarityCleaner with different similarity threshold values, ensuring that higher thresholds retain more texts while lower thresholds are more aggressive in removing similar content.

    File: /tf/active/vicechatdev/chromadb-cleanup/tests/test_similarity_cleaner.py | Lines: 41-59

    testing pytest text-deduplication similarity-detection data-cleaning
  • function test_single_text_input

    A pytest test function that verifies the SimilarityCleaner correctly handles a single text document by returning it unchanged.

    File: /tf/active/vicechatdev/chromadb-cleanup/tests/test_similarity_cleaner.py | Lines: 36-39

    testing unit-test pytest text-processing similarity
  • function test_empty_input

    A pytest test function that verifies the SimilarityCleaner correctly handles empty input by returning an empty list.

    File: /tf/active/vicechatdev/chromadb-cleanup/tests/test_similarity_cleaner.py | Lines: 31-34

    testing unit-test pytest edge-case empty-input
  • function test_nearly_similar_text_handling

    A pytest test function that verifies the SimilarityCleaner's ability to identify and remove nearly similar text entries while preserving distinct ones.

    File: /tf/active/vicechatdev/chromadb-cleanup/tests/test_similarity_cleaner.py | Lines: 20-29

    testing pytest text-processing similarity-detection deduplication
  • function test_identical_text_removal

    A pytest test function that verifies the SimilarityCleaner's ability to remove identical duplicate text entries from a list while preserving unique documents.

    File: /tf/active/vicechatdev/chromadb-cleanup/tests/test_similarity_cleaner.py | Lines: 9-18

    testing pytest unit-test deduplication text-processing