🔍 Code Extractor

function fuzzy_match_score

Maturity: 43

Calculates a fuzzy string similarity score between two input strings using the SequenceMatcher algorithm, returning a ratio between 0.0 and 1.0.

File:
/tf/active/vicechatdev/mailsearch/enhanced_document_comparison.py
Lines:
130 - 132
Complexity:
simple

Purpose

This function provides a case-insensitive fuzzy matching capability to determine how similar two strings are. It's useful for comparing text where exact matches aren't required, such as finding duplicate records with slight variations, matching user input against a database, spell-checking, or identifying similar names/addresses. The function uses Python's difflib.SequenceMatcher which implements the Ratcliff/Obershelp algorithm to compute similarity ratios.

Source Code

def fuzzy_match_score(str1: str, str2: str) -> float:
    """Calculate similarity score between two strings"""
    return SequenceMatcher(None, str1.lower(), str2.lower()).ratio()

Parameters

Name Type Default Kind
str1 str - positional_or_keyword
str2 str - positional_or_keyword

Parameter Details

str1: The first string to compare. Can be any string value including empty strings. The function converts this to lowercase internally for case-insensitive comparison.

str2: The second string to compare against the first. Can be any string value including empty strings. The function converts this to lowercase internally for case-insensitive comparison.

Return Value

Type: float

Returns a float value between 0.0 and 1.0 representing the similarity ratio. A value of 1.0 indicates identical strings (case-insensitive), 0.0 indicates completely different strings, and values in between represent partial similarity. The ratio is calculated as: 2.0 * M / T, where M is the number of matches and T is the total number of elements in both sequences.

Dependencies

  • difflib

Required Imports

from difflib import SequenceMatcher

Usage Example

from difflib import SequenceMatcher

def fuzzy_match_score(str1: str, str2: str) -> float:
    """Calculate similarity score between two strings"""
    return SequenceMatcher(None, str1.lower(), str2.lower()).ratio()

# Example usage
score1 = fuzzy_match_score("hello world", "hello world")
print(f"Exact match: {score1}")  # Output: 1.0

score2 = fuzzy_match_score("hello world", "Hello World!")
print(f"Case difference with punctuation: {score2}")  # Output: ~0.96

score3 = fuzzy_match_score("John Smith", "Jon Smyth")
print(f"Similar names: {score3}")  # Output: ~0.82

score4 = fuzzy_match_score("apple", "orange")
print(f"Different words: {score4}")  # Output: ~0.18

# Practical use case: finding best match
user_input = "Microsft"
companies = ["Microsoft", "Apple", "Google", "Amazon"]
best_match = max(companies, key=lambda x: fuzzy_match_score(user_input, x))
print(f"Best match for '{user_input}': {best_match}")  # Output: Microsoft

Best Practices

  • The function performs case-insensitive comparison by converting both strings to lowercase. If case-sensitive comparison is needed, modify the function accordingly.
  • For large-scale string matching operations, consider caching results or using more efficient algorithms like Levenshtein distance for specific use cases.
  • The SequenceMatcher algorithm works well for general text similarity but may not be optimal for all scenarios. Consider alternatives like Jaro-Winkler or Levenshtein distance for specific domain requirements.
  • Empty strings will return a score of 1.0 when compared to each other, and 0.0 when compared to non-empty strings.
  • The function does not handle None values. Ensure both parameters are valid strings or add None checks if needed.
  • For performance-critical applications with many comparisons, consider using the quick_ratio() or real_quick_ratio() methods of SequenceMatcher for faster approximate results.
  • When using this for threshold-based matching (e.g., score > 0.8 means match), test with representative data to determine appropriate threshold values for your use case.

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function fuzzy_match_filename 64.2% similar

    Calculates a fuzzy match similarity score between two filenames by comparing them after normalization, using exact matching, substring containment, and word overlap techniques.

    From: /tf/active/vicechatdev/mailsearch/compare_documents.py
  • function compare_pdf_content 59.4% similar

    Compares the textual content similarity between two PDF files by extracting text samples and computing a similarity ratio using sequence matching.

    From: /tf/active/vicechatdev/mailsearch/enhanced_document_comparison.py
  • function find_best_match 53.8% similar

    Finds the best matching document from a list of candidates by comparing hash, size, filename, and content similarity with configurable confidence thresholds.

    From: /tf/active/vicechatdev/mailsearch/enhanced_document_comparison.py
  • function test_similarity_threshold_effect 51.9% similar

    A pytest test function that validates the behavior of SimilarityCleaner with different similarity threshold values, ensuring that higher thresholds retain more texts while lower thresholds are more aggressive in removing similar content.

    From: /tf/active/vicechatdev/chromadb-cleanup/tests/test_similarity_cleaner.py
  • function calculate_similarity 47.0% similar

    Computes the cosine similarity between two embedding vectors, returning a normalized score between 0 and 1 that measures their directional alignment.

    From: /tf/active/vicechatdev/chromadb-cleanup/src/utils/similarity_utils.py
← Back to Browse