🔍 Code Extractor

function extract_warranty_data

Maturity: 46

Parses markdown-formatted warranty documentation to extract structured warranty information including IDs, titles, sections, source document counts, warranty text, and disclosure content.

File:
/tf/active/vicechatdev/convert_disclosures_to_table.py
Lines:
75 - 139
Complexity:
moderate

Purpose

This function processes markdown content containing warranty information structured with specific heading patterns (## for warranty sections, ### for subsections). It extracts and organizes warranty data into a list of dictionaries, normalizing escaped newlines, parsing warranty IDs (including complex patterns with parentheses), extracting metadata fields, and creating both summary and full versions of disclosure content. Useful for converting markdown warranty documentation into structured data for analysis, reporting, or database storage.

Source Code

def extract_warranty_data(markdown_content):
    """Extract warranty data from markdown content."""
    warranties = []
    
    # First, normalize the content by converting escaped newlines to actual newlines
    normalized_content = markdown_content.replace('\\n', '\n')
    
    # Find all warranty sections using a more flexible pattern
    # Look for ## followed by warranty ID - Title pattern
    warranty_pattern = r'## ([\d\.]+(?:\([a-z]\))?(?:\([ivx]+\))?(?:\([A-Za-z]+\))?) - (.+?)\n'
    
    # Find all warranty sections
    warranty_matches = list(re.finditer(warranty_pattern, normalized_content))
    logger.info(f"Found {len(warranty_matches)} warranty sections")
    
    for i, match in enumerate(warranty_matches):
        warranty_id = match.group(1).strip()
        warranty_title = match.group(2).strip()
        
        # Find the content between this warranty and the next one (or end of file)
        start_pos = match.end()
        if i + 1 < len(warranty_matches):
            end_pos = warranty_matches[i + 1].start()
            content = normalized_content[start_pos:end_pos]
        else:
            content = normalized_content[start_pos:]
        
        logger.info(f"Processing warranty: {warranty_id} - {warranty_title}")
        
        # Extract section name (look for **Section**: pattern)
        section_match = re.search(r'\*\*Section\*\*:\s*(.+?)(?:\n|\*\*)', content)
        section_name = section_match.group(1).strip() if section_match else ""
        
        # Extract source documents count
        source_docs_match = re.search(r'\*\*Source Documents Found\*\*:\s*(\d+)', content)
        source_docs_count = source_docs_match.group(1) if source_docs_match else "0"
        
        # Extract warranty text (between ### Warranty Text and ### Disclosure)
        warranty_text_match = re.search(r'### Warranty Text\s*\n\n(.+?)\n\n### Disclosure', content, re.DOTALL)
        warranty_text = clean_text(warranty_text_match.group(1)) if warranty_text_match else ""
        
        # Extract disclosure content (everything after ### Disclosure until next --- or end)
        disclosure_match = re.search(r'### Disclosure\s*\n\n(.+?)(?=\n\n---\n|$)', content, re.DOTALL)
        disclosure_content = clean_text(disclosure_match.group(1)) if disclosure_match else ""
        
        # If disclosure_content is empty, try a more relaxed pattern
        if not disclosure_content:
            disclosure_match = re.search(r'### Disclosure\s*\n(.+?)(?=\n---\n|$)', content, re.DOTALL)
            disclosure_content = clean_text(disclosure_match.group(1)) if disclosure_match else ""
        
        # Create both summary and full versions
        disclosure_summary = disclosure_content[:500] + "..." if len(disclosure_content) > 500 else disclosure_content
        
        warranties.append({
            'Warranty_ID': warranty_id,
            'Warranty_Title': warranty_title,
            'Section_Name': section_name,
            'Source_Documents_Count': source_docs_count,
            'Warranty_Text': warranty_text,
            'Disclosure_Summary': disclosure_summary,
            'Full_Disclosure': disclosure_content
        })
    
    logger.info(f"Extracted {len(warranties)} warranties")
    return warranties

Parameters

Name Type Default Kind
markdown_content - - positional_or_keyword

Parameter Details

markdown_content: A string containing markdown-formatted warranty documentation. Expected to have warranty sections marked with '## [ID] - [Title]' headers, followed by subsections including '**Section**:', '**Source Documents Found**:', '### Warranty Text', and '### Disclosure'. The content may contain escaped newlines (\n) which will be normalized. The warranty ID pattern supports complex formats like '1.2', '1.2(a)', '1.2(a)(i)', or '1.2(a)(Example)'.

Return Value

Returns a list of dictionaries, where each dictionary represents one warranty section with the following keys: 'Warranty_ID' (string, the extracted warranty identifier), 'Warranty_Title' (string, the warranty title), 'Section_Name' (string, the section name or empty string if not found), 'Source_Documents_Count' (string, number of source documents or '0' if not found), 'Warranty_Text' (string, cleaned warranty text content), 'Disclosure_Summary' (string, first 500 characters of disclosure with '...' if truncated), 'Full_Disclosure' (string, complete disclosure content). Returns an empty list if no warranty sections are found.

Dependencies

  • re
  • logging

Required Imports

import re
import logging

Usage Example

import re
import logging

logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)

def clean_text(text):
    """Helper function to clean text."""
    return text.strip()

markdown_content = '''
## 1.1 - Limited Warranty

**Section**: General Warranties
**Source Documents Found**: 3

### Warranty Text

This product is warranted against defects in materials and workmanship.

### Disclosure

Warranty is valid for 1 year from date of purchase. Excludes normal wear and tear.

---

## 1.2(a) - Extended Coverage

**Section**: Optional Warranties
**Source Documents Found**: 2

### Warranty Text

Extended coverage available for purchase.

### Disclosure

Additional terms apply for extended warranty coverage.
'''

warranties = extract_warranty_data(markdown_content)
for warranty in warranties:
    print(f"ID: {warranty['Warranty_ID']}, Title: {warranty['Warranty_Title']}")
    print(f"Section: {warranty['Section_Name']}")
    print(f"Sources: {warranty['Source_Documents_Count']}")
    print(f"Text: {warranty['Warranty_Text'][:50]}...")
    print(f"Disclosure: {warranty['Disclosure_Summary'][:50]}...\n")

Best Practices

  • Ensure the markdown content follows the expected format with '## [ID] - [Title]' headers for warranty sections
  • Define a clean_text() helper function before using this function to properly clean extracted text
  • Configure logging appropriately to capture the info-level messages about warranty processing
  • The function expects specific markdown patterns (### Warranty Text, ### Disclosure); deviations may result in empty fields
  • Handle the returned list appropriately as it may be empty if no warranty sections match the expected pattern
  • The warranty ID pattern is flexible but assumes specific formats with optional parenthetical suffixes
  • Consider validating the markdown structure before passing to this function for better error handling
  • The disclosure summary is truncated at 500 characters; use 'Full_Disclosure' key for complete content

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function extract_warranty_data_improved 96.0% similar

    Parses markdown-formatted warranty documentation to extract structured warranty data including IDs, titles, sections, disclosure text, and reference citations.

    From: /tf/active/vicechatdev/improved_convert_disclosures_to_table.py
  • function extract_warranty_sections 88.9% similar

    Parses markdown content to extract warranty section headers, returning a list of dictionaries containing section IDs and titles for table of contents generation.

    From: /tf/active/vicechatdev/enhanced_word_converter_fixed.py
  • function create_enhanced_word_document 72.4% similar

    Converts markdown-formatted warranty disclosure content into a formatted Microsoft Word document with hierarchical headings, styled text, lists, and special formatting for block references.

    From: /tf/active/vicechatdev/improved_word_converter.py
  • function main_v5 70.7% similar

    Converts a markdown file containing warranty disclosure data into multiple tabular formats (CSV, Excel, Word) with timestamped output files.

    From: /tf/active/vicechatdev/convert_disclosures_to_table.py
  • function main_v1 69.5% similar

    Orchestrates the conversion of an improved markdown file containing warranty disclosures into multiple tabular formats (CSV, Excel, Word) with timestamp-based file naming.

    From: /tf/active/vicechatdev/improved_convert_disclosures_to_table.py
← Back to Browse