function extract_warranty_data
Parses markdown-formatted warranty documentation to extract structured warranty information including IDs, titles, sections, source document counts, warranty text, and disclosure content.
/tf/active/vicechatdev/convert_disclosures_to_table.py
75 - 139
moderate
Purpose
This function processes markdown content containing warranty information structured with specific heading patterns (## for warranty sections, ### for subsections). It extracts and organizes warranty data into a list of dictionaries, normalizing escaped newlines, parsing warranty IDs (including complex patterns with parentheses), extracting metadata fields, and creating both summary and full versions of disclosure content. Useful for converting markdown warranty documentation into structured data for analysis, reporting, or database storage.
Source Code
def extract_warranty_data(markdown_content):
"""Extract warranty data from markdown content."""
warranties = []
# First, normalize the content by converting escaped newlines to actual newlines
normalized_content = markdown_content.replace('\\n', '\n')
# Find all warranty sections using a more flexible pattern
# Look for ## followed by warranty ID - Title pattern
warranty_pattern = r'## ([\d\.]+(?:\([a-z]\))?(?:\([ivx]+\))?(?:\([A-Za-z]+\))?) - (.+?)\n'
# Find all warranty sections
warranty_matches = list(re.finditer(warranty_pattern, normalized_content))
logger.info(f"Found {len(warranty_matches)} warranty sections")
for i, match in enumerate(warranty_matches):
warranty_id = match.group(1).strip()
warranty_title = match.group(2).strip()
# Find the content between this warranty and the next one (or end of file)
start_pos = match.end()
if i + 1 < len(warranty_matches):
end_pos = warranty_matches[i + 1].start()
content = normalized_content[start_pos:end_pos]
else:
content = normalized_content[start_pos:]
logger.info(f"Processing warranty: {warranty_id} - {warranty_title}")
# Extract section name (look for **Section**: pattern)
section_match = re.search(r'\*\*Section\*\*:\s*(.+?)(?:\n|\*\*)', content)
section_name = section_match.group(1).strip() if section_match else ""
# Extract source documents count
source_docs_match = re.search(r'\*\*Source Documents Found\*\*:\s*(\d+)', content)
source_docs_count = source_docs_match.group(1) if source_docs_match else "0"
# Extract warranty text (between ### Warranty Text and ### Disclosure)
warranty_text_match = re.search(r'### Warranty Text\s*\n\n(.+?)\n\n### Disclosure', content, re.DOTALL)
warranty_text = clean_text(warranty_text_match.group(1)) if warranty_text_match else ""
# Extract disclosure content (everything after ### Disclosure until next --- or end)
disclosure_match = re.search(r'### Disclosure\s*\n\n(.+?)(?=\n\n---\n|$)', content, re.DOTALL)
disclosure_content = clean_text(disclosure_match.group(1)) if disclosure_match else ""
# If disclosure_content is empty, try a more relaxed pattern
if not disclosure_content:
disclosure_match = re.search(r'### Disclosure\s*\n(.+?)(?=\n---\n|$)', content, re.DOTALL)
disclosure_content = clean_text(disclosure_match.group(1)) if disclosure_match else ""
# Create both summary and full versions
disclosure_summary = disclosure_content[:500] + "..." if len(disclosure_content) > 500 else disclosure_content
warranties.append({
'Warranty_ID': warranty_id,
'Warranty_Title': warranty_title,
'Section_Name': section_name,
'Source_Documents_Count': source_docs_count,
'Warranty_Text': warranty_text,
'Disclosure_Summary': disclosure_summary,
'Full_Disclosure': disclosure_content
})
logger.info(f"Extracted {len(warranties)} warranties")
return warranties
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
markdown_content |
- | - | positional_or_keyword |
Parameter Details
markdown_content: A string containing markdown-formatted warranty documentation. Expected to have warranty sections marked with '## [ID] - [Title]' headers, followed by subsections including '**Section**:', '**Source Documents Found**:', '### Warranty Text', and '### Disclosure'. The content may contain escaped newlines (\n) which will be normalized. The warranty ID pattern supports complex formats like '1.2', '1.2(a)', '1.2(a)(i)', or '1.2(a)(Example)'.
Return Value
Returns a list of dictionaries, where each dictionary represents one warranty section with the following keys: 'Warranty_ID' (string, the extracted warranty identifier), 'Warranty_Title' (string, the warranty title), 'Section_Name' (string, the section name or empty string if not found), 'Source_Documents_Count' (string, number of source documents or '0' if not found), 'Warranty_Text' (string, cleaned warranty text content), 'Disclosure_Summary' (string, first 500 characters of disclosure with '...' if truncated), 'Full_Disclosure' (string, complete disclosure content). Returns an empty list if no warranty sections are found.
Dependencies
relogging
Required Imports
import re
import logging
Usage Example
import re
import logging
logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)
def clean_text(text):
"""Helper function to clean text."""
return text.strip()
markdown_content = '''
## 1.1 - Limited Warranty
**Section**: General Warranties
**Source Documents Found**: 3
### Warranty Text
This product is warranted against defects in materials and workmanship.
### Disclosure
Warranty is valid for 1 year from date of purchase. Excludes normal wear and tear.
---
## 1.2(a) - Extended Coverage
**Section**: Optional Warranties
**Source Documents Found**: 2
### Warranty Text
Extended coverage available for purchase.
### Disclosure
Additional terms apply for extended warranty coverage.
'''
warranties = extract_warranty_data(markdown_content)
for warranty in warranties:
print(f"ID: {warranty['Warranty_ID']}, Title: {warranty['Warranty_Title']}")
print(f"Section: {warranty['Section_Name']}")
print(f"Sources: {warranty['Source_Documents_Count']}")
print(f"Text: {warranty['Warranty_Text'][:50]}...")
print(f"Disclosure: {warranty['Disclosure_Summary'][:50]}...\n")
Best Practices
- Ensure the markdown content follows the expected format with '## [ID] - [Title]' headers for warranty sections
- Define a clean_text() helper function before using this function to properly clean extracted text
- Configure logging appropriately to capture the info-level messages about warranty processing
- The function expects specific markdown patterns (### Warranty Text, ### Disclosure); deviations may result in empty fields
- Handle the returned list appropriately as it may be empty if no warranty sections match the expected pattern
- The warranty ID pattern is flexible but assumes specific formats with optional parenthetical suffixes
- Consider validating the markdown structure before passing to this function for better error handling
- The disclosure summary is truncated at 500 characters; use 'Full_Disclosure' key for complete content
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function extract_warranty_data_improved 96.0% similar
-
function extract_warranty_sections 88.9% similar
-
function create_enhanced_word_document 72.4% similar
-
function main_v5 70.7% similar
-
function main_v1 69.5% similar