function test_document_extractor
A test function that validates the DocumentExtractor class by testing file type support detection, text extraction from various document formats, and error handling.
/tf/active/vicechatdev/leexi/test_document_extractor.py
15 - 55
simple
Purpose
This function serves as a comprehensive test suite for the DocumentExtractor class. It verifies that the extractor can correctly identify supported file extensions, extract text content from different document types (markdown, Word, PowerPoint, PDF), handle missing files gracefully, and detect file type compatibility. The function provides visual feedback through console output showing success/failure status for each operation.
Source Code
def test_document_extractor():
"""Test the document extractor with various file types"""
# Initialize extractor
extractor = DocumentExtractor()
print("Document Extractor Test")
print("=" * 50)
# Test supported extensions
supported_extensions = extractor.get_supported_extensions()
print(f"Supported extensions: {supported_extensions}")
print()
# Test with existing files in the directory
test_files = [
"enhanced_meeting_minutes_2025-06-18.md",
"leexi-20250618-transcript-development_team_meeting.md",
"powerpoint_content_summary.md"
]
for file_path in test_files:
if os.path.exists(file_path):
print(f"Testing file: {file_path}")
try:
content = extractor.extract_text(file_path)
if content:
print(f"✓ Successfully extracted {len(content)} characters")
print(f"Preview: {content[:200]}...")
else:
print("✗ No content extracted")
except Exception as e:
print(f"✗ Error: {str(e)}")
print("-" * 40)
# Test file type detection
test_extensions = ['.docx', '.pdf', '.pptx', '.txt', '.md', '.doc', '.ppt']
print("\nFile type detection test:")
for ext in test_extensions:
is_supported = extractor.is_supported_file(f"test{ext}")
print(f"{ext}: {'✓ Supported' if is_supported else '✗ Not supported'}")
Return Value
This function does not return any value (implicitly returns None). It outputs test results directly to the console, including supported extensions, extraction success/failure status, character counts, content previews, and file type detection results.
Dependencies
ossyspathlibdocument_extractor
Required Imports
import os
import sys
from pathlib import Path
from document_extractor import DocumentExtractor
Usage Example
# Ensure DocumentExtractor is available and test files exist
# Run the test function
test_document_extractor()
# Expected output:
# Document Extractor Test
# ==================================================
# Supported extensions: ['.md', '.txt', '.docx', '.pdf', '.pptx']
#
# Testing file: enhanced_meeting_minutes_2025-06-18.md
# ✓ Successfully extracted 1234 characters
# Preview: # Meeting Minutes...
# ----------------------------------------
# ...
Best Practices
- Ensure the DocumentExtractor class is properly implemented before running this test
- Place test files in the same directory as the test script or update file paths accordingly
- This function is designed for manual testing and debugging; consider using pytest or unittest for automated testing
- The function uses print statements for output; redirect stdout if you need to capture results programmatically
- Add exception handling around the entire function call if using in production environments
- Consider parameterizing the test_files list to make the test more flexible and reusable
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function test_multiple_files 81.5% similar
-
function test_mixed_previous_reports 73.4% similar
-
class DocumentExtractor 73.0% similar
-
function test_attendee_extraction 59.1% similar
-
function test_attendee_extraction_comprehensive 56.3% similar