test_document_extractor - Code Extractor

function test_document_extractor

Maturity: 43

A test function that validates the DocumentExtractor class by testing file type support detection, text extraction from various document formats, and error handling.

File:
/tf/active/vicechatdev/leexi/test_document_extractor.py

Lines:
15 - 55

Complexity:
simple

Purpose

This function serves as a comprehensive test suite for the DocumentExtractor class. It verifies that the extractor can correctly identify supported file extensions, extract text content from different document types (markdown, Word, PowerPoint, PDF), handle missing files gracefully, and detect file type compatibility. The function provides visual feedback through console output showing success/failure status for each operation.

Source Code

def test_document_extractor():
    """Test the document extractor with various file types"""
    
    # Initialize extractor
    extractor = DocumentExtractor()
    
    print("Document Extractor Test")
    print("=" * 50)
    
    # Test supported extensions
    supported_extensions = extractor.get_supported_extensions()
    print(f"Supported extensions: {supported_extensions}")
    print()
    
    # Test with existing files in the directory
    test_files = [
        "enhanced_meeting_minutes_2025-06-18.md",
        "leexi-20250618-transcript-development_team_meeting.md",
        "powerpoint_content_summary.md"
    ]
    
    for file_path in test_files:
        if os.path.exists(file_path):
            print(f"Testing file: {file_path}")
            try:
                content = extractor.extract_text(file_path)
                if content:
                    print(f"✓ Successfully extracted {len(content)} characters")
                    print(f"Preview: {content[:200]}...")
                else:
                    print("✗ No content extracted")
            except Exception as e:
                print(f"✗ Error: {str(e)}")
            print("-" * 40)
    
    # Test file type detection
    test_extensions = ['.docx', '.pdf', '.pptx', '.txt', '.md', '.doc', '.ppt']
    print("\nFile type detection test:")
    for ext in test_extensions:
        is_supported = extractor.is_supported_file(f"test{ext}")
        print(f"{ext}: {'✓ Supported' if is_supported else '✗ Not supported'}")

Return Value

This function does not return any value (implicitly returns None). It outputs test results directly to the console, including supported extensions, extraction success/failure status, character counts, content previews, and file type detection results.

Dependencies

os
sys
pathlib
document_extractor

Required Imports

import os
import sys
from pathlib import Path
from document_extractor import DocumentExtractor

Usage Example

# Ensure DocumentExtractor is available and test files exist
# Run the test function
test_document_extractor()

# Expected output:
# Document Extractor Test
# ==================================================
# Supported extensions: ['.md', '.txt', '.docx', '.pdf', '.pptx']
# 
# Testing file: enhanced_meeting_minutes_2025-06-18.md
# ✓ Successfully extracted 1234 characters
# Preview: # Meeting Minutes...
# ----------------------------------------
# ...

Best Practices

Ensure the DocumentExtractor class is properly implemented before running this test
Place test files in the same directory as the test script or update file paths accordingly
This function is designed for manual testing and debugging; consider using pytest or unittest for automated testing
The function uses print statements for output; redirect stdout if you need to capture results programmatically
Add exception handling around the entire function call if using in production environments
Consider parameterizing the test_files list to make the test more flexible and reusable

Similar Components

AI-powered semantic similarity - components with related functionality:

function test_multiple_files 81.5% similar

A test function that validates the extraction of text content from multiple document files using a DocumentExtractor instance, displaying extraction results and simulating combined content processing.
From: /tf/active/vicechatdev/leexi/test_multiple_files.py
function test_mixed_previous_reports 73.4% similar

A test function that validates the DocumentExtractor's ability to extract text content from multiple file formats (text and markdown) and combine them into a unified previous reports summary.
From: /tf/active/vicechatdev/leexi/test_enhanced_reports.py
class DocumentExtractor 73.0% similar

A document text extraction class that supports multiple file formats including Word, PowerPoint, PDF, and plain text files, with automatic format detection and conversion capabilities.
From: /tf/active/vicechatdev/leexi/document_extractor.py
function test_attendee_extraction 59.1% similar

A test function that validates the attendee extraction logic of the EnhancedMeetingMinutesGenerator by parsing a meeting transcript and displaying extracted metadata including speakers, date, and duration.
From: /tf/active/vicechatdev/leexi/test_attendee_extraction.py
function test_attendee_extraction_comprehensive 56.3% similar

A comprehensive test function that validates the attendee extraction logic from meeting transcripts, comparing actual speakers versus mentioned names, and demonstrating integration with meeting minutes generation.
From: /tf/active/vicechatdev/leexi/test_attendee_comprehensive.py

← Back to Browse

Assistant

Hi! I can help improve this code. Tell me what you'd like to enhance (e.g., "add error handling", "optimize performance", "improve readability", "add type hints").

Code Comparison

Original Code

                            def test_document_extractor():
    """Test the document extractor with various file types"""
    
    # Initialize extractor
    extractor = DocumentExtractor()
    
    print("Document Extractor Test")
    print("=" * 50)
    
    # Test supported extensions
    supported_extensions = extractor.get_supported_extensions()
    print(f"Supported extensions: {supported_extensions}")
    print()
    
    # Test with existing files in the directory
    test_files = [
        "enhanced_meeting_minutes_2025-06-18.md",
        "leexi-20250618-transcript-development_team_meeting.md",
        "powerpoint_content_summary.md"
    ]
    
    for file_path in test_files:
        if os.path.exists(file_path):
            print(f"Testing file: {file_path}")
            try:
                content = extractor.extract_text(file_path)
                if content:
                    print(f"✓ Successfully extracted {len(content)} characters")
                    print(f"Preview: {content[:200]}...")
                else:
                    print("✗ No content extracted")
            except Exception as e:
                print(f"✗ Error: {str(e)}")
            print("-" * 40)
    
    # Test file type detection
    test_extensions = ['.docx', '.pdf', '.pptx', '.txt', '.md', '.doc', '.ppt']
    print("\nFile type detection test:")
    for ext in test_extensions:
        is_supported = extractor.is_supported_file(f"test{ext}")
        print(f"{ext}: {'✓ Supported' if is_supported else '✗ Not supported'}")
                        

Improved Code

🔍 Code Extractor

function test_document_extractor

Purpose

Source Code

Return Value

Dependencies

Required Imports

Usage Example

Best Practices

Tags

Similar Components

function test_multiple_files 81.5% similar

function test_mixed_previous_reports 73.4% similar

class DocumentExtractor 73.0% similar

function test_attendee_extraction 59.1% similar

function test_attendee_extraction_comprehensive 56.3% similar

function test_document_extractor

Purpose

Source Code

Return Value

Dependencies

Required Imports

Usage Example

Best Practices

Tags

Similar Components

function test_multiple_files 81.5% similar

function test_mixed_previous_reports 73.4% similar

class DocumentExtractor 73.0% similar

function test_attendee_extraction 59.1% similar

function test_attendee_extraction_comprehensive 56.3% similar

✨ Improve Code: test_document_extractor

Code Comparison