šŸ” Code Extractor

function main

Maturity: 50

Command-line interface function that orchestrates pattern-based extraction of poultry flock data, including data loading, pattern classification, geocoding, and export functionality.

File:
/tf/active/vicechatdev/pattern_based_extraction.py
Lines:
505 - 622
Complexity:
complex

Purpose

This is the main entry point for a pattern-based poultry data extraction tool. It processes command-line arguments to extract flock data based on In-Ovo usage patterns (sequential, concurrent, mixed, or all), filters data by date, optionally performs geocoding and map generation, and exports results to CSV files. The function coordinates multiple extraction steps including loading base data, identifying mixed farms, classifying farm patterns, enriching data, and exporting results.

Source Code

def main():
    """Main function for pattern-based extraction."""
    parser = argparse.ArgumentParser(description='Pattern-Based Poultry Data Extraction')
    parser.add_argument('--pattern', type=str, required=True, 
                       choices=['sequential', 'concurrent', 'mixed', 'all'],
                       help='In-Ovo usage pattern to extract')
    parser.add_argument('--output', type=str, default=None,
                       help='Output CSV filename (default: auto-generated)')
    parser.add_argument('--sample-size', type=int, default=None,
                       help='Number of flocks to sample (default: extract all)')
    parser.add_argument('--geocoded-data', type=str, default=None,
                       help='Path to geocoded data file for coordinate enrichment')
    parser.add_argument('--data-dir', type=str, default='/tf/active/pehestat_data',
                       help='Directory containing Pehestat data files')
    parser.add_argument('--skip-geocoding', action='store_true',
                       help='Skip geocoding and map generation')
    parser.add_argument('--cache-only', action='store_true',
                       help='Use geocoding cache only (no API calls)')
    parser.add_argument('--create-map', action='store_true',
                       help='Create interactive map (requires geocoding)')
    parser.add_argument('--map-output', type=str, default=None,
                       help='Output map filename (default: auto-generated)')
    parser.add_argument('--use-clustering', action='store_true',
                       help='Enable marker clustering on the map')
    parser.add_argument('--start-date', type=str, default='2020-01-01',
                       help='Start date filter (YYYY-MM-DD, default: 2020-01-01)')
    
    args = parser.parse_args()
    
    print("=" * 80)
    print("PATTERN-BASED POULTRY DATA EXTRACTION")
    print("=" * 80)
    print(f"Target pattern: {args.pattern}")
    print(f"Start date filter: {args.start_date}")
    print(f"Sample size: {'All flocks' if args.sample_size is None else f'{args.sample_size:,} flocks'}")
    print(f"Data directory: {args.data_dir}")
    if args.geocoded_data:
        print(f"Geocoded data: {args.geocoded_data}")
    if not args.skip_geocoding:
        if args.cache_only:
            print("Geocoding: Cache-only mode (no API calls)")
        else:
            print("Geocoding: Full mode (includes API calls if needed)")
        if args.create_map:
            print("Map generation: Enabled")
    else:
        print("Geocoding: Disabled")
    print("=" * 80)
    
    try:
        # Initialize extractor
        extractor = PatternBasedExtractor(
            data_dir=args.data_dir,
            geocoded_file=args.geocoded_data
        )
        
        # Load and filter base data
        flocks_df = extractor.load_and_filter_base_data(start_date=args.start_date)
        
        # Identify mixed farms
        mixed_farms_df = extractor.identify_mixed_farms(flocks_df)
        
        if len(mixed_farms_df) == 0:
            print("No mixed farms found! Cannot proceed with pattern extraction.")
            return
        
        # Classify farm patterns
        patterns_df = extractor.classify_farm_patterns(flocks_df, mixed_farms_df)
        
        if len(patterns_df) == 0:
            print("No farm patterns could be classified! Cannot proceed.")
            return
        
        # Extract flocks by pattern
        if args.pattern == 'all':
            # Extract all patterns
            for pattern in ['sequential', 'concurrent', 'mixed']:
                pattern_flocks = extractor.extract_flocks_by_pattern(
                    pattern, flocks_df, patterns_df, args.sample_size
                )
                
                if len(pattern_flocks) > 0:
                    # Enrich data
                    enriched_flocks = extractor.enrich_flock_data(pattern_flocks)
                    
                    # Export results
                    output_file = args.output
                    if output_file and args.pattern == 'all':
                        # Modify filename for each pattern
                        base, ext = os.path.splitext(output_file)
                        output_file = f"{base}_{pattern}{ext}"
                    
                    extractor.export_results(enriched_flocks, pattern, output_file)
        else:
            # Extract specific pattern
            pattern_flocks = extractor.extract_flocks_by_pattern(
                args.pattern, flocks_df, patterns_df, args.sample_size
            )
            
            if len(pattern_flocks) == 0:
                print(f"No flocks found for pattern '{args.pattern}'!")
                return
            
            # Enrich data
            enriched_flocks = extractor.enrich_flock_data(pattern_flocks)
            
            # Export results
            extractor.export_results(enriched_flocks, args.pattern, args.output)
        
        print("\nāœ… Pattern-based extraction completed successfully!")
        
    except Exception as e:
        print(f"\nāŒ Error during pattern-based extraction: {e}")
        import traceback
        traceback.print_exc()
        return 1
    
    return 0

Return Value

Returns an integer exit code: 0 for successful completion, 1 for errors during execution, or None (implicit None) if early termination occurs due to no data found. The function primarily produces side effects (file exports, console output) rather than returning data.

Dependencies

  • argparse
  • os
  • sys
  • pandas
  • numpy
  • datetime
  • typing
  • traceback
  • matched_sample_analysis
  • extractor

Required Imports

import os
import sys
import pandas as pd
import numpy as np
import argparse
from datetime import datetime
from typing import Dict, List, Optional, Tuple
from matched_sample_analysis import MatchedSampleAnalyzer
from extractor import PehestatDataExtractor
import traceback

Conditional/Optional Imports

These imports are only needed under specific conditions:

import traceback

Condition: only used in exception handling blocks for detailed error reporting

Required (conditional)

Usage Example

# Run from command line:
# Extract sequential pattern flocks from 2020 onwards
python script.py --pattern sequential --start-date 2020-01-01 --output sequential_flocks.csv

# Extract all patterns with sampling and geocoding
python script.py --pattern all --sample-size 1000 --geocoded-data geocoded.csv --create-map

# Extract concurrent pattern with cache-only geocoding
python script.py --pattern concurrent --cache-only --skip-geocoding --data-dir /custom/path

# If calling from Python code:
if __name__ == '__main__':
    sys.exit(main())

Best Practices

  • Always run with --pattern argument as it is required
  • Use --start-date to filter data to relevant time periods and improve performance
  • When extracting all patterns, provide a base output filename; the function will automatically append pattern names
  • Use --cache-only flag to avoid API rate limits when geocoding data repeatedly
  • Set --sample-size for testing or when working with large datasets to reduce processing time
  • Check console output for data availability messages before expecting output files
  • The function returns exit codes (0 or 1) suitable for shell scripting and CI/CD pipelines
  • Ensure PatternBasedExtractor class is properly defined with methods: load_and_filter_base_data, identify_mixed_farms, classify_farm_patterns, extract_flocks_by_pattern, enrich_flock_data, export_results
  • Handle the case where no mixed farms or patterns are found, as the function will exit early
  • Use --skip-geocoding when coordinates are not needed to speed up processing

Similar Components

AI-powered semantic similarity - components with related functionality:

  • class PatternBasedExtractor 65.4% similar

    Extract flocks based on farm-level In-Ovo usage patterns.

    From: /tf/active/vicechatdev/pattern_based_extraction.py
  • function analyze_flock_type_patterns 59.4% similar

    Analyzes and prints timing pattern statistics for flock data by categorizing issues that occur before start time and after end time, grouped by flock type.

    From: /tf/active/vicechatdev/data_quality_dashboard.py
  • function show_problematic_flocks 57.8% similar

    Analyzes and displays problematic flocks from a dataset by identifying those with systematic timing issues in their treatment records, categorizing them by severity and volume.

    From: /tf/active/vicechatdev/data_quality_dashboard.py
  • function create_data_quality_dashboard 55.5% similar

    Creates an interactive command-line dashboard for analyzing data quality issues in treatment timing data, specifically focusing on treatments administered outside of flock lifecycle dates.

    From: /tf/active/vicechatdev/data_quality_dashboard.py
  • function select_dataset 55.2% similar

    Interactive command-line function that prompts users to select between original, cleaned, or comparison of flock datasets for analysis.

    From: /tf/active/vicechatdev/data_quality_dashboard.py
← Back to Browse