🔍 Code Extractor

function quick_clean

Maturity: 36

Cleans flock data by identifying and removing flocks that have treatment records with timing inconsistencies (treatments administered outside the flock's start/end date range).

File:
/tf/active/vicechatdev/quick_cleaner.py
Lines:
9 - 72
Complexity:
moderate

Purpose

This function performs data quality cleaning on poultry flock datasets by cross-referencing treatment administration dates with flock lifecycle dates. It identifies flocks where treatments were recorded before the flock started or after it ended, removes these problematic flocks from the dataset, and saves a cleaned version. This is useful for ensuring data integrity in livestock management systems before performing analysis or reporting.

Source Code

def quick_clean():
    print("=== QUICK FLOCK DATA CLEANER ===")
    
    # Load data
    data_dir = "/tf/active/pehestat_data"
    flocks_path = os.path.join(data_dir, "dbo_Flocks.csv")
    treatments_path = os.path.join(data_dir, "dbo_Treatments.csv")
    
    print("Loading flocks data...")
    flocks_df = pd.read_csv(flocks_path)
    print(f"Loaded {len(flocks_df):,} flocks")
    
    print("Loading treatments data...")
    treatments_df = pd.read_csv(treatments_path)
    print(f"Loaded {len(treatments_df):,} treatments")
    
    # Convert dates
    print("Converting dates...")
    flocks_df['StartDate'] = pd.to_datetime(flocks_df['StartDate'], errors='coerce')
    flocks_df['EndDate'] = pd.to_datetime(flocks_df['EndDate'], errors='coerce')
    treatments_df['AdministeredDate'] = pd.to_datetime(treatments_df['AdministeredDate'], errors='coerce')
    
    # Merge treatments with flock data
    print("Merging data...")
    merged_df = pd.merge(
        treatments_df[['FlockCD', 'AdministeredDate']],
        flocks_df[['FlockCD', 'StartDate', 'EndDate']],
        on='FlockCD',
        how='inner'
    )
    print(f"Merged {len(merged_df):,} treatment records")
    
    # Find timing issues
    print("Finding timing issues...")
    timing_issues = (
        (merged_df['AdministeredDate'] < merged_df['StartDate']) |
        (merged_df['AdministeredDate'] > merged_df['EndDate'])
    )
    
    problematic_treatments = merged_df[timing_issues]
    problematic_flocks = set(problematic_treatments['FlockCD'].unique())
    
    print(f"Found {len(problematic_flocks):,} flocks with timing issues")
    print(f"Found {len(problematic_treatments):,} problematic treatments")
    
    # Create cleaned dataset
    print("Creating cleaned dataset...")
    original_count = len(flocks_df)
    cleaned_flocks = flocks_df[~flocks_df['FlockCD'].isin(problematic_flocks)].copy()
    cleaned_count = len(cleaned_flocks)
    removed_count = original_count - cleaned_count
    removal_pct = (removed_count / original_count) * 100
    
    # Save cleaned dataset
    output_path = os.path.join(data_dir, "dbo_Flocks_clean.csv")
    cleaned_flocks.to_csv(output_path, index=False)
    
    print("\n=== RESULTS ===")
    print(f"Original flocks: {original_count:,}")
    print(f"Cleaned flocks: {cleaned_count:,}")
    print(f"Removed flocks: {removed_count:,} ({removal_pct:.2f}%)")
    print(f"Output saved to: {output_path}")
    
    return cleaned_count, removed_count, removal_pct

Return Value

Returns a tuple of three values: (cleaned_count, removed_count, removal_pct). cleaned_count is the integer number of flocks remaining after cleaning, removed_count is the integer number of flocks removed due to timing issues, and removal_pct is the float percentage of flocks removed from the original dataset.

Dependencies

  • pandas
  • os

Required Imports

import pandas as pd
import os

Usage Example

# Ensure data files exist at /tf/active/pehestat_data/
# dbo_Flocks.csv should have: FlockCD, StartDate, EndDate columns
# dbo_Treatments.csv should have: FlockCD, AdministeredDate columns

import pandas as pd
import os

# Run the cleaning function
cleaned_count, removed_count, removal_pct = quick_clean()

print(f"Cleaning complete: {cleaned_count} flocks retained, {removed_count} removed ({removal_pct:.2f}%)")

# The cleaned data is saved to /tf/active/pehestat_data/dbo_Flocks_clean.csv

Best Practices

  • Ensure the data directory path '/tf/active/pehestat_data' exists and contains the required CSV files before calling this function
  • Verify that CSV files have the expected column names (FlockCD, StartDate, EndDate, AdministeredDate) to avoid KeyError exceptions
  • The function uses hardcoded file paths, so it's not portable without modifying the source code
  • The function prints progress information to stdout, which may not be suitable for production environments without logging configuration
  • Consider backing up original data files before running, as the function creates a new cleaned file but doesn't modify originals
  • The function uses 'errors=coerce' for date parsing, which converts invalid dates to NaT (Not a Time) - these may cause timing issues to be flagged
  • Memory usage scales with dataset size since entire DataFrames are loaded into memory
  • The inner join on FlockCD means treatments without matching flocks are excluded from timing validation

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function show_problematic_flocks 75.0% similar

    Analyzes and displays problematic flocks from a dataset by identifying those with systematic timing issues in their treatment records, categorizing them by severity and volume.

    From: /tf/active/vicechatdev/data_quality_dashboard.py
  • function create_data_quality_dashboard 73.7% similar

    Creates an interactive command-line dashboard for analyzing data quality issues in treatment timing data, specifically focusing on treatments administered outside of flock lifecycle dates.

    From: /tf/active/vicechatdev/data_quality_dashboard.py
  • function show_critical_errors 72.8% similar

    Displays critical data quality errors in treatment records, focusing on date anomalies including 1900 dates, extreme future dates, and extreme past dates relative to flock lifecycles.

    From: /tf/active/vicechatdev/data_quality_dashboard.py
  • function create_data_quality_dashboard_v1 72.5% similar

    Creates an interactive data quality dashboard for analyzing treatment timing issues in poultry flock management data by loading and processing CSV files containing timing anomalies.

    From: /tf/active/vicechatdev/data_quality_dashboard.py
  • function compare_datasets 70.2% similar

    Analyzes and compares two pandas DataFrames containing flock data (original vs cleaned), printing detailed statistics about removed records, type distributions, and impact assessment.

    From: /tf/active/vicechatdev/data_quality_dashboard.py
← Back to Browse