function quick_clean
Cleans flock data by identifying and removing flocks that have treatment records with timing inconsistencies (treatments administered outside the flock's start/end date range).
/tf/active/vicechatdev/quick_cleaner.py
9 - 72
moderate
Purpose
This function performs data quality cleaning on poultry flock datasets by cross-referencing treatment administration dates with flock lifecycle dates. It identifies flocks where treatments were recorded before the flock started or after it ended, removes these problematic flocks from the dataset, and saves a cleaned version. This is useful for ensuring data integrity in livestock management systems before performing analysis or reporting.
Source Code
def quick_clean():
print("=== QUICK FLOCK DATA CLEANER ===")
# Load data
data_dir = "/tf/active/pehestat_data"
flocks_path = os.path.join(data_dir, "dbo_Flocks.csv")
treatments_path = os.path.join(data_dir, "dbo_Treatments.csv")
print("Loading flocks data...")
flocks_df = pd.read_csv(flocks_path)
print(f"Loaded {len(flocks_df):,} flocks")
print("Loading treatments data...")
treatments_df = pd.read_csv(treatments_path)
print(f"Loaded {len(treatments_df):,} treatments")
# Convert dates
print("Converting dates...")
flocks_df['StartDate'] = pd.to_datetime(flocks_df['StartDate'], errors='coerce')
flocks_df['EndDate'] = pd.to_datetime(flocks_df['EndDate'], errors='coerce')
treatments_df['AdministeredDate'] = pd.to_datetime(treatments_df['AdministeredDate'], errors='coerce')
# Merge treatments with flock data
print("Merging data...")
merged_df = pd.merge(
treatments_df[['FlockCD', 'AdministeredDate']],
flocks_df[['FlockCD', 'StartDate', 'EndDate']],
on='FlockCD',
how='inner'
)
print(f"Merged {len(merged_df):,} treatment records")
# Find timing issues
print("Finding timing issues...")
timing_issues = (
(merged_df['AdministeredDate'] < merged_df['StartDate']) |
(merged_df['AdministeredDate'] > merged_df['EndDate'])
)
problematic_treatments = merged_df[timing_issues]
problematic_flocks = set(problematic_treatments['FlockCD'].unique())
print(f"Found {len(problematic_flocks):,} flocks with timing issues")
print(f"Found {len(problematic_treatments):,} problematic treatments")
# Create cleaned dataset
print("Creating cleaned dataset...")
original_count = len(flocks_df)
cleaned_flocks = flocks_df[~flocks_df['FlockCD'].isin(problematic_flocks)].copy()
cleaned_count = len(cleaned_flocks)
removed_count = original_count - cleaned_count
removal_pct = (removed_count / original_count) * 100
# Save cleaned dataset
output_path = os.path.join(data_dir, "dbo_Flocks_clean.csv")
cleaned_flocks.to_csv(output_path, index=False)
print("\n=== RESULTS ===")
print(f"Original flocks: {original_count:,}")
print(f"Cleaned flocks: {cleaned_count:,}")
print(f"Removed flocks: {removed_count:,} ({removal_pct:.2f}%)")
print(f"Output saved to: {output_path}")
return cleaned_count, removed_count, removal_pct
Return Value
Returns a tuple of three values: (cleaned_count, removed_count, removal_pct). cleaned_count is the integer number of flocks remaining after cleaning, removed_count is the integer number of flocks removed due to timing issues, and removal_pct is the float percentage of flocks removed from the original dataset.
Dependencies
pandasos
Required Imports
import pandas as pd
import os
Usage Example
# Ensure data files exist at /tf/active/pehestat_data/
# dbo_Flocks.csv should have: FlockCD, StartDate, EndDate columns
# dbo_Treatments.csv should have: FlockCD, AdministeredDate columns
import pandas as pd
import os
# Run the cleaning function
cleaned_count, removed_count, removal_pct = quick_clean()
print(f"Cleaning complete: {cleaned_count} flocks retained, {removed_count} removed ({removal_pct:.2f}%)")
# The cleaned data is saved to /tf/active/pehestat_data/dbo_Flocks_clean.csv
Best Practices
- Ensure the data directory path '/tf/active/pehestat_data' exists and contains the required CSV files before calling this function
- Verify that CSV files have the expected column names (FlockCD, StartDate, EndDate, AdministeredDate) to avoid KeyError exceptions
- The function uses hardcoded file paths, so it's not portable without modifying the source code
- The function prints progress information to stdout, which may not be suitable for production environments without logging configuration
- Consider backing up original data files before running, as the function creates a new cleaned file but doesn't modify originals
- The function uses 'errors=coerce' for date parsing, which converts invalid dates to NaT (Not a Time) - these may cause timing issues to be flagged
- Memory usage scales with dataset size since entire DataFrames are loaded into memory
- The inner join on FlockCD means treatments without matching flocks are excluded from timing validation
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function show_problematic_flocks 75.0% similar
-
function create_data_quality_dashboard 73.7% similar
-
function show_critical_errors 72.8% similar
-
function create_data_quality_dashboard_v1 72.5% similar
-
function compare_datasets 70.2% similar