Data Cleaner

Production-ready Python tool for messy business data

Data Engineering Production Ready Python Pandas JSON Config

View on GitHub Adapt This to Your Needs →

What This Project Does

Data Cleaner is a production-ready Python tool designed to handle the messy reality of business data. It merges Excel and CSV files, removes basic duplicates, standardizes dates and phone numbers, and outputs professionally formatted files—all while processing hundreds of thousands of rows in seconds.

This tool automates what would take hours into a process that takes seconds. It uses pandas for efficient data processing and handles the most common data cleaning tasks reliably.

See It in Action

Watch how Data Cleaner merges Excel and CSV files, removes duplicates, and standardizes dates and phone numbers; demonstrated on a sample dataset.

Key Features

Basic Duplicate Removal
Removes exact duplicates based on specified column combinations
Multi-Format Support
Handles Excel (.xlsx), CSV, and mixed file types seamlessly
Date Standardization
Converts messy date formats into consistent ISO standards
Phone Number Cleaning
Standardizes phone numbers across different formats
Simple Column Mapping
Maps predefined column name variations (customer/client/company)
Batch Processing
Process multiple files simultaneously with progress tracking
Error Handling
Graceful failure with detailed error reporting and recovery suggestions
Configuration-Driven
JSON config file for easy customization without code changes

1,000,000+

Rows Tested

Large-scale validation

213 seconds

Processing Time

For 1M+ rows

4,684 rows/sec

Speed

Rows per second

Quality Assurance

5 comprehensive automated test scenarios
Error handling for edge cases
Configuration validation
Complete documentation

Technical Note: This tool performs basic duplicate removal using pandas' drop_duplicates() method based on specified columns, and simple column name mapping for predefined variations (customer/client/company). For advanced fuzzy matching or complex data relationships, I can customize the solution for your specific needs.

Real-World Applications

Sales Data Consolidation

Merge quarterly sales reports from different regions

Customer Database Cleanup

Remove duplicates and standardize customer information

Inventory Management

Combine product data from multiple suppliers

Financial Reporting

Consolidate transaction data from various payment systems

Marketing Analytics

Merge campaign data from different platforms

HR Data Processing

Standardize employee records from multiple systems

Want to Try It?

This tool runs locally on YOUR computer. Your data never leaves your system, never gets uploaded anywhere. Complete control and privacy guaranteed. You just need to copy the repo and use it locally.

Why try it locally:

Test with your actual data files
See exactly how it handles your specific format
No data privacy concerns
Full control over the cleaning process

Two Ways to Use This Tool:

If you're a developer:

Clone the repository from GitHub
Install dependencies: pip install -r requirements.txt
Run: python main.py
Upload your files in data_cleaner/data/input and watch it work

If you're not a developer:

Don't worry about the technical setup, I can handle everything for you. I'll customize this tool for your specific needs, set it up to work with your data, and deliver a ready-to-use solution with clear instructions.

Let Me Handle It For You

How We'll Work Together

Discovery — You walk me through your data challenge and what you are cleaning manually. I confirm whether this tool fits and give you an honest timeline and cost estimate. No commitment needed.
Scope — I send you a written scope: what gets built, what gets tested, when it ships, and what it costs. You approve it or we adjust until it fits.
Build — I configure the tool for your specific file formats and edge cases. You get progress updates every 2 to 3 days. Nothing ships blind.
Handoff — You test with real data. I fix anything that does not work. Final payment on delivery. Seven days of support included.
You run it independently — The tool is yours. You have the script, the documentation, and the logic. No dependency on me to keep it running.

Email Me Back to Projects