Data Cleaner

Production-ready Python tool for messy business data

View on GitHub Adapt This to Your Needs →

What This Project Does

Data Cleaner is a production-ready Python tool designed to handle the messy reality of business data. It merges Excel and CSV files, removes basic duplicates, standardizes dates and phone numbers, and outputs professionally formatted files—all while processing hundreds of thousands of rows in seconds.

This tool automates what would take hours into a process that takes seconds. It uses pandas for efficient data processing and handles the most common data cleaning tasks reliably.

See It in Action

Watch how Data Cleaner merges Excel and CSV files, removes duplicates, and standardizes dates and phone numbers; demonstrated on a sample dataset.

Key Features

  • Basic Duplicate Removal
    Removes exact duplicates based on specified column combinations
  • Multi-Format Support
    Handles Excel (.xlsx), CSV, and mixed file types seamlessly
  • Date Standardization
    Converts messy date formats into consistent ISO standards
  • Phone Number Cleaning
    Standardizes phone numbers across different formats
  • Simple Column Mapping
    Maps predefined column name variations (customer/client/company)
  • Batch Processing
    Process multiple files simultaneously with progress tracking
  • Error Handling
    Graceful failure with detailed error reporting and recovery suggestions
  • Configuration-Driven
    JSON config file for easy customization without code changes
1,000,000+
Rows Tested
Large-scale validation
213 seconds
Processing Time
For 1M+ rows
4,684 rows/sec
Speed
Rows per second

Quality Assurance

  • 5 comprehensive automated test scenarios
  • Error handling for edge cases
  • Configuration validation
  • Complete documentation

Technical Note: This tool performs basic duplicate removal using pandas' drop_duplicates() method based on specified columns, and simple column name mapping for predefined variations (customer/client/company). For advanced fuzzy matching or complex data relationships, I can customize the solution for your specific needs.

Real-World Applications

Sales Data Consolidation
Merge quarterly sales reports from different regions
Customer Database Cleanup
Remove duplicates and standardize customer information
Inventory Management
Combine product data from multiple suppliers
Financial Reporting
Consolidate transaction data from various payment systems
Marketing Analytics
Merge campaign data from different platforms
HR Data Processing
Standardize employee records from multiple systems

Want to Try It?

This tool runs locally on YOUR computer. Your data never leaves your system, never gets uploaded anywhere. Complete control and privacy guaranteed. You just need to copy the repo and use it locally.

Why try it locally:
  • Test with your actual data files
  • See exactly how it handles your specific format
  • No data privacy concerns
  • Full control over the cleaning process
Two Ways to Use This Tool:

If you're a developer:

  1. Clone the repository from GitHub
  2. Install dependencies: pip install -r requirements.txt
  3. Run: python main.py
  4. Upload your files in data_cleaner/data/input and watch it work

If you're not a developer:

Don't worry about the technical setup, I can handle everything for you. I'll customize this tool for your specific needs, set it up to work with your data, and deliver a ready-to-use solution with clear instructions.

Let Me Handle It For You

How We'll Work Together

  1. Discovery — You walk me through your data challenge and what you are cleaning manually. I confirm whether this tool fits and give you an honest timeline and cost estimate. No commitment needed.
  2. Scope — I send you a written scope: what gets built, what gets tested, when it ships, and what it costs. You approve it or we adjust until it fits.
  3. Build — I configure the tool for your specific file formats and edge cases. You get progress updates every 2 to 3 days. Nothing ships blind.
  4. Handoff — You test with real data. I fix anything that does not work. Final payment on delivery. Seven days of support included.
  5. You run it independently — The tool is yours. You have the script, the documentation, and the logic. No dependency on me to keep it running.
Email Me Back to Projects