Data Cleaner
Production-ready Python tool for messy business data
What This Project Does
Data Cleaner is a production-ready Python tool designed to handle the messy reality of business data. It merges Excel and CSV files, removes basic duplicates, standardizes dates and phone numbers, and outputs professionally formatted files—all while processing hundreds of thousands of rows in seconds.
This tool automates what would take hours into a process that takes seconds. It uses pandas for efficient data processing and handles the most common data cleaning tasks reliably.
See It in Action
Watch how Data Cleaner merges Excel and CSV files, removes duplicates, and standardizes dates and phone numbers; demonstrated on a sample dataset.
Key Features
-
Basic Duplicate Removal
Removes exact duplicates based on specified column combinations
-
Multi-Format Support
Handles Excel (.xlsx), CSV, and mixed file types seamlessly
-
Date Standardization
Converts messy date formats into consistent ISO standards
-
Phone Number Cleaning
Standardizes phone numbers across different formats
-
Simple Column Mapping
Maps predefined column name variations (customer/client/company)
-
Batch Processing
Process multiple files simultaneously with progress tracking
-
Error Handling
Graceful failure with detailed error reporting and recovery suggestions
-
Configuration-Driven
JSON config file for easy customization without code changes
Quality Assurance
- 5 comprehensive automated test scenarios
- Error handling for edge cases
- Configuration validation
- Complete documentation
Technical Note: This tool performs basic duplicate removal using pandas' drop_duplicates() method based on specified columns, and simple column name mapping for predefined variations (customer/client/company). For advanced fuzzy matching or complex data relationships, I can customize the solution for your specific needs.
Real-World Applications
Want to Try It?
This tool runs locally on YOUR computer. Your data never leaves your system, never gets uploaded anywhere. Complete control and privacy guaranteed. You just need to copy the repo and use it locally.
- Test with your actual data files
- See exactly how it handles your specific format
- No data privacy concerns
- Full control over the cleaning process
If you're a developer:
- Clone the repository from GitHub
- Install dependencies:
pip install -r requirements.txt - Run:
python main.py - Upload your files in
data_cleaner/data/inputand watch it work
If you're not a developer:
Don't worry about the technical setup, I can handle everything for you. I'll customize this tool for your specific needs, set it up to work with your data, and deliver a ready-to-use solution with clear instructions.
Let Me Handle It For YouHow We'll Work Together
- Discovery — You walk me through your data challenge and what you are cleaning manually. I confirm whether this tool fits and give you an honest timeline and cost estimate. No commitment needed.
- Scope — I send you a written scope: what gets built, what gets tested, when it ships, and what it costs. You approve it or we adjust until it fits.
- Build — I configure the tool for your specific file formats and edge cases. You get progress updates every 2 to 3 days. Nothing ships blind.
- Handoff — You test with real data. I fix anything that does not work. Final payment on delivery. Seven days of support included.
- You run it independently — The tool is yours. You have the script, the documentation, and the logic. No dependency on me to keep it running.