"Golden Retriever"

TL;DR

The post demonstrates a reusable DataProcessor class on a synthetic sales dataset (standing in for a proprietary one), with support for generic datasets like the included “foo”. It loads, inspects, cleans, and partially visualizes data—particularly focusing on outlier detection via boxplots and histogram comparisons using Z-score and/or IQR methods to confirm the cleanup. Finally it saves the cleaned data to .csv or a .xlsx .

Call me!

In the previous posts Class of the Process , Raw Power - Data Processor Pt. 2 , we implemented our DataProcessor class — so far, so good!

Now it's time to call it from our main.py program and test its functionality!

Dataset

It would be ideal for me to demonstrate the data I’ve extracted from my job — but there’s a catch.

The data I’ve collected is proprietary !

Unfortunately, it’s not mine to share. To work around this limitation, I’ve created a new dataset with nearly identical variables, but filled with completely random data. I think it will be interesting to see what kind of results we can get from it!

With this being said let's move on!

The dataset we'll be using is foo_sales_dataset.csv and you can find it in my GitHub project DS/ML data folder as well as main.py in the src folder.

The main.py!!

Importing the class DataProcessor, setting the file path, and a target variable for analyzing are necessary steps to begin with. For this testing, target variable will be Total_Sales. After this we are ready to load data and create a processor data frame object.

Python Code:

from processor import DataProcessor

file_path = '/your_path/foo_sales_dataset.csv'
target_col = 'TOTAL_SALES'

# Load data and create a processor data frame object 
processor = DataProcessor(filepath=file_path).load()

If we would like to inspect the type of our new object, we can add the following line:

print(type(processor.df))

With the expected result that the data has been loaded successfully, we now have a:
<class 'pandas.core.frame.DataFrame'>

From here, we can proceed with a visual inspection to understand how the data appears. For example, we might look at the head and tail of the DataFrame, check the number of records, and examine the data types.

Python Code:

# Show head of the data
print("\n🔹 First 5 rows:")
print(processor.df.head())

# Show tail of the data
print("\n🔹 Last 5 rows:")
print(processor.df.tail())

print(f"\n🔹 Number of rows and columns in the dataset: {processor.df.shape}")

print("\n🔹 Check for Data Types:")
processor.check_dtypes()

The dataset consists of 654 rows and 9 columns. The first five records give us a clearer picture of what the data looks like.

ID	DATE	DAY	MANAGER	ESALES	CARDS	CASH	TOTAL_SALES	RECEIPT
1	2025-01-01	Wednesday	Bob	166.57	304.67	334.48	805.72	34
2	2025-01-01	Wednesday	Charlie	614.18	156.63	239.96	1010.77	33
3	2025-01-01	Wednesday	Alice	143.21	419.47	263.33	826.01	51
4	2025-01-02	Thursday	Bob	988.67	669.25	433.05	2090.97	32
5	2025-01-03	Friday	Bob	263.15	615.05	439.86	1318.06	83

Table 1: First 5 records of the data set

It contains five sales-related numerical variables, three categorical variables (day, date, and manager), and one variable ID serving as the record key.

Column	Data Type
ID	int64
DATE	object
DAY	object
MANAGER	object
ESALES	float64
CARDS	float64
CASH	float64
TOTAL_SALES	float64
RECEIPT	int64

Table 2: Data Types

From the first five rows, one important observation is that the store has at least three employees. Additionally, each date appears multiple times — once for each employee. If we were to aggregate the sales-related variabless by day, we would lose valuable information, such as sales per employee. We'll need to address this.

But before we do, it's clear that our DataProcessor class needs an upgrade.

The check_categorical_columns() method currently only checks for the presence of categorical variables and, if any are found, returns their names. However, it would be better if it not only detected categorical columns but also counted how many distinct values each one contains. And presto — such a feature is already implemented in this method!

🔹 Check for Categorical Columns

Column Name	Unique Values
DATE	221
DAY	7
MANAGER	3

Table 3: Distinct values for Categorical columns

After calling check_categorical_columns() method, we can confidently conclude that:

There are exactly three employees, seven distinct day names, and 221 unique calendar dates among the 654 records.

Index Selection in DataFrames

Setting an index in a DataFrame is useful because it provides structure, efficiency, and flexibility to your data operations in pandas. In this dataset, DATE is not the optimal candidate for an index on its own, although it could be used as part of one when combined with another variable — MANAGER for example. If you want to perform group-based analysis, an alternative would be to use a MultiIndex with DATE and MANAGER.

So, another missing feature in the DataProcessor class is the ability to set a non-date column as the index. Therefore, a new method — set_index_column() — is necessary.

Choosing the ID variable as the index for this analysis is best practice, as each row is uniquely identifiable. This makes it easier to perform further data manipulations if needed (e.g., filtering, joining).

Calling processor.set_index_column() from main.py confirms that the ID column has been set as the index for this analysis.

Drop columns in DataFrames

Dropping columns in a DataFrame is a common and useful operation in data analysis and machine learning workflows for several practical reasons. In this dataset, there aren’t any irrelevant columns to drop.

However, for demonstration purposes, if we call processor.drop_columns([]) with an empty list, a message will inform us:

🔺No matching columns found to drop.

On the other hand, if we choose to drop a column — for example, the DAY variable — another message will indicate the action:

🟢 Dropped column(s): ['DAY']

Checking for Missing Values

Missing values can lead to unexpected issues during data analysis or model training, so identifying them early is crucial. In our project, we use the check_missing() to detect and report missing data within the dataset.

This method not only summarizes which columns have missing values, but — when called with setting return_rows=True — it also returns the exact rows where the missing values occur. This allows for detailed inspection and targeted data cleaning.

If no missing data is found, the method confirms that everything is in order. Otherwise, it clearly reports the affected columns along with the count of missing entries, helping us decide on appropriate actions like dropping, filling, or imputing the missing values.

In our case, the dataset is clean — no missing values detected!

However, if there were any missing values (say in the ESALES and RECEIPT columns), the method would display:

The count of missing values per column, and
The exact row indices containing those missing values (e.g., [24, 41] ).

By default, calling processor.check_missing() provides a quick overview of missing columns. For a more detailed view, processor.check_missing(return_rows=True) is the way to go. Of course, we could also make return_rows a default behavior by adjusting our DataProcessor class — making detailed inspections even easier.

Handling missing values

When working with real-world data, it's common to encounter missing values — also known as (NaN) (Not a Number). To address this, we use a method called handle_missing_values().

It's a flexible tool for data cleaning, designed to work with both numerical and categorical columns. For example, mean or median strategies only apply to numeric data, while for categorical columns, the most_frequent strategy replaces missing entries with the most common category (mode).

Example:

Suppose your data has missing values. Here’s what the method might report:

🔹 Handle missing values:
🔸 Filled missing numeric values using mean in columns: ['ESALES', 'RECEIPT']
🟢 All missing values handled.

After running, the method provides a clear message showing which columns were affected and what strategy was applied — making your data-cleaning process transparent and reproducible.

Detecting and Handling Duplicate Rows in the Dataset

When analyzing data, duplicate rows can creep in due to data entry errors, merging issues, or scraping from inconsistent sources. If not handled, they can distort your results — especially in tasks like aggregation, statistical analysis, or training machine learning models.

To address this, we follow a three-step process:

Inspect for duplicates using inspect_duplicates(), which reports how many duplicate rows exist.
Log duplicates with log_duplicates() to create an audit trail — saved as a .csv or .xlsx file in a /logs/ directory
Handle duplicates via handle__duplicates(method=...), using your preferred strategy.

Example output:

🔹 Check for duplicate rows in DataFrame:
🔴 Found 3 duplicate row(s).

🔹 Make log file for duplicates:
🔴 Logged 3 duplicate rows to: logs/duplicates_log_20250924_114512.csv

🔹 Handling duplicate rows:
🔴 Found 3 duplicate row(s). Applying method: 'keep_first'
🔸 Kept first occurrence of duplicates.

The available methods for handling duplicates can be seen in the table below.

Method	Description
keeps_first	Keeps the first occurrence, removes the rest (default)
keeps_last	Keeps the last occurrence
drop_all	Removes all duplicate entries
flag	Adds a new column is_duplicate = True for duplicate rows

Table 5: Duplicates Handling Methods

Logging for Transparency

Each time duplicates are found, they're written to a file like this:

logs/duplicates_log_20250924_114512.csv

This makes it easy to track what was removed or flagged, which is especially useful in collaborative projects or for compliance.

Detecting and Removing Duplicate Columns

While duplicate rows are easier to spot, duplicate columns can quietly affect the quality of your dataset. These may come from merged files, accidental copy-pastes, or inconsistent exports (e.g., two columns named "Email").

To detect and handle such issues, two methods can be used:

🔹 Check for duplicate columns:
🔴 Found 2 duplicate column(s): ['Name', 'Email']

🔹 Handling duplicate columns:
🔸 Removed duplicate columns, keeping the first occurrence.

Keep in mind:

Cleaning duplicate rows isn't just about tidiness — it directly impacts the quality of your insights. Imagine counting sales per customer or training a model: duplicates can inflate figures or bias learning.

That’s why this process is both informative and cautious — flagging, logging, and offering control over what to do next.

Outliers Beware — Outlier Detection Techniques in Python

Real-world data can get real messy. Outliers — extreme values that differ significantly from the rest of the dataset — can distort everything from averages to predictive models. Before trusting any statistical result, it’s crucial to identify and, in some cases, remove these extreme values.

There are two popular methods for outlier detection: the Z-score method and the Interquartile Range(IQR) method.

Getting a Statistical Overview

Before detecting outliers, it’s important to understand the overall distribution of your dataset. A quick statistical summary helps reveal the central tendency, spread, and shape of your data.

print(processor.df.describe())

🔹Summary Statistics:

	ESALES	CARDS	CASH	TOTAL_SALES	RECEIPT
count	653	654	654	654	653
mean	548.83	415.05	322.23	1286.02	48.35
std	259.31	219.21	161.94	385.63	28.58
min	101.62	50.54	33.71	306.82	1
25%	336.25	222.53	191.28	990.32	24
50%	538.91	403.88	317.57	1273.10	48
75%	769.83	609.98	459.63	1574.70	72
max	999.37	796.40	599.94	2302.63	99

Table 6: Summary Statistics

Checking for Outliers

The next logical step is to check for outliers, as they can significantly distort your analysis or model performance. This is a critical data-cleaning step to identify problematic values.

Detecting Outliers Using Z-Score

The Z-score tells us how far a data point is from the average, in terms of standard deviations.

For example:

A Z-score of 0 means the value is exactly average.
A Z-score of +2 means the value is 2 standard deviations above average.
A Z-score of -2 means it's below average.

Typically, any value beyond ±2 or ±3 is considered an outlier, depending on how strict you want to be.

Using the check_outliers() method, we scan all numeric columns to calculate Z-scores. This method helps highlight potential outliers by measuring how far each data point is from the mean (in standard deviations).

Where outliers might exist this method help us highlight them.

processor.check_outliers()

🔹 Check for outliers:

🟢 Numeric columns used for Z-score calculation:
Index(['ESALES', 'CARDS', 'CASH', 'TOTAL_SALES', 'RECEIPT'], dtype='object')

🟢 Sample Z-scores:

ID	ESALES	CARDS	CASH	TOTAL_SALES	RECEIPT
1	-1.476	-0.504	0.076	-1.246	-0.503
2	0.252	-1.180	-0.508	-0.714	-0.538
3	-1.567	0.020	-0.364	-1.194	0.093
4	1.699	1.161	0.685	2.089	-0.573
5	-1.103	0.913	0.727	0.083	1.215

Table 7: Sample Z-scores

🟢 Outliers per column (Z-score > threshold):
TOTAL_SALES 17
dtype: int64

If you prefer get full rows containing outliers:

processor.check_outliers(z_thresh=2, return_rows=True, log_path=None)

🔸 Rows with at least one outlier (Z > 2):

ID	DATE	DAY	MANAGER	ESALES	CARDS	CASH	TOTAL_SALES	RECEIPT
4	2025-01-02	Thursday	Bob	988.67	669.25	433.05	2090.97	32.0
19	2025-01-08	Wednesday	Alice	224.64	107.76	41.31	373.71	25.0
33	2025-01-13	Monday	Charlie	982.45	572.22	574.35	2129.02	68.0
45	2025-01-17	Friday	Charlie	963.02	768.05	402.46	2133.53	86.0
107	2025-02-08	Saturday	Alice	989.29	609.05	552.30	2150.64	60.0

Table 8: Sample Z-score outliers

Optionally we can save to a .csv file for inspection, providing the path to the calling method.

processor.check_outliers(z_thresh=2, return_rows=True, log_path='/your_path/log_outliers.csv')

Removing outliers: Z-score vs from_column

While there is remove_outliers_zscore() method to remove outliers across all numerical columns, you can use remove_outliers_from_column() method that targets only a specific column.

🔹Removing outliers from specific column according to z_score
🔴 Removed 17 outlier(s) from 'TOTAL_SALES' (Z-thresh = 2)

🔹 Number of rows after outlier removal: 637

The first method is more suitable for global cleanup, the latest use it when focusing in specific metrics.

Choosing the Right Method: Z-Score vs. IQR

Before jumping into the Interquartile Range (IQR) method, it’s important to understand the limitations of the Z-score approach.

Z-score works best with data that follows a normal (bell curve) distribution.
If your data is skewed—as is often the case with financial or sales data — Z-score might misclassify legitimate values as outliers or miss real ones.

This is where IQR comes in.

Interquartile range(IQR) Method

The IQR (Interquartile Range) is a measure of statistical dispersion and is commonly used to detect outliers. It is especially useful for non-normally distributed datasets.

It measures the spread of the middle 50% of the data:

IQR = Q3 - Q1

Q1 (First Quartile): The 25th percentile – 25% of the data falls below this value.
Q3 (Third Quartile): The 75th percentile – 75% of the data falls below this value.

Using IQR, we define outlier thresholds:

Lower Bound = Q1 - 1.5 × IQR
Upper Bound = Q3 + 1.5 × IQR

Any value outside this range is considered outliers.

Using remove_outliers_iqr() Method

In the main.py you can apply IQR-based outlier removal with:

processor.remove_outliers_iqr(column='TOTAL_SALES', iqr_multiplier=1.5)

A larger multiplier means a wider range → fewer values flagged as outliers
A smaller multiplier means a narrower range → more values flagged as outliers

The iqr multiplier controls sensitivity. Its behavior becomes more aggressive, when high sensitivity is needed with value set at 1. For a more conservative behavior, for minimal interference with data use case, values are set at 2 or 3.

By default, it's set to 1.5 (Tukey's rule).

This method is ideal when your data is skewed or when Z-score is too sensitive to outliers.

Visual Inspection of Outliers

While we've already explored how to detect and remove outliers through statistical methods, visualizing these anomalies can be just as powerful. Graphical representations help validate our cleaning process and provide intuitive insight into the shape and spread of the data — before and after outlier removal.

Two dedicated methods help with this:

processor.visualize_outliers_boxplot() compares the boxplots of the original and cleaned data, clearly showing how extreme values are trimmed.
processor.visualize_outliers_histogram() overlays histograms with KDE curves, letting you assess changes in distribution and density.

processor.visualize_outliers_boxplot(original_df,processor.df, 'TOTAL_SALES')
processor.visualize_outliers_histogram(original_df, processor.df, 'TOTAL_SALES')

For the given dataset you can see the results below.

Figure 1: Boxplot comparison foo sales data

Figure 2: Histogram comparison foo sales data

It is clear that our data deviates significantly from what one would expect in a typical sales dataset. For real-world sales data, the graphical representations would likely look more like the following:

Figure 3: Boxplot — Real sales data

Figure 4: Histogram — Real sales data

Together, these visuals act as a final check — confirming that your data cleanup didn’t just remove values, but preserved the underlying structure of the dataset.

One ensemble method

So far, we’ve used several individual methods, and depending on the workflow of each project, we can think of additional actions to implement. If we know what to expect from our data — or if we want to automate the process — we can simply use the run_all_checks() method.

For example, if your project requires checks like missing data, duplicate rows, data types, or verifying whether the index is a date or datetime, you can perform all of these with just one line of code:

pprocessor.run_all_checks()

Example output:

🔹 Running data quality checks...
🔹 Missing values per column:
🟢 No missing values found.
🟢 No duplicate rows found.
Data types:
DATE object
DAY object
MANAGER object
ESALES float64
CARDS float64
CASH float64
TOTAL_SALES float64
RECEIPT int64
dtype: object
🔴 Index is not datetime or date. Detected type: <class 'numpy.int64'>

Of course you can customize this method according to your needs !

Saving clean data

After all this wrangling, we should export the cleaned dataset to a .csv or a .xlsx file. This gives us a fresh starting point for the next steps of our analysis. The save() method helps us do just that:

processor.save(path="/your_path/data/processed_data.xlsx", format="xlsx")

Final thoughts

Effective data cleaning is more than just removing unwanted values — it’s about understanding your dataset deeply and preserving its meaningful structure. By combining statistical techniques with clear visual inspections, you gain confidence that your cleaned data truly reflects the underlying trends without distortion — after all, no one wants to be accused of introducing bias !

This careful preparation sets a strong foundation for any further analysis or modeling you plan to do. Keep these practices in mind as you tackle future datasets, and remember: quality input leads to quality insights.

Coming Up Next

What follows next is not the end. The end is the beginning !

Cats and Dogs ...in the rain

Αναζήτηση αυτού του ιστολογίου

Prism.js

This Is Where You Should Be