Prism.js

Soft introduction - Data Preparation



Introduction

A while ago, I started a store manager position at a medium-sized business specializing in high-quality wine and liquor.

This happened a few years after I had completed my MSc.

The physical store, with a little effort,  was set up just the way I wanted it, 
and alongside my other responsibilities, I also had to manage the accounting part.
These involved long, repetitive calculations, which I needed to automate to save time for more important tasks.

To address this, I created a spreadsheet in Excel that significantly improved my workflow.
In addition to streamlining operations, this system also generated a large amount of data!  And it didn’t take me long to realize that I had to do something valuable with it!

 

Data Preparation

For data science projects, the quality of the available data is critically important. Are there any missing values? If so, how should they be handled? Are the numerical columns truly integers or floats, or are there discrepancies?

In an ideal world, a dataset for data science projects should have the following properties:

1. Relevant and Well-Defined

The data should be directly related to the problem you're trying to solve.
It should have clear objectives and labels if it's supervised learning (e.g., classification or regression).

2. Clean and Well-Structured

  • No missing or inconsistent values

  • Uniform formatting (e.g., consistent date formats, standardized text)

  • Consistent data types across each column (e.g., no strings in numerical columns)

3. Sufficient Volume

  • Enough rows (observations) to reliably train and test machine learning models

  • Enough variability to capture underlying patterns in the data

4. Balanced Distribution

Especially for classification tasks, a balanced class distribution helps avoid bias toward the majority class.

5. Representative and Unbiased

The dataset should accurately reflect the real-world scenario it's modeling.
It should be free from sampling bias, measurement bias, and labeling bias.

6. Well-Documented

A data dictionary or metadata should describe each column and its possible values.
The data source, collection methods, and update frequency should also be documented.

Real-world data is rarely ideal or ready to use. Problems often arise and must be addressed!
 

Problems with "Non-Proper" Datasets

Datasets that are not properly prepared can cause significant issues, including:

1. Missing Values

These can cause errors in models or skew the results.
Strategies/ Αντίμετρα: imputation, removal, or model-based filling.

2. Inconsistent Data Types

Example: mixing strings and numbers in a numeric column breaks numerical analysis.

3. Outliers and Noise

These may distort model training if not identified and handled properly.

4. Duplicated or Redundant Entries

These inflate the apparent size of the dataset and can bias the results.

5. Imbalanced Classes

This leads to models that perform poorly on minority classes (common in fraud detection or medical diagnoses).

6. Data Leakage

Occurs when future information "leaks" into the training data, leading to artificially high performance during training but poor performance in production.

7. Unclear or Incorrect Labels

Especially problematic in supervised learning — garbage in, garbage out (GIGO).

 If you input poor-quality data into a system or model, you’ll get poor-quality (unreliable or misleading) results — regardless of how good your model, algorithm, or processing is.

8. Lack of Domain Context

Without understanding what each feature means, it’s hard to interpret or validate model results.

 

Conclusion

Ideal datasets are clean, relevant, well-labeled, balanced, and well-documented.

Non-proper datasets — those with missing, inconsistent, biased, or mislabeled data — lead to unreliable models and incorrect conclusions.

Σχόλια