
Introduction
This happened a few years after I had completed my MSc.
and alongside my other responsibilities, I also had to manage the accounting part.
These involved long, repetitive calculations, which I needed to automate to save time for more important tasks.
To address this, I created a spreadsheet in Excel that significantly improved my workflow.
In addition to streamlining operations, this system also generated a large amount of data! And it didn’t take me long to realize that I had to do something valuable with it!
Data Preparation
For data science projects, the quality of the available data is critically important. Are there any missing values? If so, how should they be handled? Are the numerical columns truly integers or floats, or are there discrepancies?
In an ideal world, a dataset for data science projects should have the following properties:
1. Relevant and Well-Defined
The data should be directly related to the problem you're trying to solve.
It should have clear objectives and labels if it's supervised learning (e.g., classification or regression).
2. Clean and Well-Structured
-
No missing or inconsistent values
-
Uniform formatting (e.g., consistent date formats, standardized text)
-
Consistent data types across each column (e.g., no strings in numerical columns)
3. Sufficient Volume
-
Enough rows (observations) to reliably train and test machine learning models
-
Enough variability to capture underlying patterns in the data
4. Balanced Distribution
Especially for classification tasks, a balanced class distribution helps avoid bias toward the majority class.
5. Representative and Unbiased
The dataset should accurately reflect the real-world scenario it's modeling.
It should be free from sampling bias, measurement bias, and labeling bias.
6. Well-Documented
A data dictionary or metadata should describe each column and its possible values.
The data source, collection methods, and update frequency should also be documented.
Problems with "Non-Proper" Datasets
Datasets that are not properly prepared can cause significant issues, including:
1. Missing Values
These can cause errors in models or skew the results.
Strategies/ Αντίμετρα: imputation, removal, or model-based filling.
2. Inconsistent Data Types
Example: mixing strings and numbers in a numeric column breaks numerical analysis.
3. Outliers and Noise
These may distort model training if not identified and handled properly.
4. Duplicated or Redundant Entries
These inflate the apparent size of the dataset and can bias the results.
5. Imbalanced Classes
This leads to models that perform poorly on minority classes (common in fraud detection or medical diagnoses).
6. Data Leakage
Occurs when future information "leaks" into the training data, leading to artificially high performance during training but poor performance in production.
7. Unclear or Incorrect Labels
Especially problematic in supervised learning — garbage in, garbage out (GIGO).
If you input poor-quality data into a system or model, you’ll get poor-quality (unreliable or misleading) results — regardless of how good your model, algorithm, or processing is.
8. Lack of Domain Context
Without understanding what each feature means, it’s hard to interpret or validate model results.
Conclusion
Ideal datasets are clean, relevant, well-labeled, balanced, and well-documented.
Non-proper datasets — those with missing, inconsistent, biased, or mislabeled data — lead to unreliable models and incorrect conclusions.
Σχόλια
Δημοσίευση σχολίου