Preprocess FIFA dataset
Goals
In this tutorial my goal is to apply a preprocessing on a dataset I often use to teach data analysis. The dataset is contains FIFA soccer players characteristics.
Using preprocess
only I want to be able to :
- Explore the dataset : know how many rows and columns it contains
- Explore summary statistics on numeric columns : number of missing, minimum, mean, median and maximum values
- Explore summary statistics on text columns : number of missing, modalities count
- Fill numeric missing values with the mean of every column
- Scale numeric columns : apply zscore to every column
Computer specs
I am using a Windows 10 Pro Workstation. Intel i7 with 32GB of RAM
Explore the dataset
Download the dataset
I am using a Windows system, with Powershell I can easily download the dataset via the command line. Using Powershell Invoke-WebRequest
I have successfully downloaded the dataset from the Github repo that holds it. I encourage you to try if you want.
Depending on your system you may use curl or any other way to download the data as a text file.
The data is available here : https://raw.githubusercontent.com/agailloty/Outils-Analyses-R/refs/heads/main/dataset/fifa_players_22.csv
Invoke-WebRequest -Uri https://raw.githubusercontent.com/agailloty/Outils-Analyses-R/refs/heads/main/dataset/fifa_players_22.csv -Outfile "fifa_players.csv"
In my case, I save it on my system as fifa_players.csv
. It’s with this file name that I will use the preprocess CLI command.
Skim through the data
preprocess provides a convenient command for you to have a glance at the dataset.
For every preprocess command there is a built-in documentation to help you choose the right parameters for the task. Using the –help tag you can explore the documentation of the command you wish to use. Even I myself use this quite often.
preprocess skim --help
will display
Print a subset of the dataset in the console
Usage:
preprocess skim [flags]
Flags:
-d, --data string Path to the dataset
-m, --dsep string Decimal separator (default ".")
-e, --encoding string Character encoding (default "utf-8")
-h, --help help for skim
-s, --sep string Separator for csv file (default ",")
Simply use the following command to skim through the dataset :
preprocess skim --data .\fifa_players.csv
By default preprocess skim
will read the CSV dataset with a coma ,
as the delimeter. You can specify another delimiter using the --sep
flag.
The following command produces the same output as the previous. In the latter I explicitely set the delimiter.
preprocess skim --data .\fifa_players.csv --sep ","
Columns: 49 | Total rows: 16021
short_name age height_cm weight_kg [...] wage_eur league_level international_reputation preferred_foot body_type
----------------- --- --------- --------- ----- -------- ------------ ------------------------ -------------- ----------------
L, Messi 34 170 72 [...] 320000 1 5 Left Unique
R, Lewandowski 32 185 81 [...] 270000 1 5 Right Unique
Cristiano Ronaldo 36 187 83 [...] 270000 1 5 Right Unique
Neymar Jr 29 175 68 [...] 270000 1 5 Right Unique
K, De Bruyne 30 181 70 [...] 350000 1 4 Right Unique
K, Mbappé 22 182 73 [...] 230000 1 4 Right Unique
This commands print the first 30 lines of the dataset and 10 columns. It is just a convenient way to quickly see how the dataset looks.
Goal 1 solved :the dataset contains 16021 rows and 49 columns.
Explore summary statistics
Now I would like to explore more the dataset. This time in terms of statistics to have a better idea of its structure.
preprocess offers you two ways to explore your dataset :
- produce a TOML file with the summary statistics of all columns
- produce a HTML report to help you explore the summary statistics through your web browser
Using Summaryfile.toml
With the following command you can produce a TOML file that contains summary statistics of each column present in the dataset.
First let’s display the help for this command
Generate dataset summary statistics.
By default it generates a TOML file that contains all summary statistics.
You can choose to generate a HTML file with the --html flag.
Usage:
preprocess summary [flags]
Flags:
-d, --data string Path to the dataset
-m, --dsep string Decimal separator (default ".")
-e, --encoding string Character encoding (default "utf-8")
-h, --help help for summary
-t, --html Generate HTML file
-o, --output string Output name for Summaryfile (default "Summaryfile.toml")
-f, --prepfile string Path to the configuration file (default "Prepfile.toml")
-s, --sep string Separator for csv file (default ",")
preprocess summary --data .\fifa_players.csv --output fifa_stats.toml
If it is successful then you will have the following message displayed.
fifa_stats.toml successfully generated.
Finished in : 254.9019ms
Now you can open and explore the content of the fifa_stats.toml file. Below is a subset of the fifa_stats.toml file.
[data]
filename = '.\fifa_players.csv'
csv_separator = ','
decimal_separator = '.'
encoding = 'utf-8'
missing_identifier = ''
[data_summary]
rows_count = 16020
columns_count = 48
numeric_columns = 46
string_columns = 2
[[columns]]
name = 'age'
type = 'numeric'
rows_count = 16020
min = 16.0
mean = 25.19307116104869
median = 25.0
max = 39.0
missing = 0
[[columns]]
name = 'height_cm'
type = 'numeric'
rows_count = 16020
min = 155.0
mean = 180.3483770287141
median = 180.0
max = 203.0
missing = 0
You can exclude some columns from the summary using the --exclude
flag.
For my case I want to exclude the short_name from the summary because it is just the names of the players. There is no added values to counting the occurence of the names.
preprocess summary --data .\fifa_players.csv --exclude short_name --output fifa_stats.toml
Using an HTML report
To generate HTML report, you just need to append the --html
tag to the previous command.
preprocess summary --data .\fifa_players.csv --exclude short_name --output fifa_stats.toml --html
It will generate a HTML file that contains the summary statistics. You can use yout web browser to explore the summary statistics.
As an example, you will have a web page that looks like the following :
The remaining statistics
Goal 2 solved : I am able to explore summary statistics on numeric and text columns
Preprocessing the data
Now let’s explore how to use preprocess CLI for data preprocessing. Let’s suppose we want to preprocess the data so we can feed it later in a machine learning pipeline.
Introducting Prepfile
If you have worked with tools such as Docker you’ll notice that they give the user the ability to declare operations they wish to perform using simple files such as Dockerfile
or compose.yml
.
Preprocess CLI implements a similar idea with the Prepfile.toml
With the init
command you can use Preprocess CLI to generate a Prepfile.
The Prepfile contains three sections :
- Data section [data]
- Preprocess section [preprocess]
- Postprocess section [postprocess]
In the data section you provide all necessary information to read the dataset on which you want to apply the operations. The preprocess section contains the operations that will be applied on the dataset. The postprocess section contains what to do after the preprocessing steps.
Generate the Prepfile for this project
Let’s use the init
command to generate a Prepfile for our dataset.
preprocess init --help
This command is used to generate a Prepfile in which you can specify the preprocessing computations either on specifics columns of the dataset or on whole numeric or text columns.
Usage:
preprocess init [flags]
Flags:
-d, --data string Path to the dataset
-m, --dsep string Decimal separator (default ".")
-e, --encoding string Character encoding (default "utf-8")
-h, --help help for init
-o, --output string Output name for Prepfile (default "Prepfile.toml")
-s, --sep string Separator for csv file (default ",")
-t, --template Generate example Prepfile.toml template
preprocess init --data .\fifa_players.csv
If you provide only the --data
flag it will default utf-8 for character encoding, ,
for csv separator and .
for decimal separator.
The Prepfile can be quite long depending on the columns the dataset has. For brevity I show a truncated version of the Prepfile below.
[data]
filename = '.\fifa_players.csv'
csv_separator = ','
decimal_separator = '.'
encoding = 'utf-8'
missing_identifier = ''
[preprocess]
[[preprocess.columns]]
name = 'short_name'
type = 'string'
[[preprocess.columns]]
name = 'age'
type = 'int'
[[preprocess.columns]]
name = 'height_cm'
type = 'int'
[[preprocess.columns]]
name = 'wage_eur'
type = 'int'
[[preprocess.columns]]
name = 'league_level'
type = 'int'
[[preprocess.columns]]
name = 'international_reputation'
type = 'int'
[[preprocess.columns]]
name = 'preferred_foot'
type = 'string'
[[preprocess.columns]]
name = 'body_type'
type = 'string'
[postprocess]
format = 'csv'
filename = 'fifa_players_cleaned.csv'
With Prepfile I can apply operations on individual columns of the dataset.
For example if I want to fill the missing values of the column wage_eur
with the mean value of the column (excluded missing values), I will do the following.
[[preprocess.columns]]
name = 'wage_eur'
type = 'int'
operations = [
{op = "fillna", method = "mean"}
]
Last updated 30 May 2025, 13:32 +0200 .