On this page

Prepfile

Introducting Prepfile

If you have worked with tools such as Docker you’ll notice that they give the user the ability to declare operations they wish to perform using simple files such as Dockerfile or compose.yml.

Preprocess CLI implements a similar idea with the Prepfile.toml

The Prepfile.toml contains three sections :

Data section [data]
Preprocess section [preprocess]
Postprocess section [postprocess]

In the data section you provide all necessary information to read the dataset on which you want to apply the operations. The preprocess section contains the operations that will be applied on the dataset. The postprocess section contains what to do after the preprocessing steps.

Prepfile tags

[data]

Parameter	Default Value	Description
`filename`	N/A	The path to the dataset. This is a mandatory parameter.
`csv_separator`	N/A	Specify the separator used for reading the CSV file. Accepts any single character : `";"`, `"\t"`, `" "`
`decimal_separator`	`"."`	Specify the decimal separator when reading the dataset. Possible values may include `","`, `"."`
`encoding`	`"utf-8"`	Specify the character set encoding. Possible values are `"utf-8"`, `"latin-1"`
`missing_identifier`	`""`	Specify how missing values are represented in the dataset. For most data it’s empty `""` but some dataset may explicitly fill missing values with `"NA"`, `"N/A"` …

[preprocess]

[preprocess.numerics]

[preprocess.texts]

The following syntax

  [preprocess.texts]
operations = [
    {op = "fillna", method = "mean"}
]

is equivalent to :

  [[preprocess.texts.operations]]
op = "fillna"
method = "mean"

But you cannot mix them. Choose one style and keep it throughout the Prepfile.

[[preprocess.columns]]

[postprocess]

  [postprocess]
dropcolumns = ["BALANCE","SURFACE_AREA","EXPORTS_GOOD_SERVICES"]
format = 'csv'
sortdataset = {descending = false}
filename = 'indicators_numerics_cleaned.csv'

Last updated 31 May 2025, 21:32 +0200 . history

preprocess skim

Prepfile

Introducting Prepfile link

Prepfile tags link

[data] link

[preprocess] link

[preprocess.numerics] link

[preprocess.texts] link

[[preprocess.columns]] link

[postprocess] link