Data Cleanup Operations

In this example we are going to perform cleanup operations on a real-world dataset.

In this example we will perform the following steps

  • Writing cleanup operations dataflow

  • Using merge command to see the preprocessed data

  • Training our model on the preprocessed data

  • Getting the accuracy of the model

First install the data cleanup operations and scikit models

$ python -m pip install dffml-operations-data dffml-model-scikit

Dataset

The dataset we will be using is available on kaggle https://www.kaggle.com/uciml/mushroom-classification

you may go ahead and download the dataset

$ curl -fLO https://github.com/intel/dffml/files/7040983/mushrooms.csv

Data Cleanup Operations

We will be performing two cleanup operations on our dataset - ordinal_encoder , As seen in the dataset all the values are categorical values haivng a similar relationship. So we can convert these categorical values into numerical values.

Data Cleanup

** cleanup_ops.sh **

$ dffml dataflow create \
    -config \
        "mushrooms.csv"=convert_records_to_list.source.config.filename \
        csv=convert_records_to_list.source.plugin \
    -inputs \
        '["records"]'=get_single_spec \
        '["class","cap-shape","cap-surface","cap-color","bruises","odor","gill-attachment","gill-spacing","gill-size","gill-color","stalk-shape","stalk-root","stalk-surface-above-ring","stalk-surface-below-ring","stalk-color-above-ring","stalk-color-below-ring","veil-type","veil-color","ring-number","ring-type","spore-print-color","population","habitat"]'=features \
        '[]'=predict_features \
    -flow \
        '[{"convert_records_to_list": "matrix"}]'=ordinal_encoder.inputs.data \
        '[{"ordinal_encoder": "result"}]'=convert_list_to_records.inputs.matrix \
        '[{"convert_records_to_list": "keys"}]'=convert_list_to_records.inputs.keys \
        --\
            convert_list_to_records \
            convert_records_to_list \
            ordinal_encoder \
            get_single | \
    tee clean_ops.json

Data Merge

To have a look at the preprocessed data, we can use the merge command.

** merge.sh **

$ dffml merge text=df temp=csv \
    -source-text-dataflow clean_ops.json \
    -source-text-features class:float:1 cap-shape:float:1 cap-surface:float:1 cap-color:float:1 bruises:float:1 odor:float:1 gill-attachment:float:1 gill-spacing:float:1 gill-size:float:1 gill-color:float:1 stalk-shape:float:1 stalk-root:float:1 stalk-surface-above-ring:float:1 stalk-surface-below-ring:float:1 stalk-color-above-ring:float:1 stalk-color-below-ring:float:1 veil-type:float:1 veil-color:float:1 ring-number:float:1 ring-type:float:1 spore-print-color:float:1 population:float:1 habitat:float:1 \
    -source-text-source csv \
    -source-text-source-filename mushrooms.csv \
    -source-temp-filename preprocessed.csv \
    -source-temp-allowempty \
    -source-temp-readwrite \
    -log debug

$ cat preprocessed.csv

Training

Now we will be training our model using the preprocessed dataset

** train.sh **

$ dffml train \
    -model scikitmnb \
    -model-features cap-shape:float:1 cap-surface:float:1 cap-color:float:1 bruises:float:1 odor:float:1 gill-attachment:float:1 gill-spacing:float:1 gill-size:float:1 gill-color:float:1 stalk-shape:float:1 stalk-root:float:1 stalk-surface-above-ring:float:1 stalk-surface-below-ring:float:1 stalk-color-above-ring:float:1 stalk-color-below-ring:float:1 veil-type:float:1 veil-color:float:1 ring-number:float:1 ring-type:float:1 spore-print-color:float:1 population:float:1 habitat:float:1 \
    -model-predict class:str:1 \
    -model-location tempdir \
    -sources f=csv \
    -source-filename preprocessed.csv \
    -log debug

Accuracy

After training the model we can now look for accuracy of the trained model

** accuracy.sh **

$ dffml accuracy \
    -model scikitmnb \
    -scorer logloss \
    -features class:str:1 \
    -model-features cap-shape:float:1 cap-surface:float:1 cap-color:float:1 bruises:float:1 odor:float:1 gill-attachment:float:1 gill-spacing:float:1 gill-size:float:1 gill-color:float:1 stalk-shape:float:1 stalk-root:float:1 stalk-surface-above-ring:float:1 stalk-surface-below-ring:float:1 stalk-color-above-ring:float:1 stalk-color-below-ring:float:1 veil-type:float:1 veil-color:float:1 ring-number:float:1 ring-type:float:1 spore-print-color:float:1 population:float:1 habitat:float:1 \
    -model-predict class:str:1 \
    -model-location tempdir \
    -sources f=csv \
    -source-filename preprocessed.csv \
    -log debug

Conclusion

Thus, we performed cleanup operations on a classfication dataset.