Data Cleanup Operations¶

In this example we are going to perform cleanup operations on a real-world dataset.

In this example we will perform the following steps

Writing cleanup operations dataflow
Using merge command to see the preprocessed data
Training our model on the preprocessed data
Getting the accuracy of the model
Checking the accuracy of the model without cleanup of the data

First install the data cleanup operations and scikit models

$ python -m pip install dffml-operations-data dffml-model-scikit

Dataset¶

The dataset we will be using is available on kaggle https://www.kaggle.com/harlfoxem/housesalesprediction

you may go ahead and download the dataset

$ curl -fLO https://github.com/intel/dffml/files/7046671/kc_house_data.csv

Data Cleanup Operations¶

We will be performing two cleanup operations on our dataset - standard_scaler will normalize the dataset having unit variance and standard deviation of 0 - principal_component_analysis will convert the data into (number of samples, number of components)

$ dffml dataflow create \
    -config \
        "kc_house_data.csv"=convert_records_to_list.source.config.filename \
        csv=convert_records_to_list.source.plugin \
    -inputs \
        '["records"]'=get_single_spec \
        '["bedrooms", "bathrooms", "sqft_living", "sqft_lot", "floors", "waterfront", "view", "condition", "grade", "sqft_above", "sqft_basement", "yr_built", "yr_renovated", "zipcode", "lat", "long", "sqft_living15", "sqft_lot15"]'=features \
        '["price"]'=predict_features \
        None=n_components \
    -flow \
        '[{"convert_records_to_list": "matrix"}]'=standard_scaler.inputs.data \
        '[{"standard_scaler": "result"}]'=principal_component_analysis.inputs.data \
        '[{"seed": ["n_components"]}]'=principal_component_analysis.inputs.n_components \
        '[{"principal_component_analysis": "result"}]'=convert_list_to_records.inputs.matrix \
        '[{"convert_records_to_list": "keys"}]'=convert_list_to_records.inputs.keys \
        --\
            convert_list_to_records \
            convert_records_to_list \
            principal_component_analysis \
            standard_scaler \
            get_single | \
    tee clean_ops.json

Merge Command¶

Now we will run the dataflow on our dataset, with the help of merge command we can see what our preprocessed data looks like.

$ dffml merge text=df temp=csv \
    -source-text-dataflow clean_ops.json \
    -source-text-features price:float:1 bedrooms:float:1 bathrooms:float:1 sqft_living:float:1 sqft_lot:float:1 floors:str:1 waterfront:float:1 view:float:1 condition:float:1 grade:float:1 sqft_above:float:1 sqft_basement:float:1 yr_built:float:1 yr_renovated:float:1 zipcode:str:1 lat:float:1 long:float:1 sqft_living15:float:1 sqft_lot15:float:1 \
    -source-text-source csv \
    -source-text-source-filename kc_house_data.csv \
    -source-temp-filename preprocessed.csv \
    -source-temp-allowempty \
    -source-temp-readwrite \
    -log debug

$ cat preprocessed.csv

Training¶

Now we will train our model on the preprocessed dataset that we just got using the merge command.

$ dffml train \
    -model scikiteln \
    -model-features bedrooms:float:1 bathrooms:float:1 sqft_living:float:1 sqft_lot:float:1 floors:str:1 waterfront:float:1 view:float:1 condition:float:1 grade:float:1 sqft_above:float:1 sqft_basement:float:1 yr_built:float:1 yr_renovated:float:1 zipcode:str:1 lat:float:1 long:float:1 sqft_living15:float:1 sqft_lot15:float:1 \
    -model-predict price:float:1 \
    -model-location tempdir \
    -sources f=csv \
    -source-filename preprocessed.csv \
    -log debug

Accuracy¶

After training of the dataset we can check the accuracy of the model.

$ dffml accuracy \
    -model scikiteln \
    -scorer exvscore \
    -features price:float:1 \
    -model-features bedrooms:float:1 bathrooms:float:1 sqft_living:float:1 sqft_lot:float:1 floors:str:1 waterfront:float:1 view:float:1 condition:float:1 grade:float:1 sqft_above:float:1 sqft_basement:float:1 yr_built:float:1 yr_renovated:float:1 zipcode:str:1 lat:float:1 long:float:1 sqft_living15:float:1 sqft_lot15:float:1 \
    -model-predict price:float:1 \
    -model-location tempdir \
    -sources f=csv \
    -source-filename preprocessed.csv \
    -log debug

Without Cleanup Operations¶

Here we will be checking what is the accuracy of the model without performing cleanup operations.

$ dffml train \
    -model scikiteln \
    -model-features bedrooms:float:1 bathrooms:float:1 sqft_living:float:1 sqft_lot:float:1 floors:str:1 waterfront:float:1 view:float:1 condition:float:1 grade:float:1 sqft_above:float:1 sqft_basement:float:1 yr_built:float:1 yr_renovated:float:1 zipcode:str:1 lat:float:1 long:float:1 sqft_living15:float:1 sqft_lot15:float:1 \
    -model-predict price:float:1 \
    -model-location tempdir \
    -sources f=csv \
    -source-filename kc_house_data.csv \
    -log debug

$ dffml accuracy \
    -model scikiteln \
    -scorer exvscore \
    -features price:float:1 \
    -model-features bedrooms:float:1 bathrooms:float:1 sqft_living:float:1 sqft_lot:float:1 floors:str:1 waterfront:float:1 view:float:1 condition:float:1 grade:float:1 sqft_above:float:1 sqft_basement:float:1 yr_built:float:1 yr_renovated:float:1 zipcode:str:1 lat:float:1 long:float:1 sqft_living15:float:1 sqft_lot15:float:1 \
    -model-predict price:float:1 \
    -model-location tempdir \
    -sources f=csv \
    -source-filename kc_house_data.csv \
    -log debug

Conclusion¶

We can see that after performing cleanup operations and doing preprocessing on the data we have increased our accuracy and also reduced our training time for the models.