Data Cleanup Operations¶
In this example we are going to perform cleanup operations on a real-world dataset.
In this example we will perform the following steps
Writing cleanup operations dataflow
Using merge command to see the preprocessed data
Training our model on the preprocessed data
Getting the accuracy of the model
Checking the accuracy of the model without cleanup of the data
First install the data cleanup operations and scikit models
$ python -m pip install dffml-operations-data dffml-model-scikit
Dataset¶
The dataset we will be using is available on kaggle https://www.kaggle.com/harlfoxem/housesalesprediction
you may go ahead and download the dataset
$ curl -fLO https://github.com/intel/dffml/files/7046671/kc_house_data.csv
Data Cleanup Operations¶
We will be performing two cleanup operations on our dataset - standard_scaler will normalize the dataset having unit variance and standard deviation of 0 - principal_component_analysis will convert the data into (number of samples, number of components)
$ dffml dataflow create \
-config \
"kc_house_data.csv"=convert_records_to_list.source.config.filename \
csv=convert_records_to_list.source.plugin \
-inputs \
'["records"]'=get_single_spec \
'["bedrooms", "bathrooms", "sqft_living", "sqft_lot", "floors", "waterfront", "view", "condition", "grade", "sqft_above", "sqft_basement", "yr_built", "yr_renovated", "zipcode", "lat", "long", "sqft_living15", "sqft_lot15"]'=features \
'["price"]'=predict_features \
None=n_components \
-flow \
'[{"convert_records_to_list": "matrix"}]'=standard_scaler.inputs.data \
'[{"standard_scaler": "result"}]'=principal_component_analysis.inputs.data \
'[{"seed": ["n_components"]}]'=principal_component_analysis.inputs.n_components \
'[{"principal_component_analysis": "result"}]'=convert_list_to_records.inputs.matrix \
'[{"convert_records_to_list": "keys"}]'=convert_list_to_records.inputs.keys \
--\
convert_list_to_records \
convert_records_to_list \
principal_component_analysis \
standard_scaler \
get_single | \
tee clean_ops.json
Merge Command¶
Now we will run the dataflow on our dataset, with the help of merge command we can see what our preprocessed data looks like.
$ dffml merge text=df temp=csv \
-source-text-dataflow clean_ops.json \
-source-text-features price:float:1 bedrooms:float:1 bathrooms:float:1 sqft_living:float:1 sqft_lot:float:1 floors:str:1 waterfront:float:1 view:float:1 condition:float:1 grade:float:1 sqft_above:float:1 sqft_basement:float:1 yr_built:float:1 yr_renovated:float:1 zipcode:str:1 lat:float:1 long:float:1 sqft_living15:float:1 sqft_lot15:float:1 \
-source-text-source csv \
-source-text-source-filename kc_house_data.csv \
-source-temp-filename preprocessed.csv \
-source-temp-allowempty \
-source-temp-readwrite \
-log debug
$ cat preprocessed.csv
Training¶
Now we will train our model on the preprocessed dataset that we just got using the merge command.
$ dffml train \
-model scikiteln \
-model-features bedrooms:float:1 bathrooms:float:1 sqft_living:float:1 sqft_lot:float:1 floors:str:1 waterfront:float:1 view:float:1 condition:float:1 grade:float:1 sqft_above:float:1 sqft_basement:float:1 yr_built:float:1 yr_renovated:float:1 zipcode:str:1 lat:float:1 long:float:1 sqft_living15:float:1 sqft_lot15:float:1 \
-model-predict price:float:1 \
-model-location tempdir \
-sources f=csv \
-source-filename preprocessed.csv \
-log debug
Accuracy¶
After training of the dataset we can check the accuracy of the model.
$ dffml accuracy \
-model scikiteln \
-scorer exvscore \
-features price:float:1 \
-model-features bedrooms:float:1 bathrooms:float:1 sqft_living:float:1 sqft_lot:float:1 floors:str:1 waterfront:float:1 view:float:1 condition:float:1 grade:float:1 sqft_above:float:1 sqft_basement:float:1 yr_built:float:1 yr_renovated:float:1 zipcode:str:1 lat:float:1 long:float:1 sqft_living15:float:1 sqft_lot15:float:1 \
-model-predict price:float:1 \
-model-location tempdir \
-sources f=csv \
-source-filename preprocessed.csv \
-log debug
Without Cleanup Operations¶
Here we will be checking what is the accuracy of the model without performing cleanup operations.
$ dffml train \
-model scikiteln \
-model-features bedrooms:float:1 bathrooms:float:1 sqft_living:float:1 sqft_lot:float:1 floors:str:1 waterfront:float:1 view:float:1 condition:float:1 grade:float:1 sqft_above:float:1 sqft_basement:float:1 yr_built:float:1 yr_renovated:float:1 zipcode:str:1 lat:float:1 long:float:1 sqft_living15:float:1 sqft_lot15:float:1 \
-model-predict price:float:1 \
-model-location tempdir \
-sources f=csv \
-source-filename kc_house_data.csv \
-log debug
$ dffml accuracy \
-model scikiteln \
-scorer exvscore \
-features price:float:1 \
-model-features bedrooms:float:1 bathrooms:float:1 sqft_living:float:1 sqft_lot:float:1 floors:str:1 waterfront:float:1 view:float:1 condition:float:1 grade:float:1 sqft_above:float:1 sqft_basement:float:1 yr_built:float:1 yr_renovated:float:1 zipcode:str:1 lat:float:1 long:float:1 sqft_living15:float:1 sqft_lot15:float:1 \
-model-predict price:float:1 \
-model-location tempdir \
-sources f=csv \
-source-filename kc_house_data.csv \
-log debug
Conclusion¶
We can see that after performing cleanup operations and doing preprocessing on the data we have increased our accuracy and also reduced our training time for the models.