Ice Cream Sales

In this tutorial we are going to predict ice-cream sales.

This example consist of the following steps

  • Writing Operations to get population and temperature.

  • Creating dataflow for the operations that we have written above.

  • Using merge command for the data processing.

  • Training a model from the preprocessed data.

  • Getting the accuracy of the model

  • Getting predictions from the model using test dataset.

Dataset

In this we have two datasets for training and testing. We have fields city, state, month, sales. This is a dummy dataset.

dataset.csv

city,state,month,sales
Las Vegas city,Nevada,5,77131
Las Vegas city,Nevada,6,667467
Las Vegas city,Nevada,7,674107
Las Vegas city,Nevada,8,674667
Las Vegas city,Nevada,9,77131
Chicago city,Illinois,10,12000
Phoenix city,Arizona,7,1604012
Phoenix city,Arizona,8,1604172
Las Vegas city,Nevada,10,77131
Chicago city,Illinois,8,73440
Seattle city,Washington,10,87367
Chicago city,Illinois,11,12000
Las Vegas city,Nevada,11,77131
Miami city,Florida,9,58796
Phoenix city,Arizona,6,1598492
Chicago city,Illinois,12,12000
Seattle city,Washington,11,87367
Orlando city,Florida,7,337737
Phoenix city,Arizona,9,180099
Miami city,Florida,10,58796
Chicago city,Illinois,9,12000
Phoenix city,Arizona,10,180099
Seattle city,Washington,12,87367
Orlando city,Florida,8,337737
Las Vegas city,Nevada,12,77131
Miami city,Florida,11,58796
Orlando city,Florida,9,40744
Phoenix city,Arizona,11,180099
Chicago city,Illinois,1,12000
Orlando city,Florida,10,40744
Seattle city,Washington,1,87367
Miami city,Florida,12,58796
Orlando city,Florida,11,40744
Phoenix city,Arizona,12,180099
Chicago city,Illinois,2,12000
Orlando city,Florida,12,40744
Seattle city,Washington,2,87367
Orlando city,Florida,1,40744
Portland city (pt.),Oregon,1,12179
Miami city,Florida,1,58796
Chicago city,Illinois,3,12000
Portland city (pt.),Oregon,2,12179
Seattle city,Washington,3,87367
Phoenix city,Arizona,1,180099
Portland city (pt.),Oregon,3,12179
Orlando city,Florida,2,40744
Portland city (pt.),Oregon,4,12179
Miami city,Florida,2,58796
Seattle city,Washington,4,87367
Chicago city,Illinois,4,12000
Portland city (pt.),Oregon,5,12179
Miami city,Florida,3,58796
Phoenix city,Arizona,2,180099
Portland city (pt.),Oregon,6,64972
Seattle city,Washington,5,87367
Orlando city,Florida,3,40744
Portland city (pt.),Oregon,7,69772
Miami city,Florida,4,58796
Chicago city,Illinois,5,12000
Portland city (pt.),Oregon,8,70492
Seattle city,Washington,6,739747
Phoenix city,Arizona,3,180099
Portland city (pt.),Oregon,9,12179
Miami city,Florida,5,58796
Orlando city,Florida,4,40744
Portland city (pt.),Oregon,10,12179
Seattle city,Washington,7,743907
Chicago city,Illinois,6,71200
Portland city (pt.),Oregon,11,12179
Miami city,Florida,6,500206
Phoenix city,Arizona,4,180099
Portland city (pt.),Oregon,12,12179
Seattle city,Washington,8,744627
Orlando city,Florida,5,40744
Las Vegas city,Nevada,1,77131
Seattle city,Washington,9,87367
Chicago city,Illinois,7,75360
Las Vegas city,Nevada,2,77131
Miami city,Florida,7,500846
Phoenix city,Arizona,5,180099
Las Vegas city,Nevada,3,77131
Miami city,Florida,8,501726
Orlando city,Florida,6,336537
Las Vegas city,Nevada,4,77131

test_dataset.csv

city,state,month,sales
Salem city,Oregon,10,29436
Salem city,Oregon,7,224048
Salem city,Oregon,4,29436
Salem city,Oregon,1,29436
Salem city,Oregon,11,29436
Salem city,Oregon,8,224528
Salem city,Oregon,5,29436
Salem city,Oregon,2,29436
Salem city,Oregon,12,29436
Salem city,Oregon,9,29436
Salem city,Oregon,6,218928
Salem city,Oregon,3,29436
Buffalo city,New York,12,37528
Buffalo city,New York,3,37528
Buffalo city,New York,9,37528
Buffalo city,New York,6,295435
Buffalo city,New York,4,37528
Buffalo city,New York,10,37528
Buffalo city,New York,1,37528
Buffalo city,New York,7,303835
Buffalo city,New York,5,37528
Buffalo city,New York,2,37528
Buffalo city,New York,8,300555
Buffalo city,New York,11,37528

Operations

We will be writing two operations:

  1. An operation to get the temperature given the city, month.

  2. An operation to get the population given the city, state

First we need to find some official sources from where we can get the temperature and population for the cities that we have.

We will be using https://www.ncdc.noaa.gov/cdo-web/ to get the datasets for the temperature and for the population we are using https://www2.census.gov/programs-surveys/popest/datasets/2010-2019/cities/totals/ for getting the population datasets.

Since the datasets are huge, hence we are using only specific files from the dataset to get the values we need.

We will be using the cached_download method which caches the downloaded files and speeds up the development time.

operations.py

import json
import pathlib

from dffml import (
    op,
    load,
    export,
    Definition,
    cached_download,
)


temperature_dataset_urls = {
    "Phoenix city": {
        "url": "https://www.ncdc.noaa.gov/cag/city/time-series/USW00023183-tavg-all-1-2020-2021.json?base_prd=true&begbaseyear=1933&endbaseyear=2000",
        "expected_sha384_hash": "f26d3d8cb691f2c5544d05da35995329028cf04356f8018a94102701bc49edd34911a0a076eed376a11f1514ce06b277",
    },
    "Orlando city": {
        "url": "https://www.ncdc.noaa.gov/cag/city/time-series/USW00012815-tavg-all-2-2020-2021.json?base_prd=true&begbaseyear=1952&endbaseyear=2000",
        "expected_sha384_hash": "e239a9245188797da6c2b73a837785f075c462a8de6fce135a3db94c4155b586456263961c68ead2758f8185ef0a70c0",
    },
    "Miami city": {
        "url": "https://www.ncdc.noaa.gov/cag/city/time-series/USW00012839-tavg-all-1-2020-2021.json?base_prd=true&begbaseyear=1948&endbaseyear=2000",
        "expected_sha384_hash": "797c5e1645b381c242fa1defcd0cd63549770a2e68481da9253bfc04e6ece826899d19112b52ae03e84bc7ad25d4063f",
    },
    "Portland city (pt.)": {
        "url": "https://www.ncdc.noaa.gov/cag/city/time-series/USW00024229-tavg-all-1-2020-2021.json?base_prd=true&begbaseyear=1938&endbaseyear=2000",
        "expected_sha384_hash": "a0f2a87bca37cfe7ea79a25d58d0280af1c4347360ab7bcadb3be4a71963ec234d14c763c12dafd9facf5fe75f3a9502",
    },
    "New York city": {
        "url": "https://www.ncdc.noaa.gov/cag/city/time-series/USW00094789-tavg-all-1-2020-2021.json?base_prd=true&begbaseyear=1948&endbaseyear=2000",
        "expected_sha384_hash": "4e029455185fbed76e9e9bd71344dd424a3de767a4e0fc0439fbadc5cc8bfa203bbfa8e5545df8ac7812124d3d5674f0",
    },
    "Las Vegas city": {
        "url": "https://www.ncdc.noaa.gov/cag/city/time-series/USW00023169-tavg-all-1-2020-2021.json?base_prd=true&begbaseyear=1948&endbaseyear=2000",
        "expected_sha384_hash": "d7082f31f97c56a36f579f6fd0d361caff0aa585dcadf3a63294d74bd31881517d7a0e970537993cb75c17ccba1ceb4e",
    },
    "Seattle city": {
        "url": "https://www.ncdc.noaa.gov/cag/city/time-series/USW00024233-tavg-all-1-2020-2021.json?base_prd=true&begbaseyear=1948&endbaseyear=2000",
        "expected_sha384_hash": "d4c291ead4a5f4a15b7d4ef456b73e3754b4bbbff3614b56d7bdff4631d2af3205795cc39ca132091b413f95e56768c0",
    },
    "Chicago city": {
        "url": "https://www.ncdc.noaa.gov/cag/city/time-series/USW00094846-tavg-all-1-2020-2021.json?base_prd=true&begbaseyear=1958&endbaseyear=2000",
        "expected_sha384_hash": "1ee80e616b70f16c193e565215f216a354324f58f12373502fb7763edb304af4bbd56d06c3c8bc0f0a23f7e7e85a5a0e",
    },
    "Buffalo city": {
        "url": "https://www.ncdc.noaa.gov/cag/city/time-series/USH00301012-tavg-all-1-2020-2021.json?base_prd=true&begbaseyear=1958&endbaseyear=2000",
        "expected_sha384_hash": "8ffbaf046ac1b3538d1a825942beb1f1a5420d18fe69fce090f34167df78ea3c863202b520558ffd5bf99425ffd8b232",
    },
    "Salem city": {
        "url": "https://www.ncdc.noaa.gov/cag/city/time-series/USW00024232-tavg-all-1-2020-2021.json?base_prd=true&begbaseyear=1958&endbaseyear=2000",
        "expected_sha384_hash": "d1df691b45ea66be3329e31d5f7d84a583164c8293842490c8a32c46dd89bb776182ed4c4a0e51968b3bf60d76f1998b",
    },
}


population_dataset_urls = {
    "Arizona": {
        "url": "https://www2.census.gov/programs-surveys/popest/datasets/2010-2019/cities/totals/sub-est2019_4.csv",
        "expected_sha384_hash": "56f6dc515f42e584df8f2806dea2fc0f955916bc9b697975b5bb62a4efd5efa1065738a56f049053164ecd150cf84f5c",
    },
    "Florida": {
        "url": "https://www2.census.gov/programs-surveys/popest/datasets/2010-2019/cities/totals/sub-est2019_12.csv",
        "expected_sha384_hash": "969cc1abd2f3f30a9208277fcb760d4cc9e9deb67a605ca0017daf784782d5180a1e2d5c6982a84783ec310c3dde13d2",
    },
    "Oregon": {
        "url": "https://www2.census.gov/programs-surveys/popest/datasets/2010-2019/cities/totals/sub-est2019_41.csv",
        "expected_sha384_hash": "c11bb1be135eb9f2521e6397339c10c87baae4669aab46198f87716cc652a2f8c85a96c477aabe7e11240d5049b1149a",
    },
    "New York": {
        "url": "https://www2.census.gov/programs-surveys/popest/datasets/2010-2019/cities/totals/sub-est2019_36.csv",
        "expected_sha384_hash": "cd77a3c4ab0099a8353e5118e82c184622274484328883f86eae59755a6a81ad039c5c55339f730a990b0bf412acc6c2",
    },
    "Nevada": {
        "url": "https://www2.census.gov/programs-surveys/popest/datasets/2010-2019/cities/totals/sub-est2019_32.csv",
        "expected_sha384_hash": "d8518c388107483dc0da65e33769ea8e8d7983ce17f79f7020201cf3754cd4a59b9ce57e479b4c1208a46c2dd8be4fde",
    },
    "Washington": {
        "url": "https://www2.census.gov/programs-surveys/popest/datasets/2010-2019/cities/totals/sub-est2019_53.csv",
        "expected_sha384_hash": "b426810c8438585c67057c5073281fce6b20d6bf013370256d6dbdcc4ad0b92c7d673c1e7d6e2a1d14e59f7bbc6599ad",
    },
    "Illinois": {
        "url": "https://www2.census.gov/programs-surveys/popest/datasets/2010-2019/cities/totals/sub-est2019_17.csv",
        "expected_sha384_hash": "a55edf7f31ccdc792d183bb0c1dccbc55f6cfb5d518502e3fc5278d230a0174a741ae625d2b00e650dc1d8cd39f2e989",
    },
}


temperature_def = Definition(name="temperature", primitive="generic")
population_def = Definition(name="population", primitive="generic")


@op(outputs={"temperature": temperature_def})
async def lookup_temperature(self, city: str, month: int):
    if city not in temperature_dataset_urls:
        raise Exception(f"City: {city} not found in dataset")

    cache_dir = (
        pathlib.Path("~", ".cache", "dffml", "datasets", "temperature")
        .expanduser()
        .resolve()
    )

    filepath = await cached_download(
        temperature_dataset_urls[city]["url"],
        cache_dir / f"{city}.json",
        temperature_dataset_urls[city]["expected_sha384_hash"],
    )
    dataset = json.loads(pathlib.Path(filepath).read_text())
    temperature = dataset["data"][
        f"20200{month}" if month < 10 else f"2020{month}"
    ]["value"]
    return {"temperature": float(temperature)}


@op(outputs={"population": population_def})
async def lookup_population(self, city: str, state: str):
    if city not in temperature_dataset_urls:
        raise Exception(f"City: {city} not found in dataset")

    if state not in population_dataset_urls:
        raise Exception(f"State: {state} not found in dataset")

    cache_dir = (
        pathlib.Path("~", ".cache", "dffml", "datasets", "population")
        .expanduser()
        .resolve()
    )

    filepath = await cached_download(
        population_dataset_urls[state]["url"],
        cache_dir / f"{state}.csv",
        population_dataset_urls[state]["expected_sha384_hash"],
    )
    async for record in load(filepath):
        if export(record)["features"]["NAME"] == city:
            population = export(record)["features"]["POPESTIMATE2019"]
            yield {"population": population}

In the lookup_temperature operation we would take the inputs as city and month, we would go over the dataset links that we have and download the file for that particular city and return the temperature for that specific month.

In the lookup_population operation we would take the inputs as city and state, we would go over all the state datasets that we have and download the matching dataset for state and returns the population for that particular city.

Dataflow

Now we will be creating a dataflow using the operations that we have written earlier. The main objective of the dataflow would be to give output after running the operations over our dummy dataset that we have.

dataflow.sh

$ dffml dataflow create \
    -flow \
        '[{"seed": ["city"]}]'=operations:lookup_temperature.inputs.city \
        '[{"seed": ["month"]}]'=operations:lookup_temperature.inputs.month \
        '[{"seed": ["city"]}]'=operations:lookup_population.inputs.city \
        '[{"seed": ["state"]}]'=operations:lookup_population.inputs.state \
    -inputs \
        '["temperature", "population"]'=get_single_spec \
        -- \
        operations:lookup_population \
        operations:lookup_temperature \
        get_single | \
        tee preprocess_ops.json

The dataflow create command will create a dataflow for us in the json format. Now we can use this dataflow within our merge command.

Merge Command

We will be using the merge command. In the merge command we would go over the dataset.csv file run it through the dataflow that we have created earlier and dump the values of population and temperature in a new file.

processing.sh

$ dffml merge text=dfpreprocess temp=csv \
    -source-text-dataflow preprocess_ops.json \
    -source-text-features city:str:1 state:str:1 month:int:1 \
    -source-text-source csv \
    -source-text-source-filename dataset.csv \
    -source-temp-filename preprocessed.csv \
    -source-temp-allowempty \
    -source-temp-readwrite \
    -log debug
$ cat preprocessed.csv

Model Install

We need to install the tensorflow models since we’ll be using the TensorFlow DNN Regressor

$ python -m pip install -U dffml-model-tensorflow

Model Training

Now we will be training our model, the model which we will be using is tfdnnr model. You may also use any other model for this purpose.

Our model will be trained on temperature and population and the output being sales.

train.sh

$ dffml train \
    -model tfdnnr \
    -model-epochs 300 \
    -model-steps 20 \
    -model-hidden 9 18 9 \
    -model-features population:int:1 temperature:float:1 \
    -model-predict sales:int:1 \
    -model-location tempdir \
    -sources f=csv \
    -source-filename preprocessed.csv \
    -log debug

Model Accuracy

The accuracy of the trained model can be found out using the accuracy method.

accuracy.sh

$ dffml accuracy \
    -model tfdnnr \
    -model-hidden 9 18 9 \
    -model-features population:int:1 temperature:float:1 \
    -model-predict sales:int:1 \
    -model-location tempdir \
    -sources f=csv \
    -source-filename preprocessed.csv \
    -log debug

Prediction

For the prediction we will using the test_dataset.csv this data was not present in the training dataset.

Here instead of creating and intermediary file we are directly providing the output of the dataflow (temperature and population) for the prediction of sales.

predict.sh

$ dffml predict all \
    -model tfdnnr \
    -model-hidden 9 18 9 \
    -model-features population:int:1 temperature:float:1 \
    -model-predict sales:int:1 \
    -model-location tempdir \
    -sources preprocess=dfpreprocess \
    -source-preprocess-dataflow preprocess_ops.json \
    -source-preprocess-features city:str:1 state:str:1 month:int:1 \
    -source-preprocess-source csv \
    -source-preprocess-source-filename test_dataset.csv \
    -log debug