Source Dataset Base¶

dffml.source.dataset.base.dataset_source(entrypoint_name) → ContextManagedWrapperSource[source]¶

Allows us to quickly programmatically provide access to existing datasets via existing or custom sources.

Under the hood this is an alias for dffml.source.wrapper.context_managed_wrapper_source() with qualname_suffix set to “DatasetSource”.

Examples

Say we have the following dataset hosted at http://example.com/my_training.csv

Let’s test this locally by creating a file and serving it from our local machine. Write the following file.

my_training.csv

feed,face,dead,beef
0,0,0,0
1,1,10,100
2,2,20,200
3,3,30,300
4,4,40,400

We can start an HTTP server using Python

$ python3 -m http.server 8080

We could write a dataset source to download and cache the contents locally as follows. We want to make sure that we validate the contents of datasets using SHA 384 hashes (see cached_download for more details). Without hash validation we risk downloading the wrong file or potentially malicious files.

my_training.py

import pathlib

from dffml.source.csv import CSVSource
from dffml.source.dataset import dataset_source
from dffml.util.net import cached_download, DEFAULT_PROTOCOL_ALLOWLIST


@dataset_source("my.training")
async def my_training_dataset(
    url: str = "http://download.example.com/data/my_training.csv",
    expected_sha384_hash: str = "db9ec70abdc8b74bcf91a7399144dd15fc01e3dad91bbbe3c41fbbe33065b98a3e06e8e0ba053d850d7dc19e6837310e",
    cache_dir: pathlib.Path = (
        pathlib.Path("~", ".cache", "dffml", "datasets", "my")
        .expanduser()
        .resolve()
    ),
):
    # Download the file from the url give, place the downloaded file at
    # ~/.cache/dffml/datasets/my/training.csv. Ensure the SHA 384 hash
    # of the download's contents is equal the the expected value
    filepath = await cached_download(
        url,
        cache_dir / "training.csv",
        expected_sha384_hash,
        protocol_allowlist=["http://"] + DEFAULT_PROTOCOL_ALLOWLIST,
    )
    # Create a source using downloaded file
    yield CSVSource(filename=str(filepath))

We can use it from Python in two different ways as follows

run.py

import sys
import pathlib
import unittest
import asyncio

from dffml import load

from my_training import my_training_dataset


async def main():
    # Grab arguments from command line
    url = sys.argv[1]
    cache_dir = pathlib.Path(sys.argv[2])

    # Usage via Source class set as property .source of function
    records = [
        record
        async for record in load(
            my_training_dataset.source(url=url, cache_dir=cache_dir)
        )
    ]

    # Create a test case to do comparisons
    tc = unittest.TestCase()

    tc.assertEqual(len(records), 5)
    tc.assertDictEqual(
        records[0].export(),
        {
            "key": "0",
            "features": {"feed": 0.0, "face": 0, "dead": 0, "beef": 0},
            "extra": {},
        },
    )

    # Usage as context manager to create source
    async with my_training_dataset(url=url, cache_dir=cache_dir) as source:
        records = records = [record async for record in load(source)]
        tc.assertEqual(len(records), 5)
        tc.assertDictEqual(
            records[2].export(),
            {
                "key": "2",
                "features": {"feed": 0.2, "face": 2, "dead": 20, "beef": 200},
                "extra": {},
            },
        )


if __name__ == "__main__":
    asyncio.run(main())

$ python3 run.py http://localhost:8080/my_training.csv cache_dir

Or we can use it from the command line

$ dffml list records \
    -sources training=my_training:my_training_dataset.source \
    -source-training-cache_dir cache_dir \
    -source-training-url http://localhost:8080/my_training.csv