Source Dataset Base¶
- dffml.source.dataset.base.dataset_source(entrypoint_name) ContextManagedWrapperSource [source]¶
Allows us to quickly programmatically provide access to existing datasets via existing or custom sources.
Under the hood this is an alias for
dffml.source.wrapper.context_managed_wrapper_source()
withqualname_suffix
set to “DatasetSource”.Examples
Say we have the following dataset hosted at http://example.com/my_training.csv
Let’s test this locally by creating a file and serving it from our local machine. Write the following file.
my_training.csv
feed,face,dead,beef 0.0,0,0,0 0.1,1,10,100 0.2,2,20,200 0.3,3,30,300 0.4,4,40,400
We can start an HTTP server using Python
$ python3 -m http.server 8080
We could write a dataset source to download and cache the contents locally as follows. We want to make sure that we validate the contents of datasets using SHA 384 hashes (see
cached_download
for more details). Without hash validation we risk downloading the wrong file or potentially malicious files.my_training.py
import pathlib from dffml.source.csv import CSVSource from dffml.source.dataset import dataset_source from dffml.util.net import cached_download, DEFAULT_PROTOCOL_ALLOWLIST @dataset_source("my.training") async def my_training_dataset( url: str = "http://download.example.com/data/my_training.csv", expected_sha384_hash: str = "db9ec70abdc8b74bcf91a7399144dd15fc01e3dad91bbbe3c41fbbe33065b98a3e06e8e0ba053d850d7dc19e6837310e", cache_dir: pathlib.Path = ( pathlib.Path("~", ".cache", "dffml", "datasets", "my") .expanduser() .resolve() ), ): # Download the file from the url give, place the downloaded file at # ~/.cache/dffml/datasets/my/training.csv. Ensure the SHA 384 hash # of the download's contents is equal the the expected value filepath = await cached_download( url, cache_dir / "training.csv", expected_sha384_hash, protocol_allowlist=["http://"] + DEFAULT_PROTOCOL_ALLOWLIST, ) # Create a source using downloaded file yield CSVSource(filename=str(filepath))
We can use it from Python in two different ways as follows
run.py
import sys import pathlib import unittest import asyncio from dffml import load from my_training import my_training_dataset async def main(): # Grab arguments from command line url = sys.argv[1] cache_dir = pathlib.Path(sys.argv[2]) # Usage via Source class set as property .source of function records = [ record async for record in load( my_training_dataset.source(url=url, cache_dir=cache_dir) ) ] # Create a test case to do comparisons tc = unittest.TestCase() tc.assertEqual(len(records), 5) tc.assertDictEqual( records[0].export(), { "key": "0", "features": {"feed": 0.0, "face": 0, "dead": 0, "beef": 0}, "extra": {}, }, ) # Usage as context manager to create source async with my_training_dataset(url=url, cache_dir=cache_dir) as source: records = records = [record async for record in load(source)] tc.assertEqual(len(records), 5) tc.assertDictEqual( records[2].export(), { "key": "2", "features": {"feed": 0.2, "face": 2, "dead": 20, "beef": 200}, "extra": {}, }, ) if __name__ == "__main__": asyncio.run(main())
$ python3 run.py http://localhost:8080/my_training.csv cache_dir
Or we can use it from the command line
$ dffml list records \ -sources training=my_training:my_training_dataset.source \ -source-training-cache_dir cache_dir \ -source-training-url http://localhost:8080/my_training.csv