Changelog¶
All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
[Unreleased]¶
Added¶
Consolidate test case classes
Tutorial on how to load models dynamically https://intel.github.io/dffml/tutorials/models/load.html
log_time
decorator for logging running time of functions/coroutines.Download progressbar in
util/net.py
Usecase example notebook for “Moving between models”
Documentation and testing support for notebooks
Commit message linting in CI tests
Example on how to create operations and use data preprocessing source to train models https://intel.github.io/dffml/examples/ice_cream.html
Operations for zip and tar file creation and extraction
Operations for file (de)compression
Usecase example notebook for “Evaluating Model Performance”
Tests for all notebooks auto created and run via
test_notebooks.py
Support for additional layers in pytorch pretrained models via Python API
Pandas DataFrame can now be passed directly to high level APIs
Usecase example notebook for “Saving and loading models”
Usecase example notebook for “Transfer Learning”
Usecase example notebook for “Ensemble by stacking”
Support for Archive Storage of Models
Support for Multi-Output models.
Usecase example notebook for “Working with Multi-Output models”
Optimizer
parameter_grid
for tuning models.Usecase example notebook for “Tuning Models”
Operations for data cleanup
Examples of using data cleanup operations https://intel.github.io/dffml/examples/data_cleanup/index.html
Dev CMD to remove unused imports,
$ dffml service dev lint imports
Helper for creating a blank generic Python package
$ dffml service dev create blank mypackage
is_trained
flag to all modelsDynamic
location
property toModel
baseclass.
Changed¶
Calls to hashlib now go through helper functions
Build docs using
dffml service dev docs
cached_download/unpack_archive()
are now functionsModel
directory
property tolocation
high_level
split intoml
,dataflow
&source
submodulesConfig objects now support mutability/immutability at the property scope. See
docs/arch/0003-Config-Property-Mutable-vs-Immutable
for details.high_level
accuracy()
now takes predict features as parameter.Spacy
model_name_or_path
was changed tomodel_name
. Functionality is the same, it still accepts a name or a path.Renamed
accuracy()
toscore()
.Renamed
Optimizer
toTuner
.High-level functions now accept list for data.
Fixed¶
Record object key properties are now always strings
High level functions (
train()
, etc.) now work on existing open contextsIssue of download progress being logged only on first download
Removed¶
dffml/skel/common/REPLACE_IMPORT_PACKAGE_NAME/version.py
[0.4.0] - 2021-02-18¶
Added¶
New model for Anomaly Detection
Ablity to specify maximum number of contexts running at a time
CLI and Python example usage of Custom Neural Network
PyTorch loss function entrypoint style loading
Custom Neural Network, last layer support for pre-trained models
Example usage of sklearn operations
Example Flower17 species image classification
Configloading ablity from CLI using “@” before filename
Docstrings and doctestable example for DataFlowPreprocessSource
XGBoost Regression Model
Pre-Trained PyTorch torchvision Models
Spacy model for NER
Ability to rename outputs using GetSingle
Tutorial for using NLP operations with models
Operations plugin for NLP wrapping spacy and scikit functions
Support for default value in a Definition
Source for reading images in directories
Operations plugin for image preprocessing
-pretty
flag tolist records
andpredict
commandsdaal4py based linear regression model
DataFlowPreprocessSource can take a config file as dataflow via the CLI.
Support for link on conditions in dataflow diagrams
edit all
command to edit records in bulkSupport for Tensorflow 2.2
Vowpal Wabbit Models
Python 3.8 support
binsec branch to
operations/binsec
Doctestable example for
model_predict
operation.Doctestable examples to
operation/mapping.py
shouldi got an operation to run Dependency-check on java code.
load
andrun
functions in high level APIDoctestable examples to
db
operations.Source for parsing
.ini
file formatsTests for noasync high level API.
Tests for load and save functions in high level API.
Operation
inputs and outputs default to emptydict
if not given.Ability to export any object with
dffml service dev export
Complete example for dataflow run cli command
Tests for default configs instantiation.
Example ffmpeg operation.
Operations to deploy docker container on receiving github webhook.
New use case
Redeploying dataflow on webhook
in docs.Documentation for creating Source for new File types taking
.ini
as an example.New input modes, output modes for HTTP API dataflow registration.
Usage example for tfhub text classifier.
AssociateDefinition
output operation to map definition names to values produced as a result of passing Inputs with those definitions to operations.DataFlows now have a syntax for providing a set of definitions that will override the operations default definition for a given input.
Source which modifies record features as they are read from another source. Useful for modifying datasets as they are used with ML commands or editing in bulk.
Auto create Definition for the
op
when they might have a spec, subspec.shouldi use
command which detects the language of the codebase given via path to directory or Git repo URL and runs the appropriate static analyzers.Support for entrypoint style loading of operations and seed inputs in
dataflow create
.Definition for output of the function that
op
wraps.Expose high level load, run and save functions to noasync.
Operation to verify secret for GitHub webhook.
Option to modify flow and add config in
dataflow create
.Ability to use a function as a data source via the
op
sourceMake every model’s directory property required
New model AutoClassifierModel based on
AutoSklearn
.New model AutoSklearnRegressorModel based on
AutoSklearn
.Example showing usage of locks in dataflow.
-skip
flag toservice dev install
command to let users not install certain core pluginsHTTP service got a
-redirect
flag which allows for URL redirection via a HTTP 307 responseSupport for immediate response in HTTP service
Daal4py example usage.
Gitter chatbot tutorial.
Option to run dataflow without sources from cli.
Sphinx extension for automated testing of tutorials (consoletest)
Example of software portal using DataFlows and HTTP service
Retry parameter to
Operation
. Allows for setting number of times operation should be retried before it’s exception should be raised.
Changed¶
Renamed
-seed
to-inputs
indataflow create
commandRenamed configloader/png to configloader/image and added support for loading JPEG and TIFF file formats
Update record
__str__
method to output in tabular formatUpdate MNIST use case to normalize image arrays.
arg_
notation replaced withCONFIG = ExampleConfig
style syntax for parsing all command line arguments.Moved usage/io.rst to docs/tutorials/dataflows/io.rst
edit
command substituted withedit record
Edit on Github
button now hidden for plugins.Doctests now run via unittests
Every class and function can now be imported from the top level module
op
attempts to createDefinition
s for each argument if aninputs
are not given.Classes now use
CONFIG
if it has a default for every field andconfig
isNone
Models now dynamically import third party modules.
Memory dataflow classes now use auto args and config infrastructure
dffml list records
command prints Records as JSON using.export()
Feature class in
dffml/feature/feature.py
initialize a feature objectAll DefFeatures() functions are substituted with Features()
All feature.type() and feature.lenght() are substituted with feature.type and feature.length
FileSource takes pathlib.Path as filename
Tensorflow tests re-run themselves up to 6 times to stop them from failing the CI due to their randomly initialized weights making them fail ~2% of the time
Any plugin can now be loaded via it’s entrypoint style path
with_features
now raises a helpful error message if no records with matching features were foundSplit out model tutorial into writing the model, and another tutorial for packaging the model.
IntegrationCLITestCase creates a new directory and chdir into it for each test
Automated testing of Automating Classification tutorial
dffml version
command now prints git repo hash and if the repo is dirty
Fixed¶
export_value
now converts numpy array to JSON serializable datatypeCSV source overwriting configloaded data to every row
Race condition in
MemoryRedundancyChecker
when more than 4 possible parameter sets for an operation.Typing of config values for numpy parsed docstrings where type should be tuple or list
Model predict methods now use
SourcesContext.with_features
Removed¶
Monitor class and associated tests (unused)
DefinedFeature class in
dffml/feature/feature.py
DefFeature function in
dffml/feature/feature.py
load_def function in Feature class in
dffml/feature/feature.py
[0.3.7] - 2020-04-14¶
Added¶
IO operations demo and
literal_eval
operation.Python prompts
>>>
can now be enabled or disabled for easy copying of code into interactive sessions.Whitespace check now checks .rst and .md files too.
GetMulti
operation which gets all Inputs of a given definitionPython usage example for LogisticRegression and its related tests.
Support for async generator operations
Example CLI commands and Python code for
SLRModel
save
function in high level API to quickly save all given records to a sourceAbility to configure sources and models for HTTP API from command line when starting server
Documentation page for command line usage of HTTP API
Usage of HTTP API to the quickstart to use trained model
Changed¶
Renamed
"arg"
to"plugin"
.CSV source sorts feature names within headers when saving
Moved HTTP service testing code to HTTP service
util.testing
Fixed¶
Exporting plugins
Issue parsing string values when using the
dataflow run
command and specifying extra inputs.
Removed¶
Unused imports
[0.3.6] - 2020-04-04¶
Added¶
Operations for taking input from the user
AcceptUserInput
and for printing the outputprint_output
PNG ConfigLoader for reading images as arrays to predict using MNIST trained models
Docstrings and doctestable examples to
record.py
.Inputs can be validated using operations
validate
parameter inInput
takesOperation.instance_name
New db source can utilize any database that inherits from
BaseDatabase
Logistic Regression with SAG optimizer
Test tensorflow DNNEstimator documentation examples in CI
shouldi got an operation to run cargo-audit on rust code.
Moved all the downloads to tests/downloads to speed the CI test.
Test tensorflow DNNEstimator documentation exaples in CI
Add python code for tensorflow DNNEstimator
Ability to run a subflow as if it were an operation using the
dffml.dataflow.run
operation.Support for operations without inputs.
Partial doctestable examples to
features.py
Doctestable examples for
BaseSource
Instructions for setting up debugging environment in VSCode
Fixed¶
New model tutorial mentions file paths that should be edited.
DataFlow is no longer a dataclass to prevent it from being exported incorrectly.
operations_parameter_set_pairs
moved toMemoryOrchestratorContext
Ignore generated files in
docs/plugins/
Treat
"~"
as the the home directory rather than a literalWindows support by selecting
asyncio.ProactorEventLoop
and not usingasyncio.FastChildWatcher
.Moved SLR into the main dffml package and removed
scratch:slr
.
Changed¶
Refactor
model/tensroflow
[0.3.5] - 2020-03-10¶
Added¶
Parent flows can now forward inputs to active contexts of subflows.
forward
parameter inDataFlow
subflow
inOperationImplementationContext
Documentation on writing examples and running doctests
Doctestable Examples to high-level API.
Shouldi got an operation to run npm-audit on JavaScript code
Docstrings and doctestable examples for
record.py
(features and evaluated)Simplified model API with SimpleModel
Documentation on how DataFlows work conceptually.
Style guide now contains information on class, variable, and function naming.
Changed¶
Restructured contributing documentation
Use randomly generated data for scikit tests
Change Core to Official to clarify who maintains each plugin
Name of output of unsupervised model from “Prediction” to “cluster”
Test scikit LR documentation examples in CI
Create a fresh archive of the git repo for release instead of cleaning existing repo with
git clean
for development service release command.Simplified SLR tests for scratch model
Test tensorflow DNNClassifier documentation examples in CI
config directories and files associated with ConfigLoaders have been renamed to configloader.
Model config directory parameters are now
pathlib.Path
objectsNew model tutorial and
skel/model
use simplifeid model API.
[0.3.4] - 2020-02-28¶
Added¶
Tensorflow hub NLP models.
Notes on development dependencies in
setup.py
files to codebase notes.Test for
cached_download
dffml.util.net.cached_download_unpack_archive
to run a cached download and unpack the archive, very useful for testing. Documented on the Networking Helpers API docs page.Directions on how to read the CI under the Git and GitHub page of the contributing documentation.
HTTP API
Static file serving from a directory with
-static
api.js
file serving with the-js
flagDocs page for JavaScript example
shouldi got an operation to run golangci-lint on Golang code
Note about using black via VSCode
Fixed¶
Port assignment for the HTTP API via the
-port
flag
Changed¶
repo
/Repo
torecord
/Record
Definitions with a
spec
can use thesubspec
parameter to declare that they are a list or a dict where the values are of thespec
type. Rather than the list or dict itself being of thespec
type.Fixed the URL mentioned in example to configure a model.
Sphinx doctests are now run in the CI in the DOCS task.
Lint JavaScript files with js-beautify and enforce with CI
Removed¶
Unused imports
[0.3.3] - 2020-02-10¶
Added¶
Moved from TensorFlow 1 to TensorFlow 2.
IDX Sources to read binary data files and train models on MNIST Dataset
scikit models
Clusterers
KMeans
Birch
MiniBatchKMeans
AffinityPropagation
MeanShift
SpectralClustering
AgglomerativeClustering
OPTICS
allowempty
added to source config parameters.Quickstart document to show how to use models from Python.
The latest release of the documentation now includes a link to the documentation for the main branch (on GitHub pages).
Virtual environment, GitPod, and Docker development environment setup notes to the CONTRIBUTING.md file.
Changelog now included in documentation website.
Database abstraction
dffml.db
SQLite connector
MySQL connector
Documented style for imports.
Documented use of numpy docstrings.
Inputs
can now be sanitized using function passed invalidate
parameterHelper utilities to take callables with numpy style docstrings and create config classes out of them using
make_config
.File listing endpoint to HTTP service.
When an operation throws an exception the name of the instance and the parameters it was executed with will be thrown via an
OperationException
.Network utilities to peformed cached downloads with hash validation.
Development service got a new command, which can retrieve an argument passed to setuptools
setup
function within asetup.py
file.
Changed¶
All instances of
src_url
changed tokey
.readonly
parameter in source config is now changed toreadwrite
.predict
parameter of all model config classes has been changed fromstr
toFeature
.Defining features on the command line no longer requires that defined features be prefixed with
def:
The model predict operation will now raise an exception if the model it is passed via it’s config is a class rather than an instance.
entry_point
and friends have been renamed toentrypoint
.Use
FastChildWatcher
when run via the CLI to preventBlockingIOError
s.TensorFlow based neural network classifier had the
classification
parameter in it’s config changed topredict
.SciKit models use
make_config_numpy
.Predictions in
repos
are now dictionary.All instances of
label
changed totag
Subclasses of
BaseConfigurable
will now auto instantiate their respective config classes usingkwargs
if the config argument isn’t given and keyword arguments are.The quickstart documentation was improved as well as the structure of docs.
Fixed¶
CONTRIBUTING.md has
-e
in the wrong place in the getting setup section.Since moving to auto
args()
andconfig()
, BaseConfigurable no longer produces odd typenames in conjunction with docs.py.Autoconvert Definitions with spec into their spec
Removed¶
The model predict operation erroneously had a
msg
parameter in it’s config.Unused imports identified by deepsource.io
Evaluation code from feature.py file as well as tests for those evaluations.
[0.3.2] - 2020-01-03¶
Added¶
scikit models
Classifiers
LogisticRegression
GradientBoostingClassifier
BernoulliNB
ExtraTreesClassifier
BaggingClassifier
LinearDiscriminantAnalysis
MultinomialNB
Regressors
ElasticNet
BayesianRidge
Lasso
ARDRegression
RANSACRegressor
DecisionTreeRegressor
GaussianProcessRegressor
OrthogonalMatchingPursuit
Lars
Ridge
AsyncExitStackTestCase
which instantiates and enters async and non-asynccontextlib
exit stacks. Provides temporary file creation.Automatic releases to PyPi via GitHub Actions
Automatic documentation deployment to GitHub Pages
Function to create a config class dynamically, analogous to
make_dataclass
ConfigLoaders
class which loads config files from a file or directory to a dictionary.
Changed¶
CLI tests and integration tests derive from
AsyncExitStackTestCase
SciKit models now use the auto args and config methods.
Fixed¶
Correctly identify when functions decorated with
op
useself
to reference theOperationImplementationContext
.shouldi safety operation uses subprocess communicate method instead of stdin pipe writes.
Negative values are correctly parsed when input via the command line.
Do not lowercase development mode install location when reporting version.
[0.3.1] - 2019-12-12¶
Added¶
Integration tests using the command line interface.
Operation
run_dataflow
to run a dataflow and test for the same.
Changed¶
Features were moved from ModelContext to ModelConfig
CI is now run via GitHub Actions
CI testing script is now verbose
args and config methods of all classes no longer require implementation. BaseConfigurable handles exporting of arguments and creation of config objects for each class based off of the CONFIG property of that class. The CONFIG property is a class which has been decorated with dffml.base.config to make it a dataclass.
Speed up development service install of all plugins in development mode
Speed up named plugin load times
Fixed¶
DataFlows with multiple possibilities for a source for an input, now correctly look through all possible sources instead of just the first one.
DataFlow MemoryRedundancyCheckerContext was using all inputs in an input set and all their ancestors to check redundancy (a hold over from pre uid days). It now correctly only uses the inputs in the parameter set. This fixes a major performance issue.
MySQL packaging issue.
Develop service running one off operations correctly json-loads dict types.
Operations with configs can be run via the development service
JSON dumping numpy int* and float* caused crash on dump.
CSV source always loads
src_urls
as strings.
Removed¶
CLI command
operations
removed in favor ofdataflow run
Duplicate dataflow diagram code from development service
[0.3.0] - 2019-10-26¶
Added¶
Real DataFlows, see operations tutorial and usage examples
Async helper concurrently nocancel optional keyword argument which, if set is a set of tasks not to cancel when the concurrently execution loop completes.
FileSourceTest has a
test_label
method which checks that a FileSource knows how to properly load and save repos under a given label.Test case for Merge CLI command
Repo.feature method to select a single piece of feature data within a repo.
Dev service to help with hacking on DFFML and to create models from templates in the skel/ directory.
Classification type parameter to DNNClassifierModelConfig to specify data type of given classification options.
util.cli CMD classes have their argparse description set to their docstring.
util.cli CMD classes can specify the formatter class used in
argparse.ArgumentParser
via theCLI_FORMATTER_CLASS
property.Skeleton for service creation was added
Simple Linear Regression model from scratch
Scikit Linear Regression model
Community link in CONTRIBUTING.md.
Explained three main parts of DFFML on docs homepage
Documentation on how to use ML models on docs Models plugin page.
Mailing list info
Issue template for questions
Multiple Scikit Models with dynamic config
Entrypoint listing command to development service to aid in debugging issues with entrypoints.
HTTP API service to enable interacting with DFFML over HTTP. Currently includes APIs for configuring and using Sources and Models.
MySQL protocol source to work with data from a MySQL protocol compatible db
shouldi example got a bandit operation which tells users not to install if there are more than 5 issues of high severity and confidence.
dev service got the ability to run a single operation in a standalone fashion.
About page to docs.
Tensorflow DNNEstimator based regression model.
Changed¶
feature/codesec became it’s own branch, binsec
BaseOrchestratorContext
run_operations
strict is default to true. With strict as true errors will be raised and not just logged.MemoryInputNetworkContext got an
sadd
method which is shorthand for creating a MemoryInputSet with a StringInputSetContext.MemoryOrchestrator
basic_config
method takes list of operations and optional config for them.shouldi example uses updated
MemoryOrchestrator.basic_config
method and includes more explanation in comments.CSVSource allows for setting the Repo’s
src_url
from a csv columnutil Entrypoint defines a new class for each loaded class and sets the
ENTRY_POINT_LABEL
parameter within the newly defined class.Tensorflow model removed usages of repo.classifications methods.
Entrypoint prints traceback of loaded classes to standard error if they fail to load.
Updated Tensorflow model README.md to match functionality of DNNClassifierModel.
DNNClassifierModel no longer splits data for the user.
Update
pip
in Dockerfile.Restructured documentation
Ran
black
on whole codebase, including all submodulesCI style check now checks whole codebase
Merged HACKING.md into CONTRIBUTING.md
shouldi example runs bandit now in addition to safety
The way safety gets called
Switched documentation to Read The Docs theme
Models yield only a repo object instead of the value and confidence of the prediction as well. Models are not responsible for calling the predicted method on the repo. This will ease the process of making predict feature specific.
Updated Tensorflow model README.md to include usage of regression model
Fixed¶
Docs get version from dffml.version.VERSION.
FileSource zipfiles are wrapped with TextIOWrapper because CSVSource expects the underlying file object to return str instances rather than bytes.
FileSourceTest inherits from SourceTest and is used to test json and csv sources.
A temporary directory is used to replicate
mktemp -u
functionality so as to provide tests using a FileSource with a valid tempfile name.Labels for JSON sources
Labels for CSV sources
util.cli CMD’s correcly set the description of subparsers instead of their help, they also accept the
CLI_FORMATTER_CLASS
property.CSV source now has
entrypoint
decorationJSON source now has
entrypoint
decorationStrict flag in df.memory is now on by default
Dynamically created scikit models get config args correctly
Renamed
DNNClassifierModelContext
first init arg fromconfig
tofeatures
BaseSource now has
base_entry_point
decoration
Removed¶
Repo objects are no longer classification specific. Their
classify
,classified
, andclassification
methods were removed.
[0.2.1] - 2019-06-07¶
Added¶
Definition spec field to specify a class representative of key value pairs for definitions with primitives which are dictionaries
Auto generation of documentation for operation implementations, models, and sources. Generated docs include information on configuration options and inputs and outputs for operation implementations.
Async helpers got an
aenter_stack
method which creates and returns andcontextlib.AsyncExitStack
after entering all the context’s passed to it.Example of how to use Data Flow Facilitator / Orchestrator / Operations by writing a Python meta static analysis tool, shouldi
Changed¶
OperationImplementation
add_label
andadd_orig_label
methods now use op.name instead ofENTRY_POINT_ORIG_LABEL
andENTRY_POINT_NAME
.Make output specs and remap arguments optional for Operations CLI commands.
Feature skeleton project is now operations skeleton project
Fixed¶
MemoryOperationImplementationNetwork instantiates OperationImplementations using their
withconfig()
method.MemorySource now decorated with
entrypoint
MemorySource takes arguments correctly via
config_set
andconfig_get
skel modules have
long_description_content_type
set to “text/markdown”Base Orchestrator
__aenter__
and__aexit__
methods were moved to the Memory Orchestrator because they are specific to that config.Async helper
aenter_stack
usesinspect.isfunction
so it will bind lambdas
[0.2.0] - 2019-05-23¶
Added¶
Support for zip file source
Async helper for running tasks concurrently
Gitter badge to README
Documentation on the Data Flow Facilitator subsystem
codesec plugin containing operations which gather security related metrics on code and binaries.
auth plugin containing an scrypt operation as an example of thread pool usage.
Changed¶
Standardized the API for most classes in DFFML via inheritance from dffml.base
Configuration of classes is now done via the args() and config() methods
Documentation is now generated using Sphinx
Fixed¶
Corrected maxsplit in util.cli.parser
Check that dtype is a class in Tensorlfow DNN
CI script no longer always exits 0 for plugin tests
Corrected render type in setup.py to markdown
[0.1.2] - 2019-03-29¶
Added¶
Contribution guidelines
Logging documentation
Example usage of Git features
New Model and Feature creation script
New Feature skeleton directory
New Model skeleton directory
New Feature creation tutorial
New Model creation tutorial
Update functionality to the CSV source
Support for Gzip file source
Support for bz2 file source
Travis checks for additions to CHANGELOG.md
Travis checks for trailing whitespace
Support for lzma file source
Support for xz file source
Data Flow Facilitator
Changed¶
Restructured documentation to docs folder and moved from rST to markdown
Git feature cloc logs if no binaries are in path
Fixed¶
Enable source.file to read from /dev/fd/XX
[0.1.1] - 2019-03-08¶
Changed¶
Corrected formatting in README for PyPi
[0.1.0] - 2019-03-07¶
Added¶
Feature class to collect a feature in a dataset
Git features to collect feature data from Git repos
Model class to wrap implementations of machine learning models
Tensorflow DNN model for generic usage of the DNN estimator
CLI interface and framework
Source class to manage dataset storage