Models¶

Models are implementations of dffml.model.model.Model, they abstract the usage of machine learning models.

If you want to get started creating your own model, check out the Models.

You can load any of the models seen here using the py:func:Model.load <dffml.model.model.Model.load> function. See the Load Models Dynamically tutorial for more deatils. .. _plugin_model_dffml:

dffml¶

pip install dffml

slr¶

Official

Logistic Regression training one variable to predict another.

The dataset used for training

dataset.csv

f1,ans
1,0
7,1
6,1
2,0
8,1

Train the model

$ dffml train \
    -model slr \
    -model-features f1:float:1 \
    -model-predict ans:int:1 \
    -model-location tempdir \
    -sources f=csv \
    -source-filename dataset.csv

Assess the accuracy

$ dffml accuracy \
    -model slr \
    -model-features f1:float:1 \
    -model-predict ans:int:1 \
    -model-location tempdir \
    -features ans:int:1 \
    -sources f=csv \
    -source-filename dataset.csv \
    -scorer mse \
1.0

Make a prediction

predict.csv

f1
0.8

$ dffml predict all \
    -model slr \
    -model-features f1:float:1 \
    -model-predict ans:int:1 \
    -model-location tempdir \
    -sources f=csv \
    -source-filename predict.csv
[
    {
        "extra": {},
        "features": {
            "f1": 0.8
        },
        "key": "0",
        "last_updated": "2020-11-15T16:22:25Z",
        "prediction": {
            "ans": {
                "confidence": 0.9355670103092784,
                "value": 1
            }
        }
    }
]

Example usage of Logistic Regression using Python

slr.py

from dffml import Features, Feature, SLRModel
from dffml.noasync import train, score, predict
from dffml.accuracy import MeanSquaredErrorAccuracy

model = SLRModel(
    features=Features(Feature("f1", float, 1)),
    predict=Feature("ans", int, 1),
    location="tempdir",
)

# Train the model
train(model, "dataset.csv")

# Assess accuracy (alternate way of specifying data source)
scorer = MeanSquaredErrorAccuracy()
print("Accuracy:", score(model, scorer, Feature("ans", int, 1), "dataset.csv"))

# Make prediction
for i, features, prediction in predict(model, {"f1": 0.8, "ans": 0}):
    features["ans"] = prediction["ans"]["value"]
    print(features)

$ python slr.py
Accuracy: 0.9355670103092784
{'f1': 0.8, 'ans': 1}

Args

predict: Feature
- Label or the value to be predicted
features: List of features
- Features to train on. For SLR only 1 allowed
location: Path
- Location where state should be saved

dffml_model_scratch¶

pip install dffml-model-scratch

anomalydetection¶

Official

Model for Anomaly Detection using multivariate Gaussian distribution to predict probabilities of all records in the dataset and identify outliers. F1 score is used as the evaluation metric for this model. This model works well as it recognises dependencies across various features, and works particularly well if the features have a Gaussian Distribution.

Examples¶

Command line usage

Create training and test datasets

trainex.csv

A,Y
65,0
24,0
93,0
87,0
23,0
7,1
86,0
45,0
55,0
29,0
5,1
51,0
88,0
24,0
51,0
17,0
9,1
37,0
23,0
44,0
62,0
3,1
87,0

testex.csv

A,Y
45,0
23,0
67,0
8,1
19,0
34,0
49,0
31,0
47,0
4,1

Train the model

$ dffml train \
    -sources f=csv \
    -source-filename trainex.csv \
    -model anomalydetection \
    -model-features A:float:2 \
    -model-predict Y:int:1  \
    -model-location tempdir

Assess the accuracy

$ dffml accuracy \
    -sources f=csv \
    -source-filename testex.csv \
    -model anomalydetection \
    -model-features A:float:2 \
    -model-predict Y:int:1 \
    -model-location tempdir \
    -features Y:int:1 \
    -scorer anomalyscore

Make predictions

$ dffml predict all \
    -sources f=csv \
    -source-filename testex.csv \
    -model anomalydetection \
    -model-features A:float:2 \
    -model-predict Y:int:1 \
    -model-location tempdir

Python usage

from dffml import Feature, Features
from dffml.noasync import score, train

from dffml_model_scratch.anomalydetection import AnomalyModel
from dffml_model_scratch.anomaly_detection_scorer import (
    AnomalyDetectionAccuracy,
)

# Configure the model

model = AnomalyModel(
    features=Features(Feature("A", int, 2),),
    predict=Feature("Y", int, 1),
    location="model",
)


# Train the model
train(model, "trainex.csv")

# Assess accuracy for test set
scorer = AnomalyDetectionAccuracy()
print(
    "Test set F1 score :",
    score(model, scorer, Feature("Y", int, 1), "testex.csv"),
)

# Assess accuracy for training set
print(
    "Training set F1 score :",
    score(model, scorer, Feature("Y", int, 1), "trainex.csv"),
)

Output

$ python detectoutliers.py
Test set F1 score : 0.8
Training set F1 score : 0.888888888888889

Args

features: List of features
- Features to train on
predict: Feature
- Label or the value to be predicted
location: Path
- Location where state should be saved
k: float
- default: 0.8
- Validation set size

scratchlgrsag¶

Official

Logistic Regression using stochastic average gradient descent optimizer

The dataset used for training

cat > dataset.csv << EOF
f1,ans
0.1,0
0.7,1
0.6,1
0.2,0
0.8,1
EOF

Train the model

dffml train \
  -model scratchlgrsag \
  -model-features f1:float:1 \
  -model-predict ans:int:1 \
  -model-location tempdir \
  -sources f=csv \
  -source-filename dataset.csv \
  -log debug

Assess the accuracy

dffml accuracy \
  -model scratchlgrsag \
  -model-features f1:float:1 \
  -model-predict ans:int:1 \
  -model-location tempdir \
  -features ans:int:1 \
  -sources f=csv \
  -source-filename dataset.csv \
  -scorer mse \
  -log debug

Output

1.0

Make a prediction

echo -e 'f1,ans\n0.8,0\n' | \
  dffml predict all \
  -model scratchlgrsag \
  -model-features f1:float:1 \
  -model-predict ans:int:1 \
  -model-location tempdir \
  -sources f=csv \
  -source-filename /dev/stdin \
  -log debug

Output

[
    {
        "extra": {},
        "features": {
            "ans": 0,
            "f1": 0.8
        },
        "last_updated": "2020-03-19T13:41:08Z",
        "prediction": {
            "ans": {
                "confidence": 1.0,
                "value": 1
            }
        },
        "key": "0"
    }
]

Example usage of Logistic Regression using Python

from dffml import CSVSource, Features, Feature
from dffml.noasync import train, score, predict
from dffml.accuracy import MeanSquaredErrorAccuracy
from dffml_model_scratch.logisticregression import LogisticRegression

model = LogisticRegression(
    features=Features(Feature("f1", float, 1)),
    predict=Feature("ans", int, 1),
    location="tempdir",
)

# Train the model
train(model, "dataset.csv")

# Assess accuracy (alternate way of specifying data source)
scorer = MeanSquaredErrorAccuracy()
print(
    "Accuracy:",
    score(
        model,
        scorer,
        Feature("ans", int, 1),
        CSVSource(filename="dataset.csv"),
    ),
)

# Make prediction
for i, features, prediction in predict(model, {"f1": 0.8, "ans": 0}):
    features["ans"] = prediction["ans"]["value"]
    print(features)

Args

predict: Feature
- Label or the value to be predicted
features: List of features
- Features to train on
location: Path
- Location where state should be saved

dffml_model_xgboost¶

pip install dffml-model-xgboost

OSX Installation

XGBoost on OSX requires libomp

$ brew install libomp

xgbclassifier¶

Official

Model using xgboost to perform classification prediction via gradient boosted trees. XGBoost is a leading software library for working with standard tabular data (the type of data you store in Pandas DataFrames, as opposed to more exotic types of data like images and videos). With careful parameter tuning, you can train highly accurate models.

Examples¶

Command line usage

First download the training and test files, change the headers to DFFML format. The first row is an encoding of the classifications, we want CSV headers for the column names.

$ wget http://download.tensorflow.org/data/iris_training.csv
$ wget http://download.tensorflow.org/data/iris_test.csv
$ sed -i 's/.*setosa,versicolor,virginica/SepalLength,SepalWidth,PetalLength,PetalWidth,classification/g' iris_training.csv iris_test.csv

Run the train command

$ dffml train \
    -sources train=csv \
    -source-filename iris_training.csv \
    -model xgbclassifier \
    -model-features \
      SepalLength:float:1 \
      SepalWidth:float:1 \
      PetalLength:float:1 \
      PetalWidth:float:1 \
    -model-predict classification \
    -model-location model \
    -model-max_depth 3 \
    -model-learning_rate 0.01 \
    -model-learning_rate 0.01 \
    -model-n_estimators 200 \
    -model-reg_lambda 1 \
    -model-reg_alpha 0 \
    -model-gamma 0 \
    -model-colsample_bytree 0 \
    -model-subsample 1

Assess the accuracy

$ dffml accuracy \
    -sources train=csv \
    -source-filename iris_test.csv \
    -model xgbclassifier \
    -model-features \
      SepalLength:float:1 \
      SepalWidth:float:1 \
      PetalLength:float:1 \
      PetalWidth:float:1 \
    -model-predict classification \
    -model-location model \
    -features classification \
    -scorer clf

Make predictions

$ dffml predict all \
    -sources train=csv \
    -source-filename iris_test.csv \
    -model xgbclassifier \
    -model-features \
      SepalLength:float:1 \
      SepalWidth:float:1 \
      PetalLength:float:1 \
      PetalWidth:float:1 \
    -model-predict classification \
    -model-location model

Python usage

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

from dffml import Feature, Features
from dffml.noasync import train, score
from dffml.accuracy import ClassificationAccuracy
from dffml_model_xgboost.xgbclassifier import (
    XGBClassifierModel,
    XGBClassifierModelConfig,
)

iris = load_iris()
y = iris["target"]
X = iris["data"]
trainX, testX, trainy, testy = train_test_split(
    X, y, test_size=0.1, random_state=123
)

# Configure the model
model = XGBClassifierModel(
    XGBClassifierModelConfig(
        features=Features(Feature("data", float,)),
        predict=Feature("target", float, 1),
        location="model",
        max_depth=3,
        learning_rate=0.01,
        n_estimators=200,
        reg_lambda=1,
        reg_alpha=0,
        gamma=0,
        colsample_bytree=0,
        subsample=1,
    )
)

# Train the model
train(model, *[{"data": x, "target": y} for x, y in zip(trainX, trainy)])

# Assess accuracy
scorer = ClassificationAccuracy()
print(
    "Test accuracy:",
    score(
        model,
        scorer,
        Feature("target", float, 1),
        *[{"data": x, "target": y} for x, y in zip(testX, testy)],
    ),
)
print(
    "Training accuracy:",
    score(
        model,
        scorer,
        Feature("target", float, 1),
        *[{"data": x, "target": y} for x, y in zip(trainX, trainy)],
    ),
)

Output

Test accuracy: 0.933333333333333
Training accuracy: 0.9703703703703703

Args

location: Path
- Location where model should be saved
features: List of features
- Features on which we train the model
predict: Feature
- Value to be predicted
learning_rate: float
- default: 0.3
- Learning rate to train with
n_estimators: Integer
- default: 100
- Number of gradient boosted trees. Equivalent to the number of boosting rounds
max_depth: Integer
- default: 6
- Maximium tree depth for base learners
objective: String
- default: multi:softmax
- Objective in training
subsample: float
- default: 1
- Subsample ratio of the training instance
gamma: float
- default: 0
- Minimium loss reduction required to make a furthre partition on a leaf node
n_jobs: Integer
- default: -1
- Number of parallel threads used to run xgboost
colsample_bytree: float
- default: 1
- Subsample ratio of columns when constructing each tree
booster: String
- default: gbtree
- Specify which booster to use: gbtree, gblinear or dart
min_child_weight: float
- default: 1
- Minimum sum of instance weight(hessian) needed in a child
reg_lambda: float
- default: 1
- L2 regularization term on weights. Increasing this value will make model more conservative
reg_alpha: float
- default: 0
- L1 regularization term on weights. Increasing this value will make model more conservative

xgbregressor¶

Official

Model using xgboost to perform regression prediction via gradient boosted trees XGBoost is a leading software library for working with standard tabular data (the type of data you store in Pandas DataFrames, as opposed to more exotic types of data like images and videos). With careful parameter tuning, you can train highly accurate models.

Examples¶

Command line usage

First download the training and test files, change the headers to DFFML format.

$ wget http://download.tensorflow.org/data/iris_training.csv
$ wget http://download.tensorflow.org/data/iris_test.csv
$ sed -i 's/.*setosa,versicolor,virginica/SepalLength,SepalWidth,PetalLength,PetalWidth,classification/g' iris_training.csv iris_test.csv

Run the train command

$ dffml train \
    -sources train=csv \
    -source-filename iris_training.csv \
    -model xgbregressor \
    -model-features \
      SepalLength:float:1 \
      SepalWidth:float:1 \
      PetalLength:float:1 \
      PetalWidth:float:1 \
    -model-predict classification \
    -model-location model \
    -model-max_depth 3 \
    -model-learning_rate 0.01 \
    -model-n_estimators 200 \
    -model-reg_lambda 1 \
    -model-reg_alpha 0 \
    -model-gamma 0 \
    -model-colsample_bytree 0 \
    -model-subsample 1

Assess the accuracy

$ dffml accuracy \
    -sources train=csv \
    -source-filename iris_test.csv \
    -model xgbregressor \
    -model-features \
      SepalLength:float:1 \
      SepalWidth:float:1 \
      PetalLength:float:1 \
      PetalWidth:float:1 \
    -model-predict classification \
    -model-location model \
    -features classification \
    -scorer mse

Output

accuracy: 0.8841466984766406

Make predictions

$ dffml predict all \
    -sources train=csv \
    -source-filename iris_test.csv \
    -model xgbregressor \
    -model-features \
      SepalLength:float:1 \
      SepalWidth:float:1 \
      PetalLength:float:1 \
      PetalWidth:float:1 \
    -model-predict classification \
    -model-location model

Python usage

run.py

from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split

from dffml import Feature, Features
from dffml.noasync import train, score
from dffml_model_xgboost.xgbregressor import (
    XGBRegressorModel,
    XGBRegressorModelConfig,
)
from dffml.accuracy import MeanSquaredErrorAccuracy


diabetes = load_diabetes()
y = diabetes["target"]
X = diabetes["data"]
trainX, testX, trainy, testy = train_test_split(
    X, y, test_size=0.1, random_state=123
)

# Configure the model
model = XGBRegressorModel(
    XGBRegressorModelConfig(
        features=Features(Feature("data", float, 10)),
        predict=Feature("target", float, 1),
        location="model",
        max_depth=3,
        learning_rate=0.05,
        n_estimators=400,
        reg_lambda=10,
        reg_alpha=0,
        gamma=10,
        colsample_bytree=0.3,
        subsample=0.8,
    )
)

# Train the model
train(model, *[{"data": x, "target": y} for x, y in zip(trainX, trainy)])

# Assess accuracy
scorer = MeanSquaredErrorAccuracy()
print(
    "Test accuracy:",
    score(
        model,
        scorer,
        Feature("target", float, 1),
        *[{"data": x, "target": y} for x, y in zip(testX, testy)],
    ),
)

print(
    "Training accuracy:",
    score(
        model,
        scorer,
        Feature("target", float, 1),
        *[{"data": x, "target": y} for x, y in zip(trainX, trainy)],
    ),
)

Output

$ python run.py
Test accuracy: 0.6669655406927468
Training accuracy: 0.819782501866115

Args

location: Path
- Location where model should be saved
features: List of features
- Features on which we train the model
predict: Feature
- Value to be predicted
learning_rate: float
- default: 0.05
- Learning rate to train with
n_estimators: Integer
- default: 1000
- Number of gradient boosted trees. Equivalent to the number of boosting rounds
max_depth: Integer
- default: 6
- Maximium tree depth for base learners
subsample: float
- default: 1
- Subsample ratio of the training instance
gamma: float
- default: 0
- Minimium loss reduction required to make a furthre partition on a leaf node
n_jobs: Integer
- default: -1
- Number of parallel threads used to run xgboost
colsample_bytree: float
- default: 1
- Subsample ratio of columns when constructing each tree
booster: String
- default: gbtree
- Specify which booster to use: gbtree, gblinear or dart
min_child_weight: float
- default: 0
- Minimum sum of instance weight(hessian) needed in a child
reg_lambda: float
- default: 1
- L2 regularization term on weights. Increasing this value will make model more conservative
reg_alpha: float
- default: 0
- L1 regularization term on weights. Increasing this value will make model more conservative

dffml_model_vowpalWabbit¶

pip install dffml-model-vowpalWabbit

vwmodel¶

Official

Implemented using Vowpal Wabbit.

First we create the training and testing datasets

cat > train.csv << EOF
A,B
| price:.23 sqft:.25 age:.05 2006,-1
| price:.18 sqft:.15 age:.35 1976,1
| price:.53 sqft:.32 age:.87 1924,-1
EOF

cat > test.csv << EOF
A
| price:.46 sqft:.4 age:.10 1924
EOF

Train the model

dffml train \
  -model vwmodel \
  -model-features \
    A:str:1 \
  -model-predict \
    B:int:1 \
  -model-noconvert \
  -sources f=csv \
  -source-filename train.csv \
  -model-location tempdir

Assess the accuracy

dffml accuracy  \
  -model vwmodel \
  -model-features \
    A:str:1 \
  -model-predict \
    B:int:1 \
  -model-noconvert \
  -features B:int:1 \
  -scorer mse \
  -sources f=csv \
  -source-filename train.csv \
  -model-location tempdir

Output

0.38683876649129145

Make a prediction

dffml predict all \
  -model vwmodel \
  -model-features \
    A:str:1 \
  -model-predict \
    B:int:1 \
  -model-noconvert \
  -sources f=csv \
  -source-filename test.csv \
  -model-location tempdir

Output

[
    {
        "extra": {},
        "features": {
            "A": "| price:.46 sqft:.4 age:.10 1924"
        },
        "key": "0",
        "last_updated": "2020-05-29T16:36:57Z",
        "prediction": {
            "B": {
                "confidence": 0.38683876649129145,
                "value": 0.0
            }
        }
    }
]

Args

features: List of features
predict: Feature
- Feature to predict
location: Path
- Location where state should be saved
class_cost: List of features
- default: None
- Features with name Cost_{class} containing cost of class for each input example, used when csoaa is used
task: String
- default: regression
- Task to perform, possible values are classification, regression
use_binary_label: String
- default: False
- Convert target labels to -1 and 1 for binary classification
vwcmd: List of strings
- default: []
- Command Line Arguments as per vowpal wabbit convention
namespace: List of strings
- default: []
- Namespace for input features. Should be in format {namespace}_{feature name}
importance: Feature
- default: None
- Feature containing importance of each example, used in conversion of input data to vowpal wabbit input format
base: Feature
- default: None
- Feature containing base for each example, used for residual regression
tag: Feature
- default: None
- Feature to be used as tag in conversion of data to vowpal wabbit input format
noconvert: String
- default: False
- Do not convert record features to vowpal wabbit input format

dffml_model_scikit¶

pip install dffml-model-scikit

Machine Learning models implemented with scikit-learn. Models are saved under the directory in subdirectories named after the hash of their feature names.

General Usage:

Training:

$ dffml train \
    -model SCIKIT_MODEL_ENTRYPOINT \
    -model-features FEATURE_DEFINITION \
    -model-predict TO_PREDICT \
    -model-location MODEL_DIRECTORY \
    -model-SCIKIT_PARAMETER_NAME SCIKIT_PARAMETER_VALUE \
    -sources f=TRAINING_DATA_SOURCE_TYPE \
    -source-filename TRAINING_DATA_FILE_NAME \
    -log debug

Testing and Accuracy:

$ dffml accuracy \
    -model SCIKIT_MODEL_ENTRYPOINT \
    -model-features FEATURE_DEFINITION \
    -model-predict TO_PREDICT \
    -model-location MODEL_DIRECTORY \
    -features TO_PREDICT \
    -sources f=TESTING_DATA_SOURCE_TYPE \
    -source-filename TESTING_DATA_FILE_NAME \
    -scorer ACCURACY_SCORER \
    -log debug

Predicting with trained model:

$ dffml predict all \
    -model SCIKIT_MODEL_ENTRYPOINT \
    -model-features FEATURE_DEFINITION \
    -model-predict TO_PREDICT \
    -model-location MODEL_DIRECTORY \
    -sources f=PREDICT_DATA_SOURCE_TYPE \
    -source-filename PREDICT_DATA_FILE_NAME \
    -log debug

Models Available:

Type	Model	Entrypoint	Parameters	Multi-Output
Regression	LinearRegression	scikitlr	scikitlr	Yes
	ElasticNet	scikiteln	scikiteln	Yes
	RandomForestRegressor	scikitrfr	scikitrfr	Yes
	BayesianRidge	scikitbyr	scikitbyr	Yes
	Lasso	scikitlas	scikitlas	Yes
	ARDRegression	scikitard	scikitard	Yes
	RANSACRegressor	scikitrsc	scikitrsc	Yes
	DecisionTreeRegressor	scikitdtr	scikitdtr	Yes
	GaussianProcessRegressor	scikitgpr	scikitgpr	Yes
	OrthogonalMatchingPursuit	scikitomp	scikitomp	Yes
	Lars	scikitlars	scikitlars	Yes
	Ridge	scikitridge	scikitridge	Yes
Classification	KNeighborsClassifier	scikitknn	scikitknn	Yes
	AdaBoostClassifier	scikitadaboost	scikitadaboost	Yes
	GaussianProcessClassifier	scikitgpc	scikitgpc	Yes
	DecisionTreeClassifier	scikitdtc	scikitdtc	Yes
	RandomForestClassifier	scikitrfc	scikitrfc	Yes
	QuadraticDiscriminantAnalysis	scikitqda	scikitqda	Yes
	MLPClassifier	scikitmlp	scikitmlp	Yes
	GaussianNB	scikitgnb	scikitgnb	Yes
	SVC	scikitsvc	scikitsvc	Yes
	LogisticRegression	scikitlor	scikitlor	Yes
	GradientBoostingClassifier	scikitgbc	scikitgbc	Yes
	BernoulliNB	scikitbnb	scikitbnb	Yes
	ExtraTreesClassifier	scikitetc	scikitetc	Yes
	BaggingClassifier	scikitbgc	scikitbgc	Yes
	LinearDiscriminantAnalysis	scikitlda	scikitlda	Yes
	MultinomialNB	scikitmnb	scikitmnb	Yes
Clustering	KMeans	scikitkmeans	scikitkmeans	No
	Birch	scikitbirch	scikitbirch	No
	MiniBatchKMeans	scikitmbkmeans	scikitmbkmeans	No
	AffinityPropagation	scikitap	scikitap	No
	MeanShift	scikitms	scikitms	No
	SpectralClustering	scikitsc	scikitsc	No
	AgglomerativeClustering	scikitac	scikitac	No
	OPTICS	scikitoptics	scikitoptics	No

Scorers Available:

Type	Scorer	Entrypoint	Parameters	Multi-Output
Regression	Explained Variance Score	exvscore	exvscore	Yes
	Max Error	maxerr	maxerr	No
	Mean Absolute Error	meanabserr	meanabserr	Yes
	Mean Squared Error	meansqrerr	meansqrerr	Yes
	Mean Squared Log Error	meansqrlogerr	meansqrlogerr	Yes
	Median Absolute Error	medabserr	medabserr	Yes
	R2 Score	r2score	r2score	Yes
	Mean Poisson Deviance	meanpoidev	meanpoidev	No
	Mean Gamma Deviance	meangammadev	meangammadev	No
	Mean Absolute Percentage Error	meanabspererr	meanabspererr	Yes
Classification	Accuracy Score	acscore	acscore	Yes
	Balanced Accuracy Score	bacscore	bacscore	Yes
	Top K Accuracy Score	topkscore	topkscore	Yes
	Average Precision Score	avgprescore	avgprescore	Yes
	Brier Score Loss	brierscore	brierscore	Yes
	F1 Score	f1score	f1score	Yes
	Log Loss	logloss	logloss	Yes
	Precision Score	prescore	prescore	Yes
	Recall Score	recallscore	recallscore	Yes
	Jaccard Score	jacscore	jacscore	Yes
	Roc Auc Score	rocaucscore	rocaucscore	Yes
Clustering	Adjusted Mutual Info Score	adjmutinfoscore	adjmutinfoscore	No
	Adjusted Rand Score	adjrandscore	adjrandscore	No
	Completeness Score	complscore	complscore	No
	Fowlkes Mallows Score	fowlmalscore	fowlmalscore	No
	Homogeneity Score	homoscore	homoscore	No
	Mutual Info Score	mutinfoscore	mutinfoscore	No
	Normalized Mutual Info Score	normmutinfoscore	normmutinfoscore	No
	Rand Score	randscore	randscore	No
	V Measure Score	vmscore	vmscore	No
Supervised	Model’s Default Score	skmodelscore	skmodelscore	Yes

Usage Example:

Example below uses LinearRegression Model using the command line.

Let us take a simple example:

Years of Experience	Expertise	Trust Factor	Salary
0	01	0.2	10
1	03	0.4	20
2	05	0.6	30
3	07	0.8	40
4	09	1.0	50
5	11	1.2	60

First we create the files

cat > train.csv << EOF
Years,Expertise,Trust,Salary
0,1,0.1,10
1,3,0.2,20
2,5,0.3,30
3,7,0.4,40
EOF

cat > test.csv << EOF
Years,Expertise,Trust,Salary
4,9,0.5,50
5,11,0.6,60
EOF

Train the model

dffml train \
  -model scikitlr \
  -model-features Years:int:1 Expertise:int:1 Trust:float:1 \
  -model-predict Salary:float:1 \
  -model-location tempdir \
  -sources f=csv \
  -source-filename train.csv

Assess accuracy

dffml accuracy \
  -model scikitlr \
  -model-features Years:int:1 Expertise:int:1 Trust:float:1 \
  -model-predict Salary:float:1 \
  -model-location tempdir \
  -features Salary:float:1 \
  -scorer mse \
  -sources f=csv \
  -source-filename test.csv

Output:

1.0

Make a prediction

echo -e 'Years,Expertise,Trust\n6,13,0.7\n' | \
dffml predict all \
  -model scikitlr \
  -model-features Years:int:1 Expertise:int:1 Trust:float:1 \
  -model-predict Salary:float:1 \
  -model-location tempdir \
  -sources f=csv \
  -source-filename /dev/stdin

Output:

[
    {
        "extra": {},
        "features": {
            "Expertise": 13,
            "Trust": 0.7,
            "Years": 6
        },
        "key": "0",
        "last_updated": "2020-03-01T22:26:46Z",
        "prediction": {
            "Salary": {
                "confidence": 1.0,
                "value": 70.0
            }
        }
    }
]

Example usage of Linear Regression Model using python API:

from dffml import CSVSource, Features, Feature
from dffml.noasync import train, score, predict
from dffml_model_scikit import LinearRegressionModel
from dffml.accuracy import MeanSquaredErrorAccuracy

model = LinearRegressionModel(
    features=Features(
        Feature("Years", int, 1),
        Feature("Expertise", int, 1),
        Feature("Trust", float, 1),
    ),
    predict=Feature("Salary", int, 1),
    location="tempdir",
)

# Train the model
train(model, "train.csv")

# Assess accuracy (alternate way of specifying data source)
scorer = MeanSquaredErrorAccuracy()
print(
    "Accuracy:",
    score(
        model,
        scorer,
        Feature("Salary", int, 1),
        CSVSource(filename="test.csv"),
    ),
)

# Make prediction
for i, features, prediction in predict(
    model,
    {"Years": 6, "Expertise": 13, "Trust": 0.7},
    {"Years": 7, "Expertise": 15, "Trust": 0.8},
):
    features["Salary"] = prediction["Salary"]["value"]
    print(features)

Example below uses KMeans Clustering Model on a small randomly generated dataset.

 $ cat > train.csv << EOF
Col1,          Col2,        Col3,         Col4
5.05776417,   8.55128116,   6.15193196,  -8.67349666
3.48864265,  -7.25952218,  -4.89216256,   4.69308946
-8.16207603,  5.16792984,  -2.66971993,   0.2401882
6.09809669,   8.36434181,   6.70940915,  -7.91491768
-9.39122566,  5.39133807,  -2.29760281,  -1.69672981
0.48311336,   8.19998973,   7.78641979,   7.8843821
2.22409135,  -7.73598586,  -4.02660224,   2.82101794
2.8137247 ,   8.36064298,   7.66196849,   3.12704676
EOF
 $ cat > test.csv << EOF
Col1,             Col2,          Col3,         Col4,    cluster
-10.16770144,   2.73057215,  -1.49351481,   2.43005691,    6
3.59705381,  -4.76520663,  -3.34916068,   5.72391486,     1
4.01612313,  -4.641852  ,  -4.77333308,   5.87551683,     0
EOF
 $ dffml train \
     -model scikitkmeans \
     -model-features Col1:float:1 Col2:float:1 Col3:float:1 Col4:float:1 \
     -model-location tempdir \
     -sources f=csv \
     -source-filename train.csv \
     -source-readonly \
     -log debug
 $ dffml accuracy \
     -model scikitkmeans \
     -model-features Col1:float:1 Col2:float:1 Col3:float:1 Col4:float:1\
     -model-predict cluster:int:1 \
     -model-location tempdir \
     -features cluster:int:1 \
     -sources f=csv \
     -source-filename test.csv \
     -source-readonly \
     -scorer skmodelscore \
     -log debug
 0.6365141682948129
 $ echo -e 'Col1,Col2,Col3,Col4\n6.09809669,8.36434181,6.70940915,-7.91491768\n' | \
   dffml predict all \
     -model scikitkmeans \
     -model-features Col1:float:1 Col2:float:1 Col3:float:1 Col4:float:1 \
     -model-location tempdir \
     -sources f=csv \
     -source-filename /dev/stdin \
     -source-readonly \
     -log debug
 [
     {
         "extra": {},
         "features": {
             "Col1": 6.09809669,
             "Col2": 8.36434181,
             "Col3": 6.70940915,
             "Col4": -7.91491768
         },
         "last_updated": "2020-01-12T22:51:15Z",
         "prediction": {
             "confidence": 0.6365141682948129,
             "value": 2
         },
         "key": "0"
     }
 ]

Example usage of KMeans Clustering Model using python API:

from dffml import CSVSource, Features, Feature
from dffml.noasync import train, score, predict
from dffml_model_scikit import KMeansModel
from dffml_model_scikit import MutualInfoScoreScorer

model = KMeansModel(
    features=Features(
        Feature("Col1", float, 1),
        Feature("Col2", float, 1),
        Feature("Col3", float, 1),
        Feature("Col4", float, 1),
    ),
    predict=Feature("cluster", int, 1),
    location="tempdir",
)

# Train the model
train(model, "train.csv")

# Assess accuracy (alternate way of specifying data source)
scorer = MutualInfoScoreScorer()
print("Accuracy:", score(model, scorer, Feature("cluster", int, 1), CSVSource(filename="test.csv")))

# Make prediction
for i, features, prediction in predict(
    model,
    {"Col1": 6.09809669, "Col2": 8.36434181, "Col3": 6.70940915, "Col4": -7.91491768},
):
    features["cluster"] = prediction["cluster"]["value"]
    print(features)

NOTE: Transductive Clusterers(scikitsc, scikitac, scikitoptics) cannot handle unseen data. Ensure that predict and accuracy for these algorithms uses training data.

Args

predict: Feature
- Label or the value to be predicted
- Only used by classification and regression models
features: List of features
- Features to train on
location: Path
- Location where state should be saved

dffml_model_daal4py¶

pip install dffml-model-daal4py

daal4pylr¶

Official

Implemented using daal4py.

First we create the training and testing datasets

train.csv

f1,ans
4,11.2
3,12.5
5,12.7
9,13.1
1,14.1
9,14.8
5,14.4
4,13.4
0,14.9
9,15.6
8,16.4
3,17.7
4,19.6
4,16.9
5,14.0
7,14.6

test.csv

f1,ans
8,16.4
3,17.7
4,19.6
4,16.9
5,14.0
7,14.6

Train the model

$ dffml train \
    -model daal4pylr \
    -model-features f1:float:1 \
    -model-predict ans:int:1 \
    -model-location tempdir \
    -sources f=csv \
    -source-filename train.csv

Assess the accuracy

$ dffml accuracy \
    -model daal4pylr \
    -model-features f1:float:1 \
    -model-predict ans:int:1 \
    -model-location tempdir \
    -features ans:int:1 \
    -sources f=csv \
    -source-filename test.csv \
    -scorer mse \
0.6666666666666666

Make a prediction

$ echo -e 'f1,ans\n0.8,1\n' | \
  dffml predict all \
    -model daal4pylr \
    -model-features f1:float:1 \
    -model-predict ans:int:1 \
    -model-location tempdir \
    -sources f=csv \
    -source-filename /dev/stdin
[
    {
        "extra": {},
        "features": {
            "ans": 1,
            "f1": 0.8
        },
        "key": "0",
        "last_updated": "2020-07-22T02:53:11Z",
        "prediction": {
            "ans": {
                "confidence": null,
                "value": 1.1907472649730522
            }
        }
    }
]

Example usage of daal4py Linear Regression model using python API

run.py

from dffml import CSVSource, Features, Feature
from dffml.noasync import train, score, predict
from dffml_model_daal4py.daal4pylr import DAAL4PyLRModel
from dffml.accuracy import MeanSquaredErrorAccuracy

model = DAAL4PyLRModel(
    features=Features(Feature("f1", float, 1)),
    predict=Feature("ans", int, 1),
    location="tempdir",
)

# Train the model
train(model, "train.csv")

# Assess accuracy (alternate way of specifying data source)
scorer = MeanSquaredErrorAccuracy()
print(
    "Accuracy:",
    score(
        model, scorer, Feature("ans", int, 1), CSVSource(filename="test.csv")
    ),
)

# Make prediction
for i, features, prediction in predict(model, {"f1": 0.8, "ans": 0}):
    features["ans"] = prediction["ans"]["value"]
    print(features)

Run the file

$ python run.py

Args

predict: Feature
- Label or the value to be predicted
features: List of features
- Features to train on. For SLR only 1 allowed
location: Path
- Location where state should be saved

dffml_model_pytorch¶

pip install dffml-model-pytorch

Machine Learning models implemented with PyTorch. Models are saved under the directory in model.pt.

General Usage:

Training:

$ dffml train \
    -model PYTORCH_MODEL_ENTRYPOINT \
    -model-features FEATURE_DEFINITION \
    -model-predict TO_PREDICT \
    -model-location MODEL_LOCATION \
    -model-CONFIGS CONFIG_VALUES \
    -sources f=TRAINING_DATA_SOURCE_TYPE \
    -source-CONFIGS TRAINING_DATA \
    -log debug

Testing and Accuracy:

$ dffml accuracy \
    -model PYTORCH_MODEL_ENTRYPOINT \
    -model-features FEATURE_DEFINITION \
    -model-predict TO_PREDICT \
    -model-location MODEL_LOCATION \
    -model-CONFIGS CONFIG_VALUES \
    -features TO_PREDICT \
    -sources f=TESTING_DATA_SOURCE_TYPE \
    -source-CONFIGS TESTING_DATA \
    -log debug

Predicting with trained model:

$ dffml predict all \
    -model PYTORCH_MODEL_ENTRYPOINT \
    -model-features FEATURE_DEFINITION \
    -model-predict TO_PREDICT \
    -model-location MODEL_LOCATION \
    -model-CONFIGS CONFIG_VALUES \
    -sources f=PREDICT_DATA_SOURCE_TYPE \
    -source-CONFIGS PREDICTION_DATA \
    -log debug

Pre-Trained Models Available:

Type	Model	Entrypoint	Architecture
Classification	AlexNet	alexnet	AlexNet architecture
	DenseNet-121	densenet121	DenseNet architecture
	DenseNet-161	densenet161
	DenseNet-169	densenet169
	DenseNet-201	densenet201
	MnasNet 0.5	mnasnet0_5	MnasNet architecture
	MnasNet 1.0	mnasnet1_0
	MobileNet V2	mobilenet_v2	MobileNet V2 architecture
	VGG-11	vgg11	VGG-11 architecture Configuration “A”
	VGG-11 with batch normalization	vgg11_bn
	VGG-13	vgg13	VGG-13 architecture Configuration “B”
	VGG-13 with batch normalization	vgg13_bn
	VGG-16	vgg16	VGG-16 architecture Configuration “D”
	VGG-16 with batch normalization	vgg16_bn
	VGG-19	vgg19	VGG-19 architecture Configuration “E”
	VGG-19 with batch normalization	vgg19_bn
	GoogleNet	googlenet	GoogleNet architecture
	Inception V3	inception_v3	Inception V3 architecture
	ResNet-18	resnet18	ResNet architecture
	ResNet-34	resnet34
	ResNet-50	resnet50
	ResNet-101	resnet101
	ResNet-152	resnet152
	Wide ResNet-101-2	wide_resnet101_2	Wide Resnet architecture
	Wide ResNet-50-2	wide_resnet50_2
	ShuffleNet V2 0.5	shufflenet_v2_x0_5	Shuffle Net V2 architecture
	ShuffleNet V2 1.0	shufflenet_v2_x1_0
	ResNext-101-32x8D	resnext101_32x8d	ResNext architecture
	ResNext-50-32x4D	resnext50_32x4d

Usage Example:

Example below uses ResNet-18 Model using the command line.

Let us take a simple example: Classifying Ants and Bees Images

First, we download the dataset and verify with sha384sum

curl -LO https://download.pytorch.org/tutorial/hymenoptera_data.zip
sha384sum -c - << EOF
491db45cfcab02d99843fbdcf0574ecf99aa4f056d52c660a39248b5524f9e6e8f896d9faabd27ffcfc2eaca0cec6f39  /home/tron/Desktop/Development/hymenoptera_data.zip
EOF
hymenoptera_data.zip: OK

Unzip the file

unzip hymenoptera_data.zip

We first create a YAML file to define the last layer(s) to replace from the network architecture

layers.yaml

linear1:
  layer_type: Linear
  in_features: 512
  out_features: 256
relu:
  layer_type: ReLU
dropout:
  layer_type: Dropout
  p: 0.2
linear2:
  layer_type: Linear
  in_features: 256
  out_features: 2
logsoftmax:
  layer_type: LogSoftmax
  dim: 1

Train the model

dffml train \
  -model resnet18 \
  -model-add_layers \
  -model-layers @layers.yaml \
  -model-clstype str \
  -model-classifications ants bees \
  -model-location resnet18_model \
  -model-imageSize 224 \
  -model-epochs 5 \
  -model-batch_size 32 \
  -model-enableGPU \
  -model-features image:int:$((500*500)) \
  -model-predict label:str:1 \
  -sources f=dir \
    -source-foldername hymenoptera_data/train \
    -source-feature image \
    -source-labels ants bees \
  -log critical

Assess accuracy

dffml accuracy \
  -model resnet18 \
  -model-add_layers \
  -model-layers @layers.yaml \
  -model-clstype str \
  -model-classifications ants bees \
  -model-location resnet18_model \
  -model-imageSize 224 \
  -model-batch_size 32 \
  -model-enableGPU \
  -model-features image:int:$((500*500)) \
  -model-predict label:str:1 \
  -features label:str:1 \
  -sources f=dir \
    -source-foldername hymenoptera_data/val \
    -source-feature image \
    -source-labels ants bees \
  -scorer pytorchscore \
  -log critical

Output:

0.9215686274509803

Create a csv file with the names of the images to predict, whether they are ants or bees.

cat > unknown_images.csv << EOF
key,image
ants1,hymenoptera_data/val/ants/Ant-1818.jpg
bee1,hymenoptera_data/val//bees/10870992_eebeeb3a12.jpg
bee2,hymenoptera_data/val/bees/abeja.jpg
ants2,hymenoptera_data/val/ants/desert_ant.jpg
EOF

Make the predictions

dffml predict all \
  -model resnet18 \
  -model-add_layers \
  -model-layers @layers.yaml \
  -model-clstype str \
  -model-classifications ants bees \
  -model-location resnet18_model \
  -model-imageSize 224 \
  -model-enableGPU \
  -model-features image:int:$((500*500)) \
  -model-predict label:str:1 \
  -sources f=csv \
    -source-filename unknown_images.csv \
    -source-loadfiles image \
  -log critical \
  -pretty

Output:

	Key:	ants1
                                                               Record Features
+----------------------------------------------------------------------------------------------------------------------------------------------+
|               image               |                    59, 66, 83, 60, 70, 87, 57, 72, 88, 53, 74, 89 ... (length:263250)                    |
+----------------------------------------------------------------------------------------------------------------------------------------------+

                                                                  Prediction
+----------------------------------------------------------------------------------------------------------------------------------------------+
|                                                                    label                                                                     |
+----------------------------------------------------------------------------------------------------------------------------------------------+
|            Value:  ants           |                                     Confidence:   0.9920881390571594                                     |
+----------------------------------------------------------------------------------------------------------------------------------------------+

	Key:	bee1
                                                               Record Features
+----------------------------------------------------------------------------------------------------------------------------------------------+
|               image               |                    63, 114, 146, 63, 114, 146, 63, 114, 146, 63,  ... (length:696000)                    |
+----------------------------------------------------------------------------------------------------------------------------------------------+

                                                                  Prediction
+----------------------------------------------------------------------------------------------------------------------------------------------+
|                                                                    label                                                                     |
+----------------------------------------------------------------------------------------------------------------------------------------------+
|            Value:  bees           |                                     Confidence:   0.6108130216598511                                     |
+----------------------------------------------------------------------------------------------------------------------------------------------+

	Key:	bee2
                                                               Record Features
+----------------------------------------------------------------------------------------------------------------------------------------------+
|               image               |                    103, 253, 254, 98, 254, 254, 91, 255, 254, 89, ... (length:359100)                    |
+----------------------------------------------------------------------------------------------------------------------------------------------+

                                                                  Prediction
+----------------------------------------------------------------------------------------------------------------------------------------------+
|                                                                    label                                                                     |
+----------------------------------------------------------------------------------------------------------------------------------------------+
|            Value:  bees           |                                     Confidence:   0.9162276387214661                                     |
+----------------------------------------------------------------------------------------------------------------------------------------------+

	Key:	ants2
                                                               Record Features
+----------------------------------------------------------------------------------------------------------------------------------------------+
|               image               |                   69, 121, 162, 44, 96, 137, 41, 90, 130, 68, 11 ... (length:1563912)                    |
+----------------------------------------------------------------------------------------------------------------------------------------------+

                                                                  Prediction
+----------------------------------------------------------------------------------------------------------------------------------------------+
|                                                                    label                                                                     |
+----------------------------------------------------------------------------------------------------------------------------------------------+
|            Value:  ants           |                                     Confidence:   0.9368477463722229                                     |
+----------------------------------------------------------------------------------------------------------------------------------------------+

alexnet¶

Official

No description

Args

predict: Feature
- Feature name holding classification value
features: List of features
- Features to train on
location: Path
- Location where state should be saved
classifications: List of strings
- default: None
- Options for value of classification
clstype: Type
- default: <class ‘str’>
- Data type of classifications values
imageSize: Integer
- default: None
- Common size for all images to resize and crop to
enableGPU: String
- default: False
- Utilize GPUs for processing
epochs: Integer
- default: 20
- Number of iterations to pass over all records in a source
batch_size: Integer
- default: 32
- Batch size
validation_split: float
- default: 0.0
- Split training data for Validation
patience: Integer
- default: 5
- Early stops the training if validation loss doesn’t improve after a given patience
loss: PyTorchLoss
- default: <class ‘dffml.base.CrossEntropyLossFunction’>
- Loss Functions available in PyTorch
optimizer: String
- default: SGD
- Optimizer Algorithms available in PyTorch
normalize_mean: List of floats
- default: None
- Mean values for normalizing Tensor image
normalize_std: List of floats
- default: None
- Standard Deviation values for normalizing Tensor image
pretrained: String
- default: True
- Load Pre-trained model weights
trainable: String
- default: False
- Tweak pretrained model by training again
add_layers: String
- default: False
- Replace the last layer of the pretrained model
layers: typing.Union[dict, torch.nn.modules.container.ModuleDict, torch.nn.modules.container.Sequential, torch.nn.modules.container.ModuleList, torch.nn.modules.module.Module]
- default: None
- Extra layers to replace the last layer of the pretrained model

densenet121¶

Official

No description

Args

predict: Feature
- Feature name holding classification value
features: List of features
- Features to train on
location: Path
- Location where state should be saved
classifications: List of strings
- default: None
- Options for value of classification
clstype: Type
- default: <class ‘str’>
- Data type of classifications values
imageSize: Integer
- default: None
- Common size for all images to resize and crop to
enableGPU: String
- default: False
- Utilize GPUs for processing
epochs: Integer
- default: 20
- Number of iterations to pass over all records in a source
batch_size: Integer
- default: 32
- Batch size
validation_split: float
- default: 0.0
- Split training data for Validation
patience: Integer
- default: 5
- Early stops the training if validation loss doesn’t improve after a given patience
loss: PyTorchLoss
- default: <class ‘dffml.base.CrossEntropyLossFunction’>
- Loss Functions available in PyTorch
optimizer: String
- default: SGD
- Optimizer Algorithms available in PyTorch
normalize_mean: List of floats
- default: None
- Mean values for normalizing Tensor image
normalize_std: List of floats
- default: None
- Standard Deviation values for normalizing Tensor image
pretrained: String
- default: True
- Load Pre-trained model weights
trainable: String
- default: False
- Tweak pretrained model by training again
add_layers: String
- default: False
- Replace the last layer of the pretrained model
layers: typing.Union[dict, torch.nn.modules.container.ModuleDict, torch.nn.modules.container.Sequential, torch.nn.modules.container.ModuleList, torch.nn.modules.module.Module]
- default: None
- Extra layers to replace the last layer of the pretrained model

densenet161¶

Official

No description

Args

predict: Feature
- Feature name holding classification value
features: List of features
- Features to train on
location: Path
- Location where state should be saved
classifications: List of strings
- default: None
- Options for value of classification
clstype: Type
- default: <class ‘str’>
- Data type of classifications values
imageSize: Integer
- default: None
- Common size for all images to resize and crop to
enableGPU: String
- default: False
- Utilize GPUs for processing
epochs: Integer
- default: 20
- Number of iterations to pass over all records in a source
batch_size: Integer
- default: 32
- Batch size
validation_split: float
- default: 0.0
- Split training data for Validation
patience: Integer
- default: 5
- Early stops the training if validation loss doesn’t improve after a given patience
loss: PyTorchLoss
- default: <class ‘dffml.base.CrossEntropyLossFunction’>
- Loss Functions available in PyTorch
optimizer: String
- default: SGD
- Optimizer Algorithms available in PyTorch
normalize_mean: List of floats
- default: None
- Mean values for normalizing Tensor image
normalize_std: List of floats
- default: None
- Standard Deviation values for normalizing Tensor image
pretrained: String
- default: True
- Load Pre-trained model weights
trainable: String
- default: False
- Tweak pretrained model by training again
add_layers: String
- default: False
- Replace the last layer of the pretrained model
layers: typing.Union[dict, torch.nn.modules.container.ModuleDict, torch.nn.modules.container.Sequential, torch.nn.modules.container.ModuleList, torch.nn.modules.module.Module]
- default: None
- Extra layers to replace the last layer of the pretrained model

densenet169¶

Official

No description

Args

predict: Feature
- Feature name holding classification value
features: List of features
- Features to train on
location: Path
- Location where state should be saved
classifications: List of strings
- default: None
- Options for value of classification
clstype: Type
- default: <class ‘str’>
- Data type of classifications values
imageSize: Integer
- default: None
- Common size for all images to resize and crop to
enableGPU: String
- default: False
- Utilize GPUs for processing
epochs: Integer
- default: 20
- Number of iterations to pass over all records in a source
batch_size: Integer
- default: 32
- Batch size
validation_split: float
- default: 0.0
- Split training data for Validation
patience: Integer
- default: 5
- Early stops the training if validation loss doesn’t improve after a given patience
loss: PyTorchLoss
- default: <class ‘dffml.base.CrossEntropyLossFunction’>
- Loss Functions available in PyTorch
optimizer: String
- default: SGD
- Optimizer Algorithms available in PyTorch
normalize_mean: List of floats
- default: None
- Mean values for normalizing Tensor image
normalize_std: List of floats
- default: None
- Standard Deviation values for normalizing Tensor image
pretrained: String
- default: True
- Load Pre-trained model weights
trainable: String
- default: False
- Tweak pretrained model by training again
add_layers: String
- default: False
- Replace the last layer of the pretrained model
layers: typing.Union[dict, torch.nn.modules.container.ModuleDict, torch.nn.modules.container.Sequential, torch.nn.modules.container.ModuleList, torch.nn.modules.module.Module]
- default: None
- Extra layers to replace the last layer of the pretrained model

densenet201¶

Official

No description

Args

predict: Feature
- Feature name holding classification value
features: List of features
- Features to train on
location: Path
- Location where state should be saved
classifications: List of strings
- default: None
- Options for value of classification
clstype: Type
- default: <class ‘str’>
- Data type of classifications values
imageSize: Integer
- default: None
- Common size for all images to resize and crop to
enableGPU: String
- default: False
- Utilize GPUs for processing
epochs: Integer
- default: 20
- Number of iterations to pass over all records in a source
batch_size: Integer
- default: 32
- Batch size
validation_split: float
- default: 0.0
- Split training data for Validation
patience: Integer
- default: 5
- Early stops the training if validation loss doesn’t improve after a given patience
loss: PyTorchLoss
- default: <class ‘dffml.base.CrossEntropyLossFunction’>
- Loss Functions available in PyTorch
optimizer: String
- default: SGD
- Optimizer Algorithms available in PyTorch
normalize_mean: List of floats
- default: None
- Mean values for normalizing Tensor image
normalize_std: List of floats
- default: None
- Standard Deviation values for normalizing Tensor image
pretrained: String
- default: True
- Load Pre-trained model weights
trainable: String
- default: False
- Tweak pretrained model by training again
add_layers: String
- default: False
- Replace the last layer of the pretrained model
layers: typing.Union[dict, torch.nn.modules.container.ModuleDict, torch.nn.modules.container.Sequential, torch.nn.modules.container.ModuleList, torch.nn.modules.module.Module]
- default: None
- Extra layers to replace the last layer of the pretrained model

googlenet¶

Official

No description

Args

predict: Feature
- Feature name holding classification value
features: List of features
- Features to train on
location: Path
- Location where state should be saved
classifications: List of strings
- default: None
- Options for value of classification
clstype: Type
- default: <class ‘str’>
- Data type of classifications values
imageSize: Integer
- default: None
- Common size for all images to resize and crop to
enableGPU: String
- default: False
- Utilize GPUs for processing
epochs: Integer
- default: 20
- Number of iterations to pass over all records in a source
batch_size: Integer
- default: 32
- Batch size
validation_split: float
- default: 0.0
- Split training data for Validation
patience: Integer
- default: 5
- Early stops the training if validation loss doesn’t improve after a given patience
loss: PyTorchLoss
- default: <class ‘dffml.base.CrossEntropyLossFunction’>
- Loss Functions available in PyTorch
optimizer: String
- default: SGD
- Optimizer Algorithms available in PyTorch
normalize_mean: List of floats
- default: None
- Mean values for normalizing Tensor image
normalize_std: List of floats
- default: None
- Standard Deviation values for normalizing Tensor image
pretrained: String
- default: True
- Load Pre-trained model weights
trainable: String
- default: False
- Tweak pretrained model by training again
add_layers: String
- default: False
- Replace the last layer of the pretrained model
layers: typing.Union[dict, torch.nn.modules.container.ModuleDict, torch.nn.modules.container.Sequential, torch.nn.modules.container.ModuleList, torch.nn.modules.module.Module]
- default: None
- Extra layers to replace the last layer of the pretrained model

inception_v3¶

Official

No description

Args

predict: Feature
- Feature name holding classification value
features: List of features
- Features to train on
location: Path
- Location where state should be saved
classifications: List of strings
- default: None
- Options for value of classification
clstype: Type
- default: <class ‘str’>
- Data type of classifications values
imageSize: Integer
- default: None
- Common size for all images to resize and crop to
enableGPU: String
- default: False
- Utilize GPUs for processing
epochs: Integer
- default: 20
- Number of iterations to pass over all records in a source
batch_size: Integer
- default: 32
- Batch size
validation_split: float
- default: 0.0
- Split training data for Validation
patience: Integer
- default: 5
- Early stops the training if validation loss doesn’t improve after a given patience
loss: PyTorchLoss
- default: <class ‘dffml.base.CrossEntropyLossFunction’>
- Loss Functions available in PyTorch
optimizer: String
- default: SGD
- Optimizer Algorithms available in PyTorch
normalize_mean: List of floats
- default: None
- Mean values for normalizing Tensor image
normalize_std: List of floats
- default: None
- Standard Deviation values for normalizing Tensor image
pretrained: String
- default: True
- Load Pre-trained model weights
trainable: String
- default: False
- Tweak pretrained model by training again
add_layers: String
- default: False
- Replace the last layer of the pretrained model
layers: typing.Union[dict, torch.nn.modules.container.ModuleDict, torch.nn.modules.container.Sequential, torch.nn.modules.container.ModuleList, torch.nn.modules.module.Module]
- default: None
- Extra layers to replace the last layer of the pretrained model

mnasnet0_5¶

Official

No description

Args

predict: Feature
- Feature name holding classification value
features: List of features
- Features to train on
location: Path
- Location where state should be saved
classifications: List of strings
- default: None
- Options for value of classification
clstype: Type
- default: <class ‘str’>
- Data type of classifications values
imageSize: Integer
- default: None
- Common size for all images to resize and crop to
enableGPU: String
- default: False
- Utilize GPUs for processing
epochs: Integer
- default: 20
- Number of iterations to pass over all records in a source
batch_size: Integer
- default: 32
- Batch size
validation_split: float
- default: 0.0
- Split training data for Validation
patience: Integer
- default: 5
- Early stops the training if validation loss doesn’t improve after a given patience
loss: PyTorchLoss
- default: <class ‘dffml.base.CrossEntropyLossFunction’>
- Loss Functions available in PyTorch
optimizer: String
- default: SGD
- Optimizer Algorithms available in PyTorch
normalize_mean: List of floats
- default: None
- Mean values for normalizing Tensor image
normalize_std: List of floats
- default: None
- Standard Deviation values for normalizing Tensor image
pretrained: String
- default: True
- Load Pre-trained model weights
trainable: String
- default: False
- Tweak pretrained model by training again
add_layers: String
- default: False
- Replace the last layer of the pretrained model
layers: typing.Union[dict, torch.nn.modules.container.ModuleDict, torch.nn.modules.container.Sequential, torch.nn.modules.container.ModuleList, torch.nn.modules.module.Module]
- default: None
- Extra layers to replace the last layer of the pretrained model

mnasnet1_0¶

Official

No description

Args

predict: Feature
- Feature name holding classification value
features: List of features
- Features to train on
location: Path
- Location where state should be saved
classifications: List of strings
- default: None
- Options for value of classification
clstype: Type
- default: <class ‘str’>
- Data type of classifications values
imageSize: Integer
- default: None
- Common size for all images to resize and crop to
enableGPU: String
- default: False
- Utilize GPUs for processing
epochs: Integer
- default: 20
- Number of iterations to pass over all records in a source
batch_size: Integer
- default: 32
- Batch size
validation_split: float
- default: 0.0
- Split training data for Validation
patience: Integer
- default: 5
- Early stops the training if validation loss doesn’t improve after a given patience
loss: PyTorchLoss
- default: <class ‘dffml.base.CrossEntropyLossFunction’>
- Loss Functions available in PyTorch
optimizer: String
- default: SGD
- Optimizer Algorithms available in PyTorch
normalize_mean: List of floats
- default: None
- Mean values for normalizing Tensor image
normalize_std: List of floats
- default: None
- Standard Deviation values for normalizing Tensor image
pretrained: String
- default: True
- Load Pre-trained model weights
trainable: String
- default: False
- Tweak pretrained model by training again
add_layers: String
- default: False
- Replace the last layer of the pretrained model
layers: typing.Union[dict, torch.nn.modules.container.ModuleDict, torch.nn.modules.container.Sequential, torch.nn.modules.container.ModuleList, torch.nn.modules.module.Module]
- default: None
- Extra layers to replace the last layer of the pretrained model

mobilenet_v2¶

Official

No description

Args

predict: Feature
- Feature name holding classification value
features: List of features
- Features to train on
location: Path
- Location where state should be saved
classifications: List of strings
- default: None
- Options for value of classification
clstype: Type
- default: <class ‘str’>
- Data type of classifications values
imageSize: Integer
- default: None
- Common size for all images to resize and crop to
enableGPU: String
- default: False
- Utilize GPUs for processing
epochs: Integer
- default: 20
- Number of iterations to pass over all records in a source
batch_size: Integer
- default: 32
- Batch size
validation_split: float
- default: 0.0
- Split training data for Validation
patience: Integer
- default: 5
- Early stops the training if validation loss doesn’t improve after a given patience
loss: PyTorchLoss
- default: <class ‘dffml.base.CrossEntropyLossFunction’>
- Loss Functions available in PyTorch
optimizer: String
- default: SGD
- Optimizer Algorithms available in PyTorch
normalize_mean: List of floats
- default: None
- Mean values for normalizing Tensor image
normalize_std: List of floats
- default: None
- Standard Deviation values for normalizing Tensor image
pretrained: String
- default: True
- Load Pre-trained model weights
trainable: String
- default: False
- Tweak pretrained model by training again
add_layers: String
- default: False
- Replace the last layer of the pretrained model
layers: typing.Union[dict, torch.nn.modules.container.ModuleDict, torch.nn.modules.container.Sequential, torch.nn.modules.container.ModuleList, torch.nn.modules.module.Module]
- default: None
- Extra layers to replace the last layer of the pretrained model

pytorchnet¶

Official

No description

Args

predict: Feature
- Feature name holding classification value
features: List of features
- Features to train on
location: Path
- Location where state should be saved
classifications: List of strings
- default: None
- Options for value of classification
clstype: Type
- default: <class ‘str’>
- Data type of classifications values
imageSize: Integer
- default: None
- Common size for all images to resize and crop to
enableGPU: String
- default: False
- Utilize GPUs for processing
epochs: Integer
- default: 20
- Number of iterations to pass over all records in a source
batch_size: Integer
- default: 32
- Batch size
validation_split: float
- default: 0.0
- Split training data for Validation
patience: Integer
- default: 5
- Early stops the training if validation loss doesn’t improve after a given patience
loss: PyTorchLoss
- default: <class ‘dffml.base.CrossEntropyLossFunction’>
- Loss Functions available in PyTorch
optimizer: String
- default: SGD
- Optimizer Algorithms available in PyTorch
normalize_mean: List of floats
- default: None
- Mean values for normalizing Tensor image
normalize_std: List of floats
- default: None
- Standard Deviation values for normalizing Tensor image
network: typing.Union[dffml_model_pytorch.pytorch_net.Network, torch.nn.modules.module.Module]
- default: None
- Model

resnet101¶

Official

No description

Args

predict: Feature
- Feature name holding classification value
features: List of features
- Features to train on
location: Path
- Location where state should be saved
classifications: List of strings
- default: None
- Options for value of classification
clstype: Type
- default: <class ‘str’>
- Data type of classifications values
imageSize: Integer
- default: None
- Common size for all images to resize and crop to
enableGPU: String
- default: False
- Utilize GPUs for processing
epochs: Integer
- default: 20
- Number of iterations to pass over all records in a source
batch_size: Integer
- default: 32
- Batch size
validation_split: float
- default: 0.0
- Split training data for Validation
patience: Integer
- default: 5
- Early stops the training if validation loss doesn’t improve after a given patience
loss: PyTorchLoss
- default: <class ‘dffml.base.CrossEntropyLossFunction’>
- Loss Functions available in PyTorch
optimizer: String
- default: SGD
- Optimizer Algorithms available in PyTorch
normalize_mean: List of floats
- default: None
- Mean values for normalizing Tensor image
normalize_std: List of floats
- default: None
- Standard Deviation values for normalizing Tensor image
pretrained: String
- default: True
- Load Pre-trained model weights
trainable: String
- default: False
- Tweak pretrained model by training again
add_layers: String
- default: False
- Replace the last layer of the pretrained model
layers: typing.Union[dict, torch.nn.modules.container.ModuleDict, torch.nn.modules.container.Sequential, torch.nn.modules.container.ModuleList, torch.nn.modules.module.Module]
- default: None
- Extra layers to replace the last layer of the pretrained model

resnet152¶

Official

No description

Args

predict: Feature
- Feature name holding classification value
features: List of features
- Features to train on
location: Path
- Location where state should be saved
classifications: List of strings
- default: None
- Options for value of classification
clstype: Type
- default: <class ‘str’>
- Data type of classifications values
imageSize: Integer
- default: None
- Common size for all images to resize and crop to
enableGPU: String
- default: False
- Utilize GPUs for processing
epochs: Integer
- default: 20
- Number of iterations to pass over all records in a source
batch_size: Integer
- default: 32
- Batch size
validation_split: float
- default: 0.0
- Split training data for Validation
patience: Integer
- default: 5
- Early stops the training if validation loss doesn’t improve after a given patience
loss: PyTorchLoss
- default: <class ‘dffml.base.CrossEntropyLossFunction’>
- Loss Functions available in PyTorch
optimizer: String
- default: SGD
- Optimizer Algorithms available in PyTorch
normalize_mean: List of floats
- default: None
- Mean values for normalizing Tensor image
normalize_std: List of floats
- default: None
- Standard Deviation values for normalizing Tensor image
pretrained: String
- default: True
- Load Pre-trained model weights
trainable: String
- default: False
- Tweak pretrained model by training again
add_layers: String
- default: False
- Replace the last layer of the pretrained model
layers: typing.Union[dict, torch.nn.modules.container.ModuleDict, torch.nn.modules.container.Sequential, torch.nn.modules.container.ModuleList, torch.nn.modules.module.Module]
- default: None
- Extra layers to replace the last layer of the pretrained model

resnet18¶

Official

No description

Args

predict: Feature
- Feature name holding classification value
features: List of features
- Features to train on
location: Path
- Location where state should be saved
classifications: List of strings
- default: None
- Options for value of classification
clstype: Type
- default: <class ‘str’>
- Data type of classifications values
imageSize: Integer
- default: None
- Common size for all images to resize and crop to
enableGPU: String
- default: False
- Utilize GPUs for processing
epochs: Integer
- default: 20
- Number of iterations to pass over all records in a source
batch_size: Integer
- default: 32
- Batch size
validation_split: float
- default: 0.0
- Split training data for Validation
patience: Integer
- default: 5
- Early stops the training if validation loss doesn’t improve after a given patience
loss: PyTorchLoss
- default: <class ‘dffml.base.CrossEntropyLossFunction’>
- Loss Functions available in PyTorch
optimizer: String
- default: SGD
- Optimizer Algorithms available in PyTorch
normalize_mean: List of floats
- default: None
- Mean values for normalizing Tensor image
normalize_std: List of floats
- default: None
- Standard Deviation values for normalizing Tensor image
pretrained: String
- default: True
- Load Pre-trained model weights
trainable: String
- default: False
- Tweak pretrained model by training again
add_layers: String
- default: False
- Replace the last layer of the pretrained model
layers: typing.Union[dict, torch.nn.modules.container.ModuleDict, torch.nn.modules.container.Sequential, torch.nn.modules.container.ModuleList, torch.nn.modules.module.Module]
- default: None
- Extra layers to replace the last layer of the pretrained model

resnet34¶

Official

No description

Args

predict: Feature
- Feature name holding classification value
features: List of features
- Features to train on
location: Path
- Location where state should be saved
classifications: List of strings
- default: None
- Options for value of classification
clstype: Type
- default: <class ‘str’>
- Data type of classifications values
imageSize: Integer
- default: None
- Common size for all images to resize and crop to
enableGPU: String
- default: False
- Utilize GPUs for processing
epochs: Integer
- default: 20
- Number of iterations to pass over all records in a source
batch_size: Integer
- default: 32
- Batch size
validation_split: float
- default: 0.0
- Split training data for Validation
patience: Integer
- default: 5
- Early stops the training if validation loss doesn’t improve after a given patience
loss: PyTorchLoss
- default: <class ‘dffml.base.CrossEntropyLossFunction’>
- Loss Functions available in PyTorch
optimizer: String
- default: SGD
- Optimizer Algorithms available in PyTorch
normalize_mean: List of floats
- default: None
- Mean values for normalizing Tensor image
normalize_std: List of floats
- default: None
- Standard Deviation values for normalizing Tensor image
pretrained: String
- default: True
- Load Pre-trained model weights
trainable: String
- default: False
- Tweak pretrained model by training again
add_layers: String
- default: False
- Replace the last layer of the pretrained model
layers: typing.Union[dict, torch.nn.modules.container.ModuleDict, torch.nn.modules.container.Sequential, torch.nn.modules.container.ModuleList, torch.nn.modules.module.Module]
- default: None
- Extra layers to replace the last layer of the pretrained model

resnet50¶

Official

No description

Args

predict: Feature
- Feature name holding classification value
features: List of features
- Features to train on
location: Path
- Location where state should be saved
classifications: List of strings
- default: None
- Options for value of classification
clstype: Type
- default: <class ‘str’>
- Data type of classifications values
imageSize: Integer
- default: None
- Common size for all images to resize and crop to
enableGPU: String
- default: False
- Utilize GPUs for processing
epochs: Integer
- default: 20
- Number of iterations to pass over all records in a source
batch_size: Integer
- default: 32
- Batch size
validation_split: float
- default: 0.0
- Split training data for Validation
patience: Integer
- default: 5
- Early stops the training if validation loss doesn’t improve after a given patience
loss: PyTorchLoss
- default: <class ‘dffml.base.CrossEntropyLossFunction’>
- Loss Functions available in PyTorch
optimizer: String
- default: SGD
- Optimizer Algorithms available in PyTorch
normalize_mean: List of floats
- default: None
- Mean values for normalizing Tensor image
normalize_std: List of floats
- default: None
- Standard Deviation values for normalizing Tensor image
pretrained: String
- default: True
- Load Pre-trained model weights
trainable: String
- default: False
- Tweak pretrained model by training again
add_layers: String
- default: False
- Replace the last layer of the pretrained model
layers: typing.Union[dict, torch.nn.modules.container.ModuleDict, torch.nn.modules.container.Sequential, torch.nn.modules.container.ModuleList, torch.nn.modules.module.Module]
- default: None
- Extra layers to replace the last layer of the pretrained model

resnext101_32x8d¶

Official

No description

Args

predict: Feature
- Feature name holding classification value
features: List of features
- Features to train on
location: Path
- Location where state should be saved
classifications: List of strings
- default: None
- Options for value of classification
clstype: Type
- default: <class ‘str’>
- Data type of classifications values
imageSize: Integer
- default: None
- Common size for all images to resize and crop to
enableGPU: String
- default: False
- Utilize GPUs for processing
epochs: Integer
- default: 20
- Number of iterations to pass over all records in a source
batch_size: Integer
- default: 32
- Batch size
validation_split: float
- default: 0.0
- Split training data for Validation
patience: Integer
- default: 5
- Early stops the training if validation loss doesn’t improve after a given patience
loss: PyTorchLoss
- default: <class ‘dffml.base.CrossEntropyLossFunction’>
- Loss Functions available in PyTorch
optimizer: String
- default: SGD
- Optimizer Algorithms available in PyTorch
normalize_mean: List of floats
- default: None
- Mean values for normalizing Tensor image
normalize_std: List of floats
- default: None
- Standard Deviation values for normalizing Tensor image
pretrained: String
- default: True
- Load Pre-trained model weights
trainable: String
- default: False
- Tweak pretrained model by training again
add_layers: String
- default: False
- Replace the last layer of the pretrained model
layers: typing.Union[dict, torch.nn.modules.container.ModuleDict, torch.nn.modules.container.Sequential, torch.nn.modules.container.ModuleList, torch.nn.modules.module.Module]
- default: None
- Extra layers to replace the last layer of the pretrained model

resnext50_32x4d¶

Official

No description

Args

predict: Feature
- Feature name holding classification value
features: List of features
- Features to train on
location: Path
- Location where state should be saved
classifications: List of strings
- default: None
- Options for value of classification
clstype: Type
- default: <class ‘str’>
- Data type of classifications values
imageSize: Integer
- default: None
- Common size for all images to resize and crop to
enableGPU: String
- default: False
- Utilize GPUs for processing
epochs: Integer
- default: 20
- Number of iterations to pass over all records in a source
batch_size: Integer
- default: 32
- Batch size
validation_split: float
- default: 0.0
- Split training data for Validation
patience: Integer
- default: 5
- Early stops the training if validation loss doesn’t improve after a given patience
loss: PyTorchLoss
- default: <class ‘dffml.base.CrossEntropyLossFunction’>
- Loss Functions available in PyTorch
optimizer: String
- default: SGD
- Optimizer Algorithms available in PyTorch
normalize_mean: List of floats
- default: None
- Mean values for normalizing Tensor image
normalize_std: List of floats
- default: None
- Standard Deviation values for normalizing Tensor image
pretrained: String
- default: True
- Load Pre-trained model weights
trainable: String
- default: False
- Tweak pretrained model by training again
add_layers: String
- default: False
- Replace the last layer of the pretrained model
layers: typing.Union[dict, torch.nn.modules.container.ModuleDict, torch.nn.modules.container.Sequential, torch.nn.modules.container.ModuleList, torch.nn.modules.module.Module]
- default: None
- Extra layers to replace the last layer of the pretrained model

shufflenet_v2_x0_5¶

Official

No description

Args

predict: Feature
- Feature name holding classification value
features: List of features
- Features to train on
location: Path
- Location where state should be saved
classifications: List of strings
- default: None
- Options for value of classification
clstype: Type
- default: <class ‘str’>
- Data type of classifications values
imageSize: Integer
- default: None
- Common size for all images to resize and crop to
enableGPU: String
- default: False
- Utilize GPUs for processing
epochs: Integer
- default: 20
- Number of iterations to pass over all records in a source
batch_size: Integer
- default: 32
- Batch size
validation_split: float
- default: 0.0
- Split training data for Validation
patience: Integer
- default: 5
- Early stops the training if validation loss doesn’t improve after a given patience
loss: PyTorchLoss
- default: <class ‘dffml.base.CrossEntropyLossFunction’>
- Loss Functions available in PyTorch
optimizer: String
- default: SGD
- Optimizer Algorithms available in PyTorch
normalize_mean: List of floats
- default: None
- Mean values for normalizing Tensor image
normalize_std: List of floats
- default: None
- Standard Deviation values for normalizing Tensor image
pretrained: String
- default: True
- Load Pre-trained model weights
trainable: String
- default: False
- Tweak pretrained model by training again
add_layers: String
- default: False
- Replace the last layer of the pretrained model
layers: typing.Union[dict, torch.nn.modules.container.ModuleDict, torch.nn.modules.container.Sequential, torch.nn.modules.container.ModuleList, torch.nn.modules.module.Module]
- default: None
- Extra layers to replace the last layer of the pretrained model

shufflenet_v2_x1_0¶

Official

No description

Args

predict: Feature
- Feature name holding classification value
features: List of features
- Features to train on
location: Path
- Location where state should be saved
classifications: List of strings
- default: None
- Options for value of classification
clstype: Type
- default: <class ‘str’>
- Data type of classifications values
imageSize: Integer
- default: None
- Common size for all images to resize and crop to
enableGPU: String
- default: False
- Utilize GPUs for processing
epochs: Integer
- default: 20
- Number of iterations to pass over all records in a source
batch_size: Integer
- default: 32
- Batch size
validation_split: float
- default: 0.0
- Split training data for Validation
patience: Integer
- default: 5
- Early stops the training if validation loss doesn’t improve after a given patience
loss: PyTorchLoss
- default: <class ‘dffml.base.CrossEntropyLossFunction’>
- Loss Functions available in PyTorch
optimizer: String
- default: SGD
- Optimizer Algorithms available in PyTorch
normalize_mean: List of floats
- default: None
- Mean values for normalizing Tensor image
normalize_std: List of floats
- default: None
- Standard Deviation values for normalizing Tensor image
pretrained: String
- default: True
- Load Pre-trained model weights
trainable: String
- default: False
- Tweak pretrained model by training again
add_layers: String
- default: False
- Replace the last layer of the pretrained model
layers: typing.Union[dict, torch.nn.modules.container.ModuleDict, torch.nn.modules.container.Sequential, torch.nn.modules.container.ModuleList, torch.nn.modules.module.Module]
- default: None
- Extra layers to replace the last layer of the pretrained model

vgg11¶

Official

No description

Args

predict: Feature
- Feature name holding classification value
features: List of features
- Features to train on
location: Path
- Location where state should be saved
classifications: List of strings
- default: None
- Options for value of classification
clstype: Type
- default: <class ‘str’>
- Data type of classifications values
imageSize: Integer
- default: None
- Common size for all images to resize and crop to
enableGPU: String
- default: False
- Utilize GPUs for processing
epochs: Integer
- default: 20
- Number of iterations to pass over all records in a source
batch_size: Integer
- default: 32
- Batch size
validation_split: float
- default: 0.0
- Split training data for Validation
patience: Integer
- default: 5
- Early stops the training if validation loss doesn’t improve after a given patience
loss: PyTorchLoss
- default: <class ‘dffml.base.CrossEntropyLossFunction’>
- Loss Functions available in PyTorch
optimizer: String
- default: SGD
- Optimizer Algorithms available in PyTorch
normalize_mean: List of floats
- default: None
- Mean values for normalizing Tensor image
normalize_std: List of floats
- default: None
- Standard Deviation values for normalizing Tensor image
pretrained: String
- default: True
- Load Pre-trained model weights
trainable: String
- default: False
- Tweak pretrained model by training again
add_layers: String
- default: False
- Replace the last layer of the pretrained model
layers: typing.Union[dict, torch.nn.modules.container.ModuleDict, torch.nn.modules.container.Sequential, torch.nn.modules.container.ModuleList, torch.nn.modules.module.Module]
- default: None
- Extra layers to replace the last layer of the pretrained model

vgg11_bn¶

Official

No description

Args

predict: Feature
- Feature name holding classification value
features: List of features
- Features to train on
location: Path
- Location where state should be saved
classifications: List of strings
- default: None
- Options for value of classification
clstype: Type
- default: <class ‘str’>
- Data type of classifications values
imageSize: Integer
- default: None
- Common size for all images to resize and crop to
enableGPU: String
- default: False
- Utilize GPUs for processing
epochs: Integer
- default: 20
- Number of iterations to pass over all records in a source
batch_size: Integer
- default: 32
- Batch size
validation_split: float
- default: 0.0
- Split training data for Validation
patience: Integer
- default: 5
- Early stops the training if validation loss doesn’t improve after a given patience
loss: PyTorchLoss
- default: <class ‘dffml.base.CrossEntropyLossFunction’>
- Loss Functions available in PyTorch
optimizer: String
- default: SGD
- Optimizer Algorithms available in PyTorch
normalize_mean: List of floats
- default: None
- Mean values for normalizing Tensor image
normalize_std: List of floats
- default: None
- Standard Deviation values for normalizing Tensor image
pretrained: String
- default: True
- Load Pre-trained model weights
trainable: String
- default: False
- Tweak pretrained model by training again
add_layers: String
- default: False
- Replace the last layer of the pretrained model
layers: typing.Union[dict, torch.nn.modules.container.ModuleDict, torch.nn.modules.container.Sequential, torch.nn.modules.container.ModuleList, torch.nn.modules.module.Module]
- default: None
- Extra layers to replace the last layer of the pretrained model

vgg13¶

Official

No description

Args

predict: Feature
- Feature name holding classification value
features: List of features
- Features to train on
location: Path
- Location where state should be saved
classifications: List of strings
- default: None
- Options for value of classification
clstype: Type
- default: <class ‘str’>
- Data type of classifications values
imageSize: Integer
- default: None
- Common size for all images to resize and crop to
enableGPU: String
- default: False
- Utilize GPUs for processing
epochs: Integer
- default: 20
- Number of iterations to pass over all records in a source
batch_size: Integer
- default: 32
- Batch size
validation_split: float
- default: 0.0
- Split training data for Validation
patience: Integer
- default: 5
- Early stops the training if validation loss doesn’t improve after a given patience
loss: PyTorchLoss
- default: <class ‘dffml.base.CrossEntropyLossFunction’>
- Loss Functions available in PyTorch
optimizer: String
- default: SGD
- Optimizer Algorithms available in PyTorch
normalize_mean: List of floats
- default: None
- Mean values for normalizing Tensor image
normalize_std: List of floats
- default: None
- Standard Deviation values for normalizing Tensor image
pretrained: String
- default: True
- Load Pre-trained model weights
trainable: String
- default: False
- Tweak pretrained model by training again
add_layers: String
- default: False
- Replace the last layer of the pretrained model
layers: typing.Union[dict, torch.nn.modules.container.ModuleDict, torch.nn.modules.container.Sequential, torch.nn.modules.container.ModuleList, torch.nn.modules.module.Module]
- default: None
- Extra layers to replace the last layer of the pretrained model

vgg13_bn¶

Official

No description

Args

predict: Feature
- Feature name holding classification value
features: List of features
- Features to train on
location: Path
- Location where state should be saved
classifications: List of strings
- default: None
- Options for value of classification
clstype: Type
- default: <class ‘str’>
- Data type of classifications values
imageSize: Integer
- default: None
- Common size for all images to resize and crop to
enableGPU: String
- default: False
- Utilize GPUs for processing
epochs: Integer
- default: 20
- Number of iterations to pass over all records in a source
batch_size: Integer
- default: 32
- Batch size
validation_split: float
- default: 0.0
- Split training data for Validation
patience: Integer
- default: 5
- Early stops the training if validation loss doesn’t improve after a given patience
loss: PyTorchLoss
- default: <class ‘dffml.base.CrossEntropyLossFunction’>
- Loss Functions available in PyTorch
optimizer: String
- default: SGD
- Optimizer Algorithms available in PyTorch
normalize_mean: List of floats
- default: None
- Mean values for normalizing Tensor image
normalize_std: List of floats
- default: None
- Standard Deviation values for normalizing Tensor image
pretrained: String
- default: True
- Load Pre-trained model weights
trainable: String
- default: False
- Tweak pretrained model by training again
add_layers: String
- default: False
- Replace the last layer of the pretrained model
layers: typing.Union[dict, torch.nn.modules.container.ModuleDict, torch.nn.modules.container.Sequential, torch.nn.modules.container.ModuleList, torch.nn.modules.module.Module]
- default: None
- Extra layers to replace the last layer of the pretrained model

vgg16¶

Official

No description

Args

predict: Feature
- Feature name holding classification value
features: List of features
- Features to train on
location: Path
- Location where state should be saved
classifications: List of strings
- default: None
- Options for value of classification
clstype: Type
- default: <class ‘str’>
- Data type of classifications values
imageSize: Integer
- default: None
- Common size for all images to resize and crop to
enableGPU: String
- default: False
- Utilize GPUs for processing
epochs: Integer
- default: 20
- Number of iterations to pass over all records in a source
batch_size: Integer
- default: 32
- Batch size
validation_split: float
- default: 0.0
- Split training data for Validation
patience: Integer
- default: 5
- Early stops the training if validation loss doesn’t improve after a given patience
loss: PyTorchLoss
- default: <class ‘dffml.base.CrossEntropyLossFunction’>
- Loss Functions available in PyTorch
optimizer: String
- default: SGD
- Optimizer Algorithms available in PyTorch
normalize_mean: List of floats
- default: None
- Mean values for normalizing Tensor image
normalize_std: List of floats
- default: None
- Standard Deviation values for normalizing Tensor image
pretrained: String
- default: True
- Load Pre-trained model weights
trainable: String
- default: False
- Tweak pretrained model by training again
add_layers: String
- default: False
- Replace the last layer of the pretrained model
layers: typing.Union[dict, torch.nn.modules.container.ModuleDict, torch.nn.modules.container.Sequential, torch.nn.modules.container.ModuleList, torch.nn.modules.module.Module]
- default: None
- Extra layers to replace the last layer of the pretrained model

vgg16_bn¶

Official

No description

Args

predict: Feature
- Feature name holding classification value
features: List of features
- Features to train on
location: Path
- Location where state should be saved
classifications: List of strings
- default: None
- Options for value of classification
clstype: Type
- default: <class ‘str’>
- Data type of classifications values
imageSize: Integer
- default: None
- Common size for all images to resize and crop to
enableGPU: String
- default: False
- Utilize GPUs for processing
epochs: Integer
- default: 20
- Number of iterations to pass over all records in a source
batch_size: Integer
- default: 32
- Batch size
validation_split: float
- default: 0.0
- Split training data for Validation
patience: Integer
- default: 5
- Early stops the training if validation loss doesn’t improve after a given patience
loss: PyTorchLoss
- default: <class ‘dffml.base.CrossEntropyLossFunction’>
- Loss Functions available in PyTorch
optimizer: String
- default: SGD
- Optimizer Algorithms available in PyTorch
normalize_mean: List of floats
- default: None
- Mean values for normalizing Tensor image
normalize_std: List of floats
- default: None
- Standard Deviation values for normalizing Tensor image
pretrained: String
- default: True
- Load Pre-trained model weights
trainable: String
- default: False
- Tweak pretrained model by training again
add_layers: String
- default: False
- Replace the last layer of the pretrained model
layers: typing.Union[dict, torch.nn.modules.container.ModuleDict, torch.nn.modules.container.Sequential, torch.nn.modules.container.ModuleList, torch.nn.modules.module.Module]
- default: None
- Extra layers to replace the last layer of the pretrained model

vgg19¶

Official

No description

Args

predict: Feature
- Feature name holding classification value
features: List of features
- Features to train on
location: Path
- Location where state should be saved
classifications: List of strings
- default: None
- Options for value of classification
clstype: Type
- default: <class ‘str’>
- Data type of classifications values
imageSize: Integer
- default: None
- Common size for all images to resize and crop to
enableGPU: String
- default: False
- Utilize GPUs for processing
epochs: Integer
- default: 20
- Number of iterations to pass over all records in a source
batch_size: Integer
- default: 32
- Batch size
validation_split: float
- default: 0.0
- Split training data for Validation
patience: Integer
- default: 5
- Early stops the training if validation loss doesn’t improve after a given patience
loss: PyTorchLoss
- default: <class ‘dffml.base.CrossEntropyLossFunction’>
- Loss Functions available in PyTorch
optimizer: String
- default: SGD
- Optimizer Algorithms available in PyTorch
normalize_mean: List of floats
- default: None
- Mean values for normalizing Tensor image
normalize_std: List of floats
- default: None
- Standard Deviation values for normalizing Tensor image
pretrained: String
- default: True
- Load Pre-trained model weights
trainable: String
- default: False
- Tweak pretrained model by training again
add_layers: String
- default: False
- Replace the last layer of the pretrained model
layers: typing.Union[dict, torch.nn.modules.container.ModuleDict, torch.nn.modules.container.Sequential, torch.nn.modules.container.ModuleList, torch.nn.modules.module.Module]
- default: None
- Extra layers to replace the last layer of the pretrained model

vgg19_bn¶

Official

No description

Args

predict: Feature
- Feature name holding classification value
features: List of features
- Features to train on
location: Path
- Location where state should be saved
classifications: List of strings
- default: None
- Options for value of classification
clstype: Type
- default: <class ‘str’>
- Data type of classifications values
imageSize: Integer
- default: None
- Common size for all images to resize and crop to
enableGPU: String
- default: False
- Utilize GPUs for processing
epochs: Integer
- default: 20
- Number of iterations to pass over all records in a source
batch_size: Integer
- default: 32
- Batch size
validation_split: float
- default: 0.0
- Split training data for Validation
patience: Integer
- default: 5
- Early stops the training if validation loss doesn’t improve after a given patience
loss: PyTorchLoss
- default: <class ‘dffml.base.CrossEntropyLossFunction’>
- Loss Functions available in PyTorch
optimizer: String
- default: SGD
- Optimizer Algorithms available in PyTorch
normalize_mean: List of floats
- default: None
- Mean values for normalizing Tensor image
normalize_std: List of floats
- default: None
- Standard Deviation values for normalizing Tensor image
pretrained: String
- default: True
- Load Pre-trained model weights
trainable: String
- default: False
- Tweak pretrained model by training again
add_layers: String
- default: False
- Replace the last layer of the pretrained model
layers: typing.Union[dict, torch.nn.modules.container.ModuleDict, torch.nn.modules.container.Sequential, torch.nn.modules.container.ModuleList, torch.nn.modules.module.Module]
- default: None
- Extra layers to replace the last layer of the pretrained model

wide_resnet101_2¶

Official

No description

Args

predict: Feature
- Feature name holding classification value
features: List of features
- Features to train on
location: Path
- Location where state should be saved
classifications: List of strings
- default: None
- Options for value of classification
clstype: Type
- default: <class ‘str’>
- Data type of classifications values
imageSize: Integer
- default: None
- Common size for all images to resize and crop to
enableGPU: String
- default: False
- Utilize GPUs for processing
epochs: Integer
- default: 20
- Number of iterations to pass over all records in a source
batch_size: Integer
- default: 32
- Batch size
validation_split: float
- default: 0.0
- Split training data for Validation
patience: Integer
- default: 5
- Early stops the training if validation loss doesn’t improve after a given patience
loss: PyTorchLoss
- default: <class ‘dffml.base.CrossEntropyLossFunction’>
- Loss Functions available in PyTorch
optimizer: String
- default: SGD
- Optimizer Algorithms available in PyTorch
normalize_mean: List of floats
- default: None
- Mean values for normalizing Tensor image
normalize_std: List of floats
- default: None
- Standard Deviation values for normalizing Tensor image
pretrained: String
- default: True
- Load Pre-trained model weights
trainable: String
- default: False
- Tweak pretrained model by training again
add_layers: String
- default: False
- Replace the last layer of the pretrained model
layers: typing.Union[dict, torch.nn.modules.container.ModuleDict, torch.nn.modules.container.Sequential, torch.nn.modules.container.ModuleList, torch.nn.modules.module.Module]
- default: None
- Extra layers to replace the last layer of the pretrained model

wide_resnet50_2¶

Official

No description

Args

predict: Feature
- Feature name holding classification value
features: List of features
- Features to train on
location: Path
- Location where state should be saved
classifications: List of strings
- default: None
- Options for value of classification
clstype: Type
- default: <class ‘str’>
- Data type of classifications values
imageSize: Integer
- default: None
- Common size for all images to resize and crop to
enableGPU: String
- default: False
- Utilize GPUs for processing
epochs: Integer
- default: 20
- Number of iterations to pass over all records in a source
batch_size: Integer
- default: 32
- Batch size
validation_split: float
- default: 0.0
- Split training data for Validation
patience: Integer
- default: 5
- Early stops the training if validation loss doesn’t improve after a given patience
loss: PyTorchLoss
- default: <class ‘dffml.base.CrossEntropyLossFunction’>
- Loss Functions available in PyTorch
optimizer: String
- default: SGD
- Optimizer Algorithms available in PyTorch
normalize_mean: List of floats
- default: None
- Mean values for normalizing Tensor image
normalize_std: List of floats
- default: None
- Standard Deviation values for normalizing Tensor image
pretrained: String
- default: True
- Load Pre-trained model weights
trainable: String
- default: False
- Tweak pretrained model by training again
add_layers: String
- default: False
- Replace the last layer of the pretrained model
layers: typing.Union[dict, torch.nn.modules.container.ModuleDict, torch.nn.modules.container.Sequential, torch.nn.modules.container.ModuleList, torch.nn.modules.module.Module]
- default: None
- Extra layers to replace the last layer of the pretrained model

dffml_model_tensorflow¶

pip install dffml-model-tensorflow

Note

It’s important to keep the hidden layer config and feature config the same across invocations of train, predict, and accuracy methods.

Models are saved under the directory parameter in subdirectories named after the hash of their feature names and hidden layer config. Which means if any of those parameters change between invocations, it’s being told to look for a different saved model.

tfdnnc¶

Official

Implemented using Tensorflow’s DNNClassifier.

First we create the training and testing datasets

wget http://download.tensorflow.org/data/iris_training.csv
echo '376c8ea3b7f85caff195b4abe62f34e8f4e7aece8bd087bbd746518a9d1fd60ae3b4274479f88ab0aa5c839460d535ef iris_training.csv' | sha384sum -c -
sed -i 's/.*setosa,versicolor,virginica/SepalLength,SepalWidth,PetalLength,PetalWidth,classification/g' *.csv

wget http://download.tensorflow.org/data/iris_test.csv
echo '8c2cda42ce5ce6f977d17d668b1c98a45bfe320175f33e97293c62ab543b3439eab934d8e11b1208de1e4a9eb1957714 iris_test.csv' | sha384sum -c -
sed -i 's/.*setosa,versicolor,virginica/SepalLength,SepalWidth,PetalLength,PetalWidth,classification/g' *.csv

Train the model

dffml train \
  -model tfdnnc \
  -model-epochs 3000 \
  -model-steps 20000 \
  -model-predict classification:int:1 \
  -model-location tempdir \
  -model-classifications 0 1 2 \
  -model-clstype int \
  -sources iris=csv \
  -source-filename iris_training.csv \
  -model-features \
    SepalLength:float:1 \
    SepalWidth:float:1 \
    PetalLength:float:1 \
    PetalWidth:float:1 \
  -log debug

Assess the accuracy

dffml accuracy \
  -model tfdnnc \
  -model-predict classification:int:1 \
  -model-location tempdir \
  -model-classifications 0 1 2 \
  -model-clstype int \
  -features classification:int:1 \
  -scorer clf \
  -sources iris=csv \
  -source-filename iris_test.csv \
  -model-features \
    SepalLength:float:1 \
    SepalWidth:float:1 \
    PetalLength:float:1 \
    PetalWidth:float:1 \
  -log critical

Output

0.99996233782

Make a prediction

echo -e 'SepalLength,SepalWidth,PetalLength,PetalWidth\n5.9,3.0,4.2,1.5\n' | \
dffml predict all \
  -model tfdnnc \
  -model-predict classification:int:1 \
  -model-location tempdir \
  -model-classifications 0 1 2 \
  -model-clstype int \
  -sources iris=csv \
  -model-features \
    SepalLength:float:1 \
    SepalWidth:float:1 \
    PetalLength:float:1 \
    PetalWidth:float:1 \
  -source-filename /dev/stdin

Output

[
    {
        "extra": {},
        "features": {
            "PetalLength": 4.2,
            "PetalWidth": 1.5,
            "SepalLength": 5.9,
            "SepalWidth": 3.0,
            "classification": 1
        },
        "last_updated": "2019-07-31T02:00:12Z",
        "prediction": {
            "classification":
                {
                    "confidence": 0.9999997615814209,
                    "value": 1
                }
        },
        "key": "0"
    },
]

Example usage of Tensorflow DNNClassifier model using python API

from dffml import CSVSource, Features, Feature
from dffml.noasync import train, score, predict
from dffml_model_tensorflow.dnnc import DNNClassifierModel
from dffml.accuracy import ClassificationAccuracy

model = DNNClassifierModel(
    features=Features(
        Feature("SepalLength", float, 1),
        Feature("SepalWidth", float, 1),
        Feature("PetalLength", float, 1),
        Feature("PetalWidth", float, 1),
    ),
    predict=Feature("classification", int, 1),
    epochs=3000,
    steps=20000,
    classifications=[0, 1, 2],
    clstype=int,
    location="tempdir",
)

# Train the model
train(model, "iris_training.csv")

# Assess accuracy (alternate way of specifying data source)
scorer = ClassificationAccuracy()
print(
    "Accuracy:",
    score(
        model,
        scorer,
        Feature("classification", int, 1),
        CSVSource(filename="iris_test.csv"),
    ),
)

# Make prediction
for i, features, prediction in predict(
    model,
    {
        "PetalLength": 4.2,
        "PetalWidth": 1.5,
        "SepalLength": 5.9,
        "SepalWidth": 3.0,
    },
    {
        "PetalLength": 5.4,
        "PetalWidth": 2.1,
        "SepalLength": 6.9,
        "SepalWidth": 3.1,
    },
):
    features["classification"] = prediction["classification"]["value"]
    print(features)

Args

predict: Feature
- Feature name holding target values
features: List of features
- Features to train on
location: Path
- Location where state should be saved
steps: Integer
- default: 3000
- Number of steps to train the model
epochs: Integer
- default: 30
- Number of iterations to pass over all records in a source
hidden: List of integers
- default: [12, 40, 15]
- List length is the number of hidden layers in the network. Each entry in the list is the number of nodes in that hidden layer
classifications: List of strings
- default: None
- Options for value of classification
clstype: Type
- default: <class ‘str’>
- Data type of classifications values
batchsize: Integer
- default: 20
- Number records to pass through in an epoch
shuffle: String
- default: True
- Randomise order of records in a batch

tfdnnr¶

Official

Implemented using Tensorflow’s DNNEstimator.

Usage:

predict: Name of the feature we are trying to predict or using for training.

Generating train and test data

This creates files train.csv and test.csv, make sure to take a BACKUP of files with same name in the directory from where this command is run as it overwrites any existing files.

cat > train.csv << EOF
Feature1,Feature2,TARGET
0.93,0.68,3.89
0.24,0.42,1.75
0.36,0.68,2.75
0.53,0.31,2.00
0.29,0.25,1.32
0.29,0.52,2.14
EOF

cat > test.csv << EOF
Feature1,Feature2,TARGET
0.57,0.84,3.65
0.95,0.19,2.46
0.23,0.15,0.93
EOF

Train the model

dffml train \
  -model tfdnnr \
  -model-epochs 300 \
  -model-steps 2000 \
  -model-predict TARGET:float:1 \
  -model-location tempdir \
  -model-hidden 8 16 8 \
  -sources s=csv \
  -source-filename train.csv \
  -model-features \
    Feature1:float:1 \
    Feature2:float:1 \
  -log debug

Assess the accuracy

dffml accuracy \
  -model tfdnnr \
  -model-predict TARGET:float:1 \
  -model-location tempdir \
  -model-hidden 8 16 8 \
  -features TARGET:float:1 \
  -sources s=csv \
  -source-filename test.csv \
  -model-features \
    Feature1:float:1 \
    Feature2:float:1 \
  -scorer mse \
  -log critical

Output

0.9468210011

Make a prediction

echo -e 'Feature1,Feature2,TARGET\n0.21,0.18,0.84\n' | \
  dffml predict all \
  -model tfdnnr \
  -model-predict TARGET:float:1 \
  -model-location tempdir \
  -model-hidden 8 16 8 \
  -sources s=csv \
  -source-filename /dev/stdin \
  -model-features \
    Feature1:float:1 \
    Feature2:float:1 \
  -log critical

Output

[
    {
        "extra": {},
        "features": {
            "Feature1": 0.21,
            "Feature2": 0.18,
            "TARGET": 0.84
        },
        "last_updated": "2019-10-24T15:26:41Z",
        "prediction": {
            "TARGET" : {
                "confidence": null,
                "value": 1.1983429193496704
            }
        },
        "key": 0
    }
]

Example usage of Tensorflow DNNEstimator model using python API

from dffml import CSVSource, Features, Feature
from dffml.noasync import train, score, predict
from dffml_model_tensorflow.dnnr import DNNRegressionModel
from dffml.accuracy import MeanSquaredErrorAccuracy

model = DNNRegressionModel(
    features=Features(
        Feature("Feature1", float, 1), Feature("Feature2", float, 1)
    ),
    predict=Feature("TARGET", float, 1),
    epochs=300,
    steps=2000,
    hidden=[8, 16, 8],
    location="tempdir",
)

# Train the model
train(model, "train.csv")

# Assess accuracy (alternate way of specifying data source)
scorer = MeanSquaredErrorAccuracy()
print(
    "Accuracy:",
    score(
        model,
        scorer,
        Feature("TARGET", float, 1),
        CSVSource(filename="test.csv"),
    ),
)

# Make prediction
for i, features, prediction in predict(
    model, {"Feature1": 0.21, "Feature2": 0.18, "TARGET": 0.84}
):
    features["TARGET"] = prediction["TARGET"]["value"]
    print(features)

The null in confidence is the expected behavior. (See TODO in predict).

Args

predict: Feature
- Feature name holding target values
features: List of features
- Features to train on
location: Path
- Location where state should be saved
steps: Integer
- default: 3000
- Number of steps to train the model
epochs: Integer
- default: 30
- Number of iterations to pass over all records in a source
hidden: List of integers
- default: [12, 40, 15]
- List length is the number of hidden layers in the network. Each entry in the list is the number of nodes in that hidden layer

dffml_model_tensorflow_hub¶

pip install dffml-model-tensorflow-hub

text_classifier¶

Official

Implemented using Tensorflow hub pretrained models.

cat > train.csv << EOF
sentence,sentiment
Life is good,1
This book is amazing,1
It's a terrible movie,2
Global warming is bad,0
I hate you!!,2
This movie is horrible,2
EOF

cat > test.csv << EOF
sentence,sentiment
I am not feeling good,0
Our trip was full of adventures,1
EOF

Train the model

dffml train \
  -model text_classifier \
  -model-epochs 30 \
  -model-predict sentiment:int:1 \
  -model-location tempdir \
  -model-classifications 0 1 \
  -model-clstype int \
  -sources f=csv \
  -source-filename train.csv \
  -model-features \
    sentence:str:1 \
  -model-model_path "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim-with-oov/1" \
  -model-add_layers \
  -model-layers "Dense(units=512, activation='relu')" "Dense(units=2, activation='softmax')" \
  -log debug

Assess the accuracy

dffml accuracy \
  -model text_classifier \
  -model-predict sentiment:int:1 \
  -model-location tempdir \
  -model-classifications 0 1 \
  -model-model_path "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim-with-oov/1" \
  -model-clstype int \
  -features sentiment:int:1 \
  -sources f=csv \
  -source-filename test.csv \
  -model-features \
    sentence:str:1 \
  -scorer textclf \
  -log critical

Output

0.5

Make a prediction

dffml predict all \
  -model text_classifier \
  -model-predict sentiment:int:1 \
  -model-location tempdir \
  -model-classifications 0 1 \
  -model-model_path "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim-with-oov/1" \
  -model-clstype int \
  -sources f=csv \
  -source-filename test.csv \
  -model-features \
    sentence:str:1 \
  -log debug

Output

[
    {
        "extra": {},
        "features": {
            "sentence": "I am not feeling good",
            "sentiment": 0
        },
        "key": "0",
        "last_updated": "2020-05-14T20:14:30Z",
        "prediction": {
            "sentiment": {
                "confidence": 0.9999992847442627,
                "value": 1
            }
        }
    },
    {
        "extra": {},
        "features": {
            "sentence": "Our trip was full of adventures",
            "sentiment": 1
        },
        "key": "1",
        "last_updated": "2020-05-14T20:14:30Z",
        "prediction": {
            "sentiment": {
                "confidence": 0.9999088048934937,
                "value": 1
            }
        }
    }
]

Example usage of Tensorflow_hub Text Classifier model using python API

from dffml import CSVSource, Features, Feature
from dffml.noasync import train, score, predict
from dffml_model_tensorflow_hub.text_classifier import TextClassificationModel
from dffml_model_tensorflow_hub.text_classifier_accuracy import (
    TextClassifierAccuracy,
)

model = TextClassificationModel(
    features=Features(Feature("sentence", str, 1)),
    predict=Feature("sentiment", int, 1),
    classifications=[0, 1, 2],
    clstype=int,
    location="tempdir",
)

# Train the model
train(model, "train.csv")

# Assess accuracy (alternate way of specifying data source)
scorer = TextClassifierAccuracy()
print(
    "Accuracy:",
    score(
        model,
        scorer,
        Feature("sentiment", int, 1),
        CSVSource(filename="test.csv"),
    ),
)

# Make prediction
for i, features, prediction in predict(
    model, {"sentence": "This track is horrible"},
):
    features["sentiment"] = prediction["sentiment"]["value"]
    print(features)

Args

predict: Feature
- Feature name holding classification value
classifications: List of strings
- Options for value of classification
features: List of features
- Features to train on
location: Path
- Location where state should be saved
trainable: String
- default: True
- Tweak pretrained model by training again
batch_size: Integer
- default: 120
- Batch size
max_seq_length: Integer
- default: 256
- Length of sentence, used in preprocessing of input for bert embedding
add_layers: String
- default: False
- Add layers on the top of pretrianed model/layer
embedType: String
- default: None
- Type of pretrained embedding model, required to be set to bert to use bert pretrained embedding
layers: List of strings
- default: None
- Extra layers to be added on top of pretrained model
model_path: String
- default: https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim-with-oov/1
- Pretrained model path/url
optimizer: String
- default: adam
- Optimizer used by model
metrics: String
- default: accuracy
- Metric used to evaluate model
clstype: Type
- default: <class ‘str’>
- Data type of classifications values
epochs: Integer
- default: 10
- Number of iterations to pass over all records in a source

dffml_model_spacy¶

pip install dffml-model-spacy

spacyner¶

Official

Implemented using Spacy statistical models .

Note

You must download en_core_web_sm before using this model

$ python -m spacy download en_core_web_sm

First we create the training and testing datasets.

Training data:

train.json

{
    "data": [
        {
            "sentence": "I went to London and Berlin.",
            "entities": [
                {
                    "start":10,
                    "end": 16,
                    "tag": "LOC"
                },
                {
                    "start":21,
                    "end": 27,
                    "tag": "LOC"
                }
            ]
        },
        {
            "sentence": "Who is Alex?",
            "entities": [
                {
                    "start":7,
                    "end": 11,
                    "tag": "PERSON"
                }
            ]
        }
    ]
}

Testing data:

test.json

{
    "data": [
        {
            "sentence": "Alex went to London?"
        }
    ]
}

Train the model

$ dffml train \
    -model spacyner \
    -sources s=op \
    -source-opimp dffml_model_spacy.ner.utils:parser \
    -source-args train.json False \
    -model-model_name en_core_web_sm \
    -model-location temp \
    -model-n_iter 5 \
    -log debug

Assess the accuracy

$ dffml accuracy \
    -model spacyner \
    -sources s=op \
    -source-opimp dffml_model_spacy.ner.utils:parser \
    -source-args train.json False \
    -model-model_name en_core_web_sm \
    -model-location temp \
    -model-n_iter 5 \
    -features tag:str:1 \
    -scorer sner \
    -log debug
0.0

Make a prediction

$ dffml predict all \
    -model spacyner \
    -sources s=op \
    -source-opimp dffml_model_spacy.ner.utils:parser \
    -source-args test.json True \
    -model-model_name en_core_web_sm \
    -model-location temp \
    -model-n_iter 5 \
    -log debug
[
    {
        "extra": {},
        "features": {
            "entities": [],
            "sentence": "Alex went to London?"
        },
        "key": 0,
        "last_updated": "2020-07-27T16:26:18Z",
        "prediction": {
            "Answer": {
                "confidence": null,
                "value": [
                    [
                        "Alex",
                        "PERSON"
                    ],
                    [
                        "London",
                        "GPE"
                    ]
                ]
            }
        }
    }
]

The model can be trained on large datasets to get the expected output. The example shown above is to demonstrate the commandline usage of the model.

In the above train, accuracy and predict commands, op source is used to read and parse data from json file before feeding it to the model. The function used by opsource to parse json data is:

import ast
import json


def parser(json_file: str, is_predicting: bool) -> dict:
    with open(json_file) as f:
        parsed_data = {}
        data = json.load(f)["data"]
        for id, entry in enumerate(data):
            entities = []
            sentence = entry["sentence"]
            if not ast.literal_eval(is_predicting):
                for entity in entry["entities"]:
                    start = entity["start"]
                    end = entity["end"]
                    tag = entity["tag"]
                    entities.append((start, end, tag))
            parsed_data[id] = {
                "features": {"sentence": sentence, "entities": entities,}
            }
        return parsed_data

The location of the function is passed using:

-source-opimp dffml_model_spacy.ner.utils:parser

And the arguments to parser are passed by:

-source-args train.json False

where train.json is the name of file containing training data and the bool False is value of the flag is_predicting.

Args

location: String
- Output location.
model_name: String
- default: None
- Name of one of the trained pipelines provided by spaCy. You can find complete list at: https://spacy.io/models Defaults to blank ‘en’ model.
n_iter: Integer
- default: 10
- Number of training iterations
dropout: float
- default: 0.5
- Dropout rate to be used during training

dffml_model_autosklearn¶

pip install dffml-model-autosklearn

Follow these instructions before running the above install command to ensure that auto-sklearn installs correctly

Ubuntu Installation

To provide a C++11 building environment and the lateste SWIG version on Ubuntu, run:

$ sudo apt-get install build-essential swig

Install other PyPi dependencies with

$ python3 -m pip install cython liac-arff psutil
$ curl https://raw.githubusercontent.com/automl/auto-sklearn/master/requirements.txt | xargs -n 1 -L 1 python3 -m pip install

For more information about installation visit https://automl.github.io/auto-sklearn/master/installation.html#installation

autoclassifier¶

Official

No description

Args

features: List of features
- Features to train on
predict: Feature
- Label or the value to be predicted
location: Path
- Location where state should be saved
time_left_for_this_task: Integer
- default: 3600
- Time limit in seconds for the search of appropriate models. By increasing this value, auto-sklearn has a higher chance of finding better models.
per_run_time_limit: Integer
- default: None
- Time limit for a single call to the machine learning model. Model fitting will be terminated if the machine learning algorithm runs over the time limit. Set this value high enough so that typical machine learning algorithms can be fit on the training data.
initial_configurations_via_metalearning: Integer
- default: 25
- Initialize the hyperparameter optimization algorithm with this many configurations which worked well on previously seen datasets. Disable if the hyperparameter optimization algorithm should start from scratch.
ensemble_size: Integer
- default: 50
- Number of models added to the ensemble built by Ensemble selection from libraries of models. Models are drawn with replacement.
ensemble_nbest: Integer
- default: 50
- Only consider the ensemble_nbest models when building an ensemble.
max_models_on_disc: Integer
- default: 50
- Defines the maximum number of models that are kept in the disc. The additional number of models are permanently deleted. Due to the nature of this variable, it sets the upper limit on how many models can be used for an ensemble. It must be an integer greater or equal than 1. If set to None, all models are kept on the disc.
seed: Integer
- default: 1
- Used to seed SMAC. Will determine the output file names.
memory_limit: Integer
- default: 3072
- Memory limit in MB for the machine learning algorithm. auto-sklearn will stop fitting the machine learning algorithm if it tries to allocate more than memory_limit MB. If None is provided, no memory limit is set. In case of multi-processing, memory_limit will be per job. This memory limit also applies to the ensemble creation process.
include_estimators: typing.Any
- default: None
- If None, all possible estimators are used. Otherwise specifies set of estimators to use.
exclude_estimators: typing.Any
- default: None
- If None, all possible estimators are used. Otherwise specifies set of estimators not to use. Incompatible with include_estimators.
include_preprocessors: typing.Any
- default: None
- If None all possible preprocessors are used. Otherwise specifies set of preprocessors to use.
exclude_preprocessors: typing.Any
- default: None
- If None all possible preprocessors are used. Otherwise specifies set of preprocessors not to use. Incompatible with include_preprocessors.
resampling_strategy: String
- default: holdout
- how to to handle overfitting, might need ‘resampling_strategy_arguments’ fit where possible ‘folds’ in scikit-learn model_selection module in scikit-learn model_selection module in scikit-learn model_selection module
resampling_strategy_arguments: dict
- default: None
- - train_size should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. * shuffle determines whether the data is shuffled prior to splitting it into train and validation. required by chosen class as specified in scikit-learn documentation. If arguments are not provided, scikit-learn defaults are used. If no defaults are available, an exception is raised. Refer to the ‘n_splits’ argument as ‘folds’.
tmp_folder: String
- default: None
- folder to store configuration output and log files, if None automatically use /tmp/autosklearn_tmp_$pid_$random_number
output_folder: String
- default: None
- folder to store predictions for optional test set, if None no output will be generated
delete_tmp_folder_after_terminate: String
- default: True
- remove tmp_folder, when finished. If tmp_folder is None tmp_dir will always be deleted
delete_output_folder_after_terminate: String
- default: True
- remove output_folder, when finished. If output_folder is None output_dir will always be deleted
n_jobs: Integer
- default: None
- The number of jobs to run in parallel for fit(). -1 means using all processors. By default, Auto-sklearn uses a single core for fitting the machine learning model and a single core for fitting an ensemble. Ensemble building is not affected by n_jobs but can be controlled by the number of models in the ensemble. In contrast to most scikit-learn models, n_jobs given in the constructor is not applied to the predict() method. If dask_client is None, a new dask client is created.
dask_client: typing.Any
- default: None
- User-created dask client, can be used to start a dask cluster and then attach auto-sklearn to it.
disable_evaluator_output: String
- default: False
- If True, disable model and prediction output. Cannot be used together with ensemble building. predict() cannot be used when setting this True. Can also be used as a list to pass more fine-grained information on what to save. Allowed elements in the optimization/validation set, which would later on be used to build an ensemble.
smac_scenario_args: dict
- default: None
- Additional arguments inserted into the scenario of SMAC. See the for a list of available arguments.
get_smac_object_callback: typing.Any
- default: None
- Callback function to create an object of class The function must accept the arguments scenario_dict, instances, num_params, runhistory, seed and ta. This is an advanced feature. Use only if you are familiar with
logging_config: dict
- default: None
- dictionary object specifying the logger configuration. If None, the default logging.yaml file is used, which can be found in the directory util/logging.yaml relative to the installation.
metadata_directory: String
- default: None
- path to the metadata directory. If None, the default directory (autosklearn.metalearning.files) is used.
metric: typing.Any
- default: None
- Metrics`_. If None is provided, a default metric is selected depending on the task.
scoring_functions: typing.Any
- default: None
- List of scorers which will be calculated for each pipeline and results will be available via cv_results
load_models: String
- default: True
- Whether to load the models after fitting Auto-sklearn.

autoregressor¶

Official

autoregressor / AutoSklearnRegressorModel will use auto-sklearn to train the a scikit model for you.

This is AutoML, it will tune hyperparameters for you.

Implemented using AutoSklearnRegressor.

First we create the training and testing datasets

train.csv

Feature1,Feature2,TARGET
93,0.68,3.89
24,0.42,1.75
36,0.68,2.75
53,0.31,2.00
29,0.25,1.32
29,0.52,2.14

test.csv

Feature1,Feature2,TARGET
57,0.84,3.65
95,0.19,2.46
23,0.15,0.93

Train the model

$ dffml train \
    -model autoregressor \
    -model-predict TARGET:float:1 \
    -model-clstype int \
    -sources f=csv \
    -source-filename train.csv \
    -model-features \
      Feature1:float:1 \
      Feature2:float:1 \
    -model-time_left_for_this_task 120 \
    -model-per_run_time_limit 30 \
    -model-ensemble_size 50 \
    -model-delete_tmp_folder_after_terminate False \
    -model-location tempdir \
    -log debug

Assess the accuracy

$ dffml accuracy \
    -model autoregressor \
    -model-predict TARGET:float:1 \
    -model-location tempdir \
    -features TARGET:float:1 \
    -sources f=csv \
    -source-filename test.csv \
    -model-features \
      Feature1:float:1 \
      Feature2:float:1 \
    -scorer mse \
    -log critical
0.9961211434899032

Make a file containing the data to predict on

predict.csv

Feature1,Feature2
0.57,0.84

Make a prediction

$ dffml predict all \
    -model autoregressor \
    -model-location tempdir \
    -model-predict TARGET:float:1 \
    -sources iris=csv \
    -model-features \
      Feature1:float:1 \
      Feature2:float:1 \
    -source-filename predict.csv
[
    {
        "extra": {},
        "features": {
            "Feature1": 0.57,
            "Feature2": 0.84
        },
        "key": "0",
        "last_updated": "2020-11-23T05:52:13Z",
        "prediction": {
            "TARGET": {
                "confidence": NaN,
                "value": 3.566799074411392
            }
        }
    }
]

The model can be trained on large datasets to get better accuracy output. The example shown above is to demonstrate the command line usage of the model.

Example usage of using the model from Python

run.py

from dffml import Features, Feature
from dffml.noasync import train, score, predict
from dffml_model_autosklearn import AutoSklearnRegressorModel
from dffml.accuracy import MeanSquaredErrorAccuracy

model = AutoSklearnRegressorModel(
    features=Features(
        Feature("Feature1", float, 1), Feature("Feature2", float, 1),
    ),
    predict=Feature("TARGET", float, 1),
    location="tempdir-python",
    time_left_for_this_task=120,
)


def main():
    # Train the model
    train(model, "train.csv")

    # Assess accuracy
    scorer = MeanSquaredErrorAccuracy()
    print(
        "Accuracy:",
        score(model, scorer, Feature("TARGET", float, 1), "test.csv"),
    )

    # Make prediction
    for i, features, prediction in predict(model, "predict.csv"):
        features["TARGET"] = prediction["TARGET"]["value"]
        print(features)


if __name__ == "__main__":
    main()

Run the file

$ python run.py
Accuracy: 0.9961211434899032
{'Feature1': 0.57, 'Feature2': 0.84, 'TARGET': 3.6180416345596313}

Args

features: List of features
- Features to train on
predict: Feature
- Label or the value to be predicted
location: Path
- Location where state should be saved
time_left_for_this_task: Integer
- default: 3600
- Time limit in seconds for the search of appropriate models. By increasing this value, auto-sklearn has a higher chance of finding better models.
per_run_time_limit: Integer
- default: None
- Time limit for a single call to the machine learning model. Model fitting will be terminated if the machine learning algorithm runs over the time limit. Set this value high enough so that typical machine learning algorithms can be fit on the training data.
initial_configurations_via_metalearning: Integer
- default: 25
- Initialize the hyperparameter optimization algorithm with this many configurations which worked well on previously seen datasets. Disable if the hyperparameter optimization algorithm should start from scratch.
ensemble_size: Integer
- default: 50
- Number of models added to the ensemble built by Ensemble selection from libraries of models. Models are drawn with replacement.
ensemble_nbest: Integer
- default: 50
- Only consider the ensemble_nbest models when building an ensemble.
max_models_on_disc: Integer
- default: 50
- Defines the maximum number of models that are kept in the disc. The additional number of models are permanently deleted. Due to the nature of this variable, it sets the upper limit on how many models can be used for an ensemble. It must be an integer greater or equal than 1. If set to None, all models are kept on the disc.
seed: Integer
- default: 1
- Used to seed SMAC. Will determine the output file names.
memory_limit: Integer
- default: 3072
- Memory limit in MB for the machine learning algorithm. auto-sklearn will stop fitting the machine learning algorithm if it tries to allocate more than memory_limit MB. If None is provided, no memory limit is set. In case of multi-processing, memory_limit will be per job. This memory limit also applies to the ensemble creation process.
include_estimators: typing.Any
- default: None
- If None, all possible estimators are used. Otherwise specifies set of estimators to use.
exclude_estimators: typing.Any
- default: None
- If None, all possible estimators are used. Otherwise specifies set of estimators not to use. Incompatible with include_estimators.
include_preprocessors: typing.Any
- default: None
- If None all possible preprocessors are used. Otherwise specifies set of preprocessors to use.
exclude_preprocessors: typing.Any
- default: None
- If None all possible preprocessors are used. Otherwise specifies set of preprocessors not to use. Incompatible with include_preprocessors.
resampling_strategy: String
- default: holdout
- how to to handle overfitting, might need ‘resampling_strategy_arguments’ fit where possible ‘folds’ in scikit-learn model_selection module in scikit-learn model_selection module in scikit-learn model_selection module
resampling_strategy_arguments: dict
- default: None
- - train_size should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. * shuffle determines whether the data is shuffled prior to splitting it into train and validation. required by chosen class as specified in scikit-learn documentation. If arguments are not provided, scikit-learn defaults are used. If no defaults are available, an exception is raised. Refer to the ‘n_splits’ argument as ‘folds’.
tmp_folder: String
- default: None
- folder to store configuration output and log files, if None automatically use /tmp/autosklearn_tmp_$pid_$random_number
output_folder: String
- default: None
- folder to store predictions for optional test set, if None no output will be generated
delete_tmp_folder_after_terminate: String
- default: True
- remove tmp_folder, when finished. If tmp_folder is None tmp_dir will always be deleted
delete_output_folder_after_terminate: String
- default: True
- remove output_folder, when finished. If output_folder is None output_dir will always be deleted
n_jobs: Integer
- default: None
- The number of jobs to run in parallel for fit(). -1 means using all processors. By default, Auto-sklearn uses a single core for fitting the machine learning model and a single core for fitting an ensemble. Ensemble building is not affected by n_jobs but can be controlled by the number of models in the ensemble. In contrast to most scikit-learn models, n_jobs given in the constructor is not applied to the predict() method. If dask_client is None, a new dask client is created.
dask_client: typing.Any
- default: None
- User-created dask client, can be used to start a dask cluster and then attach auto-sklearn to it.
disable_evaluator_output: String
- default: False
- If True, disable model and prediction output. Cannot be used together with ensemble building. predict() cannot be used when setting this True. Can also be used as a list to pass more fine-grained information on what to save. Allowed elements in the optimization/validation set, which would later on be used to build an ensemble.
smac_scenario_args: dict
- default: None
- Additional arguments inserted into the scenario of SMAC. See the for a list of available arguments.
get_smac_object_callback: typing.Any
- default: None
- Callback function to create an object of class The function must accept the arguments scenario_dict, instances, num_params, runhistory, seed and ta. This is an advanced feature. Use only if you are familiar with
logging_config: dict
- default: None
- dictionary object specifying the logger configuration. If None, the default logging.yaml file is used, which can be found in the directory util/logging.yaml relative to the installation.
metadata_directory: String
- default: None
- path to the metadata directory. If None, the default directory (autosklearn.metalearning.files) is used.
metric: typing.Any
- default: None
- Metrics`_. If None is provided, a default metric is selected depending on the task.
scoring_functions: typing.Any
- default: None
- List of scorers which will be calculated for each pipeline and results will be available via cv_results
load_models: String
- default: True
- Whether to load the models after fitting Auto-sklearn.