Models¶
Models are implementations of dffml.model.model.Model
, they
abstract the usage of machine learning models.
If you want to get started creating your own model, check out the Models.
You can load any of the models seen here using the py:func:Model.load <dffml.model.model.Model.load> function. See the Load Models Dynamically tutorial for more deatils. .. _plugin_model_dffml:
dffml¶
pip install dffml
slr¶
Official
Logistic Regression training one variable to predict another.
The dataset used for training
dataset.csv
f1,ans
0.1,0
0.7,1
0.6,1
0.2,0
0.8,1
Train the model
$ dffml train \
-model slr \
-model-features f1:float:1 \
-model-predict ans:int:1 \
-model-location tempdir \
-sources f=csv \
-source-filename dataset.csv
Assess the accuracy
$ dffml accuracy \
-model slr \
-model-features f1:float:1 \
-model-predict ans:int:1 \
-model-location tempdir \
-features ans:int:1 \
-sources f=csv \
-source-filename dataset.csv \
-scorer mse \
1.0
Make a prediction
predict.csv
f1
0.8
$ dffml predict all \
-model slr \
-model-features f1:float:1 \
-model-predict ans:int:1 \
-model-location tempdir \
-sources f=csv \
-source-filename predict.csv
[
{
"extra": {},
"features": {
"f1": 0.8
},
"key": "0",
"last_updated": "2020-11-15T16:22:25Z",
"prediction": {
"ans": {
"confidence": 0.9355670103092784,
"value": 1
}
}
}
]
Example usage of Logistic Regression using Python
slr.py
from dffml import Features, Feature, SLRModel
from dffml.noasync import train, score, predict
from dffml.accuracy import MeanSquaredErrorAccuracy
model = SLRModel(
features=Features(Feature("f1", float, 1)),
predict=Feature("ans", int, 1),
location="tempdir",
)
# Train the model
train(model, "dataset.csv")
# Assess accuracy (alternate way of specifying data source)
scorer = MeanSquaredErrorAccuracy()
print("Accuracy:", score(model, scorer, Feature("ans", int, 1), "dataset.csv"))
# Make prediction
for i, features, prediction in predict(model, {"f1": 0.8, "ans": 0}):
features["ans"] = prediction["ans"]["value"]
print(features)
$ python slr.py
Accuracy: 0.9355670103092784
{'f1': 0.8, 'ans': 1}
Args
predict: Feature
Label or the value to be predicted
features: List of features
Features to train on. For SLR only 1 allowed
location: Path
Location where state should be saved
dffml_model_scratch¶
pip install dffml-model-scratch
anomalydetection¶
Official
Model for Anomaly Detection using multivariate Gaussian distribution to predict probabilities of all records in the dataset and identify outliers. F1 score is used as the evaluation metric for this model. This model works well as it recognises dependencies across various features, and works particularly well if the features have a Gaussian Distribution.
Examples¶
Command line usage
Create training and test datasets
trainex.csv
A,Y
0.65,0
0.24,0
0.93,0
0.87,0
0.23,0
7,1
0.86,0
0.45,0
0.55,0
0.29,0
5,1
0.51,0
0.88,0
0.24,0
0.51,0
0.17,0
9,1
0.37,0
0.23,0
0.44,0
0.62,0
3,1
0.87,0
testex.csv
A,Y
0.45,0
0.23,0
0.67,0
8,1
0.19,0
0.34,0
0.49,0
0.31,0
0.47,0
4,1
Train the model
$ dffml train \
-sources f=csv \
-source-filename trainex.csv \
-model anomalydetection \
-model-features A:float:2 \
-model-predict Y:int:1 \
-model-location tempdir
Assess the accuracy
$ dffml accuracy \
-sources f=csv \
-source-filename testex.csv \
-model anomalydetection \
-model-features A:float:2 \
-model-predict Y:int:1 \
-model-location tempdir \
-features Y:int:1 \
-scorer anomalyscore
Make predictions
$ dffml predict all \
-sources f=csv \
-source-filename testex.csv \
-model anomalydetection \
-model-features A:float:2 \
-model-predict Y:int:1 \
-model-location tempdir
Python usage
from dffml import Feature, Features
from dffml.noasync import score, train
from dffml_model_scratch.anomalydetection import AnomalyModel
from dffml_model_scratch.anomaly_detection_scorer import (
AnomalyDetectionAccuracy,
)
# Configure the model
model = AnomalyModel(
features=Features(Feature("A", int, 2),),
predict=Feature("Y", int, 1),
location="model",
)
# Train the model
train(model, "trainex.csv")
# Assess accuracy for test set
scorer = AnomalyDetectionAccuracy()
print(
"Test set F1 score :",
score(model, scorer, Feature("Y", int, 1), "testex.csv"),
)
# Assess accuracy for training set
print(
"Training set F1 score :",
score(model, scorer, Feature("Y", int, 1), "trainex.csv"),
)
Output
$ python detectoutliers.py
Test set F1 score : 0.8
Training set F1 score : 0.888888888888889
Args
features: List of features
Features to train on
predict: Feature
Label or the value to be predicted
location: Path
Location where state should be saved
k: float
default: 0.8
Validation set size
scratchlgrsag¶
Official
Logistic Regression using stochastic average gradient descent optimizer
The dataset used for training
cat > dataset.csv << EOF
f1,ans
0.1,0
0.7,1
0.6,1
0.2,0
0.8,1
EOF
Train the model
dffml train \
-model scratchlgrsag \
-model-features f1:float:1 \
-model-predict ans:int:1 \
-model-location tempdir \
-sources f=csv \
-source-filename dataset.csv \
-log debug
Assess the accuracy
dffml accuracy \
-model scratchlgrsag \
-model-features f1:float:1 \
-model-predict ans:int:1 \
-model-location tempdir \
-features ans:int:1 \
-sources f=csv \
-source-filename dataset.csv \
-scorer mse \
-log debug
Output
1.0
Make a prediction
echo -e 'f1,ans\n0.8,0\n' | \
dffml predict all \
-model scratchlgrsag \
-model-features f1:float:1 \
-model-predict ans:int:1 \
-model-location tempdir \
-sources f=csv \
-source-filename /dev/stdin \
-log debug
Output
[
{
"extra": {},
"features": {
"ans": 0,
"f1": 0.8
},
"last_updated": "2020-03-19T13:41:08Z",
"prediction": {
"ans": {
"confidence": 1.0,
"value": 1
}
},
"key": "0"
}
]
Example usage of Logistic Regression using Python
from dffml import CSVSource, Features, Feature
from dffml.noasync import train, score, predict
from dffml.accuracy import MeanSquaredErrorAccuracy
from dffml_model_scratch.logisticregression import LogisticRegression
model = LogisticRegression(
features=Features(Feature("f1", float, 1)),
predict=Feature("ans", int, 1),
location="tempdir",
)
# Train the model
train(model, "dataset.csv")
# Assess accuracy (alternate way of specifying data source)
scorer = MeanSquaredErrorAccuracy()
print(
"Accuracy:",
score(
model,
scorer,
Feature("ans", int, 1),
CSVSource(filename="dataset.csv"),
),
)
# Make prediction
for i, features, prediction in predict(model, {"f1": 0.8, "ans": 0}):
features["ans"] = prediction["ans"]["value"]
print(features)
Args
predict: Feature
Label or the value to be predicted
features: List of features
Features to train on
location: Path
Location where state should be saved
dffml_model_xgboost¶
pip install dffml-model-xgboost
OSX Installation
XGBoost on OSX requires libomp
$ brew install libomp
xgbclassifier¶
Official
Model using xgboost to perform classification prediction via gradient boosted trees. XGBoost is a leading software library for working with standard tabular data (the type of data you store in Pandas DataFrames, as opposed to more exotic types of data like images and videos). With careful parameter tuning, you can train highly accurate models.
Examples¶
Command line usage
First download the training and test files, change the headers to DFFML format. The first row is an encoding of the classifications, we want CSV headers for the column names.
$ wget http://download.tensorflow.org/data/iris_training.csv
$ wget http://download.tensorflow.org/data/iris_test.csv
$ sed -i 's/.*setosa,versicolor,virginica/SepalLength,SepalWidth,PetalLength,PetalWidth,classification/g' iris_training.csv iris_test.csv
Run the train command
$ dffml train \
-sources train=csv \
-source-filename iris_training.csv \
-model xgbclassifier \
-model-features \
SepalLength:float:1 \
SepalWidth:float:1 \
PetalLength:float:1 \
PetalWidth:float:1 \
-model-predict classification \
-model-location model \
-model-max_depth 3 \
-model-learning_rate 0.01 \
-model-learning_rate 0.01 \
-model-n_estimators 200 \
-model-reg_lambda 1 \
-model-reg_alpha 0 \
-model-gamma 0 \
-model-colsample_bytree 0 \
-model-subsample 1
Assess the accuracy
$ dffml accuracy \
-sources train=csv \
-source-filename iris_test.csv \
-model xgbclassifier \
-model-features \
SepalLength:float:1 \
SepalWidth:float:1 \
PetalLength:float:1 \
PetalWidth:float:1 \
-model-predict classification \
-model-location model \
-features classification \
-scorer clf
Make predictions
$ dffml predict all \
-sources train=csv \
-source-filename iris_test.csv \
-model xgbclassifier \
-model-features \
SepalLength:float:1 \
SepalWidth:float:1 \
PetalLength:float:1 \
PetalWidth:float:1 \
-model-predict classification \
-model-location model
Python usage
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from dffml import Feature, Features
from dffml.noasync import train, score
from dffml.accuracy import ClassificationAccuracy
from dffml_model_xgboost.xgbclassifier import (
XGBClassifierModel,
XGBClassifierModelConfig,
)
iris = load_iris()
y = iris["target"]
X = iris["data"]
trainX, testX, trainy, testy = train_test_split(
X, y, test_size=0.1, random_state=123
)
# Configure the model
model = XGBClassifierModel(
XGBClassifierModelConfig(
features=Features(Feature("data", float,)),
predict=Feature("target", float, 1),
location="model",
max_depth=3,
learning_rate=0.01,
n_estimators=200,
reg_lambda=1,
reg_alpha=0,
gamma=0,
colsample_bytree=0,
subsample=1,
)
)
# Train the model
train(model, *[{"data": x, "target": y} for x, y in zip(trainX, trainy)])
# Assess accuracy
scorer = ClassificationAccuracy()
print(
"Test accuracy:",
score(
model,
scorer,
Feature("target", float, 1),
*[{"data": x, "target": y} for x, y in zip(testX, testy)],
),
)
print(
"Training accuracy:",
score(
model,
scorer,
Feature("target", float, 1),
*[{"data": x, "target": y} for x, y in zip(trainX, trainy)],
),
)
Output
Test accuracy: 0.933333333333333
Training accuracy: 0.9703703703703703
Args
location: Path
Location where model should be saved
features: List of features
Features on which we train the model
predict: Feature
Value to be predicted
learning_rate: float
default: 0.3
Learning rate to train with
n_estimators: Integer
default: 100
Number of gradient boosted trees. Equivalent to the number of boosting rounds
max_depth: Integer
default: 6
Maximium tree depth for base learners
objective: String
default: multi:softmax
Objective in training
subsample: float
default: 1
Subsample ratio of the training instance
gamma: float
default: 0
Minimium loss reduction required to make a furthre partition on a leaf node
n_jobs: Integer
default: -1
Number of parallel threads used to run xgboost
colsample_bytree: float
default: 1
Subsample ratio of columns when constructing each tree
booster: String
default: gbtree
Specify which booster to use: gbtree, gblinear or dart
min_child_weight: float
default: 1
Minimum sum of instance weight(hessian) needed in a child
reg_lambda: float
default: 1
L2 regularization term on weights. Increasing this value will make model more conservative
reg_alpha: float
default: 0
L1 regularization term on weights. Increasing this value will make model more conservative
xgbregressor¶
Official
Model using xgboost to perform regression prediction via gradient boosted trees XGBoost is a leading software library for working with standard tabular data (the type of data you store in Pandas DataFrames, as opposed to more exotic types of data like images and videos). With careful parameter tuning, you can train highly accurate models.
Examples¶
Command line usage
First download the training and test files, change the headers to DFFML format.
$ wget http://download.tensorflow.org/data/iris_training.csv
$ wget http://download.tensorflow.org/data/iris_test.csv
$ sed -i 's/.*setosa,versicolor,virginica/SepalLength,SepalWidth,PetalLength,PetalWidth,classification/g' iris_training.csv iris_test.csv
Run the train command
$ dffml train \
-sources train=csv \
-source-filename iris_training.csv \
-model xgbregressor \
-model-features \
SepalLength:float:1 \
SepalWidth:float:1 \
PetalLength:float:1 \
PetalWidth:float:1 \
-model-predict classification \
-model-location model \
-model-max_depth 3 \
-model-learning_rate 0.01 \
-model-n_estimators 200 \
-model-reg_lambda 1 \
-model-reg_alpha 0 \
-model-gamma 0 \
-model-colsample_bytree 0 \
-model-subsample 1
Assess the accuracy
$ dffml accuracy \
-sources train=csv \
-source-filename iris_test.csv \
-model xgbregressor \
-model-features \
SepalLength:float:1 \
SepalWidth:float:1 \
PetalLength:float:1 \
PetalWidth:float:1 \
-model-predict classification \
-model-location model \
-features classification \
-scorer mse
Output
accuracy: 0.8841466984766406
Make predictions
$ dffml predict all \
-sources train=csv \
-source-filename iris_test.csv \
-model xgbregressor \
-model-features \
SepalLength:float:1 \
SepalWidth:float:1 \
PetalLength:float:1 \
PetalWidth:float:1 \
-model-predict classification \
-model-location model
Python usage
run.py
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from dffml import Feature, Features
from dffml.noasync import train, score
from dffml_model_xgboost.xgbregressor import (
XGBRegressorModel,
XGBRegressorModelConfig,
)
from dffml.accuracy import MeanSquaredErrorAccuracy
diabetes = load_diabetes()
y = diabetes["target"]
X = diabetes["data"]
trainX, testX, trainy, testy = train_test_split(
X, y, test_size=0.1, random_state=123
)
# Configure the model
model = XGBRegressorModel(
XGBRegressorModelConfig(
features=Features(Feature("data", float, 10)),
predict=Feature("target", float, 1),
location="model",
max_depth=3,
learning_rate=0.05,
n_estimators=400,
reg_lambda=10,
reg_alpha=0,
gamma=10,
colsample_bytree=0.3,
subsample=0.8,
)
)
# Train the model
train(model, *[{"data": x, "target": y} for x, y in zip(trainX, trainy)])
# Assess accuracy
scorer = MeanSquaredErrorAccuracy()
print(
"Test accuracy:",
score(
model,
scorer,
Feature("target", float, 1),
*[{"data": x, "target": y} for x, y in zip(testX, testy)],
),
)
print(
"Training accuracy:",
score(
model,
scorer,
Feature("target", float, 1),
*[{"data": x, "target": y} for x, y in zip(trainX, trainy)],
),
)
Output
$ python run.py
Test accuracy: 0.6669655406927468
Training accuracy: 0.819782501866115
Args
location: Path
Location where model should be saved
features: List of features
Features on which we train the model
predict: Feature
Value to be predicted
learning_rate: float
default: 0.05
Learning rate to train with
n_estimators: Integer
default: 1000
Number of gradient boosted trees. Equivalent to the number of boosting rounds
max_depth: Integer
default: 6
Maximium tree depth for base learners
subsample: float
default: 1
Subsample ratio of the training instance
gamma: float
default: 0
Minimium loss reduction required to make a furthre partition on a leaf node
n_jobs: Integer
default: -1
Number of parallel threads used to run xgboost
colsample_bytree: float
default: 1
Subsample ratio of columns when constructing each tree
booster: String
default: gbtree
Specify which booster to use: gbtree, gblinear or dart
min_child_weight: float
default: 0
Minimum sum of instance weight(hessian) needed in a child
reg_lambda: float
default: 1
L2 regularization term on weights. Increasing this value will make model more conservative
reg_alpha: float
default: 0
L1 regularization term on weights. Increasing this value will make model more conservative
dffml_model_vowpalWabbit¶
pip install dffml-model-vowpalWabbit
vwmodel¶
Official
Implemented using Vowpal Wabbit.
First we create the training and testing datasets
cat > train.csv << EOF
A,B
| price:.23 sqft:.25 age:.05 2006,-1
| price:.18 sqft:.15 age:.35 1976,1
| price:.53 sqft:.32 age:.87 1924,-1
EOF
cat > test.csv << EOF
A
| price:.46 sqft:.4 age:.10 1924
EOF
Train the model
dffml train \
-model vwmodel \
-model-features \
A:str:1 \
-model-predict \
B:int:1 \
-model-noconvert \
-sources f=csv \
-source-filename train.csv \
-model-location tempdir
Assess the accuracy
dffml accuracy \
-model vwmodel \
-model-features \
A:str:1 \
-model-predict \
B:int:1 \
-model-noconvert \
-features B:int:1 \
-scorer mse \
-sources f=csv \
-source-filename train.csv \
-model-location tempdir
Output
0.38683876649129145
Make a prediction
dffml predict all \
-model vwmodel \
-model-features \
A:str:1 \
-model-predict \
B:int:1 \
-model-noconvert \
-sources f=csv \
-source-filename test.csv \
-model-location tempdir
Output
[
{
"extra": {},
"features": {
"A": "| price:.46 sqft:.4 age:.10 1924"
},
"key": "0",
"last_updated": "2020-05-29T16:36:57Z",
"prediction": {
"B": {
"confidence": 0.38683876649129145,
"value": 0.0
}
}
}
]
Args
features: List of features
predict: Feature
Feature to predict
location: Path
Location where state should be saved
class_cost: List of features
default: None
Features with name Cost_{class} containing cost of class for each input example, used when csoaa is used
task: String
default: regression
Task to perform, possible values are classification, regression
use_binary_label: String
default: False
Convert target labels to -1 and 1 for binary classification
vwcmd: List of strings
default: []
Command Line Arguments as per vowpal wabbit convention
namespace: List of strings
default: []
Namespace for input features. Should be in format {namespace}_{feature name}
importance: Feature
default: None
Feature containing importance of each example, used in conversion of input data to vowpal wabbit input format
base: Feature
default: None
Feature containing base for each example, used for residual regression
tag: Feature
default: None
Feature to be used as tag in conversion of data to vowpal wabbit input format
noconvert: String
default: False
Do not convert record features to vowpal wabbit input format
dffml_model_scikit¶
pip install dffml-model-scikit
Machine Learning models implemented with scikit-learn. Models are saved under the directory in subdirectories named after the hash of their feature names.
General Usage:
Training:
$ dffml train \
-model SCIKIT_MODEL_ENTRYPOINT \
-model-features FEATURE_DEFINITION \
-model-predict TO_PREDICT \
-model-location MODEL_DIRECTORY \
-model-SCIKIT_PARAMETER_NAME SCIKIT_PARAMETER_VALUE \
-sources f=TRAINING_DATA_SOURCE_TYPE \
-source-filename TRAINING_DATA_FILE_NAME \
-log debug
Testing and Accuracy:
$ dffml accuracy \
-model SCIKIT_MODEL_ENTRYPOINT \
-model-features FEATURE_DEFINITION \
-model-predict TO_PREDICT \
-model-location MODEL_DIRECTORY \
-features TO_PREDICT \
-sources f=TESTING_DATA_SOURCE_TYPE \
-source-filename TESTING_DATA_FILE_NAME \
-scorer ACCURACY_SCORER \
-log debug
Predicting with trained model:
$ dffml predict all \
-model SCIKIT_MODEL_ENTRYPOINT \
-model-features FEATURE_DEFINITION \
-model-predict TO_PREDICT \
-model-location MODEL_DIRECTORY \
-sources f=PREDICT_DATA_SOURCE_TYPE \
-source-filename PREDICT_DATA_FILE_NAME \
-log debug
Models Available:
Type |
Model |
Entrypoint |
Parameters |
Multi-Output |
---|---|---|---|---|
Regression |
LinearRegression |
scikitlr |
Yes |
|
ElasticNet |
scikiteln |
Yes |
||
RandomForestRegressor |
scikitrfr |
Yes |
||
BayesianRidge |
scikitbyr |
Yes |
||
Lasso |
scikitlas |
Yes |
||
ARDRegression |
scikitard |
Yes |
||
RANSACRegressor |
scikitrsc |
Yes |
||
DecisionTreeRegressor |
scikitdtr |
Yes |
||
GaussianProcessRegressor |
scikitgpr |
Yes |
||
OrthogonalMatchingPursuit |
scikitomp |
Yes |
||
Lars |
scikitlars |
Yes |
||
Ridge |
scikitridge |
Yes |
||
Classification |
KNeighborsClassifier |
scikitknn |
Yes |
|
AdaBoostClassifier |
scikitadaboost |
Yes |
||
GaussianProcessClassifier |
scikitgpc |
Yes |
||
DecisionTreeClassifier |
scikitdtc |
Yes |
||
RandomForestClassifier |
scikitrfc |
Yes |
||
QuadraticDiscriminantAnalysis |
scikitqda |
Yes |
||
MLPClassifier |
scikitmlp |
Yes |
||
GaussianNB |
scikitgnb |
Yes |
||
SVC |
scikitsvc |
Yes |
||
LogisticRegression |
scikitlor |
Yes |
||
GradientBoostingClassifier |
scikitgbc |
Yes |
||
BernoulliNB |
scikitbnb |
Yes |
||
ExtraTreesClassifier |
scikitetc |
Yes |
||
BaggingClassifier |
scikitbgc |
Yes |
||
LinearDiscriminantAnalysis |
scikitlda |
Yes |
||
MultinomialNB |
scikitmnb |
Yes |
||
Clustering |
KMeans |
scikitkmeans |
No |
|
Birch |
scikitbirch |
No |
||
MiniBatchKMeans |
scikitmbkmeans |
No |
||
AffinityPropagation |
scikitap |
No |
||
MeanShift |
scikitms |
No |
||
SpectralClustering |
scikitsc |
No |
||
AgglomerativeClustering |
scikitac |
No |
||
OPTICS |
scikitoptics |
No |
Scorers Available:
Type |
Scorer |
Entrypoint |
Parameters |
Multi-Output |
---|---|---|---|---|
Regression |
Explained Variance Score |
exvscore |
Yes |
|
Max Error |
maxerr |
No |
||
Mean Absolute Error |
meanabserr |
Yes |
||
Mean Squared Error |
meansqrerr |
Yes |
||
Mean Squared Log Error |
meansqrlogerr |
Yes |
||
Median Absolute Error |
medabserr |
Yes |
||
R2 Score |
r2score |
Yes |
||
Mean Poisson Deviance |
meanpoidev |
No |
||
Mean Gamma Deviance |
meangammadev |
No |
||
Mean Absolute Percentage Error |
meanabspererr |
Yes |
||
Classification |
Accuracy Score |
acscore |
Yes |
|
Balanced Accuracy Score |
bacscore |
Yes |
||
Top K Accuracy Score |
topkscore |
Yes |
||
Average Precision Score |
avgprescore |
Yes |
||
Brier Score Loss |
brierscore |
Yes |
||
F1 Score |
f1score |
Yes |
||
Log Loss |
logloss |
Yes |
||
Precision Score |
prescore |
Yes |
||
Recall Score |
recallscore |
Yes |
||
Jaccard Score |
jacscore |
Yes |
||
Roc Auc Score |
rocaucscore |
Yes |
||
Clustering |
Adjusted Mutual Info Score |
adjmutinfoscore |
No |
|
Adjusted Rand Score |
adjrandscore |
No |
||
Completeness Score |
complscore |
No |
||
Fowlkes Mallows Score |
fowlmalscore |
No |
||
Homogeneity Score |
homoscore |
No |
||
Mutual Info Score |
mutinfoscore |
No |
||
Normalized Mutual Info Score |
normmutinfoscore |
No |
||
Rand Score |
randscore |
No |
||
V Measure Score |
vmscore |
No |
||
Supervised |
Model’s Default Score |
skmodelscore |
Yes |
Usage Example:
Example below uses LinearRegression Model using the command line.
Let us take a simple example:
Years of Experience |
Expertise |
Trust Factor |
Salary |
---|---|---|---|
0 |
01 |
0.2 |
10 |
1 |
03 |
0.4 |
20 |
2 |
05 |
0.6 |
30 |
3 |
07 |
0.8 |
40 |
4 |
09 |
1.0 |
50 |
5 |
11 |
1.2 |
60 |
First we create the files
cat > train.csv << EOF
Years,Expertise,Trust,Salary
0,1,0.1,10
1,3,0.2,20
2,5,0.3,30
3,7,0.4,40
EOF
cat > test.csv << EOF
Years,Expertise,Trust,Salary
4,9,0.5,50
5,11,0.6,60
EOF
Train the model
dffml train \
-model scikitlr \
-model-features Years:int:1 Expertise:int:1 Trust:float:1 \
-model-predict Salary:float:1 \
-model-location tempdir \
-sources f=csv \
-source-filename train.csv
Assess accuracy
dffml accuracy \
-model scikitlr \
-model-features Years:int:1 Expertise:int:1 Trust:float:1 \
-model-predict Salary:float:1 \
-model-location tempdir \
-features Salary:float:1 \
-scorer mse \
-sources f=csv \
-source-filename test.csv
Output:
1.0
Make a prediction
echo -e 'Years,Expertise,Trust\n6,13,0.7\n' | \
dffml predict all \
-model scikitlr \
-model-features Years:int:1 Expertise:int:1 Trust:float:1 \
-model-predict Salary:float:1 \
-model-location tempdir \
-sources f=csv \
-source-filename /dev/stdin
Output:
[
{
"extra": {},
"features": {
"Expertise": 13,
"Trust": 0.7,
"Years": 6
},
"key": "0",
"last_updated": "2020-03-01T22:26:46Z",
"prediction": {
"Salary": {
"confidence": 1.0,
"value": 70.0
}
}
}
]
Example usage of Linear Regression Model using python API:
from dffml import CSVSource, Features, Feature
from dffml.noasync import train, score, predict
from dffml_model_scikit import LinearRegressionModel
from dffml.accuracy import MeanSquaredErrorAccuracy
model = LinearRegressionModel(
features=Features(
Feature("Years", int, 1),
Feature("Expertise", int, 1),
Feature("Trust", float, 1),
),
predict=Feature("Salary", int, 1),
location="tempdir",
)
# Train the model
train(model, "train.csv")
# Assess accuracy (alternate way of specifying data source)
scorer = MeanSquaredErrorAccuracy()
print(
"Accuracy:",
score(
model,
scorer,
Feature("Salary", int, 1),
CSVSource(filename="test.csv"),
),
)
# Make prediction
for i, features, prediction in predict(
model,
{"Years": 6, "Expertise": 13, "Trust": 0.7},
{"Years": 7, "Expertise": 15, "Trust": 0.8},
):
features["Salary"] = prediction["Salary"]["value"]
print(features)
Example below uses KMeans Clustering Model on a small randomly generated dataset.
$ cat > train.csv << EOF
Col1, Col2, Col3, Col4
5.05776417, 8.55128116, 6.15193196, -8.67349666
3.48864265, -7.25952218, -4.89216256, 4.69308946
-8.16207603, 5.16792984, -2.66971993, 0.2401882
6.09809669, 8.36434181, 6.70940915, -7.91491768
-9.39122566, 5.39133807, -2.29760281, -1.69672981
0.48311336, 8.19998973, 7.78641979, 7.8843821
2.22409135, -7.73598586, -4.02660224, 2.82101794
2.8137247 , 8.36064298, 7.66196849, 3.12704676
EOF
$ cat > test.csv << EOF
Col1, Col2, Col3, Col4, cluster
-10.16770144, 2.73057215, -1.49351481, 2.43005691, 6
3.59705381, -4.76520663, -3.34916068, 5.72391486, 1
4.01612313, -4.641852 , -4.77333308, 5.87551683, 0
EOF
$ dffml train \
-model scikitkmeans \
-model-features Col1:float:1 Col2:float:1 Col3:float:1 Col4:float:1 \
-model-location tempdir \
-sources f=csv \
-source-filename train.csv \
-source-readonly \
-log debug
$ dffml accuracy \
-model scikitkmeans \
-model-features Col1:float:1 Col2:float:1 Col3:float:1 Col4:float:1\
-model-predict cluster:int:1 \
-model-location tempdir \
-features cluster:int:1 \
-sources f=csv \
-source-filename test.csv \
-source-readonly \
-scorer skmodelscore \
-log debug
0.6365141682948129
$ echo -e 'Col1,Col2,Col3,Col4\n6.09809669,8.36434181,6.70940915,-7.91491768\n' | \
dffml predict all \
-model scikitkmeans \
-model-features Col1:float:1 Col2:float:1 Col3:float:1 Col4:float:1 \
-model-location tempdir \
-sources f=csv \
-source-filename /dev/stdin \
-source-readonly \
-log debug
[
{
"extra": {},
"features": {
"Col1": 6.09809669,
"Col2": 8.36434181,
"Col3": 6.70940915,
"Col4": -7.91491768
},
"last_updated": "2020-01-12T22:51:15Z",
"prediction": {
"confidence": 0.6365141682948129,
"value": 2
},
"key": "0"
}
]
Example usage of KMeans Clustering Model using python API:
from dffml import CSVSource, Features, Feature
from dffml.noasync import train, score, predict
from dffml_model_scikit import KMeansModel
from dffml_model_scikit import MutualInfoScoreScorer
model = KMeansModel(
features=Features(
Feature("Col1", float, 1),
Feature("Col2", float, 1),
Feature("Col3", float, 1),
Feature("Col4", float, 1),
),
predict=Feature("cluster", int, 1),
location="tempdir",
)
# Train the model
train(model, "train.csv")
# Assess accuracy (alternate way of specifying data source)
scorer = MutualInfoScoreScorer()
print("Accuracy:", score(model, scorer, Feature("cluster", int, 1), CSVSource(filename="test.csv")))
# Make prediction
for i, features, prediction in predict(
model,
{"Col1": 6.09809669, "Col2": 8.36434181, "Col3": 6.70940915, "Col4": -7.91491768},
):
features["cluster"] = prediction["cluster"]["value"]
print(features)
NOTE: Transductive Clusterers(scikitsc, scikitac, scikitoptics) cannot handle unseen data. Ensure that predict and accuracy for these algorithms uses training data.
Args
predict: Feature
Label or the value to be predicted
Only used by classification and regression models
features: List of features
Features to train on
location: Path
Location where state should be saved
dffml_model_daal4py¶
pip install dffml-model-daal4py
daal4pylr¶
Official
Implemented using daal4py.
First we create the training and testing datasets
train.csv
f1,ans
12.4,11.2
14.3,12.5
14.5,12.7
14.9,13.1
16.1,14.1
16.9,14.8
16.5,14.4
15.4,13.4
17.0,14.9
17.9,15.6
18.8,16.4
20.3,17.7
22.4,19.6
19.4,16.9
15.5,14.0
16.7,14.6
test.csv
f1,ans
18.8,16.4
20.3,17.7
22.4,19.6
19.4,16.9
15.5,14.0
16.7,14.6
Train the model
$ dffml train \
-model daal4pylr \
-model-features f1:float:1 \
-model-predict ans:int:1 \
-model-location tempdir \
-sources f=csv \
-source-filename train.csv
Assess the accuracy
$ dffml accuracy \
-model daal4pylr \
-model-features f1:float:1 \
-model-predict ans:int:1 \
-model-location tempdir \
-features ans:int:1 \
-sources f=csv \
-source-filename test.csv \
-scorer mse \
0.6666666666666666
Make a prediction
$ echo -e 'f1,ans\n0.8,1\n' | \
dffml predict all \
-model daal4pylr \
-model-features f1:float:1 \
-model-predict ans:int:1 \
-model-location tempdir \
-sources f=csv \
-source-filename /dev/stdin
[
{
"extra": {},
"features": {
"ans": 1,
"f1": 0.8
},
"key": "0",
"last_updated": "2020-07-22T02:53:11Z",
"prediction": {
"ans": {
"confidence": null,
"value": 1.1907472649730522
}
}
}
]
Example usage of daal4py Linear Regression model using python API
run.py
from dffml import CSVSource, Features, Feature
from dffml.noasync import train, score, predict
from dffml_model_daal4py.daal4pylr import DAAL4PyLRModel
from dffml.accuracy import MeanSquaredErrorAccuracy
model = DAAL4PyLRModel(
features=Features(Feature("f1", float, 1)),
predict=Feature("ans", int, 1),
location="tempdir",
)
# Train the model
train(model, "train.csv")
# Assess accuracy (alternate way of specifying data source)
scorer = MeanSquaredErrorAccuracy()
print(
"Accuracy:",
score(
model, scorer, Feature("ans", int, 1), CSVSource(filename="test.csv")
),
)
# Make prediction
for i, features, prediction in predict(model, {"f1": 0.8, "ans": 0}):
features["ans"] = prediction["ans"]["value"]
print(features)
Run the file
$ python run.py
Args
predict: Feature
Label or the value to be predicted
features: List of features
Features to train on. For SLR only 1 allowed
location: Path
Location where state should be saved
dffml_model_pytorch¶
pip install dffml-model-pytorch
Machine Learning models implemented with PyTorch. Models are saved under the directory in model.pt.
General Usage:
Training:
$ dffml train \
-model PYTORCH_MODEL_ENTRYPOINT \
-model-features FEATURE_DEFINITION \
-model-predict TO_PREDICT \
-model-location MODEL_LOCATION \
-model-CONFIGS CONFIG_VALUES \
-sources f=TRAINING_DATA_SOURCE_TYPE \
-source-CONFIGS TRAINING_DATA \
-log debug
Testing and Accuracy:
$ dffml accuracy \
-model PYTORCH_MODEL_ENTRYPOINT \
-model-features FEATURE_DEFINITION \
-model-predict TO_PREDICT \
-model-location MODEL_LOCATION \
-model-CONFIGS CONFIG_VALUES \
-features TO_PREDICT \
-sources f=TESTING_DATA_SOURCE_TYPE \
-source-CONFIGS TESTING_DATA \
-log debug
Predicting with trained model:
$ dffml predict all \
-model PYTORCH_MODEL_ENTRYPOINT \
-model-features FEATURE_DEFINITION \
-model-predict TO_PREDICT \
-model-location MODEL_LOCATION \
-model-CONFIGS CONFIG_VALUES \
-sources f=PREDICT_DATA_SOURCE_TYPE \
-source-CONFIGS PREDICTION_DATA \
-log debug
Pre-Trained Models Available:
Type |
Model |
Entrypoint |
Architecture |
---|---|---|---|
Classification |
AlexNet |
alexnet |
|
DenseNet-121 |
densenet121 |
||
DenseNet-161 |
densenet161 |
||
DenseNet-169 |
densenet169 |
||
DenseNet-201 |
densenet201 |
||
MnasNet 0.5 |
mnasnet0_5 |
||
MnasNet 1.0 |
mnasnet1_0 |
||
MobileNet V2 |
mobilenet_v2 |
||
VGG-11 |
vgg11 |
||
VGG-11 with batch normalization |
vgg11_bn |
||
VGG-13 |
vgg13 |
||
VGG-13 with batch normalization |
vgg13_bn |
||
VGG-16 |
vgg16 |
||
VGG-16 with batch normalization |
vgg16_bn |
||
VGG-19 |
vgg19 |
||
VGG-19 with batch normalization |
vgg19_bn |
||
GoogleNet |
googlenet |
||
Inception V3 |
inception_v3 |
||
ResNet-18 |
resnet18 |
||
ResNet-34 |
resnet34 |
||
ResNet-50 |
resnet50 |
||
ResNet-101 |
resnet101 |
||
ResNet-152 |
resnet152 |
||
Wide ResNet-101-2 |
wide_resnet101_2 |
||
Wide ResNet-50-2 |
wide_resnet50_2 |
||
ShuffleNet V2 0.5 |
shufflenet_v2_x0_5 |
||
ShuffleNet V2 1.0 |
shufflenet_v2_x1_0 |
||
ResNext-101-32x8D |
resnext101_32x8d |
||
ResNext-50-32x4D |
resnext50_32x4d |
Usage Example:
Example below uses ResNet-18 Model using the command line.
Let us take a simple example: Classifying Ants and Bees Images
First, we download the dataset and verify with sha384sum
curl -LO https://download.pytorch.org/tutorial/hymenoptera_data.zip
sha384sum -c - << EOF
491db45cfcab02d99843fbdcf0574ecf99aa4f056d52c660a39248b5524f9e6e8f896d9faabd27ffcfc2eaca0cec6f39 /home/tron/Desktop/Development/hymenoptera_data.zip
EOF
hymenoptera_data.zip: OK
Unzip the file
unzip hymenoptera_data.zip
We first create a YAML file to define the last layer(s) to replace from the network architecture
layers.yaml
linear1:
layer_type: Linear
in_features: 512
out_features: 256
relu:
layer_type: ReLU
dropout:
layer_type: Dropout
p: 0.2
linear2:
layer_type: Linear
in_features: 256
out_features: 2
logsoftmax:
layer_type: LogSoftmax
dim: 1
Train the model
dffml train \
-model resnet18 \
-model-add_layers \
-model-layers @layers.yaml \
-model-clstype str \
-model-classifications ants bees \
-model-location resnet18_model \
-model-imageSize 224 \
-model-epochs 5 \
-model-batch_size 32 \
-model-enableGPU \
-model-features image:int:$((500*500)) \
-model-predict label:str:1 \
-sources f=dir \
-source-foldername hymenoptera_data/train \
-source-feature image \
-source-labels ants bees \
-log critical
Assess accuracy
dffml accuracy \
-model resnet18 \
-model-add_layers \
-model-layers @layers.yaml \
-model-clstype str \
-model-classifications ants bees \
-model-location resnet18_model \
-model-imageSize 224 \
-model-batch_size 32 \
-model-enableGPU \
-model-features image:int:$((500*500)) \
-model-predict label:str:1 \
-features label:str:1 \
-sources f=dir \
-source-foldername hymenoptera_data/val \
-source-feature image \
-source-labels ants bees \
-scorer pytorchscore \
-log critical
Output:
0.9215686274509803
Create a csv file with the names of the images to predict, whether they are ants or bees.
cat > unknown_images.csv << EOF
key,image
ants1,hymenoptera_data/val/ants/Ant-1818.jpg
bee1,hymenoptera_data/val//bees/10870992_eebeeb3a12.jpg
bee2,hymenoptera_data/val/bees/abeja.jpg
ants2,hymenoptera_data/val/ants/desert_ant.jpg
EOF
Make the predictions
dffml predict all \
-model resnet18 \
-model-add_layers \
-model-layers @layers.yaml \
-model-clstype str \
-model-classifications ants bees \
-model-location resnet18_model \
-model-imageSize 224 \
-model-enableGPU \
-model-features image:int:$((500*500)) \
-model-predict label:str:1 \
-sources f=csv \
-source-filename unknown_images.csv \
-source-loadfiles image \
-log critical \
-pretty
Output:
Key: ants1
Record Features
+----------------------------------------------------------------------------------------------------------------------------------------------+
| image | 59, 66, 83, 60, 70, 87, 57, 72, 88, 53, 74, 89 ... (length:263250) |
+----------------------------------------------------------------------------------------------------------------------------------------------+
Prediction
+----------------------------------------------------------------------------------------------------------------------------------------------+
| label |
+----------------------------------------------------------------------------------------------------------------------------------------------+
| Value: ants | Confidence: 0.9920881390571594 |
+----------------------------------------------------------------------------------------------------------------------------------------------+
Key: bee1
Record Features
+----------------------------------------------------------------------------------------------------------------------------------------------+
| image | 63, 114, 146, 63, 114, 146, 63, 114, 146, 63, ... (length:696000) |
+----------------------------------------------------------------------------------------------------------------------------------------------+
Prediction
+----------------------------------------------------------------------------------------------------------------------------------------------+
| label |
+----------------------------------------------------------------------------------------------------------------------------------------------+
| Value: bees | Confidence: 0.6108130216598511 |
+----------------------------------------------------------------------------------------------------------------------------------------------+
Key: bee2
Record Features
+----------------------------------------------------------------------------------------------------------------------------------------------+
| image | 103, 253, 254, 98, 254, 254, 91, 255, 254, 89, ... (length:359100) |
+----------------------------------------------------------------------------------------------------------------------------------------------+
Prediction
+----------------------------------------------------------------------------------------------------------------------------------------------+
| label |
+----------------------------------------------------------------------------------------------------------------------------------------------+
| Value: bees | Confidence: 0.9162276387214661 |
+----------------------------------------------------------------------------------------------------------------------------------------------+
Key: ants2
Record Features
+----------------------------------------------------------------------------------------------------------------------------------------------+
| image | 69, 121, 162, 44, 96, 137, 41, 90, 130, 68, 11 ... (length:1563912) |
+----------------------------------------------------------------------------------------------------------------------------------------------+
Prediction
+----------------------------------------------------------------------------------------------------------------------------------------------+
| label |
+----------------------------------------------------------------------------------------------------------------------------------------------+
| Value: ants | Confidence: 0.9368477463722229 |
+----------------------------------------------------------------------------------------------------------------------------------------------+
alexnet¶
Official
No description
Args
predict: Feature
Feature name holding classification value
features: List of features
Features to train on
location: Path
Location where state should be saved
classifications: List of strings
default: None
Options for value of classification
clstype: Type
default: <class ‘str’>
Data type of classifications values
imageSize: Integer
default: None
Common size for all images to resize and crop to
enableGPU: String
default: False
Utilize GPUs for processing
epochs: Integer
default: 20
Number of iterations to pass over all records in a source
batch_size: Integer
default: 32
Batch size
validation_split: float
default: 0.0
Split training data for Validation
patience: Integer
default: 5
Early stops the training if validation loss doesn’t improve after a given patience
loss: PyTorchLoss
default: <class ‘dffml.base.CrossEntropyLossFunction’>
Loss Functions available in PyTorch
optimizer: String
default: SGD
Optimizer Algorithms available in PyTorch
normalize_mean: List of floats
default: None
Mean values for normalizing Tensor image
normalize_std: List of floats
default: None
Standard Deviation values for normalizing Tensor image
pretrained: String
default: True
Load Pre-trained model weights
trainable: String
default: False
Tweak pretrained model by training again
add_layers: String
default: False
Replace the last layer of the pretrained model
layers: typing.Union[dict, torch.nn.modules.container.ModuleDict, torch.nn.modules.container.Sequential, torch.nn.modules.container.ModuleList, torch.nn.modules.module.Module]
default: None
Extra layers to replace the last layer of the pretrained model
densenet121¶
Official
No description
Args
predict: Feature
Feature name holding classification value
features: List of features
Features to train on
location: Path
Location where state should be saved
classifications: List of strings
default: None
Options for value of classification
clstype: Type
default: <class ‘str’>
Data type of classifications values
imageSize: Integer
default: None
Common size for all images to resize and crop to
enableGPU: String
default: False
Utilize GPUs for processing
epochs: Integer
default: 20
Number of iterations to pass over all records in a source
batch_size: Integer
default: 32
Batch size
validation_split: float
default: 0.0
Split training data for Validation
patience: Integer
default: 5
Early stops the training if validation loss doesn’t improve after a given patience
loss: PyTorchLoss
default: <class ‘dffml.base.CrossEntropyLossFunction’>
Loss Functions available in PyTorch
optimizer: String
default: SGD
Optimizer Algorithms available in PyTorch
normalize_mean: List of floats
default: None
Mean values for normalizing Tensor image
normalize_std: List of floats
default: None
Standard Deviation values for normalizing Tensor image
pretrained: String
default: True
Load Pre-trained model weights
trainable: String
default: False
Tweak pretrained model by training again
add_layers: String
default: False
Replace the last layer of the pretrained model
layers: typing.Union[dict, torch.nn.modules.container.ModuleDict, torch.nn.modules.container.Sequential, torch.nn.modules.container.ModuleList, torch.nn.modules.module.Module]
default: None
Extra layers to replace the last layer of the pretrained model
densenet161¶
Official
No description
Args
predict: Feature
Feature name holding classification value
features: List of features
Features to train on
location: Path
Location where state should be saved
classifications: List of strings
default: None
Options for value of classification
clstype: Type
default: <class ‘str’>
Data type of classifications values
imageSize: Integer
default: None
Common size for all images to resize and crop to
enableGPU: String
default: False
Utilize GPUs for processing
epochs: Integer
default: 20
Number of iterations to pass over all records in a source
batch_size: Integer
default: 32
Batch size
validation_split: float
default: 0.0
Split training data for Validation
patience: Integer
default: 5
Early stops the training if validation loss doesn’t improve after a given patience
loss: PyTorchLoss
default: <class ‘dffml.base.CrossEntropyLossFunction’>
Loss Functions available in PyTorch
optimizer: String
default: SGD
Optimizer Algorithms available in PyTorch
normalize_mean: List of floats
default: None
Mean values for normalizing Tensor image
normalize_std: List of floats
default: None
Standard Deviation values for normalizing Tensor image
pretrained: String
default: True
Load Pre-trained model weights
trainable: String
default: False
Tweak pretrained model by training again
add_layers: String
default: False
Replace the last layer of the pretrained model
layers: typing.Union[dict, torch.nn.modules.container.ModuleDict, torch.nn.modules.container.Sequential, torch.nn.modules.container.ModuleList, torch.nn.modules.module.Module]
default: None
Extra layers to replace the last layer of the pretrained model
densenet169¶
Official
No description
Args
predict: Feature
Feature name holding classification value
features: List of features
Features to train on
location: Path
Location where state should be saved
classifications: List of strings
default: None
Options for value of classification
clstype: Type
default: <class ‘str’>
Data type of classifications values
imageSize: Integer
default: None
Common size for all images to resize and crop to
enableGPU: String
default: False
Utilize GPUs for processing
epochs: Integer
default: 20
Number of iterations to pass over all records in a source
batch_size: Integer
default: 32
Batch size
validation_split: float
default: 0.0
Split training data for Validation
patience: Integer
default: 5
Early stops the training if validation loss doesn’t improve after a given patience
loss: PyTorchLoss
default: <class ‘dffml.base.CrossEntropyLossFunction’>
Loss Functions available in PyTorch
optimizer: String
default: SGD
Optimizer Algorithms available in PyTorch
normalize_mean: List of floats
default: None
Mean values for normalizing Tensor image
normalize_std: List of floats
default: None
Standard Deviation values for normalizing Tensor image
pretrained: String
default: True
Load Pre-trained model weights
trainable: String
default: False
Tweak pretrained model by training again
add_layers: String
default: False
Replace the last layer of the pretrained model
layers: typing.Union[dict, torch.nn.modules.container.ModuleDict, torch.nn.modules.container.Sequential, torch.nn.modules.container.ModuleList, torch.nn.modules.module.Module]
default: None
Extra layers to replace the last layer of the pretrained model
densenet201¶
Official
No description
Args
predict: Feature
Feature name holding classification value
features: List of features
Features to train on
location: Path
Location where state should be saved
classifications: List of strings
default: None
Options for value of classification
clstype: Type
default: <class ‘str’>
Data type of classifications values
imageSize: Integer
default: None
Common size for all images to resize and crop to
enableGPU: String
default: False
Utilize GPUs for processing
epochs: Integer
default: 20
Number of iterations to pass over all records in a source
batch_size: Integer
default: 32
Batch size
validation_split: float
default: 0.0
Split training data for Validation
patience: Integer
default: 5
Early stops the training if validation loss doesn’t improve after a given patience
loss: PyTorchLoss
default: <class ‘dffml.base.CrossEntropyLossFunction’>
Loss Functions available in PyTorch
optimizer: String
default: SGD
Optimizer Algorithms available in PyTorch
normalize_mean: List of floats
default: None
Mean values for normalizing Tensor image
normalize_std: List of floats
default: None
Standard Deviation values for normalizing Tensor image
pretrained: String
default: True
Load Pre-trained model weights
trainable: String
default: False
Tweak pretrained model by training again
add_layers: String
default: False
Replace the last layer of the pretrained model
layers: typing.Union[dict, torch.nn.modules.container.ModuleDict, torch.nn.modules.container.Sequential, torch.nn.modules.container.ModuleList, torch.nn.modules.module.Module]
default: None
Extra layers to replace the last layer of the pretrained model
googlenet¶
Official
No description
Args
predict: Feature
Feature name holding classification value
features: List of features
Features to train on
location: Path
Location where state should be saved
classifications: List of strings
default: None
Options for value of classification
clstype: Type
default: <class ‘str’>
Data type of classifications values
imageSize: Integer
default: None
Common size for all images to resize and crop to
enableGPU: String
default: False
Utilize GPUs for processing
epochs: Integer
default: 20
Number of iterations to pass over all records in a source
batch_size: Integer
default: 32
Batch size
validation_split: float
default: 0.0
Split training data for Validation
patience: Integer
default: 5
Early stops the training if validation loss doesn’t improve after a given patience
loss: PyTorchLoss
default: <class ‘dffml.base.CrossEntropyLossFunction’>
Loss Functions available in PyTorch
optimizer: String
default: SGD
Optimizer Algorithms available in PyTorch
normalize_mean: List of floats
default: None
Mean values for normalizing Tensor image
normalize_std: List of floats
default: None
Standard Deviation values for normalizing Tensor image
pretrained: String
default: True
Load Pre-trained model weights
trainable: String
default: False
Tweak pretrained model by training again
add_layers: String
default: False
Replace the last layer of the pretrained model
layers: typing.Union[dict, torch.nn.modules.container.ModuleDict, torch.nn.modules.container.Sequential, torch.nn.modules.container.ModuleList, torch.nn.modules.module.Module]
default: None
Extra layers to replace the last layer of the pretrained model
inception_v3¶
Official
No description
Args
predict: Feature
Feature name holding classification value
features: List of features
Features to train on
location: Path
Location where state should be saved
classifications: List of strings
default: None
Options for value of classification
clstype: Type
default: <class ‘str’>
Data type of classifications values
imageSize: Integer
default: None
Common size for all images to resize and crop to
enableGPU: String
default: False
Utilize GPUs for processing
epochs: Integer
default: 20
Number of iterations to pass over all records in a source
batch_size: Integer
default: 32
Batch size
validation_split: float
default: 0.0
Split training data for Validation
patience: Integer
default: 5
Early stops the training if validation loss doesn’t improve after a given patience
loss: PyTorchLoss
default: <class ‘dffml.base.CrossEntropyLossFunction’>
Loss Functions available in PyTorch
optimizer: String
default: SGD
Optimizer Algorithms available in PyTorch
normalize_mean: List of floats
default: None
Mean values for normalizing Tensor image
normalize_std: List of floats
default: None
Standard Deviation values for normalizing Tensor image
pretrained: String
default: True
Load Pre-trained model weights
trainable: String
default: False
Tweak pretrained model by training again
add_layers: String
default: False
Replace the last layer of the pretrained model
layers: typing.Union[dict, torch.nn.modules.container.ModuleDict, torch.nn.modules.container.Sequential, torch.nn.modules.container.ModuleList, torch.nn.modules.module.Module]
default: None
Extra layers to replace the last layer of the pretrained model
mnasnet0_5¶
Official
No description
Args
predict: Feature
Feature name holding classification value
features: List of features
Features to train on
location: Path
Location where state should be saved
classifications: List of strings
default: None
Options for value of classification
clstype: Type
default: <class ‘str’>
Data type of classifications values
imageSize: Integer
default: None
Common size for all images to resize and crop to
enableGPU: String
default: False
Utilize GPUs for processing
epochs: Integer
default: 20
Number of iterations to pass over all records in a source
batch_size: Integer
default: 32
Batch size
validation_split: float
default: 0.0
Split training data for Validation
patience: Integer
default: 5
Early stops the training if validation loss doesn’t improve after a given patience
loss: PyTorchLoss
default: <class ‘dffml.base.CrossEntropyLossFunction’>
Loss Functions available in PyTorch
optimizer: String
default: SGD
Optimizer Algorithms available in PyTorch
normalize_mean: List of floats
default: None
Mean values for normalizing Tensor image
normalize_std: List of floats
default: None
Standard Deviation values for normalizing Tensor image
pretrained: String
default: True
Load Pre-trained model weights
trainable: String
default: False
Tweak pretrained model by training again
add_layers: String
default: False
Replace the last layer of the pretrained model
layers: typing.Union[dict, torch.nn.modules.container.ModuleDict, torch.nn.modules.container.Sequential, torch.nn.modules.container.ModuleList, torch.nn.modules.module.Module]
default: None
Extra layers to replace the last layer of the pretrained model
mnasnet1_0¶
Official
No description
Args
predict: Feature
Feature name holding classification value
features: List of features
Features to train on
location: Path
Location where state should be saved
classifications: List of strings
default: None
Options for value of classification
clstype: Type
default: <class ‘str’>
Data type of classifications values
imageSize: Integer
default: None
Common size for all images to resize and crop to
enableGPU: String
default: False
Utilize GPUs for processing
epochs: Integer
default: 20
Number of iterations to pass over all records in a source
batch_size: Integer
default: 32
Batch size
validation_split: float
default: 0.0
Split training data for Validation
patience: Integer
default: 5
Early stops the training if validation loss doesn’t improve after a given patience
loss: PyTorchLoss
default: <class ‘dffml.base.CrossEntropyLossFunction’>
Loss Functions available in PyTorch
optimizer: String
default: SGD
Optimizer Algorithms available in PyTorch
normalize_mean: List of floats
default: None
Mean values for normalizing Tensor image
normalize_std: List of floats
default: None
Standard Deviation values for normalizing Tensor image
pretrained: String
default: True
Load Pre-trained model weights
trainable: String
default: False
Tweak pretrained model by training again
add_layers: String
default: False
Replace the last layer of the pretrained model
layers: typing.Union[dict, torch.nn.modules.container.ModuleDict, torch.nn.modules.container.Sequential, torch.nn.modules.container.ModuleList, torch.nn.modules.module.Module]
default: None
Extra layers to replace the last layer of the pretrained model
mobilenet_v2¶
Official
No description
Args
predict: Feature
Feature name holding classification value
features: List of features
Features to train on
location: Path
Location where state should be saved
classifications: List of strings
default: None
Options for value of classification
clstype: Type
default: <class ‘str’>
Data type of classifications values
imageSize: Integer
default: None
Common size for all images to resize and crop to
enableGPU: String
default: False
Utilize GPUs for processing
epochs: Integer
default: 20
Number of iterations to pass over all records in a source
batch_size: Integer
default: 32
Batch size
validation_split: float
default: 0.0
Split training data for Validation
patience: Integer
default: 5
Early stops the training if validation loss doesn’t improve after a given patience
loss: PyTorchLoss
default: <class ‘dffml.base.CrossEntropyLossFunction’>
Loss Functions available in PyTorch
optimizer: String
default: SGD
Optimizer Algorithms available in PyTorch
normalize_mean: List of floats
default: None
Mean values for normalizing Tensor image
normalize_std: List of floats
default: None
Standard Deviation values for normalizing Tensor image
pretrained: String
default: True
Load Pre-trained model weights
trainable: String
default: False
Tweak pretrained model by training again
add_layers: String
default: False
Replace the last layer of the pretrained model
layers: typing.Union[dict, torch.nn.modules.container.ModuleDict, torch.nn.modules.container.Sequential, torch.nn.modules.container.ModuleList, torch.nn.modules.module.Module]
default: None
Extra layers to replace the last layer of the pretrained model
pytorchnet¶
Official
No description
Args
predict: Feature
Feature name holding classification value
features: List of features
Features to train on
location: Path
Location where state should be saved
classifications: List of strings
default: None
Options for value of classification
clstype: Type
default: <class ‘str’>
Data type of classifications values
imageSize: Integer
default: None
Common size for all images to resize and crop to
enableGPU: String
default: False
Utilize GPUs for processing
epochs: Integer
default: 20
Number of iterations to pass over all records in a source
batch_size: Integer
default: 32
Batch size
validation_split: float
default: 0.0
Split training data for Validation
patience: Integer
default: 5
Early stops the training if validation loss doesn’t improve after a given patience
loss: PyTorchLoss
default: <class ‘dffml.base.CrossEntropyLossFunction’>
Loss Functions available in PyTorch
optimizer: String
default: SGD
Optimizer Algorithms available in PyTorch
normalize_mean: List of floats
default: None
Mean values for normalizing Tensor image
normalize_std: List of floats
default: None
Standard Deviation values for normalizing Tensor image
network: typing.Union[dffml_model_pytorch.pytorch_net.Network, torch.nn.modules.module.Module]
default: None
Model
resnet101¶
Official
No description
Args
predict: Feature
Feature name holding classification value
features: List of features
Features to train on
location: Path
Location where state should be saved
classifications: List of strings
default: None
Options for value of classification
clstype: Type
default: <class ‘str’>
Data type of classifications values
imageSize: Integer
default: None
Common size for all images to resize and crop to
enableGPU: String
default: False
Utilize GPUs for processing
epochs: Integer
default: 20
Number of iterations to pass over all records in a source
batch_size: Integer
default: 32
Batch size
validation_split: float
default: 0.0
Split training data for Validation
patience: Integer
default: 5
Early stops the training if validation loss doesn’t improve after a given patience
loss: PyTorchLoss
default: <class ‘dffml.base.CrossEntropyLossFunction’>
Loss Functions available in PyTorch
optimizer: String
default: SGD
Optimizer Algorithms available in PyTorch
normalize_mean: List of floats
default: None
Mean values for normalizing Tensor image
normalize_std: List of floats
default: None
Standard Deviation values for normalizing Tensor image
pretrained: String
default: True
Load Pre-trained model weights
trainable: String
default: False
Tweak pretrained model by training again
add_layers: String
default: False
Replace the last layer of the pretrained model
layers: typing.Union[dict, torch.nn.modules.container.ModuleDict, torch.nn.modules.container.Sequential, torch.nn.modules.container.ModuleList, torch.nn.modules.module.Module]
default: None
Extra layers to replace the last layer of the pretrained model
resnet152¶
Official
No description
Args
predict: Feature
Feature name holding classification value
features: List of features
Features to train on
location: Path
Location where state should be saved
classifications: List of strings
default: None
Options for value of classification
clstype: Type
default: <class ‘str’>
Data type of classifications values
imageSize: Integer
default: None
Common size for all images to resize and crop to
enableGPU: String
default: False
Utilize GPUs for processing
epochs: Integer
default: 20
Number of iterations to pass over all records in a source
batch_size: Integer
default: 32
Batch size
validation_split: float
default: 0.0
Split training data for Validation
patience: Integer
default: 5
Early stops the training if validation loss doesn’t improve after a given patience
loss: PyTorchLoss
default: <class ‘dffml.base.CrossEntropyLossFunction’>
Loss Functions available in PyTorch
optimizer: String
default: SGD
Optimizer Algorithms available in PyTorch
normalize_mean: List of floats
default: None
Mean values for normalizing Tensor image
normalize_std: List of floats
default: None
Standard Deviation values for normalizing Tensor image
pretrained: String
default: True
Load Pre-trained model weights
trainable: String
default: False
Tweak pretrained model by training again
add_layers: String
default: False
Replace the last layer of the pretrained model
layers: typing.Union[dict, torch.nn.modules.container.ModuleDict, torch.nn.modules.container.Sequential, torch.nn.modules.container.ModuleList, torch.nn.modules.module.Module]
default: None
Extra layers to replace the last layer of the pretrained model
resnet18¶
Official
No description
Args
predict: Feature
Feature name holding classification value
features: List of features
Features to train on
location: Path
Location where state should be saved
classifications: List of strings
default: None
Options for value of classification
clstype: Type
default: <class ‘str’>
Data type of classifications values
imageSize: Integer
default: None
Common size for all images to resize and crop to
enableGPU: String
default: False
Utilize GPUs for processing
epochs: Integer
default: 20
Number of iterations to pass over all records in a source
batch_size: Integer
default: 32
Batch size
validation_split: float
default: 0.0
Split training data for Validation
patience: Integer
default: 5
Early stops the training if validation loss doesn’t improve after a given patience
loss: PyTorchLoss
default: <class ‘dffml.base.CrossEntropyLossFunction’>
Loss Functions available in PyTorch
optimizer: String
default: SGD
Optimizer Algorithms available in PyTorch
normalize_mean: List of floats
default: None
Mean values for normalizing Tensor image
normalize_std: List of floats
default: None
Standard Deviation values for normalizing Tensor image
pretrained: String
default: True
Load Pre-trained model weights
trainable: String
default: False
Tweak pretrained model by training again
add_layers: String
default: False
Replace the last layer of the pretrained model
layers: typing.Union[dict, torch.nn.modules.container.ModuleDict, torch.nn.modules.container.Sequential, torch.nn.modules.container.ModuleList, torch.nn.modules.module.Module]
default: None
Extra layers to replace the last layer of the pretrained model
resnet34¶
Official
No description
Args
predict: Feature
Feature name holding classification value
features: List of features
Features to train on
location: Path
Location where state should be saved
classifications: List of strings
default: None
Options for value of classification
clstype: Type
default: <class ‘str’>
Data type of classifications values
imageSize: Integer
default: None
Common size for all images to resize and crop to
enableGPU: String
default: False
Utilize GPUs for processing
epochs: Integer
default: 20
Number of iterations to pass over all records in a source
batch_size: Integer
default: 32
Batch size
validation_split: float
default: 0.0
Split training data for Validation
patience: Integer
default: 5
Early stops the training if validation loss doesn’t improve after a given patience
loss: PyTorchLoss
default: <class ‘dffml.base.CrossEntropyLossFunction’>
Loss Functions available in PyTorch
optimizer: String
default: SGD
Optimizer Algorithms available in PyTorch
normalize_mean: List of floats
default: None
Mean values for normalizing Tensor image
normalize_std: List of floats
default: None
Standard Deviation values for normalizing Tensor image
pretrained: String
default: True
Load Pre-trained model weights
trainable: String
default: False
Tweak pretrained model by training again
add_layers: String
default: False
Replace the last layer of the pretrained model
layers: typing.Union[dict, torch.nn.modules.container.ModuleDict, torch.nn.modules.container.Sequential, torch.nn.modules.container.ModuleList, torch.nn.modules.module.Module]
default: None
Extra layers to replace the last layer of the pretrained model
resnet50¶
Official
No description
Args
predict: Feature
Feature name holding classification value
features: List of features
Features to train on
location: Path
Location where state should be saved
classifications: List of strings
default: None
Options for value of classification
clstype: Type
default: <class ‘str’>
Data type of classifications values
imageSize: Integer
default: None
Common size for all images to resize and crop to
enableGPU: String
default: False
Utilize GPUs for processing
epochs: Integer
default: 20
Number of iterations to pass over all records in a source
batch_size: Integer
default: 32
Batch size
validation_split: float
default: 0.0
Split training data for Validation
patience: Integer
default: 5
Early stops the training if validation loss doesn’t improve after a given patience
loss: PyTorchLoss
default: <class ‘dffml.base.CrossEntropyLossFunction’>
Loss Functions available in PyTorch
optimizer: String
default: SGD
Optimizer Algorithms available in PyTorch
normalize_mean: List of floats
default: None
Mean values for normalizing Tensor image
normalize_std: List of floats
default: None
Standard Deviation values for normalizing Tensor image
pretrained: String
default: True
Load Pre-trained model weights
trainable: String
default: False
Tweak pretrained model by training again
add_layers: String
default: False
Replace the last layer of the pretrained model
layers: typing.Union[dict, torch.nn.modules.container.ModuleDict, torch.nn.modules.container.Sequential, torch.nn.modules.container.ModuleList, torch.nn.modules.module.Module]
default: None
Extra layers to replace the last layer of the pretrained model
resnext101_32x8d¶
Official
No description
Args
predict: Feature
Feature name holding classification value
features: List of features
Features to train on
location: Path
Location where state should be saved
classifications: List of strings
default: None
Options for value of classification
clstype: Type
default: <class ‘str’>
Data type of classifications values
imageSize: Integer
default: None
Common size for all images to resize and crop to
enableGPU: String
default: False
Utilize GPUs for processing
epochs: Integer
default: 20
Number of iterations to pass over all records in a source
batch_size: Integer
default: 32
Batch size
validation_split: float
default: 0.0
Split training data for Validation
patience: Integer
default: 5
Early stops the training if validation loss doesn’t improve after a given patience
loss: PyTorchLoss
default: <class ‘dffml.base.CrossEntropyLossFunction’>
Loss Functions available in PyTorch
optimizer: String
default: SGD
Optimizer Algorithms available in PyTorch
normalize_mean: List of floats
default: None
Mean values for normalizing Tensor image
normalize_std: List of floats
default: None
Standard Deviation values for normalizing Tensor image
pretrained: String
default: True
Load Pre-trained model weights
trainable: String
default: False
Tweak pretrained model by training again
add_layers: String
default: False
Replace the last layer of the pretrained model
layers: typing.Union[dict, torch.nn.modules.container.ModuleDict, torch.nn.modules.container.Sequential, torch.nn.modules.container.ModuleList, torch.nn.modules.module.Module]
default: None
Extra layers to replace the last layer of the pretrained model
resnext50_32x4d¶
Official
No description
Args
predict: Feature
Feature name holding classification value
features: List of features
Features to train on
location: Path
Location where state should be saved
classifications: List of strings
default: None
Options for value of classification
clstype: Type
default: <class ‘str’>
Data type of classifications values
imageSize: Integer
default: None
Common size for all images to resize and crop to
enableGPU: String
default: False
Utilize GPUs for processing
epochs: Integer
default: 20
Number of iterations to pass over all records in a source
batch_size: Integer
default: 32
Batch size
validation_split: float
default: 0.0
Split training data for Validation
patience: Integer
default: 5
Early stops the training if validation loss doesn’t improve after a given patience
loss: PyTorchLoss
default: <class ‘dffml.base.CrossEntropyLossFunction’>
Loss Functions available in PyTorch
optimizer: String
default: SGD
Optimizer Algorithms available in PyTorch
normalize_mean: List of floats
default: None
Mean values for normalizing Tensor image
normalize_std: List of floats
default: None
Standard Deviation values for normalizing Tensor image
pretrained: String
default: True
Load Pre-trained model weights
trainable: String
default: False
Tweak pretrained model by training again
add_layers: String
default: False
Replace the last layer of the pretrained model
layers: typing.Union[dict, torch.nn.modules.container.ModuleDict, torch.nn.modules.container.Sequential, torch.nn.modules.container.ModuleList, torch.nn.modules.module.Module]
default: None
Extra layers to replace the last layer of the pretrained model
shufflenet_v2_x0_5¶
Official
No description
Args
predict: Feature
Feature name holding classification value
features: List of features
Features to train on
location: Path
Location where state should be saved
classifications: List of strings
default: None
Options for value of classification
clstype: Type
default: <class ‘str’>
Data type of classifications values
imageSize: Integer
default: None
Common size for all images to resize and crop to
enableGPU: String
default: False
Utilize GPUs for processing
epochs: Integer
default: 20
Number of iterations to pass over all records in a source
batch_size: Integer
default: 32
Batch size
validation_split: float
default: 0.0
Split training data for Validation
patience: Integer
default: 5
Early stops the training if validation loss doesn’t improve after a given patience
loss: PyTorchLoss
default: <class ‘dffml.base.CrossEntropyLossFunction’>
Loss Functions available in PyTorch
optimizer: String
default: SGD
Optimizer Algorithms available in PyTorch
normalize_mean: List of floats
default: None
Mean values for normalizing Tensor image
normalize_std: List of floats
default: None
Standard Deviation values for normalizing Tensor image
pretrained: String
default: True
Load Pre-trained model weights
trainable: String
default: False
Tweak pretrained model by training again
add_layers: String
default: False
Replace the last layer of the pretrained model
layers: typing.Union[dict, torch.nn.modules.container.ModuleDict, torch.nn.modules.container.Sequential, torch.nn.modules.container.ModuleList, torch.nn.modules.module.Module]
default: None
Extra layers to replace the last layer of the pretrained model
shufflenet_v2_x1_0¶
Official
No description
Args
predict: Feature
Feature name holding classification value
features: List of features
Features to train on
location: Path
Location where state should be saved
classifications: List of strings
default: None
Options for value of classification
clstype: Type
default: <class ‘str’>
Data type of classifications values
imageSize: Integer
default: None
Common size for all images to resize and crop to
enableGPU: String
default: False
Utilize GPUs for processing
epochs: Integer
default: 20
Number of iterations to pass over all records in a source
batch_size: Integer
default: 32
Batch size
validation_split: float
default: 0.0
Split training data for Validation
patience: Integer
default: 5
Early stops the training if validation loss doesn’t improve after a given patience
loss: PyTorchLoss
default: <class ‘dffml.base.CrossEntropyLossFunction’>
Loss Functions available in PyTorch
optimizer: String
default: SGD
Optimizer Algorithms available in PyTorch
normalize_mean: List of floats
default: None
Mean values for normalizing Tensor image
normalize_std: List of floats
default: None
Standard Deviation values for normalizing Tensor image
pretrained: String
default: True
Load Pre-trained model weights
trainable: String
default: False
Tweak pretrained model by training again
add_layers: String
default: False
Replace the last layer of the pretrained model
layers: typing.Union[dict, torch.nn.modules.container.ModuleDict, torch.nn.modules.container.Sequential, torch.nn.modules.container.ModuleList, torch.nn.modules.module.Module]
default: None
Extra layers to replace the last layer of the pretrained model
vgg11¶
Official
No description
Args
predict: Feature
Feature name holding classification value
features: List of features
Features to train on
location: Path
Location where state should be saved
classifications: List of strings
default: None
Options for value of classification
clstype: Type
default: <class ‘str’>
Data type of classifications values
imageSize: Integer
default: None
Common size for all images to resize and crop to
enableGPU: String
default: False
Utilize GPUs for processing
epochs: Integer
default: 20
Number of iterations to pass over all records in a source
batch_size: Integer
default: 32
Batch size
validation_split: float
default: 0.0
Split training data for Validation
patience: Integer
default: 5
Early stops the training if validation loss doesn’t improve after a given patience
loss: PyTorchLoss
default: <class ‘dffml.base.CrossEntropyLossFunction’>
Loss Functions available in PyTorch
optimizer: String
default: SGD
Optimizer Algorithms available in PyTorch
normalize_mean: List of floats
default: None
Mean values for normalizing Tensor image
normalize_std: List of floats
default: None
Standard Deviation values for normalizing Tensor image
pretrained: String
default: True
Load Pre-trained model weights
trainable: String
default: False
Tweak pretrained model by training again
add_layers: String
default: False
Replace the last layer of the pretrained model
layers: typing.Union[dict, torch.nn.modules.container.ModuleDict, torch.nn.modules.container.Sequential, torch.nn.modules.container.ModuleList, torch.nn.modules.module.Module]
default: None
Extra layers to replace the last layer of the pretrained model
vgg11_bn¶
Official
No description
Args
predict: Feature
Feature name holding classification value
features: List of features
Features to train on
location: Path
Location where state should be saved
classifications: List of strings
default: None
Options for value of classification
clstype: Type
default: <class ‘str’>
Data type of classifications values
imageSize: Integer
default: None
Common size for all images to resize and crop to
enableGPU: String
default: False
Utilize GPUs for processing
epochs: Integer
default: 20
Number of iterations to pass over all records in a source
batch_size: Integer
default: 32
Batch size
validation_split: float
default: 0.0
Split training data for Validation
patience: Integer
default: 5
Early stops the training if validation loss doesn’t improve after a given patience
loss: PyTorchLoss
default: <class ‘dffml.base.CrossEntropyLossFunction’>
Loss Functions available in PyTorch
optimizer: String
default: SGD
Optimizer Algorithms available in PyTorch
normalize_mean: List of floats
default: None
Mean values for normalizing Tensor image
normalize_std: List of floats
default: None
Standard Deviation values for normalizing Tensor image
pretrained: String
default: True
Load Pre-trained model weights
trainable: String
default: False
Tweak pretrained model by training again
add_layers: String
default: False
Replace the last layer of the pretrained model
layers: typing.Union[dict, torch.nn.modules.container.ModuleDict, torch.nn.modules.container.Sequential, torch.nn.modules.container.ModuleList, torch.nn.modules.module.Module]
default: None
Extra layers to replace the last layer of the pretrained model
vgg13¶
Official
No description
Args
predict: Feature
Feature name holding classification value
features: List of features
Features to train on
location: Path
Location where state should be saved
classifications: List of strings
default: None
Options for value of classification
clstype: Type
default: <class ‘str’>
Data type of classifications values
imageSize: Integer
default: None
Common size for all images to resize and crop to
enableGPU: String
default: False
Utilize GPUs for processing
epochs: Integer
default: 20
Number of iterations to pass over all records in a source
batch_size: Integer
default: 32
Batch size
validation_split: float
default: 0.0
Split training data for Validation
patience: Integer
default: 5
Early stops the training if validation loss doesn’t improve after a given patience
loss: PyTorchLoss
default: <class ‘dffml.base.CrossEntropyLossFunction’>
Loss Functions available in PyTorch
optimizer: String
default: SGD
Optimizer Algorithms available in PyTorch
normalize_mean: List of floats
default: None
Mean values for normalizing Tensor image
normalize_std: List of floats
default: None
Standard Deviation values for normalizing Tensor image
pretrained: String
default: True
Load Pre-trained model weights
trainable: String
default: False
Tweak pretrained model by training again
add_layers: String
default: False
Replace the last layer of the pretrained model
layers: typing.Union[dict, torch.nn.modules.container.ModuleDict, torch.nn.modules.container.Sequential, torch.nn.modules.container.ModuleList, torch.nn.modules.module.Module]
default: None
Extra layers to replace the last layer of the pretrained model
vgg13_bn¶
Official
No description
Args
predict: Feature
Feature name holding classification value
features: List of features
Features to train on
location: Path
Location where state should be saved
classifications: List of strings
default: None
Options for value of classification
clstype: Type
default: <class ‘str’>
Data type of classifications values
imageSize: Integer
default: None
Common size for all images to resize and crop to
enableGPU: String
default: False
Utilize GPUs for processing
epochs: Integer
default: 20
Number of iterations to pass over all records in a source
batch_size: Integer
default: 32
Batch size
validation_split: float
default: 0.0
Split training data for Validation
patience: Integer
default: 5
Early stops the training if validation loss doesn’t improve after a given patience
loss: PyTorchLoss
default: <class ‘dffml.base.CrossEntropyLossFunction’>
Loss Functions available in PyTorch
optimizer: String
default: SGD
Optimizer Algorithms available in PyTorch
normalize_mean: List of floats
default: None
Mean values for normalizing Tensor image
normalize_std: List of floats
default: None
Standard Deviation values for normalizing Tensor image
pretrained: String
default: True
Load Pre-trained model weights
trainable: String
default: False
Tweak pretrained model by training again
add_layers: String
default: False
Replace the last layer of the pretrained model
layers: typing.Union[dict, torch.nn.modules.container.ModuleDict, torch.nn.modules.container.Sequential, torch.nn.modules.container.ModuleList, torch.nn.modules.module.Module]
default: None
Extra layers to replace the last layer of the pretrained model
vgg16¶
Official
No description
Args
predict: Feature
Feature name holding classification value
features: List of features
Features to train on
location: Path
Location where state should be saved
classifications: List of strings
default: None
Options for value of classification
clstype: Type
default: <class ‘str’>
Data type of classifications values
imageSize: Integer
default: None
Common size for all images to resize and crop to
enableGPU: String
default: False
Utilize GPUs for processing
epochs: Integer
default: 20
Number of iterations to pass over all records in a source
batch_size: Integer
default: 32
Batch size
validation_split: float
default: 0.0
Split training data for Validation
patience: Integer
default: 5
Early stops the training if validation loss doesn’t improve after a given patience
loss: PyTorchLoss
default: <class ‘dffml.base.CrossEntropyLossFunction’>
Loss Functions available in PyTorch
optimizer: String
default: SGD
Optimizer Algorithms available in PyTorch
normalize_mean: List of floats
default: None
Mean values for normalizing Tensor image
normalize_std: List of floats
default: None
Standard Deviation values for normalizing Tensor image
pretrained: String
default: True
Load Pre-trained model weights
trainable: String
default: False
Tweak pretrained model by training again
add_layers: String
default: False
Replace the last layer of the pretrained model
layers: typing.Union[dict, torch.nn.modules.container.ModuleDict, torch.nn.modules.container.Sequential, torch.nn.modules.container.ModuleList, torch.nn.modules.module.Module]
default: None
Extra layers to replace the last layer of the pretrained model
vgg16_bn¶
Official
No description
Args
predict: Feature
Feature name holding classification value
features: List of features
Features to train on
location: Path
Location where state should be saved
classifications: List of strings
default: None
Options for value of classification
clstype: Type
default: <class ‘str’>
Data type of classifications values
imageSize: Integer
default: None
Common size for all images to resize and crop to
enableGPU: String
default: False
Utilize GPUs for processing
epochs: Integer
default: 20
Number of iterations to pass over all records in a source
batch_size: Integer
default: 32
Batch size
validation_split: float
default: 0.0
Split training data for Validation
patience: Integer
default: 5
Early stops the training if validation loss doesn’t improve after a given patience
loss: PyTorchLoss
default: <class ‘dffml.base.CrossEntropyLossFunction’>
Loss Functions available in PyTorch
optimizer: String
default: SGD
Optimizer Algorithms available in PyTorch
normalize_mean: List of floats
default: None
Mean values for normalizing Tensor image
normalize_std: List of floats
default: None
Standard Deviation values for normalizing Tensor image
pretrained: String
default: True
Load Pre-trained model weights
trainable: String
default: False
Tweak pretrained model by training again
add_layers: String
default: False
Replace the last layer of the pretrained model
layers: typing.Union[dict, torch.nn.modules.container.ModuleDict, torch.nn.modules.container.Sequential, torch.nn.modules.container.ModuleList, torch.nn.modules.module.Module]
default: None
Extra layers to replace the last layer of the pretrained model
vgg19¶
Official
No description
Args
predict: Feature
Feature name holding classification value
features: List of features
Features to train on
location: Path
Location where state should be saved
classifications: List of strings
default: None
Options for value of classification
clstype: Type
default: <class ‘str’>
Data type of classifications values
imageSize: Integer
default: None
Common size for all images to resize and crop to
enableGPU: String
default: False
Utilize GPUs for processing
epochs: Integer
default: 20
Number of iterations to pass over all records in a source
batch_size: Integer
default: 32
Batch size
validation_split: float
default: 0.0
Split training data for Validation
patience: Integer
default: 5
Early stops the training if validation loss doesn’t improve after a given patience
loss: PyTorchLoss
default: <class ‘dffml.base.CrossEntropyLossFunction’>
Loss Functions available in PyTorch
optimizer: String
default: SGD
Optimizer Algorithms available in PyTorch
normalize_mean: List of floats
default: None
Mean values for normalizing Tensor image
normalize_std: List of floats
default: None
Standard Deviation values for normalizing Tensor image
pretrained: String
default: True
Load Pre-trained model weights
trainable: String
default: False
Tweak pretrained model by training again
add_layers: String
default: False
Replace the last layer of the pretrained model
layers: typing.Union[dict, torch.nn.modules.container.ModuleDict, torch.nn.modules.container.Sequential, torch.nn.modules.container.ModuleList, torch.nn.modules.module.Module]
default: None
Extra layers to replace the last layer of the pretrained model
vgg19_bn¶
Official
No description
Args
predict: Feature
Feature name holding classification value
features: List of features
Features to train on
location: Path
Location where state should be saved
classifications: List of strings
default: None
Options for value of classification
clstype: Type
default: <class ‘str’>
Data type of classifications values
imageSize: Integer
default: None
Common size for all images to resize and crop to
enableGPU: String
default: False
Utilize GPUs for processing
epochs: Integer
default: 20
Number of iterations to pass over all records in a source
batch_size: Integer
default: 32
Batch size
validation_split: float
default: 0.0
Split training data for Validation
patience: Integer
default: 5
Early stops the training if validation loss doesn’t improve after a given patience
loss: PyTorchLoss
default: <class ‘dffml.base.CrossEntropyLossFunction’>
Loss Functions available in PyTorch
optimizer: String
default: SGD
Optimizer Algorithms available in PyTorch
normalize_mean: List of floats
default: None
Mean values for normalizing Tensor image
normalize_std: List of floats
default: None
Standard Deviation values for normalizing Tensor image
pretrained: String
default: True
Load Pre-trained model weights
trainable: String
default: False
Tweak pretrained model by training again
add_layers: String
default: False
Replace the last layer of the pretrained model
layers: typing.Union[dict, torch.nn.modules.container.ModuleDict, torch.nn.modules.container.Sequential, torch.nn.modules.container.ModuleList, torch.nn.modules.module.Module]
default: None
Extra layers to replace the last layer of the pretrained model
wide_resnet101_2¶
Official
No description
Args
predict: Feature
Feature name holding classification value
features: List of features
Features to train on
location: Path
Location where state should be saved
classifications: List of strings
default: None
Options for value of classification
clstype: Type
default: <class ‘str’>
Data type of classifications values
imageSize: Integer
default: None
Common size for all images to resize and crop to
enableGPU: String
default: False
Utilize GPUs for processing
epochs: Integer
default: 20
Number of iterations to pass over all records in a source
batch_size: Integer
default: 32
Batch size
validation_split: float
default: 0.0
Split training data for Validation
patience: Integer
default: 5
Early stops the training if validation loss doesn’t improve after a given patience
loss: PyTorchLoss
default: <class ‘dffml.base.CrossEntropyLossFunction’>
Loss Functions available in PyTorch
optimizer: String
default: SGD
Optimizer Algorithms available in PyTorch
normalize_mean: List of floats
default: None
Mean values for normalizing Tensor image
normalize_std: List of floats
default: None
Standard Deviation values for normalizing Tensor image
pretrained: String
default: True
Load Pre-trained model weights
trainable: String
default: False
Tweak pretrained model by training again
add_layers: String
default: False
Replace the last layer of the pretrained model
layers: typing.Union[dict, torch.nn.modules.container.ModuleDict, torch.nn.modules.container.Sequential, torch.nn.modules.container.ModuleList, torch.nn.modules.module.Module]
default: None
Extra layers to replace the last layer of the pretrained model
wide_resnet50_2¶
Official
No description
Args
predict: Feature
Feature name holding classification value
features: List of features
Features to train on
location: Path
Location where state should be saved
classifications: List of strings
default: None
Options for value of classification
clstype: Type
default: <class ‘str’>
Data type of classifications values
imageSize: Integer
default: None
Common size for all images to resize and crop to
enableGPU: String
default: False
Utilize GPUs for processing
epochs: Integer
default: 20
Number of iterations to pass over all records in a source
batch_size: Integer
default: 32
Batch size
validation_split: float
default: 0.0
Split training data for Validation
patience: Integer
default: 5
Early stops the training if validation loss doesn’t improve after a given patience
loss: PyTorchLoss
default: <class ‘dffml.base.CrossEntropyLossFunction’>
Loss Functions available in PyTorch
optimizer: String
default: SGD
Optimizer Algorithms available in PyTorch
normalize_mean: List of floats
default: None
Mean values for normalizing Tensor image
normalize_std: List of floats
default: None
Standard Deviation values for normalizing Tensor image
pretrained: String
default: True
Load Pre-trained model weights
trainable: String
default: False
Tweak pretrained model by training again
add_layers: String
default: False
Replace the last layer of the pretrained model
layers: typing.Union[dict, torch.nn.modules.container.ModuleDict, torch.nn.modules.container.Sequential, torch.nn.modules.container.ModuleList, torch.nn.modules.module.Module]
default: None
Extra layers to replace the last layer of the pretrained model
dffml_model_tensorflow¶
pip install dffml-model-tensorflow
Note
It’s important to keep the hidden layer config and feature config the same across invocations of train, predict, and accuracy methods.
Models are saved under the directory
parameter in subdirectories named
after the hash of their feature names and hidden layer config. Which means
if any of those parameters change between invocations, it’s being told to
look for a different saved model.
tfdnnc¶
Official
Implemented using Tensorflow’s DNNClassifier.
First we create the training and testing datasets
wget http://download.tensorflow.org/data/iris_training.csv
echo '376c8ea3b7f85caff195b4abe62f34e8f4e7aece8bd087bbd746518a9d1fd60ae3b4274479f88ab0aa5c839460d535ef iris_training.csv' | sha384sum -c -
sed -i 's/.*setosa,versicolor,virginica/SepalLength,SepalWidth,PetalLength,PetalWidth,classification/g' *.csv
wget http://download.tensorflow.org/data/iris_test.csv
echo '8c2cda42ce5ce6f977d17d668b1c98a45bfe320175f33e97293c62ab543b3439eab934d8e11b1208de1e4a9eb1957714 iris_test.csv' | sha384sum -c -
sed -i 's/.*setosa,versicolor,virginica/SepalLength,SepalWidth,PetalLength,PetalWidth,classification/g' *.csv
Train the model
dffml train \
-model tfdnnc \
-model-epochs 3000 \
-model-steps 20000 \
-model-predict classification:int:1 \
-model-location tempdir \
-model-classifications 0 1 2 \
-model-clstype int \
-sources iris=csv \
-source-filename iris_training.csv \
-model-features \
SepalLength:float:1 \
SepalWidth:float:1 \
PetalLength:float:1 \
PetalWidth:float:1 \
-log debug
Assess the accuracy
dffml accuracy \
-model tfdnnc \
-model-predict classification:int:1 \
-model-location tempdir \
-model-classifications 0 1 2 \
-model-clstype int \
-features classification:int:1 \
-scorer clf \
-sources iris=csv \
-source-filename iris_test.csv \
-model-features \
SepalLength:float:1 \
SepalWidth:float:1 \
PetalLength:float:1 \
PetalWidth:float:1 \
-log critical
Output
0.99996233782
Make a prediction
echo -e 'SepalLength,SepalWidth,PetalLength,PetalWidth\n5.9,3.0,4.2,1.5\n' | \
dffml predict all \
-model tfdnnc \
-model-predict classification:int:1 \
-model-location tempdir \
-model-classifications 0 1 2 \
-model-clstype int \
-sources iris=csv \
-model-features \
SepalLength:float:1 \
SepalWidth:float:1 \
PetalLength:float:1 \
PetalWidth:float:1 \
-source-filename /dev/stdin
Output
[
{
"extra": {},
"features": {
"PetalLength": 4.2,
"PetalWidth": 1.5,
"SepalLength": 5.9,
"SepalWidth": 3.0,
"classification": 1
},
"last_updated": "2019-07-31T02:00:12Z",
"prediction": {
"classification":
{
"confidence": 0.9999997615814209,
"value": 1
}
},
"key": "0"
},
]
Example usage of Tensorflow DNNClassifier model using python API
from dffml import CSVSource, Features, Feature
from dffml.noasync import train, score, predict
from dffml_model_tensorflow.dnnc import DNNClassifierModel
from dffml.accuracy import ClassificationAccuracy
model = DNNClassifierModel(
features=Features(
Feature("SepalLength", float, 1),
Feature("SepalWidth", float, 1),
Feature("PetalLength", float, 1),
Feature("PetalWidth", float, 1),
),
predict=Feature("classification", int, 1),
epochs=3000,
steps=20000,
classifications=[0, 1, 2],
clstype=int,
location="tempdir",
)
# Train the model
train(model, "iris_training.csv")
# Assess accuracy (alternate way of specifying data source)
scorer = ClassificationAccuracy()
print(
"Accuracy:",
score(
model,
scorer,
Feature("classification", int, 1),
CSVSource(filename="iris_test.csv"),
),
)
# Make prediction
for i, features, prediction in predict(
model,
{
"PetalLength": 4.2,
"PetalWidth": 1.5,
"SepalLength": 5.9,
"SepalWidth": 3.0,
},
{
"PetalLength": 5.4,
"PetalWidth": 2.1,
"SepalLength": 6.9,
"SepalWidth": 3.1,
},
):
features["classification"] = prediction["classification"]["value"]
print(features)
Args
predict: Feature
Feature name holding target values
features: List of features
Features to train on
location: Path
Location where state should be saved
steps: Integer
default: 3000
Number of steps to train the model
epochs: Integer
default: 30
Number of iterations to pass over all records in a source
hidden: List of integers
default: [12, 40, 15]
List length is the number of hidden layers in the network. Each entry in the list is the number of nodes in that hidden layer
classifications: List of strings
default: None
Options for value of classification
clstype: Type
default: <class ‘str’>
Data type of classifications values
batchsize: Integer
default: 20
Number records to pass through in an epoch
shuffle: String
default: True
Randomise order of records in a batch
tfdnnr¶
Official
Implemented using Tensorflow’s DNNEstimator.
Usage:
predict: Name of the feature we are trying to predict or using for training.
Generating train and test data
This creates files train.csv and test.csv, make sure to take a BACKUP of files with same name in the directory from where this command is run as it overwrites any existing files.
cat > train.csv << EOF
Feature1,Feature2,TARGET
0.93,0.68,3.89
0.24,0.42,1.75
0.36,0.68,2.75
0.53,0.31,2.00
0.29,0.25,1.32
0.29,0.52,2.14
EOF
cat > test.csv << EOF
Feature1,Feature2,TARGET
0.57,0.84,3.65
0.95,0.19,2.46
0.23,0.15,0.93
EOF
Train the model
dffml train \
-model tfdnnr \
-model-epochs 300 \
-model-steps 2000 \
-model-predict TARGET:float:1 \
-model-location tempdir \
-model-hidden 8 16 8 \
-sources s=csv \
-source-filename train.csv \
-model-features \
Feature1:float:1 \
Feature2:float:1 \
-log debug
Assess the accuracy
dffml accuracy \
-model tfdnnr \
-model-predict TARGET:float:1 \
-model-location tempdir \
-model-hidden 8 16 8 \
-features TARGET:float:1 \
-sources s=csv \
-source-filename test.csv \
-model-features \
Feature1:float:1 \
Feature2:float:1 \
-scorer mse \
-log critical
Output
0.9468210011
Make a prediction
echo -e 'Feature1,Feature2,TARGET\n0.21,0.18,0.84\n' | \
dffml predict all \
-model tfdnnr \
-model-predict TARGET:float:1 \
-model-location tempdir \
-model-hidden 8 16 8 \
-sources s=csv \
-source-filename /dev/stdin \
-model-features \
Feature1:float:1 \
Feature2:float:1 \
-log critical
Output
[
{
"extra": {},
"features": {
"Feature1": 0.21,
"Feature2": 0.18,
"TARGET": 0.84
},
"last_updated": "2019-10-24T15:26:41Z",
"prediction": {
"TARGET" : {
"confidence": null,
"value": 1.1983429193496704
}
},
"key": 0
}
]
Example usage of Tensorflow DNNEstimator model using python API
from dffml import CSVSource, Features, Feature
from dffml.noasync import train, score, predict
from dffml_model_tensorflow.dnnr import DNNRegressionModel
from dffml.accuracy import MeanSquaredErrorAccuracy
model = DNNRegressionModel(
features=Features(
Feature("Feature1", float, 1), Feature("Feature2", float, 1)
),
predict=Feature("TARGET", float, 1),
epochs=300,
steps=2000,
hidden=[8, 16, 8],
location="tempdir",
)
# Train the model
train(model, "train.csv")
# Assess accuracy (alternate way of specifying data source)
scorer = MeanSquaredErrorAccuracy()
print(
"Accuracy:",
score(
model,
scorer,
Feature("TARGET", float, 1),
CSVSource(filename="test.csv"),
),
)
# Make prediction
for i, features, prediction in predict(
model, {"Feature1": 0.21, "Feature2": 0.18, "TARGET": 0.84}
):
features["TARGET"] = prediction["TARGET"]["value"]
print(features)
The null
in confidence
is the expected behavior. (See TODO in
predict).
Args
predict: Feature
Feature name holding target values
features: List of features
Features to train on
location: Path
Location where state should be saved
steps: Integer
default: 3000
Number of steps to train the model
epochs: Integer
default: 30
Number of iterations to pass over all records in a source
hidden: List of integers
default: [12, 40, 15]
List length is the number of hidden layers in the network. Each entry in the list is the number of nodes in that hidden layer
dffml_model_tensorflow_hub¶
pip install dffml-model-tensorflow-hub
text_classifier¶
Official
Implemented using Tensorflow hub pretrained models.
cat > train.csv << EOF
sentence,sentiment
Life is good,1
This book is amazing,1
It's a terrible movie,2
Global warming is bad,0
I hate you!!,2
This movie is horrible,2
EOF
cat > test.csv << EOF
sentence,sentiment
I am not feeling good,0
Our trip was full of adventures,1
EOF
Train the model
dffml train \
-model text_classifier \
-model-epochs 30 \
-model-predict sentiment:int:1 \
-model-location tempdir \
-model-classifications 0 1 \
-model-clstype int \
-sources f=csv \
-source-filename train.csv \
-model-features \
sentence:str:1 \
-model-model_path "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim-with-oov/1" \
-model-add_layers \
-model-layers "Dense(units=512, activation='relu')" "Dense(units=2, activation='softmax')" \
-log debug
Assess the accuracy
dffml accuracy \
-model text_classifier \
-model-predict sentiment:int:1 \
-model-location tempdir \
-model-classifications 0 1 \
-model-model_path "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim-with-oov/1" \
-model-clstype int \
-features sentiment:int:1 \
-sources f=csv \
-source-filename test.csv \
-model-features \
sentence:str:1 \
-scorer textclf \
-log critical
Output
0.5
Make a prediction
dffml predict all \
-model text_classifier \
-model-predict sentiment:int:1 \
-model-location tempdir \
-model-classifications 0 1 \
-model-model_path "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim-with-oov/1" \
-model-clstype int \
-sources f=csv \
-source-filename test.csv \
-model-features \
sentence:str:1 \
-log debug
Output
[
{
"extra": {},
"features": {
"sentence": "I am not feeling good",
"sentiment": 0
},
"key": "0",
"last_updated": "2020-05-14T20:14:30Z",
"prediction": {
"sentiment": {
"confidence": 0.9999992847442627,
"value": 1
}
}
},
{
"extra": {},
"features": {
"sentence": "Our trip was full of adventures",
"sentiment": 1
},
"key": "1",
"last_updated": "2020-05-14T20:14:30Z",
"prediction": {
"sentiment": {
"confidence": 0.9999088048934937,
"value": 1
}
}
}
]
Example usage of Tensorflow_hub Text Classifier model using python API
from dffml import CSVSource, Features, Feature
from dffml.noasync import train, score, predict
from dffml_model_tensorflow_hub.text_classifier import TextClassificationModel
from dffml_model_tensorflow_hub.text_classifier_accuracy import (
TextClassifierAccuracy,
)
model = TextClassificationModel(
features=Features(Feature("sentence", str, 1)),
predict=Feature("sentiment", int, 1),
classifications=[0, 1, 2],
clstype=int,
location="tempdir",
)
# Train the model
train(model, "train.csv")
# Assess accuracy (alternate way of specifying data source)
scorer = TextClassifierAccuracy()
print(
"Accuracy:",
score(
model,
scorer,
Feature("sentiment", int, 1),
CSVSource(filename="test.csv"),
),
)
# Make prediction
for i, features, prediction in predict(
model, {"sentence": "This track is horrible"},
):
features["sentiment"] = prediction["sentiment"]["value"]
print(features)
Args
predict: Feature
Feature name holding classification value
classifications: List of strings
Options for value of classification
features: List of features
Features to train on
location: Path
Location where state should be saved
trainable: String
default: True
Tweak pretrained model by training again
batch_size: Integer
default: 120
Batch size
max_seq_length: Integer
default: 256
Length of sentence, used in preprocessing of input for bert embedding
add_layers: String
default: False
Add layers on the top of pretrianed model/layer
embedType: String
default: None
Type of pretrained embedding model, required to be set to bert to use bert pretrained embedding
layers: List of strings
default: None
Extra layers to be added on top of pretrained model
model_path: String
default: https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim-with-oov/1
Pretrained model path/url
optimizer: String
default: adam
Optimizer used by model
metrics: String
default: accuracy
Metric used to evaluate model
clstype: Type
default: <class ‘str’>
Data type of classifications values
epochs: Integer
default: 10
Number of iterations to pass over all records in a source
dffml_model_spacy¶
pip install dffml-model-spacy
spacyner¶
Official
Implemented using Spacy statistical models .
Note
You must download en_core_web_sm
before using this model
$ python -m spacy download en_core_web_sm
First we create the training and testing datasets.
Training data:
train.json
{
"data": [
{
"sentence": "I went to London and Berlin.",
"entities": [
{
"start":10,
"end": 16,
"tag": "LOC"
},
{
"start":21,
"end": 27,
"tag": "LOC"
}
]
},
{
"sentence": "Who is Alex?",
"entities": [
{
"start":7,
"end": 11,
"tag": "PERSON"
}
]
}
]
}
Testing data:
test.json
{
"data": [
{
"sentence": "Alex went to London?"
}
]
}
Train the model
$ dffml train \
-model spacyner \
-sources s=op \
-source-opimp dffml_model_spacy.ner.utils:parser \
-source-args train.json False \
-model-model_name en_core_web_sm \
-model-location temp \
-model-n_iter 5 \
-log debug
Assess the accuracy
$ dffml accuracy \
-model spacyner \
-sources s=op \
-source-opimp dffml_model_spacy.ner.utils:parser \
-source-args train.json False \
-model-model_name en_core_web_sm \
-model-location temp \
-model-n_iter 5 \
-features tag:str:1 \
-scorer sner \
-log debug
0.0
Make a prediction
$ dffml predict all \
-model spacyner \
-sources s=op \
-source-opimp dffml_model_spacy.ner.utils:parser \
-source-args test.json True \
-model-model_name en_core_web_sm \
-model-location temp \
-model-n_iter 5 \
-log debug
[
{
"extra": {},
"features": {
"entities": [],
"sentence": "Alex went to London?"
},
"key": 0,
"last_updated": "2020-07-27T16:26:18Z",
"prediction": {
"Answer": {
"confidence": null,
"value": [
[
"Alex",
"PERSON"
],
[
"London",
"GPE"
]
]
}
}
}
]
The model can be trained on large datasets to get the expected output. The example shown above is to demonstrate the commandline usage of the model.
In the above train, accuracy and predict commands, op source is used to read and parse data from json file before feeding it to the model. The function used by opsource to parse json data is:
import ast
import json
def parser(json_file: str, is_predicting: bool) -> dict:
with open(json_file) as f:
parsed_data = {}
data = json.load(f)["data"]
for id, entry in enumerate(data):
entities = []
sentence = entry["sentence"]
if not ast.literal_eval(is_predicting):
for entity in entry["entities"]:
start = entity["start"]
end = entity["end"]
tag = entity["tag"]
entities.append((start, end, tag))
parsed_data[id] = {
"features": {"sentence": sentence, "entities": entities,}
}
return parsed_data
The location of the function is passed using:
-source-opimp dffml_model_spacy.ner.utils:parser
And the arguments to parser are passed by:
-source-args train.json False
where train.json is the name of file containing training data and the bool False is value of the flag is_predicting.
Args
location: String
Output location.
model_name: String
default: None
Name of one of the trained pipelines provided by spaCy. You can find complete list at: https://spacy.io/models Defaults to blank ‘en’ model.
n_iter: Integer
default: 10
Number of training iterations
dropout: float
default: 0.5
Dropout rate to be used during training
dffml_model_autosklearn¶
pip install dffml-model-autosklearn
Follow these instructions before running the above install
command to ensure that auto-sklearn
installs correctly
Ubuntu Installation
To provide a C++11 building environment and the lateste SWIG version on Ubuntu, run:
$ sudo apt-get install build-essential swig
Install other PyPi dependencies with
$ python3 -m pip install cython liac-arff psutil
$ curl https://raw.githubusercontent.com/automl/auto-sklearn/master/requirements.txt | xargs -n 1 -L 1 python3 -m pip install
For more information about installation visit https://automl.github.io/auto-sklearn/master/installation.html#installation
autoclassifier¶
Official
No description
Args
features: List of features
Features to train on
predict: Feature
Label or the value to be predicted
location: Path
Location where state should be saved
time_left_for_this_task: Integer
default: 3600
Time limit in seconds for the search of appropriate models. By increasing this value, auto-sklearn has a higher chance of finding better models.
per_run_time_limit: Integer
default: None
Time limit for a single call to the machine learning model. Model fitting will be terminated if the machine learning algorithm runs over the time limit. Set this value high enough so that typical machine learning algorithms can be fit on the training data.
initial_configurations_via_metalearning: Integer
default: 25
Initialize the hyperparameter optimization algorithm with this many configurations which worked well on previously seen datasets. Disable if the hyperparameter optimization algorithm should start from scratch.
ensemble_size: Integer
default: 50
Number of models added to the ensemble built by Ensemble selection from libraries of models. Models are drawn with replacement.
ensemble_nbest: Integer
default: 50
Only consider the
ensemble_nbest
models when building an ensemble.
max_models_on_disc: Integer
default: 50
Defines the maximum number of models that are kept in the disc. The additional number of models are permanently deleted. Due to the nature of this variable, it sets the upper limit on how many models can be used for an ensemble. It must be an integer greater or equal than 1. If set to None, all models are kept on the disc.
seed: Integer
default: 1
Used to seed SMAC. Will determine the output file names.
memory_limit: Integer
default: 3072
Memory limit in MB for the machine learning algorithm. auto-sklearn will stop fitting the machine learning algorithm if it tries to allocate more than memory_limit MB. If None is provided, no memory limit is set. In case of multi-processing, memory_limit will be per job. This memory limit also applies to the ensemble creation process.
include_estimators: typing.Any
default: None
If None, all possible estimators are used. Otherwise specifies set of estimators to use.
exclude_estimators: typing.Any
default: None
If None, all possible estimators are used. Otherwise specifies set of estimators not to use. Incompatible with include_estimators.
include_preprocessors: typing.Any
default: None
If None all possible preprocessors are used. Otherwise specifies set of preprocessors to use.
exclude_preprocessors: typing.Any
default: None
If None all possible preprocessors are used. Otherwise specifies set of preprocessors not to use. Incompatible with include_preprocessors.
resampling_strategy: String
default: holdout
how to to handle overfitting, might need ‘resampling_strategy_arguments’ fit where possible ‘folds’ in scikit-learn model_selection module in scikit-learn model_selection module in scikit-learn model_selection module
resampling_strategy_arguments: dict
default: None
train_size
should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. *shuffle
determines whether the data is shuffled prior to splitting it into train and validation. required by chosen class as specified in scikit-learn documentation. If arguments are not provided, scikit-learn defaults are used. If no defaults are available, an exception is raised. Refer to the ‘n_splits’ argument as ‘folds’.
tmp_folder: String
default: None
folder to store configuration output and log files, if
None
automatically use/tmp/autosklearn_tmp_$pid_$random_number
output_folder: String
default: None
folder to store predictions for optional test set, if
None
no output will be generated
delete_tmp_folder_after_terminate: String
default: True
remove tmp_folder, when finished. If tmp_folder is None tmp_dir will always be deleted
delete_output_folder_after_terminate: String
default: True
remove output_folder, when finished. If output_folder is None output_dir will always be deleted
n_jobs: Integer
default: None
The number of jobs to run in parallel for
fit()
.-1
means using all processors. By default, Auto-sklearn uses a single core for fitting the machine learning model and a single core for fitting an ensemble. Ensemble building is not affected byn_jobs
but can be controlled by the number of models in the ensemble. In contrast to most scikit-learn models,n_jobs
given in the constructor is not applied to thepredict()
method. Ifdask_client
is None, a new dask client is created.
dask_client: typing.Any
default: None
User-created dask client, can be used to start a dask cluster and then attach auto-sklearn to it.
disable_evaluator_output: String
default: False
If True, disable model and prediction output. Cannot be used together with ensemble building.
predict()
cannot be used when setting this True. Can also be used as a list to pass more fine-grained information on what to save. Allowed elements in the optimization/validation set, which would later on be used to build an ensemble.
smac_scenario_args: dict
default: None
Additional arguments inserted into the scenario of SMAC. See the for a list of available arguments.
get_smac_object_callback: typing.Any
default: None
Callback function to create an object of class The function must accept the arguments
scenario_dict
,instances
,num_params
,runhistory
,seed
andta
. This is an advanced feature. Use only if you are familiar with
logging_config: dict
default: None
dictionary object specifying the logger configuration. If None, the default logging.yaml file is used, which can be found in the directory
util/logging.yaml
relative to the installation.
metadata_directory: String
default: None
path to the metadata directory. If None, the default directory (autosklearn.metalearning.files) is used.
metric: typing.Any
default: None
Metrics`_. If None is provided, a default metric is selected depending on the task.
scoring_functions: typing.Any
default: None
List of scorers which will be calculated for each pipeline and results will be available via
cv_results
load_models: String
default: True
Whether to load the models after fitting Auto-sklearn.
autoregressor¶
Official
autoregressor
/ AutoSklearnRegressorModel
will use auto-sklearn
to train the a scikit model for you.
This is AutoML, it will tune hyperparameters for you.
Implemented using AutoSklearnRegressor.
First we create the training and testing datasets
train.csv
Feature1,Feature2,TARGET
0.93,0.68,3.89
0.24,0.42,1.75
0.36,0.68,2.75
0.53,0.31,2.00
0.29,0.25,1.32
0.29,0.52,2.14
test.csv
Feature1,Feature2,TARGET
0.57,0.84,3.65
0.95,0.19,2.46
0.23,0.15,0.93
Train the model
$ dffml train \
-model autoregressor \
-model-predict TARGET:float:1 \
-model-clstype int \
-sources f=csv \
-source-filename train.csv \
-model-features \
Feature1:float:1 \
Feature2:float:1 \
-model-time_left_for_this_task 120 \
-model-per_run_time_limit 30 \
-model-ensemble_size 50 \
-model-delete_tmp_folder_after_terminate False \
-model-location tempdir \
-log debug
Assess the accuracy
$ dffml accuracy \
-model autoregressor \
-model-predict TARGET:float:1 \
-model-location tempdir \
-features TARGET:float:1 \
-sources f=csv \
-source-filename test.csv \
-model-features \
Feature1:float:1 \
Feature2:float:1 \
-scorer mse \
-log critical
0.9961211434899032
Make a file containing the data to predict on
predict.csv
Feature1,Feature2
0.57,0.84
Make a prediction
$ dffml predict all \
-model autoregressor \
-model-location tempdir \
-model-predict TARGET:float:1 \
-sources iris=csv \
-model-features \
Feature1:float:1 \
Feature2:float:1 \
-source-filename predict.csv
[
{
"extra": {},
"features": {
"Feature1": 0.57,
"Feature2": 0.84
},
"key": "0",
"last_updated": "2020-11-23T05:52:13Z",
"prediction": {
"TARGET": {
"confidence": NaN,
"value": 3.566799074411392
}
}
}
]
The model can be trained on large datasets to get better accuracy output. The example shown above is to demonstrate the command line usage of the model.
Example usage of using the model from Python
run.py
from dffml import Features, Feature
from dffml.noasync import train, score, predict
from dffml_model_autosklearn import AutoSklearnRegressorModel
from dffml.accuracy import MeanSquaredErrorAccuracy
model = AutoSklearnRegressorModel(
features=Features(
Feature("Feature1", float, 1), Feature("Feature2", float, 1),
),
predict=Feature("TARGET", float, 1),
location="tempdir-python",
time_left_for_this_task=120,
)
def main():
# Train the model
train(model, "train.csv")
# Assess accuracy
scorer = MeanSquaredErrorAccuracy()
print(
"Accuracy:",
score(model, scorer, Feature("TARGET", float, 1), "test.csv"),
)
# Make prediction
for i, features, prediction in predict(model, "predict.csv"):
features["TARGET"] = prediction["TARGET"]["value"]
print(features)
if __name__ == "__main__":
main()
Run the file
$ python run.py
Accuracy: 0.9961211434899032
{'Feature1': 0.57, 'Feature2': 0.84, 'TARGET': 3.6180416345596313}
Args
features: List of features
Features to train on
predict: Feature
Label or the value to be predicted
location: Path
Location where state should be saved
time_left_for_this_task: Integer
default: 3600
Time limit in seconds for the search of appropriate models. By increasing this value, auto-sklearn has a higher chance of finding better models.
per_run_time_limit: Integer
default: None
Time limit for a single call to the machine learning model. Model fitting will be terminated if the machine learning algorithm runs over the time limit. Set this value high enough so that typical machine learning algorithms can be fit on the training data.
initial_configurations_via_metalearning: Integer
default: 25
Initialize the hyperparameter optimization algorithm with this many configurations which worked well on previously seen datasets. Disable if the hyperparameter optimization algorithm should start from scratch.
ensemble_size: Integer
default: 50
Number of models added to the ensemble built by Ensemble selection from libraries of models. Models are drawn with replacement.
ensemble_nbest: Integer
default: 50
Only consider the
ensemble_nbest
models when building an ensemble.
max_models_on_disc: Integer
default: 50
Defines the maximum number of models that are kept in the disc. The additional number of models are permanently deleted. Due to the nature of this variable, it sets the upper limit on how many models can be used for an ensemble. It must be an integer greater or equal than 1. If set to None, all models are kept on the disc.
seed: Integer
default: 1
Used to seed SMAC. Will determine the output file names.
memory_limit: Integer
default: 3072
Memory limit in MB for the machine learning algorithm. auto-sklearn will stop fitting the machine learning algorithm if it tries to allocate more than memory_limit MB. If None is provided, no memory limit is set. In case of multi-processing, memory_limit will be per job. This memory limit also applies to the ensemble creation process.
include_estimators: typing.Any
default: None
If None, all possible estimators are used. Otherwise specifies set of estimators to use.
exclude_estimators: typing.Any
default: None
If None, all possible estimators are used. Otherwise specifies set of estimators not to use. Incompatible with include_estimators.
include_preprocessors: typing.Any
default: None
If None all possible preprocessors are used. Otherwise specifies set of preprocessors to use.
exclude_preprocessors: typing.Any
default: None
If None all possible preprocessors are used. Otherwise specifies set of preprocessors not to use. Incompatible with include_preprocessors.
resampling_strategy: String
default: holdout
how to to handle overfitting, might need ‘resampling_strategy_arguments’ fit where possible ‘folds’ in scikit-learn model_selection module in scikit-learn model_selection module in scikit-learn model_selection module
resampling_strategy_arguments: dict
default: None
train_size
should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. *shuffle
determines whether the data is shuffled prior to splitting it into train and validation. required by chosen class as specified in scikit-learn documentation. If arguments are not provided, scikit-learn defaults are used. If no defaults are available, an exception is raised. Refer to the ‘n_splits’ argument as ‘folds’.
tmp_folder: String
default: None
folder to store configuration output and log files, if
None
automatically use/tmp/autosklearn_tmp_$pid_$random_number
output_folder: String
default: None
folder to store predictions for optional test set, if
None
no output will be generated
delete_tmp_folder_after_terminate: String
default: True
remove tmp_folder, when finished. If tmp_folder is None tmp_dir will always be deleted
delete_output_folder_after_terminate: String
default: True
remove output_folder, when finished. If output_folder is None output_dir will always be deleted
n_jobs: Integer
default: None
The number of jobs to run in parallel for
fit()
.-1
means using all processors. By default, Auto-sklearn uses a single core for fitting the machine learning model and a single core for fitting an ensemble. Ensemble building is not affected byn_jobs
but can be controlled by the number of models in the ensemble. In contrast to most scikit-learn models,n_jobs
given in the constructor is not applied to thepredict()
method. Ifdask_client
is None, a new dask client is created.
dask_client: typing.Any
default: None
User-created dask client, can be used to start a dask cluster and then attach auto-sklearn to it.
disable_evaluator_output: String
default: False
If True, disable model and prediction output. Cannot be used together with ensemble building.
predict()
cannot be used when setting this True. Can also be used as a list to pass more fine-grained information on what to save. Allowed elements in the optimization/validation set, which would later on be used to build an ensemble.
smac_scenario_args: dict
default: None
Additional arguments inserted into the scenario of SMAC. See the for a list of available arguments.
get_smac_object_callback: typing.Any
default: None
Callback function to create an object of class The function must accept the arguments
scenario_dict
,instances
,num_params
,runhistory
,seed
andta
. This is an advanced feature. Use only if you are familiar with
logging_config: dict
default: None
dictionary object specifying the logger configuration. If None, the default logging.yaml file is used, which can be found in the directory
util/logging.yaml
relative to the installation.
metadata_directory: String
default: None
path to the metadata directory. If None, the default directory (autosklearn.metalearning.files) is used.
metric: typing.Any
default: None
Metrics`_. If None is provided, a default metric is selected depending on the task.
scoring_functions: typing.Any
default: None
List of scorers which will be calculated for each pipeline and results will be available via
cv_results
load_models: String
default: True
Whether to load the models after fitting Auto-sklearn.