Operations¶
Operations Implementations are subclasses of
dffml.df.base.OperationImplementation
, they are functions or classes
which could do anything, make HTTP requests, do inference, etc.
They don’t necessarily have to be written in Python. Although DFFML isn’t quite to the point where it can use operations written in other languages yet, it’s on the roadmap.
dffml¶
pip install dffml
AcceptUserInput¶
Official
Accept input from stdin using python input()
Returns¶
- dict
A dictionary containing user input.
Examples¶
The following example shows how to use AcceptUserInput. (Assumes that the input from stdio is “Data flow is awesome”!)
>>> import asyncio
>>> from dffml import *
>>>
>>> dataflow = DataFlow.auto(AcceptUserInput, GetSingle)
>>> dataflow.seed.append(
... Input(
... value=[AcceptUserInput.op.outputs["InputData"].name],
... definition=GetSingle.op.inputs["spec"],
... )
... )
>>>
>>> async def main():
... async for ctx, results in MemoryOrchestrator.run(dataflow, {"input": []}):
... print(results)
>>>
>>> asyncio.run(main())
Enter the value: {'UserInput': 'Data flow is awesome'}
Stage: processing
Outputs
InputData: UserInput(type: str)
associate¶
Official
No description
Stage: output
Inputs
spec: associate_spec(type: List[str])
Outputs
output: associate_output(type: Dict[str, Any])
associate_definition¶
Official
Examples¶
>>> import asyncio
>>> from dffml import *
>>>
>>> feed_def = Definition(name="feed", primitive="string")
>>> dead_def = Definition(name="dead", primitive="string")
>>> output = Definition(name="output", primitive="string")
>>>
>>> feed_input = Input(value="my favorite value", definition=feed_def)
>>> face_input = Input(
... value="face", definition=output, parents=[feed_input]
... )
>>>
>>> dead_input = Input(
... value="my second favorite value", definition=dead_def
... )
>>> beef_input = Input(
... value="beef", definition=output, parents=[dead_input]
... )
>>>
>>> async def main():
... for value in ["feed", "dead"]:
... async for ctx, results in MemoryOrchestrator.run(
... DataFlow.auto(AssociateDefinition),
... [
... feed_input,
... face_input,
... dead_input,
... beef_input,
... Input(
... value={value: "output"},
... definition=AssociateDefinition.op.inputs["spec"],
... ),
... ],
... ):
... print(results)
>>>
>>> asyncio.run(main())
{'feed': 'face'}
{'dead': 'beef'}
Stage: output
Inputs
spec: associate_spec(type: List[str])
Outputs
output: associate_output(type: Dict[str, Any])
bz2_compress¶
Official
No description
Stage: processing
Inputs
input_file_path: decompressed_bz2_file_path(type: str)
output_file_path: compressed_bz2_file_path(type: str)
Outputs
output_path: compressed_output_bz2_file_path(type: str)
bz2_decompress¶
Official
No description
Stage: processing
Inputs
input_file_path: compressed_bz2_file_path(type: str)
output_file_path: decompressed_bz2_file_path(type: str)
Outputs
output_path: decompressed_output_bz2_file_path(type: str)
convert_list_to_records¶
Official
No description
Stage: processing
Inputs
matrix: matrix(type: List[List[Any]])
features: features(type: List[str])
keys: keys(type: List[str])
predict_features: predict_features(type: List[str])
unprocessed_matrix: unprocessed_matrix(type: List[List[Any]])
Outputs
records: records(type: Dict[str, Any])
convert_records_to_list¶
Official
No description
Stage: processing
Inputs
features: features(type: List[str])
predict_features: predict_features(type: List[str])
Outputs
matrix: matrix(type: List[List[Any]])
keys: keys(type: List[str])
unprocessed_matrix: unprocessed_matrix(type: List[List[Any]])
Args
source: Entrypoint
db_query_create_table¶
Official
Generates a create table query in the database.
Parameters¶
- table_namestr
The name of the table to be created.
- colslist[str]
Columns of the table.
Examples¶
>>> import asyncio
>>> from dffml import *
>>>
>>> sdb = SqliteDatabase(SqliteDatabaseConfig(filename="examples.db"))
>>>
>>> dataflow = DataFlow(
... operations={"db_query_create": db_query_create_table.op,},
... configs={"db_query_create": DatabaseQueryConfig(database=sdb),},
... seed=[],
... )
>>>
>>> inputs = [
... Input(
... value="myTable1",
... definition=db_query_create_table.op.inputs["table_name"],
... ),
... Input(
... value={
... "key": "real",
... "firstName": "text",
... "lastName": "text",
... "age": "real",
... },
... definition=db_query_create_table.op.inputs["cols"],
... ),
... ]
>>>
>>> async def main():
... async for ctx, result in MemoryOrchestrator.run(dataflow, inputs):
... pass
>>>
>>> asyncio.run(main())
Stage: processing
Inputs
table_name: query_table(type: str)
cols: query_cols(type: Dict[str, str])
Args
database: Entrypoint
db_query_insert¶
Official
Generates an insert query in the database.
Parameters¶
- table_namestr
The name of the table to insert data in to.
- datadict
Data to be inserted into the table.
Examples¶
>>> import asyncio
>>> from dffml import *
>>>
>>> sdb = SqliteDatabase(SqliteDatabaseConfig(filename="examples.db"))
>>>
>>> dataflow = DataFlow(
... operations={
... "db_query_insert": db_query_insert.op,
... "db_query_lookup": db_query_lookup.op,
... "get_single": GetSingle.imp.op,
... },
... configs={
... "db_query_lookup": DatabaseQueryConfig(database=sdb),
... "db_query_insert": DatabaseQueryConfig(database=sdb),
... },
... seed=[],
... )
>>>
>>> inputs = {
... "insert": [
... Input(
... value="myTable", definition=db_query_insert.op.inputs["table_name"],
... ),
... Input(
... value={"key": 10, "firstName": "John", "lastName": "Doe", "age": 16},
... definition=db_query_insert.op.inputs["data"],
... ),
... ],
... "lookup": [
... Input(
... value="myTable", definition=db_query_lookup.op.inputs["table_name"],
... ),
... Input(
... value=["firstName", "lastName", "age"],
... definition=db_query_lookup.op.inputs["cols"],
... ),
... Input(value=[], definition=db_query_lookup.op.inputs["conditions"],),
... Input(
... value=[db_query_lookup.op.outputs["lookups"].name],
... definition=GetSingle.op.inputs["spec"],
... ),
... ]
... }
>>>
>>> async def main():
... async for ctx, result in MemoryOrchestrator.run(dataflow, inputs):
... if result:
... print(result)
>>>
>>> asyncio.run(main())
{'query_lookups': [{'firstName': 'John', 'lastName': 'Doe', 'age': 16}]}
Stage: processing
Inputs
table_name: query_table(type: str)
data: query_data(type: Dict[str, Any])
Args
database: Entrypoint
db_query_insert_or_update¶
Official
Automatically uses the better suited operation, insert query or update query.
Parameters¶
- table_namestr
The name of the table to insert data in to.
- datadict
Data to be inserted or updated into the table.
Examples¶
>>> import asyncio
>>> from dffml import *
>>>
>>> sdb = SqliteDatabase(SqliteDatabaseConfig(filename="examples.db"))
>>>
>>> person = {"key": 11, "firstName": "John", "lastName": "Wick", "age": 38}
>>>
>>> dataflow = DataFlow(
... operations={
... "db_query_insert_or_update": db_query_insert_or_update.op,
... "db_query_lookup": db_query_lookup.op,
... "get_single": GetSingle.imp.op,
... },
... configs={
... "db_query_insert_or_update": DatabaseQueryConfig(database=sdb),
... "db_query_lookup": DatabaseQueryConfig(database=sdb),
... },
... seed=[],
... )
>>>
>>> inputs = {
... "insert_or_update": [
... Input(
... value="myTable", definition=db_query_update.op.inputs["table_name"],
... ),
... Input(
... value=person,
... definition=db_query_update.op.inputs["data"],
... ),
... ],
... "lookup": [
... Input(
... value="myTable",
... definition=db_query_lookup.op.inputs["table_name"],
... ),
... Input(
... value=["firstName", "lastName", "age"],
... definition=db_query_lookup.op.inputs["cols"],
... ),
... Input(value=[], definition=db_query_lookup.op.inputs["conditions"],),
... Input(
... value=[db_query_lookup.op.outputs["lookups"].name],
... definition=GetSingle.op.inputs["spec"],
... ),
... ],
... }
>>>
>>> async def main():
... async for ctx, result in MemoryOrchestrator.run(dataflow, inputs):
... if result:
... print(result)
>>>
>>> asyncio.run(main())
{'query_lookups': [{'firstName': 'John', 'lastName': 'Wick', 'age': 38}]}
>>>
>>> person["age"] += 1
>>>
>>> asyncio.run(main())
{'query_lookups': [{'firstName': 'John', 'lastName': 'Wick', 'age': 39}]}
Stage: processing
Inputs
table_name: query_table(type: str)
data: query_data(type: Dict[str, Any])
Args
database: Entrypoint
db_query_lookup¶
Official
Generates a lookup query in the database.
Parameters¶
- table_namestr
The name of the table.
- colslist[str]
Columns of the table.
- conditionsConditions
Query conditions.
Examples¶
>>> import asyncio
>>> from dffml import *
>>>
>>> sdb = SqliteDatabase(SqliteDatabaseConfig(filename="examples.db"))
>>>
>>> dataflow = DataFlow(
... operations={
... "db_query_lookup": db_query_lookup.op,
... "get_single": GetSingle.imp.op,
... },
... configs={"db_query_lookup": DatabaseQueryConfig(database=sdb),},
... seed=[],
... )
>>>
>>> inputs = {
... "lookup": [
... Input(
... value="myTable",
... definition=db_query_lookup.op.inputs["table_name"],
... ),
... Input(
... value=["firstName", "lastName", "age"],
... definition=db_query_lookup.op.inputs["cols"],
... ),
... Input(value=[], definition=db_query_lookup.op.inputs["conditions"],),
... Input(
... value=[db_query_lookup.op.outputs["lookups"].name],
... definition=GetSingle.op.inputs["spec"],
... ),
... ],
... }
>>>
>>> async def main():
... async for ctx, result in MemoryOrchestrator.run(dataflow, inputs):
... if result:
... print(result)
>>>
>>> asyncio.run(main())
{'query_lookups': [{'firstName': 'John', 'lastName': 'Doe', 'age': 16}, {'firstName': 'John', 'lastName': 'Wick', 'age': 39}]}
Stage: processing
Inputs
table_name: query_table(type: str)
cols: query_cols(type: Dict[str, str])
conditions: query_conditions(type: Conditions)
Outputs
lookups: query_lookups(type: Dict[str, Any])
Args
database: Entrypoint
db_query_remove¶
Official
Generates a remove table query in the database.
Parameters¶
- table_namestr
The name of the table to insert data in to.
- conditionsConditions
Query conditions.
Examples¶
>>> import asyncio
>>> from dffml import *
>>>
>>> sdb = SqliteDatabase(SqliteDatabaseConfig(filename="examples.db"))
>>>
>>> dataflow = DataFlow(
... operations={
... "db_query_lookup": db_query_lookup.op,
... "db_query_remove": db_query_remove.op,
... "get_single": GetSingle.imp.op,
... },
... configs={
... "db_query_remove": DatabaseQueryConfig(database=sdb),
... "db_query_lookup": DatabaseQueryConfig(database=sdb),
... },
... seed=[],
... )
>>>
>>> inputs = {
... "remove": [
... Input(
... value="myTable",
... definition=db_query_remove.op.inputs["table_name"],
... ),
... Input(value=[],
... definition=db_query_remove.op.inputs["conditions"],),
... ],
... "lookup": [
... Input(
... value="myTable",
... definition=db_query_lookup.op.inputs["table_name"],
... ),
... Input(
... value=["firstName", "lastName", "age"],
... definition=db_query_lookup.op.inputs["cols"],
... ),
... Input(value=[], definition=db_query_lookup.op.inputs["conditions"],),
... Input(
... value=[db_query_lookup.op.outputs["lookups"].name],
... definition=GetSingle.op.inputs["spec"],
... ),
... ],
... }
>>>
>>> async def main():
... async for ctx, result in MemoryOrchestrator.run(dataflow, inputs):
... if result:
... print(result)
>>>
>>> asyncio.run(main())
{'query_lookups': []}
Stage: processing
Inputs
table_name: query_table(type: str)
conditions: query_conditions(type: Conditions)
Args
database: Entrypoint
db_query_update¶
Official
Generates an Update table query in the database.
Parameters¶
- table_namestr
The name of the table to insert data in to.
- datadict
Data to be updated into the table.
- conditionslist
List of query conditions.
Examples¶
>>> import asyncio
>>> from dffml import *
>>>
>>> sdb = SqliteDatabase(SqliteDatabaseConfig(filename="examples.db"))
>>>
>>> dataflow = DataFlow(
... operations={
... "db_query_update": db_query_update.op,
... "db_query_lookup": db_query_lookup.op,
... "get_single": GetSingle.imp.op,
... },
... configs={
... "db_query_update": DatabaseQueryConfig(database=sdb),
... "db_query_lookup": DatabaseQueryConfig(database=sdb),
... },
... seed=[],
... )
>>>
>>> inputs = {
... "update": [
... Input(
... value="myTable",
... definition=db_query_update.op.inputs["table_name"],
... ),
... Input(
... value={
... "key": 10,
... "firstName": "John",
... "lastName": "Doe",
... "age": 17,
... },
... definition=db_query_update.op.inputs["data"],
... ),
... Input(value=[], definition=db_query_update.op.inputs["conditions"],),
... ],
... "lookup": [
... Input(
... value="myTable",
... definition=db_query_lookup.op.inputs["table_name"],
... ),
... Input(
... value=["firstName", "lastName", "age"],
... definition=db_query_lookup.op.inputs["cols"],
... ),
... Input(value=[], definition=db_query_lookup.op.inputs["conditions"],),
... Input(
... value=[db_query_lookup.op.outputs["lookups"].name],
... definition=GetSingle.op.inputs["spec"],
... ),
... ],
... }
>>>
>>> async def main():
... async for ctx, result in MemoryOrchestrator.run(dataflow, inputs):
... if result:
... print(result)
>>>
>>> asyncio.run(main())
{'query_lookups': [{'firstName': 'John', 'lastName': 'Doe', 'age': 17}]}
Stage: processing
Inputs
table_name: query_table(type: str)
data: query_data(type: Dict[str, Any])
conditions: query_conditions(type: Conditions)
Args
database: Entrypoint
dffml.dataflow.run¶
Official
Starts a subflow self.config.dataflow
and adds inputs
in it.
Parameters¶
- inputsdict
The inputs to add to the subflow. These should be a key value mapping of the context string to the inputs which should be seeded for that context string.
Returns¶
- dict
Maps context strings in inputs to output after running through dataflow.
Examples¶
The following shows how to use run dataflow in its default behavior.
>>> import asyncio
>>> from dffml import *
>>>
>>> URL = Definition(name="URL", primitive="string")
>>>
>>> subflow = DataFlow.auto(GetSingle)
>>> subflow.definitions[URL.name] = URL
>>> subflow.seed.append(
... Input(
... value=[URL.name],
... definition=GetSingle.op.inputs["spec"]
... )
... )
>>>
>>> dataflow = DataFlow.auto(run_dataflow, GetSingle)
>>> dataflow.configs[run_dataflow.op.name] = RunDataFlowConfig(subflow)
>>> dataflow.seed.append(
... Input(
... value=[run_dataflow.op.outputs["results"].name],
... definition=GetSingle.op.inputs["spec"]
... )
... )
>>>
>>> async def main():
... async for ctx, results in MemoryOrchestrator.run(dataflow, {
... "run_subflow": [
... Input(
... value={
... "dffml": [
... {
... "value": "https://github.com/intel/dffml",
... "definition": URL.name
... }
... ]
... },
... definition=run_dataflow.op.inputs["inputs"]
... )
... ]
... }):
... print(results)
>>>
>>> asyncio.run(main())
{'flow_results': {'dffml': {'URL': 'https://github.com/intel/dffml'}}}
The following shows how to use run dataflow with custom inputs and outputs. This allows you to run a subflow as if it were an operation.
>>> import asyncio
>>> from dffml import *
>>>
>>> URL = Definition(name="URL", primitive="string")
>>>
>>> @op(
... inputs={"url": URL},
... outputs={"last": Definition("last_element_in_path", primitive="string")},
... )
... def last_path(url):
... return {"last": url.split("/")[-1]}
>>>
>>> subflow = DataFlow.auto(last_path, GetSingle)
>>> subflow.seed.append(
... Input(
... value=[last_path.op.outputs["last"].name],
... definition=GetSingle.op.inputs["spec"],
... )
... )
>>>
>>> dataflow = DataFlow.auto(run_dataflow, GetSingle)
>>> dataflow.operations[run_dataflow.op.name] = run_dataflow.op._replace(
... inputs={"URL": URL},
... outputs={last_path.op.outputs["last"].name: last_path.op.outputs["last"]},
... expand=[],
... )
>>> dataflow.configs[run_dataflow.op.name] = RunDataFlowConfig(subflow)
>>> dataflow.seed.append(
... Input(
... value=[last_path.op.outputs["last"].name],
... definition=GetSingle.op.inputs["spec"],
... )
... )
>>> dataflow.update(auto_flow=True)
>>>
>>> async def main():
... async for ctx, results in MemoryOrchestrator.run(
... dataflow,
... {
... "run_subflow": [
... Input(value="https://github.com/intel/dffml", definition=URL)
... ]
... },
... ):
... print(results)
>>>
>>> asyncio.run(main())
{'last_element_in_path': 'dffml'}
Stage: processing
Inputs
inputs: flow_inputs(type: Dict[str,Any])
Outputs
results: flow_results(type: Dict[str,Any])
Args
dataflow: DataFlow
dffml.mapping.create¶
Official
Creates a mapping of a given key and value.
Parameters¶
- keystr
The key for the mapping.
- valueAny
The value for the mapping.
Returns¶
- dict
A dictionary containing the mapping created.
Examples¶
>>> import asyncio
>>> from dffml import *
>>>
>>> dataflow = DataFlow.auto(create_mapping, GetSingle)
>>> dataflow.seed.append(
... Input(
... value=[create_mapping.op.outputs["mapping"].name],
... definition=GetSingle.op.inputs["spec"],
... )
... )
>>> inputs = [
... Input(
... value="key1", definition=create_mapping.op.inputs["key"],
... ),
... Input(
... value=42, definition=create_mapping.op.inputs["value"],
... ),
... ]
>>>
>>> async def main():
... async for ctx, result in MemoryOrchestrator.run(dataflow, inputs):
... print(result)
>>>
>>> asyncio.run(main())
{'mapping': {'key1': 42}}
Stage: processing
Inputs
key: key(type: str)
value: value(type: generic)
Outputs
mapping: mapping(type: map)
dffml.mapping.extract¶
Official
Extracts value from a given mapping.
Parameters¶
- mappingdict
The mapping to extract the value from.
- traverselist[str]
A list of keys to traverse through the mapping dictionary and extract the values.
Returns¶
- dict
A dictionary containing the value of the keys.
Examples¶
>>> import asyncio
>>> from dffml import *
>>>
>>> dataflow = DataFlow.auto(mapping_extract_value, GetSingle)
>>>
>>> dataflow.seed.append(
... Input(
... value=[mapping_extract_value.op.outputs["value"].name],
... definition=GetSingle.op.inputs["spec"],
... )
... )
>>> inputs = [
... Input(
... value={"key1": {"key2": 42}},
... definition=mapping_extract_value.op.inputs["mapping"],
... ),
... Input(
... value=["key1", "key2"],
... definition=mapping_extract_value.op.inputs["traverse"],
... ),
... ]
>>>
>>> async def main():
... async for ctx, result in MemoryOrchestrator.run(dataflow, inputs):
... print(result)
>>>
>>> asyncio.run(main())
{'value': 42}
Stage: processing
Inputs
mapping: mapping(type: map)
traverse: mapping_traverse(type: List[str])
Outputs
value: value(type: generic)
dffml.model.predict¶
Official
Predict using dffml models.
Parameters¶
- featuresdict
A dictionary contaning feature name and feature value.
Returns¶
- dict
A dictionary containing prediction.
Examples¶
The following example shows how to use model_predict.
>>> import asyncio
>>> from dffml import *
>>>
>>> slr_model = SLRModel(
... features=Features(Feature("Years", int, 1)),
... predict=Feature("Salary", int, 1),
... location="tempdir",
... )
>>> dataflow = DataFlow(
... operations={
... "prediction_using_model": model_predict,
... "get_single": GetSingle,
... },
... configs={"prediction_using_model": ModelPredictConfig(model=slr_model)},
... )
>>> dataflow.seed.append(
... Input(
... value=[model_predict.op.outputs["prediction"].name],
... definition=GetSingle.op.inputs["spec"],
... )
... )
>>>
>>> async def main():
... await train(
... slr_model,
... {"Years": 0, "Salary": 10},
... {"Years": 1, "Salary": 20},
... {"Years": 2, "Salary": 30},
... {"Years": 3, "Salary": 40},
... )
... inputs = [
... Input(
... value={"Years": 4}, definition=model_predict.op.inputs["features"],
... )
... ]
... async for ctx, results in MemoryOrchestrator.run(dataflow, inputs):
... print(results)
>>>
>>> asyncio.run(main())
{'model_predictions': {'Salary': {'confidence': 1.0, 'value': 50}}}
Stage: processing
Inputs
features: record_features(type: Dict[str, Any])
Outputs
prediction: model_predictions(type: Dict[str, Any])
Args
model: Entrypoint
extract_tar_archive¶
Official
Extracts a given tar file.
Parameters¶
- input_file_pathstr
Path to the tar file
- output_directory_pathstr
Path where all the files should be extracted
Returns¶
- dict
Path to the directory where the archive has been extracted
Stage: processing
Inputs
input_file_path: tar_file(type: str)
output_directory_path: directory(type: str)
Outputs
output_path: output_directory_path(type: str)
extract_zip_archive¶
Official
Extracts a given zip file.
Parameters¶
- input_file_pathstr
Path to the zip file
- output_directory_pathstr
Path where all the files should be extracted
Returns¶
- dict
Path to the directory where the archive has been extracted
Stage: processing
Inputs
input_file_path: zip_file(type: str)
output_directory_path: directory(type: str)
Outputs
output_path: output_directory_path(type: str)
get_multi¶
Official
Output operation to get all Inputs matching given definitions.
Parameters¶
- speclist
List of definition names. Any Inputs with matching definition will be returned.
Returns¶
- dict
Maps definition names to all the Inputs of that definition
Examples¶
The following shows how to grab all Inputs with the URL definition. If we had we run an operation which output a URL, that output URL would have also been returned to us.
>>> import asyncio
>>> from dffml import *
>>>
>>> URL = Definition(name="URL", primitive="string")
>>>
>>> dataflow = DataFlow.auto(GetMulti)
>>> dataflow.seed.append(
... Input(
... value=[URL.name],
... definition=GetMulti.op.inputs["spec"]
... )
... )
>>>
>>> async def main():
... async for ctx, results in MemoryOrchestrator.run(dataflow, [
... Input(
... value="https://github.com/intel/dffml",
... definition=URL
... ),
... Input(
... value="https://github.com/intel/cve-bin-tool",
... definition=URL
... )
... ]):
... print(results)
...
>>> asyncio.run(main())
{'URL': ['https://github.com/intel/dffml', 'https://github.com/intel/cve-bin-tool']}
Stage: output
Inputs
spec: get_multi_spec(type: array)
Outputs
output: get_multi_output(type: map)
get_single¶
Official
Output operation to get a single Input for each definition given.
Parameters¶
- speclist
List of definition names. An Input with matching definition will be returned.
Returns¶
- dict
Maps definition names to an Input of that definition
Examples¶
The following shows how to grab an Inputs with the URL definition. If we had we run an operation which output a URL, that output URL could have also been returned to us.
>>> import asyncio
>>> from dffml import *
>>>
>>> URL = Definition(name="URL", primitive="string")
>>> ORG = Definition(name="ORG", primitive="string")
>>>
>>> dataflow = DataFlow.auto(GetSingle)
>>> dataflow.seed.append(
... Input(
... value=[{"Repo Link": URL.name}, ORG.name],
... definition=GetSingle.op.inputs["spec"]
... )
... )
>>>
>>> async def main():
... async for ctx, results in MemoryOrchestrator.run(dataflow, [
... Input(
... value="https://github.com/intel/dffml",
... definition=URL
... ),
... Input(
... value="Intel",
... definition=ORG
... )
... ]):
... print(results)
...
>>> asyncio.run(main())
{'ORG': 'Intel', 'Repo Link': 'https://github.com/intel/dffml'}
Stage: output
Inputs
spec: get_single_spec(type: array)
Outputs
output: get_single_output(type: map)
group_by¶
Official
No description
Stage: output
Inputs
spec: group_by_spec(type: Dict[str, Any])
Outputs
output: group_by_output(type: Dict[str, List[Any]])
gz_compress¶
Official
No description
Stage: processing
Inputs
input_file_path: decompressed_gz_file_path(type: str)
output_file_path: compressed_gz_file_path(type: str)
Outputs
output_path: compressed_output_gz_file_path(type: str)
gz_decompress¶
Official
No description
Stage: processing
Inputs
input_file_path: compressed_gz_file_path(type: str)
output_file_path: decompressed_gz_file_path(type: str)
Outputs
output_path: decompressed_output_gz_file_path(type: str)
literal_eval¶
Official
Evaluate the input using ast.literal_eval()
Parameters¶
- str_to_evalstr
A string to be evaluated.
Returns¶
- dict
A dict containing python literal.
Examples¶
The following example shows how to use literal_eval.
>>> import asyncio
>>> from dffml import *
>>>
>>> dataflow = DataFlow.auto(literal_eval, GetSingle)
>>> dataflow.seed.append(
... Input(
... value=[literal_eval.op.outputs["str_after_eval"].name,],
... definition=GetSingle.op.inputs["spec"],
... )
... )
>>> inputs = [
... Input(
... value="[1,2,3]",
... definition=literal_eval.op.inputs["str_to_eval"],
... parents=None,
... )
... ]
>>>
>>> async def main():
... async for ctx, results in MemoryOrchestrator.run(dataflow, inputs):
... print(results)
>>>
>>> asyncio.run(main())
{'EvaluatedStr': [1, 2, 3]}
Stage: processing
Inputs
str_to_eval: InputStr(type: str)
Outputs
str_after_eval: EvaluatedStr(type: generic)
make_tar_archive¶
Official
Creates tar file of a directory.
Parameters¶
- input_directory_pathstr
Path to directory to be archived as a tarfile.
- output_file_pathstr
Path where the output archive should be saved (should include file name)
Returns¶
- dict
Path to the created tar file.
Stage: processing
Inputs
input_directory_path: directory(type: str)
output_file_path: tar_file(type: str)
Outputs
output_path: output_tarfile_path(type: str)
make_zip_archive¶
Official
Creates zip file of a directory.
Parameters¶
- input_directory_pathstr
Path to directory to be archived
- output_file_pathstr
Path where the output archive should be saved (should include file name)
Returns¶
- dict
Path to the output zip file
Stage: processing
Inputs
input_directory_path: directory(type: str)
output_file_path: zip_file(type: str)
Outputs
output_path: output_zipfile_path(type: str)
multiply¶
Official
Multiply record values
Parameters¶
- multiplicandgeneric
An arithmetic type value.
- multipliergeneric
An arithmetic type value.
Returns¶
- dict
A dict containing the product.
Examples¶
The following example shows how to use multiply.
>>> import asyncio
>>> from dffml import *
>>>
>>> dataflow = DataFlow.auto(multiply, GetSingle)
>>> dataflow.seed.append(
... Input(
... value=[multiply.op.outputs["product"].name,],
... definition=GetSingle.op.inputs["spec"],
... )
... )
>>> inputs = [
... Input(
... value=12,
... definition=multiply.op.inputs["multiplicand"],
... ),
... Input(
... value=3,
... definition=multiply.op.inputs["multiplier"],
... ),
... ]
>>>
>>> async def main():
... async for ctx, results in MemoryOrchestrator.run(dataflow, inputs):
... print(results)
>>>
>>> asyncio.run(main())
{'product': 36}
Stage: processing
Inputs
multiplicand: multiplicand_def(type: generic)
multiplier: multiplier_def(type: generic)
Outputs
product: product(type: generic)
print_output¶
Official
Print the output on stdout using python print()
Parameters¶
- dataAny
A python literal to be printed.
Examples¶
The following example shows how to use print_output.
>>> import asyncio
>>> from dffml import *
>>>
>>> dataflow = DataFlow.auto(print_output)
>>> inputs = [
... Input(
... value="print_output example", definition=print_output.op.inputs["data"]
... )
... ]
>>>
>>> async def main():
... async for ctx, results in MemoryOrchestrator.run(dataflow, inputs):
... pass
>>>
>>> asyncio.run(main())
print_output example
Stage: processing
Inputs
data: DataToPrint(type: generic)
xz_compress¶
Official
No description
Stage: processing
Inputs
input_file_path: decompressed_xz_file_path(type: str)
output_file_path: compressed_xz_file_path(type: str)
Outputs
output_path: compressed_output_xz_file_path(type: str)
xz_decompress¶
Official
No description
Stage: processing
Inputs
input_file_path: compressed_xz_file_path(type: str)
output_file_path: decompressed_xz_file_path(type: str)
Outputs
output_path: decompressed_output_xz_file_path(type: str)
dffml_operations_image¶
pip install dffml-operations-image
Haralick¶
Official
Computes Haralick texture features
Stage: processing
Inputs
f: Haralick.inputs.f(type: array)
ignore_zeros: Haralick.inputs.ignore_zeros(type: bool)
preserve_haralick_bug: Haralick.inputs.preserve_haralick_bug(type: bool)
compute_14th_feature: Haralick.inputs.compute_14th_feature(type: bool)
return_mean: Haralick.inputs.return_mean(type: bool)
return_mean_ptp: Haralick.inputs.return_mean_ptp(type: bool)
use_x_minus_y_variance: Haralick.inputs.use_x_minus_y_variance(type: bool)
distance: Haralick.inputs.distance(type: int)
Outputs
result: Haralick.outputs.result(type: array)
HuMoments¶
Official
Calculates seven Hu invariants
Stage: processing
Inputs
m: HuMoments.inputs.m(type: array)
Outputs
result: HuMoments.outputs.result(type: array)
calcHist¶
Official
Calculates a histogram
Stage: processing
Inputs
images: calcHist.inputs.images(type: array)
channels: calcHist.inputs.channels(type: array)
mask: calcHist.inputs.mask(type: array)
histSize: calcHist.inputs.histSize(type: array)
ranges: calcHist.inputs.ranges(type: array)
Outputs
result: calcHist.outputs.result(type: array)
convert_color¶
Official
Converts images from one color space to another
Stage: processing
Inputs
src: convert_color.inputs.src(type: array)
code: convert_color.inputs.code(type: str)
Outputs
result: convert_color.outputs.result(type: array)
flatten¶
Official
No description
Stage: processing
Inputs
array: flatten.inputs.array(type: array)
Outputs
result: flatten.outputs.result(type: array)
normalize¶
Official
Normalizes arrays
Stage: processing
Inputs
src: normalize.inputs.src(type: array)
alpha: normalize.inputs.alpha(type: int)
beta: normalize.inputs.beta(type: int)
norm_type: normalize.inputs.norm_type(type: int)
dtype: normalize.inputs.dtype(type: int)
mask: normalize.inputs.mask(type: array)
Outputs
result: normalize.outputs.result(type: array)
resize¶
Official
Resizes image array to the specified new dimensions
If the new dimensions are in 2D, the image is converted to grayscale.
- To enlarge the image (src dimensions < dsize),
it will resize the image with INTER_CUBIC interpolation.
- To shrink the image (src dimensions > dsize),
it will resize the image with INTER_AREA interpolation
Stage: processing
Inputs
src: resize.inputs.src(type: array)
dsize: resize.inputs.dsize(type: array)
fx: resize.inputs.fx(type: float)
fy: resize.inputs.fy(type: float)
interpolation: resize.inputs.interpolation(type: int)
Outputs
result: resize.outputs.result(type: array)
dffml_feature_git¶
pip install dffml-feature-git
check_if_valid_git_repository_URL¶
Official
No description
Stage: processing
Inputs
URL: URL(type: string)
Outputs
valid: valid_git_repository_URL(type: boolean)
cleanup_git_repo¶
Official
No description
Stage: cleanup
Inputs
repo: git_repository(type: Dict[str, str])
directory: str
URL: str(default: None)
clone_git_repo¶
Official
No description
Stage: processing
Inputs
URL: URL(type: string)
Outputs
repo: git_repository(type: Dict[str, str])
directory: str
URL: str(default: None)
Conditions
valid_git_repository_URL: boolean
git_commits¶
Official
No description
Stage: processing
Inputs
repo: git_repository(type: Dict[str, str])
directory: str
URL: str(default: None)
branch: git_branch(type: str)
start_end: date_pair(type: List[date])
Outputs
commits: commit_count(type: int)
git_repo_checkout¶
Official
No description
Stage: processing
Inputs
repo: git_repository(type: Dict[str, str])
directory: str
URL: str(default: None)
commit: git_commit(type: string)
Outputs
repo: git_repository_checked_out(type: Dict[str, str])
directory: str
URL: str(default: None)
commit: str(default: None)
git_repo_commit_from_date¶
Official
No description
Stage: processing
Inputs
repo: git_repository(type: Dict[str, str])
directory: str
URL: str(default: None)
branch: git_branch(type: str)
date: date(type: string)
Outputs
commit: git_commit(type: string)
git_repo_default_branch¶
Official
No description
Stage: processing
Inputs
repo: git_repository(type: Dict[str, str])
directory: str
URL: str(default: None)
Outputs
branch: git_branch(type: str)
Conditions
no_git_branch_given: boolean
git_repo_release¶
Official
Was there a release within this date range
Stage: processing
Inputs
repo: git_repository(type: Dict[str, str])
directory: str
URL: str(default: None)
branch: git_branch(type: str)
start_end: date_pair(type: List[date])
Outputs
present: release_within_period(type: bool)
lines_of_code_by_language¶
Official
This operation relys on tokei
. Here’s how to install version 10.1.1,
check it’s releases page to make sure you’re installing the latest version.
On Linux
$ curl -sSL 'https://github.com/XAMPPRocky/tokei/releases/download/v10.1.1/tokei-v10.1.1-x86_64-apple-darwin.tar.gz' \
| tar -xvz && \
echo '22699e16e71f07ff805805d26ee86ecb9b1052d7879350f7eb9ed87beb0e6b84fbb512963d01b75cec8e80532e4ea29a tokei' | sha384sum -c - && \
sudo mv tokei /usr/local/bin/
On OSX
$ curl -sSL 'https://github.com/XAMPPRocky/tokei/releases/download/v10.1.1/tokei-v10.1.1-x86_64-apple-darwin.tar.gz' \
| tar -xvz && \
echo '8c8a1d8d8dd4d8bef93dabf5d2f6e27023777f8553393e269765d7ece85e68837cba4374a2615d83f071dfae22ba40e2 tokei' | sha384sum -c - && \
sudo mv tokei /usr/local/bin/
Stage: processing
Inputs
repo: git_repository_checked_out(type: Dict[str, str])
directory: str
URL: str(default: None)
commit: str(default: None)
Outputs
lines_by_language: lines_by_language_count(type: Dict[str, Dict[str, int]])
lines_of_code_to_comments¶
Official
No description
Stage: processing
Inputs
langs: lines_by_language_count(type: Dict[str, Dict[str, int]])
Outputs
code_to_comment_ratio: language_to_comment_ratio(type: int)
make_quarters¶
Official
No description
Stage: processing
Inputs
number: quarters(type: int)
Outputs
quarters: quarter(type: int)
quarters_back_to_date¶
Official
No description
Stage: processing
Inputs
date: quarter_start_date(type: int)
number: quarter(type: int)
Outputs
date: date(type: string)
start_end: date_pair(type: List[date])
work¶
Official
No description
Stage: processing
Inputs
author_lines: author_line_count(type: Dict[str, int])
Outputs
work: work_spread(type: int)
dffml_feature_auth¶
pip install dffml-feature-auth
scrypt¶
Official
No description
Stage: processing
Inputs
password: UnhashedPassword(type: string)
Outputs
password: ScryptPassword(type: string)
dffml_operations_deploy¶
pip install dffml-operations-deploy
check_if_default_branch¶
Official
No description
Stage: processing
Inputs
payload: git_payload(type: Dict[str,Any])
Outputs
is_default_branch: is_default_branch(type: bool)
check_secret_match¶
Official
No description
Stage: processing
Inputs
headers: webhook_headers(type: Dict[str,Any])
body: payload(type: bytes)
Outputs
git_payload: git_payload(type: Dict[str,Any])
Args
secret: Entrypoint
docker_build_image¶
Official
No description
Stage: processing
Inputs
docker_commands: docker_commands(type: Dict[str,Any])
Outputs
build_status: is_image_built(type: bool)
Conditions
is_default_branch: bool
got_running_containers: bool
get_image_tag¶
Official
No description
Stage: processing
Inputs
payload: git_payload(type: Dict[str,Any])
Outputs
image_tag: docker_image_tag(type: str)
get_running_containers¶
Official
No description
Stage: processing
Inputs
tag: docker_image_tag(type: str)
Outputs
running_containers: docker_running_containers(type: List[str])
get_status_running_containers¶
Official
No description
Stage: processing
Inputs
containers: docker_running_containers(type: List[str])
Outputs
status: got_running_containers(type: bool)
get_url_from_payload¶
Official
No description
Stage: processing
Inputs
payload: git_payload(type: Dict[str,Any])
Outputs
url: URL(type: string)
parse_docker_commands¶
Official
No description
Stage: processing
Inputs
repo: git_repository(type: Dict[str, str])
directory: str
URL: str(default: None)
image_tag: docker_image_tag(type: str)
Outputs
docker_commands: docker_commands(type: Dict[str,Any])
restart_running_containers¶
Official
No description
Stage: processing
Inputs
docker_commands: docker_commands(type: Dict[str,Any])
containers: docker_running_containers(type: List[str])
Outputs
containers: docker_restarted_containers(type: str)
Conditions
is_image_built: bool
dffml_operations_data¶
pip install dffml-operations-data
one_hot_encoder¶
Official
One hot encoding for categorical data columns
Parameters¶
- dataList[List[int]]
data to be encoded.
- categoriesList[List[str]]
Categorical values which needs to be encoded
Returns¶
result: Encoded data for categorical values
Stage: processing
Inputs
data: input_data(type: List[List[int]])
categories: categories(type: List[List[Any]])
Outputs
result: output_data(type: List[List[int]])
ordinal_encoder¶
Official
One hot encoding for categorical data columns
Parameters¶
- dataList[List[int]]
data to be encoded.
- categoriesList[List[str]]
Categorical values which needs to be encoded
Returns¶
result: Encoded data for categorical values
Stage: processing
Inputs
data: input_data(type: List[List[int]])
Outputs
result: output_data(type: List[List[int]])
principal_component_analysis¶
Official
Decomposes the data into (n_samples, n_components) using PCA method
Parameters¶
- dataList[List[int]]
data to be decomposed.
- n_componentsint
number of colums the data should have after decomposition.
Returns¶
result: Data having dimensions (n_samples, n_components)
Stage: processing
Inputs
data: input_data(type: List[List[int]])
n_components: n_components(type: int)
Outputs
result: output_data(type: List[List[int]])
remove_whitespaces¶
Official
Removes white-spaces from the dataset
Parameters¶
- dataList[List[int]]
dataset.
Returns¶
result: dataset having whitespaces removed
Stage: processing
Inputs
data: input_data(type: List[List[int]])
Outputs
result: output_data(type: List[List[int]])
simple_imputer¶
Official
Imputation method for missing values
Parameters¶
- dataList[List[int]]
data in which missing values are present
- missing_valuesAny str, int, float, None default = np.nan
The value present in place of missing value
- strategystr “mean”, “median”, “constant”, “most_frequent” default = “mean”
The value present in place of missing value
Returns¶
result: Dataset having missing values imputed with the strategy
Stage: processing
Inputs
data: input_data(type: List[List[int]])
missing_values: missing_values(type: Any)
strategy: strategy(type: str)
Outputs
result: output_data(type: List[List[int]])
singular_value_decomposition¶
Official
Decomposes the data into (n_samples, n_components) using SVD method.
Parameters¶
- dataList[List[int]]
data to be decomposed.
- n_componentsint
number of colums the data should have after decomposition.
Returns¶
result: Data having dimensions (n_samples, n_components)
Stage: processing
Inputs
data: input_data(type: List[List[int]])
n_components: n_components(type: int)
n_iter: n_iter(type: int)
random_state: random_state(type: int)
Outputs
result: output_data(type: List[List[int]])
standard_scaler¶
Official
Standardize features by removing the mean and scaling to unit variance.
Parameters¶
- data: List[List[int]]
data that needs to be standardized
Returns¶
result: Standardized data
Stage: processing
Inputs
data: input_data(type: List[List[int]])
Outputs
result: output_data(type: List[List[int]])
dffml_operations_binsec¶
pip install dffml-operations-binsec
cleanup_rpm¶
Official
No description
Stage: cleanup
Inputs
rpm: RPMObject(type: python_obj)
files_in_rpm¶
Official
No description
Stage: processing
Inputs
rpm: RPMObject(type: python_obj)
Outputs
files: rpm_filename(type: str)
is_binary_pie¶
Official
No description
Stage: processing
Inputs
rpm: RPMObject(type: python_obj)
filename: rpm_filename(type: str)
Outputs
is_pie: binary_is_PIE(type: bool)
url_to_urlbytes¶
Official
No description
Stage: processing
Inputs
URL: URL(type: string)
Outputs
download: URLBytes(type: python_obj)
urlbytes_to_rpmfile¶
Official
No description
Stage: processing
Inputs
download: URLBytes(type: python_obj)
Outputs
rpm: RPMObject(type: python_obj)
urlbytes_to_tarfile¶
Official
No description
Stage: processing
Inputs
download: URLBytes(type: python_obj)
Outputs
rpm: RPMObject(type: python_obj)
shouldi¶
pip install shouldi
cleanup_pypi_package¶
Official
Remove the directory containing the source code release.
Stage: cleanup
Inputs
directory: run_bandit.inputs.pkg(type: str)
pypi_package_contents¶
Official
Download a source code release and extract it to a temporary directory.
Stage: processing
Inputs
url: pypi_package_contents.inputs.url(type: str)
Outputs
directory: run_bandit.inputs.pkg(type: str)
pypi_package_json¶
Official
Download the information on the package in JSON format.
Stage: processing
Inputs
package: safety_check.inputs.package(type: str)
Outputs
version: safety_check.inputs.version(type: str)
url: pypi_package_contents.inputs.url(type: str)
run_bandit¶
Official
CLI usage: dffml service dev run -log debug shouldi.bandit:run_bandit -pkg .
Stage: processing
Inputs
pkg: run_bandit.inputs.pkg(type: str)
Outputs
result: run_bandit.outputs.result(type: map)
safety_check¶
Official
No description
Stage: processing
Inputs
package: safety_check.inputs.package(type: str)
version: safety_check.inputs.version(type: str)
Outputs
result: safety_check.outputs.result(type: int)
dffml_operations_nlp¶
pip install dffml-operations-nlp
collect_output¶
Official
No description
Stage: processing
Inputs
sentence: sentence(type: string)
length: source_length(type: string)
Outputs
all: all_sentences(type: List[string])
count_vectorizer¶
Official
Converts a collection of text documents to a matrix of token counts using sklearn CountVectorizer’s fit_transform method. For details on parameters check https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html Parameters specific to this operation are described below.
Parameters¶
- textlist
A list of strings.
- get_feature_names: bool
If True return feature names using get_feature_names method of CountVectorizer.
Returns¶
- result: list
A list containing token counts and feature names if get_feature_names is True.
Stage: processing
Inputs
text: count_vectorizer.inputs.text(type: array)
encoding: count_vectorizer.inputs.encoding(type: str)
decode_error: count_vectorizer.inputs.decode_error(type: str)
strip_accents: count_vectorizer.inputs.strip_accents(type: str)
lowercase: count_vectorizer.inputs.lowercase(type: bool)
stop_words: count_vectorizer.inputs.stop_words(type: str)
token_pattern: count_vectorizer.inputs.token_pattern(type: str)
ngram_range: count_vectorizer.inputs.ngram_range(type: array)
analyzer: count_vectorizer.inputs.analyzer(type: str)
max_df: count_vectorizer.inputs.max_df(type: float)
min_df: count_vectorizer.inputs.min_df(type: float)
max_features: count_vectorizer.inputs.max_features(type: int)
vocabulary: count_vectorizer.inputs.vocabulary(type: map)
binary: count_vectorizer.inputs.binary(type: bool)
get_feature_names: count_vectorizer.inputs.get_feature_names(type: bool)
Outputs
result: count_vectorizer.outputs.result(type: array)
extract_array_from_matrix¶
Official
Returns row from input_matrix based on index of single_text_example in collected_text.
Parameters¶
- single_text_examplestr
String to be used for indexing into collected_text.
- collected_text: list
List of strings.
- input_matrix: list
A 2-D matrix where each row represents vector corresponding to single_text_example.
Returns¶
result: A 1-d array.
Stage: processing
Inputs
single_text_example: extract_array_from_matrix.inputs.single_text_example(type: str)
collected_text: extract_array_from_matrix.inputs.collected_text(type: array)
input_matrix: extract_array_from_matrix.inputs.input_matrix(type: array)
Outputs
result: extract_array_from_matrix.outputs.result(type: array)
get_embedding¶
Official
Maps words of text data to their corresponding word vectors.
Parameters¶
- textstr
String to be converted to word vectors.
- max_len: int
Maximum length of sentence. If the length of text > max_len, text is truncated to have length = max_len. If the length of text < max_len, text is padded with pad_token such that len(text) = max_len.
- pad_token: str
Token to be used for padding text if len(text) < max_len
- spacy_model: str
Spacy model to be used for assigning vectors to tokens.
Returns¶
result: A 2-d array of shape (max_len, embedding_size of vectors).
Stage: processing
Inputs
text: text_def(type: str)
spacy_model: spacy_model_name_def(type: str)
max_len: max_len_def(type: int)
pad_token: pad_token_def(type: str)
Outputs
embedding: embedding(type: generic)
get_noun_chunks¶
Official
Extracts the noun chunks from text.
Parameters¶
- textstr
String to extract noun chunks from.
- spacy_model: str
A spacy model with the capability of parsing.
Returns¶
- result: list
A list containing noun chunks.
Stage: processing
Inputs
text: get_noun_chunks.inputs.text(type: str)
spacy_model: get_noun_chunks.inputs.spacy_model(type: str)
Outputs
result: get_noun_chunks.outputs.result(type: array)
get_sentences¶
Official
Extracts the sentences from text.
Parameters¶
- textstr
String to extract sentences from.
- spacy_model: str
A spacy model with the capability of parsing. Sentence boundaries are calculated from the syntactic dependency parse.
Returns¶
- result: list
A list containing sentences.
Stage: processing
Inputs
text: get_sentences.inputs.text(type: str)
spacy_model: get_sentences.inputs.spacy_model(type: str)
Outputs
result: get_sentences.outputs.result(type: array)
get_similarity¶
Official
Calculates similarity between two text strings as a score between 0 and 1.
Parameters¶
- text_1str
First string to compare.
- text_2str
Second string to compare.
- spacy_model: str
Spacy model to be used for extracting word vectors which are used for calculating similarity.
Returns¶
- result: float
A similarity score between 0 and 1.
Stage: processing
Inputs
text_1: get_similarity.inputs.text_1(type: str)
text_2: get_similarity.inputs.text_2(type: str)
spacy_model: get_similarity.inputs.spacy_model(type: str)
Outputs
result: get_similarity.outputs.result(type: float)
lemmatizer¶
Official
Reduce words in the text to their dictionary form (lemma)
Parameters¶
- textstr
String to lemmatize.
- spacy_model: str
Spacy model to be used for lemmatization.
Returns¶
- result: list
A list containing base form of the words.
Stage: processing
Inputs
text: lemmatizer.inputs.text(type: str)
spacy_model: lemmatizer.inputs.spacy_model(type: str)
Outputs
result: lemmatizer.outputs.result(type: array)
pos_tagger¶
Official
Assigns part-of-speech tags to text.
Parameters¶
- textstr
Text to be tagged.
- spacy_model: str
A spacy model with tagger and parser.
Returns¶
- result: list
A list containing tuples of word and their respective pos tag.
Stage: processing
Inputs
text: pos_tagger.inputs.text(type: str)
spacy_model: pos_tagger.inputs.spacy_model(type: str)
tag_type: pos_tagger.inputs.tag_type(type: str)
Outputs
result: pos_tagger.outputs.result(type: array)
remove_stopwords¶
Official
Removes stopword from text data.
Parameters¶
- textstr
String to be cleaned.
- custom_stop_words: List[str], default = None
List of words to be considered as stop words.
Returns¶
result: A string without stop words.
Stage: processing
Inputs
text: remove_stopwords.inputs.text(type: str)
custom_stop_words: remove_stopwords.inputs.custom_stop_words(type: array)
Outputs
result: remove_stopwords.outputs.result(type: str)
tfidf_vectorizer¶
Official
Convert a collection of raw documents to a matrix of TF-IDF features using sklearn TfidfVectorizer’s fit_transform method. For details on parameters check https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html Parameters specific to this operation are described below.
Parameters¶
- textlist
A list of strings.
- get_feature_names: bool
If True return feature names using get_feature_names method of TfidfVectorizer.
Returns¶
- result: list
A list containing token counts and feature names if get_feature_names is True.
Stage: processing
Inputs
text: tfidf_vectorizer.inputs.text(type: array)
encoding: tfidf_vectorizer.inputs.encoding(type: str)
decode_error: tfidf_vectorizer.inputs.decode_error(type: str)
strip_accents: tfidf_vectorizer.inputs.strip_accents(type: str)
lowercase: tfidf_vectorizer.inputs.lowercase(type: bool)
analyzer: tfidf_vectorizer.inputs.analyzer(type: str)
stop_words: tfidf_vectorizer.inputs.stop_words(type: str)
token_pattern: tfidf_vectorizer.inputs.token_pattern(type: str)
ngram_range: tfidf_vectorizer.inputs.ngram_range(type: array)
max_df: tfidf_vectorizer.inputs.max_df(type: str)
min_df: tfidf_vectorizer.inputs.min_df(type: str)
max_features: tfidf_vectorizer.inputs.max_features(type: str)
vocabulary: tfidf_vectorizer.inputs.vocabulary(type: str)
binary: tfidf_vectorizer.inputs.binary(type: bool)
norm: tfidf_vectorizer.inputs.norm(type: str)
use_idf: tfidf_vectorizer.inputs.use_idf(type: bool)
smooth_idf: tfidf_vectorizer.inputs.smooth_idf(type: bool)
sublinear_tf: tfidf_vectorizer.inputs.sublinear_tf(type: bool)
get_feature_names: tfidf_vectorizer.inputs.get_feature_names(type: bool)
Outputs
result: tfidf_vectorizer.outputs.result(type: array)