Operations¶

Operations Implementations are subclasses of dffml.df.base.OperationImplementation, they are functions or classes which could do anything, make HTTP requests, do inference, etc.

They don’t necessarily have to be written in Python. Although DFFML isn’t quite to the point where it can use operations written in other languages yet, it’s on the roadmap.

dffml¶

pip install dffml

AcceptUserInput¶

Official

Accept input from stdin using python input()

Returns¶

dict: A dictionary containing user input.

Examples¶

The following example shows how to use AcceptUserInput. (Assumes that the input from stdio is “Data flow is awesome”!)

>>> import asyncio
>>> from dffml import *
>>>
>>> dataflow = DataFlow.auto(AcceptUserInput, GetSingle)
>>> dataflow.seed.append(
...     Input(
...         value=[AcceptUserInput.op.outputs["InputData"].name],
...         definition=GetSingle.op.inputs["spec"],
...     )
... )
>>>
>>> async def main():
...     async for ctx, results in MemoryOrchestrator.run(dataflow, {"input": []}):
...         print(results)
>>>
>>> asyncio.run(main())
Enter the value: {'UserInput': 'Data flow is awesome'}

Stage: processing

Outputs

InputData: UserInput(type: str)

associate¶

Official

No description

Stage: output

Inputs

spec: associate_spec(type: List[str])

Outputs

output: associate_output(type: Dict[str, Any])

associate_definition¶

Official

Examples¶

>>> import asyncio
>>> from dffml import *
>>>
>>> feed_def = Definition(name="feed", primitive="string")
>>> dead_def = Definition(name="dead", primitive="string")
>>> output = Definition(name="output", primitive="string")
>>>
>>> feed_input = Input(value="my favorite value", definition=feed_def)
>>> face_input = Input(
...     value="face", definition=output, parents=[feed_input]
... )
>>>
>>> dead_input = Input(
...     value="my second favorite value", definition=dead_def
... )
>>> beef_input = Input(
...     value="beef", definition=output, parents=[dead_input]
... )
>>>
>>> async def main():
...     for value in ["feed", "dead"]:
...         async for ctx, results in MemoryOrchestrator.run(
...             DataFlow.auto(AssociateDefinition),
...             [
...                 feed_input,
...                 face_input,
...                 dead_input,
...                 beef_input,
...                 Input(
...                     value={value: "output"},
...                     definition=AssociateDefinition.op.inputs["spec"],
...                 ),
...             ],
...         ):
...             print(results)
>>>
>>> asyncio.run(main())
{'feed': 'face'}
{'dead': 'beef'}

Stage: output

Inputs

spec: associate_spec(type: List[str])

Outputs

output: associate_output(type: Dict[str, Any])

bz2_compress¶

Official

No description

Stage: processing

Inputs

input_file_path: decompressed_bz2_file_path(type: str)
output_file_path: compressed_bz2_file_path(type: str)

Outputs

output_path: compressed_output_bz2_file_path(type: str)

bz2_decompress¶

Official

No description

Stage: processing

Inputs

input_file_path: compressed_bz2_file_path(type: str)
output_file_path: decompressed_bz2_file_path(type: str)

Outputs

output_path: decompressed_output_bz2_file_path(type: str)

convert_list_to_records¶

Official

No description

Stage: processing

Inputs

matrix: matrix(type: List[List[Any]])
features: features(type: List[str])
keys: keys(type: List[str])
predict_features: predict_features(type: List[str])
unprocessed_matrix: unprocessed_matrix(type: List[List[Any]])

Outputs

records: records(type: Dict[str, Any])

convert_records_to_list¶

Official

No description

Stage: processing

Inputs

features: features(type: List[str])
predict_features: predict_features(type: List[str])

Outputs

matrix: matrix(type: List[List[Any]])
keys: keys(type: List[str])
unprocessed_matrix: unprocessed_matrix(type: List[List[Any]])

Args

source: Entrypoint

db_query_create_table¶

Official

Generates a create table query in the database.

Parameters¶

table_namestr: The name of the table to be created.
colslist[str]: Columns of the table.

Examples¶

>>> import asyncio
>>> from dffml import *
>>>
>>> sdb = SqliteDatabase(SqliteDatabaseConfig(filename="examples.db"))
>>>
>>> dataflow = DataFlow(
...     operations={"db_query_create": db_query_create_table.op,},
...     configs={"db_query_create": DatabaseQueryConfig(database=sdb),},
...     seed=[],
... )
>>>
>>> inputs = [
...     Input(
...         value="myTable1",
...         definition=db_query_create_table.op.inputs["table_name"],
...     ),
...     Input(
...         value={
...             "key": "real",
...             "firstName": "text",
...             "lastName": "text",
...             "age": "real",
...         },
...         definition=db_query_create_table.op.inputs["cols"],
...     ),
... ]
>>>
>>> async def main():
...     async for ctx, result in MemoryOrchestrator.run(dataflow, inputs):
...         pass
>>>
>>> asyncio.run(main())

Stage: processing

Inputs

table_name: query_table(type: str)
cols: query_cols(type: Dict[str, str])

Args

database: Entrypoint

db_query_insert¶

Official

Generates an insert query in the database.

Parameters¶

table_namestr: The name of the table to insert data in to.
datadict: Data to be inserted into the table.

Examples¶

>>> import asyncio
>>> from dffml import *
>>>
>>> sdb = SqliteDatabase(SqliteDatabaseConfig(filename="examples.db"))
>>>
>>> dataflow = DataFlow(
...     operations={
...         "db_query_insert": db_query_insert.op,
...         "db_query_lookup": db_query_lookup.op,
...         "get_single": GetSingle.imp.op,
...     },
...     configs={
...         "db_query_lookup": DatabaseQueryConfig(database=sdb),
...         "db_query_insert": DatabaseQueryConfig(database=sdb),
...     },
...     seed=[],
... )
>>>
>>> inputs = {
...     "insert": [
...         Input(
...             value="myTable", definition=db_query_insert.op.inputs["table_name"],
...         ),
...         Input(
...            value={"key": 10, "firstName": "John", "lastName": "Doe", "age": 16},
...             definition=db_query_insert.op.inputs["data"],
...         ),
...     ],
...     "lookup": [
...         Input(
...             value="myTable", definition=db_query_lookup.op.inputs["table_name"],
...         ),
...         Input(
...             value=["firstName", "lastName", "age"],
...             definition=db_query_lookup.op.inputs["cols"],
...         ),
...         Input(value=[], definition=db_query_lookup.op.inputs["conditions"],),
...         Input(
...             value=[db_query_lookup.op.outputs["lookups"].name],
...             definition=GetSingle.op.inputs["spec"],
...         ),
...     ]
... }
>>>
>>> async def main():
...     async for ctx, result in MemoryOrchestrator.run(dataflow, inputs):
...         if result:
...             print(result)
>>>
>>> asyncio.run(main())
{'query_lookups': [{'firstName': 'John', 'lastName': 'Doe', 'age': 16}]}

Stage: processing

Inputs

table_name: query_table(type: str)
data: query_data(type: Dict[str, Any])

Args

database: Entrypoint

db_query_insert_or_update¶

Official

Automatically uses the better suited operation, insert query or update query.

Parameters¶

table_namestr: The name of the table to insert data in to.
datadict: Data to be inserted or updated into the table.

Examples¶

>>> import asyncio
>>> from dffml import *
>>>
>>> sdb = SqliteDatabase(SqliteDatabaseConfig(filename="examples.db"))
>>>
>>> person = {"key": 11, "firstName": "John", "lastName": "Wick", "age": 38}
>>>
>>> dataflow = DataFlow(
...     operations={
...         "db_query_insert_or_update": db_query_insert_or_update.op,
...         "db_query_lookup": db_query_lookup.op,
...         "get_single": GetSingle.imp.op,
...     },
...     configs={
...         "db_query_insert_or_update": DatabaseQueryConfig(database=sdb),
...         "db_query_lookup": DatabaseQueryConfig(database=sdb),
...     },
...     seed=[],
... )
>>>
>>> inputs = {
...     "insert_or_update": [
...         Input(
...             value="myTable", definition=db_query_update.op.inputs["table_name"],
...         ),
...         Input(
...             value=person,
...             definition=db_query_update.op.inputs["data"],
...         ),
...     ],
...     "lookup": [
...         Input(
...             value="myTable",
...             definition=db_query_lookup.op.inputs["table_name"],
...         ),
...         Input(
...             value=["firstName", "lastName", "age"],
...             definition=db_query_lookup.op.inputs["cols"],
...         ),
...         Input(value=[], definition=db_query_lookup.op.inputs["conditions"],),
...         Input(
...             value=[db_query_lookup.op.outputs["lookups"].name],
...             definition=GetSingle.op.inputs["spec"],
...         ),
...     ],
... }
>>>
>>> async def main():
...     async for ctx, result in MemoryOrchestrator.run(dataflow, inputs):
...         if result:
...             print(result)
>>>
>>> asyncio.run(main())
{'query_lookups': [{'firstName': 'John', 'lastName': 'Wick', 'age': 38}]}
>>>
>>> person["age"] += 1
>>>
>>> asyncio.run(main())
{'query_lookups': [{'firstName': 'John', 'lastName': 'Wick', 'age': 39}]}

Stage: processing

Inputs

table_name: query_table(type: str)
data: query_data(type: Dict[str, Any])

Args

database: Entrypoint

db_query_lookup¶

Official

Generates a lookup query in the database.

Parameters¶

table_namestr: The name of the table.
colslist[str]: Columns of the table.
conditionsConditions: Query conditions.

Examples¶

>>> import asyncio
>>> from dffml import *
>>>
>>> sdb = SqliteDatabase(SqliteDatabaseConfig(filename="examples.db"))
>>>
>>> dataflow = DataFlow(
...     operations={
...         "db_query_lookup": db_query_lookup.op,
...         "get_single": GetSingle.imp.op,
...     },
...     configs={"db_query_lookup": DatabaseQueryConfig(database=sdb),},
...     seed=[],
... )
>>>
>>> inputs = {
...     "lookup": [
...         Input(
...             value="myTable",
...             definition=db_query_lookup.op.inputs["table_name"],
...         ),
...         Input(
...             value=["firstName", "lastName", "age"],
...             definition=db_query_lookup.op.inputs["cols"],
...         ),
...         Input(value=[], definition=db_query_lookup.op.inputs["conditions"],),
...         Input(
...             value=[db_query_lookup.op.outputs["lookups"].name],
...             definition=GetSingle.op.inputs["spec"],
...         ),
...     ],
... }
>>>
>>> async def main():
...     async for ctx, result in MemoryOrchestrator.run(dataflow, inputs):
...         if result:
...             print(result)
>>>
>>> asyncio.run(main())
{'query_lookups': [{'firstName': 'John', 'lastName': 'Doe', 'age': 16}, {'firstName': 'John', 'lastName': 'Wick', 'age': 39}]}

Stage: processing

Inputs

table_name: query_table(type: str)
cols: query_cols(type: Dict[str, str])
conditions: query_conditions(type: Conditions)

Outputs

lookups: query_lookups(type: Dict[str, Any])

Args

database: Entrypoint

db_query_remove¶

Official

Generates a remove table query in the database.

Parameters¶

table_namestr: The name of the table to insert data in to.
conditionsConditions: Query conditions.

Examples¶

>>> import asyncio
>>> from dffml import *
>>>
>>> sdb = SqliteDatabase(SqliteDatabaseConfig(filename="examples.db"))
>>>
>>> dataflow = DataFlow(
...     operations={
...         "db_query_lookup": db_query_lookup.op,
...         "db_query_remove": db_query_remove.op,
...         "get_single": GetSingle.imp.op,
...     },
...     configs={
...         "db_query_remove": DatabaseQueryConfig(database=sdb),
...         "db_query_lookup": DatabaseQueryConfig(database=sdb),
...     },
...     seed=[],
... )
>>>
>>> inputs = {
...     "remove": [
...         Input(
...             value="myTable",
...             definition=db_query_remove.op.inputs["table_name"],
...         ),
...         Input(value=[],
...         definition=db_query_remove.op.inputs["conditions"],),
...     ],
...     "lookup": [
...         Input(
...             value="myTable",
...             definition=db_query_lookup.op.inputs["table_name"],
...         ),
...         Input(
...             value=["firstName", "lastName", "age"],
...             definition=db_query_lookup.op.inputs["cols"],
...         ),
...         Input(value=[], definition=db_query_lookup.op.inputs["conditions"],),
...         Input(
...             value=[db_query_lookup.op.outputs["lookups"].name],
...             definition=GetSingle.op.inputs["spec"],
...         ),
...     ],
... }
>>>
>>> async def main():
...     async for ctx, result in MemoryOrchestrator.run(dataflow, inputs):
...         if result:
...             print(result)
>>>
>>> asyncio.run(main())
{'query_lookups': []}

Stage: processing

Inputs

table_name: query_table(type: str)
conditions: query_conditions(type: Conditions)

Args

database: Entrypoint

db_query_update¶

Official

Generates an Update table query in the database.

Parameters¶

table_namestr: The name of the table to insert data in to.
datadict: Data to be updated into the table.
conditionslist: List of query conditions.

Examples¶

>>> import asyncio
>>> from dffml import *
>>>
>>> sdb = SqliteDatabase(SqliteDatabaseConfig(filename="examples.db"))
>>>
>>> dataflow = DataFlow(
...     operations={
...         "db_query_update": db_query_update.op,
...         "db_query_lookup": db_query_lookup.op,
...         "get_single": GetSingle.imp.op,
...     },
...     configs={
...         "db_query_update": DatabaseQueryConfig(database=sdb),
...         "db_query_lookup": DatabaseQueryConfig(database=sdb),
...     },
...     seed=[],
... )
>>>
>>> inputs = {
...     "update": [
...         Input(
...             value="myTable",
...             definition=db_query_update.op.inputs["table_name"],
...         ),
...         Input(
...             value={
...                 "key": 10,
...                 "firstName": "John",
...                 "lastName": "Doe",
...                 "age": 17,
...             },
...             definition=db_query_update.op.inputs["data"],
...         ),
...         Input(value=[], definition=db_query_update.op.inputs["conditions"],),
...     ],
...     "lookup": [
...         Input(
...             value="myTable",
...             definition=db_query_lookup.op.inputs["table_name"],
...         ),
...         Input(
...             value=["firstName", "lastName", "age"],
...             definition=db_query_lookup.op.inputs["cols"],
...         ),
...         Input(value=[], definition=db_query_lookup.op.inputs["conditions"],),
...         Input(
...             value=[db_query_lookup.op.outputs["lookups"].name],
...             definition=GetSingle.op.inputs["spec"],
...         ),
...     ],
... }
>>>
>>> async def main():
...     async for ctx, result in MemoryOrchestrator.run(dataflow, inputs):
...         if result:
...             print(result)
>>>
>>> asyncio.run(main())
{'query_lookups': [{'firstName': 'John', 'lastName': 'Doe', 'age': 17}]}

Stage: processing

Inputs

table_name: query_table(type: str)
data: query_data(type: Dict[str, Any])
conditions: query_conditions(type: Conditions)

Args

database: Entrypoint

dffml.dataflow.run¶

Official

Starts a subflow self.config.dataflow and adds inputs in it.

Parameters¶

inputsdict: The inputs to add to the subflow. These should be a key value mapping of the context string to the inputs which should be seeded for that context string.

Returns¶

dict: Maps context strings in inputs to output after running through dataflow.

Examples¶

The following shows how to use run dataflow in its default behavior.

>>> import asyncio
>>> from dffml import *
>>>
>>> URL = Definition(name="URL", primitive="string")
>>>
>>> subflow = DataFlow.auto(GetSingle)
>>> subflow.definitions[URL.name] = URL
>>> subflow.seed.append(
...     Input(
...         value=[URL.name],
...         definition=GetSingle.op.inputs["spec"]
...     )
... )
>>>
>>> dataflow = DataFlow.auto(run_dataflow, GetSingle)
>>> dataflow.configs[run_dataflow.op.name] = RunDataFlowConfig(subflow)
>>> dataflow.seed.append(
...     Input(
...         value=[run_dataflow.op.outputs["results"].name],
...         definition=GetSingle.op.inputs["spec"]
...     )
... )
>>>
>>> async def main():
...     async for ctx, results in MemoryOrchestrator.run(dataflow, {
...         "run_subflow": [
...             Input(
...                 value={
...                     "dffml": [
...                         {
...                             "value": "https://github.com/intel/dffml",
...                             "definition": URL.name
...                         }
...                     ]
...                 },
...                 definition=run_dataflow.op.inputs["inputs"]
...             )
...         ]
...     }):
...         print(results)
>>>
>>> asyncio.run(main())
{'flow_results': {'dffml': {'URL': 'https://github.com/intel/dffml'}}}

The following shows how to use run dataflow with custom inputs and outputs. This allows you to run a subflow as if it were an operation.

>>> import asyncio
>>> from dffml import *
>>>
>>> URL = Definition(name="URL", primitive="string")
>>>
>>> @op(
...     inputs={"url": URL},
...     outputs={"last": Definition("last_element_in_path", primitive="string")},
... )
... def last_path(url):
...     return {"last": url.split("/")[-1]}
>>>
>>> subflow = DataFlow.auto(last_path, GetSingle)
>>> subflow.seed.append(
...     Input(
...         value=[last_path.op.outputs["last"].name],
...         definition=GetSingle.op.inputs["spec"],
...     )
... )
>>>
>>> dataflow = DataFlow.auto(run_dataflow, GetSingle)
>>> dataflow.operations[run_dataflow.op.name] = run_dataflow.op._replace(
...     inputs={"URL": URL},
...     outputs={last_path.op.outputs["last"].name: last_path.op.outputs["last"]},
...     expand=[],
... )
>>> dataflow.configs[run_dataflow.op.name] = RunDataFlowConfig(subflow)
>>> dataflow.seed.append(
...     Input(
...         value=[last_path.op.outputs["last"].name],
...         definition=GetSingle.op.inputs["spec"],
...     )
... )
>>> dataflow.update(auto_flow=True)
>>>
>>> async def main():
...     async for ctx, results in MemoryOrchestrator.run(
...         dataflow,
...         {
...             "run_subflow": [
...                 Input(value="https://github.com/intel/dffml", definition=URL)
...             ]
...         },
...     ):
...         print(results)
>>>
>>> asyncio.run(main())
{'last_element_in_path': 'dffml'}

Stage: processing

Inputs

inputs: flow_inputs(type: Dict[str,Any])

Outputs

results: flow_results(type: Dict[str,Any])

Args

dataflow: DataFlow

dffml.mapping.create¶

Official

Creates a mapping of a given key and value.

Parameters¶

keystr: The key for the mapping.
valueAny: The value for the mapping.

Returns¶

dict: A dictionary containing the mapping created.

Examples¶

>>> import asyncio
>>> from dffml import *
>>>
>>> dataflow = DataFlow.auto(create_mapping, GetSingle)
>>> dataflow.seed.append(
...     Input(
...         value=[create_mapping.op.outputs["mapping"].name],
...         definition=GetSingle.op.inputs["spec"],
...     )
... )
>>> inputs = [
...     Input(
...         value="key1", definition=create_mapping.op.inputs["key"],
...     ),
...     Input(
...         value=42, definition=create_mapping.op.inputs["value"],
...     ),
... ]
>>>
>>> async def main():
...     async for ctx, result in MemoryOrchestrator.run(dataflow, inputs):
...         print(result)
>>>
>>> asyncio.run(main())
{'mapping': {'key1': 42}}

Stage: processing

Inputs

key: key(type: str)
value: value(type: generic)

Outputs

mapping: mapping(type: map)

dffml.mapping.extract¶

Official

Extracts value from a given mapping.

Parameters¶

mappingdict: The mapping to extract the value from.
traverselist[str]: A list of keys to traverse through the mapping dictionary and extract the values.

Returns¶

dict: A dictionary containing the value of the keys.

Examples¶

>>> import asyncio
>>> from dffml import *
>>>
>>> dataflow = DataFlow.auto(mapping_extract_value, GetSingle)
>>>
>>> dataflow.seed.append(
...     Input(
...         value=[mapping_extract_value.op.outputs["value"].name],
...         definition=GetSingle.op.inputs["spec"],
...     )
... )
>>> inputs = [
...     Input(
...         value={"key1": {"key2": 42}},
...         definition=mapping_extract_value.op.inputs["mapping"],
...     ),
...     Input(
...         value=["key1", "key2"],
...         definition=mapping_extract_value.op.inputs["traverse"],
...     ),
... ]
>>>
>>> async def main():
...     async for ctx, result in MemoryOrchestrator.run(dataflow, inputs):
...         print(result)
>>>
>>> asyncio.run(main())
{'value': 42}

Stage: processing

Inputs

mapping: mapping(type: map)
traverse: mapping_traverse(type: List[str])

Outputs

value: value(type: generic)

dffml.model.predict¶

Official

Predict using dffml models.

Parameters¶

featuresdict: A dictionary contaning feature name and feature value.

Returns¶

dict: A dictionary containing prediction.

Examples¶

The following example shows how to use model_predict.

>>> import asyncio
>>> from dffml import *
>>>
>>> slr_model = SLRModel(
...     features=Features(Feature("Years", int, 1)),
...     predict=Feature("Salary", int, 1),
...     location="tempdir",
... )
>>> dataflow = DataFlow(
...     operations={
...         "prediction_using_model": model_predict,
...         "get_single": GetSingle,
...     },
...     configs={"prediction_using_model": ModelPredictConfig(model=slr_model)},
... )
>>> dataflow.seed.append(
...     Input(
...         value=[model_predict.op.outputs["prediction"].name],
...         definition=GetSingle.op.inputs["spec"],
...     )
... )
>>>
>>> async def main():
...     await train(
...         slr_model,
...         {"Years": 0, "Salary": 10},
...         {"Years": 1, "Salary": 20},
...         {"Years": 2, "Salary": 30},
...         {"Years": 3, "Salary": 40},
...     )
...     inputs = [
...        Input(
...            value={"Years": 4}, definition=model_predict.op.inputs["features"],
...        )
...     ]
...     async for ctx, results in MemoryOrchestrator.run(dataflow, inputs):
...         print(results)
>>>
>>> asyncio.run(main())
{'model_predictions': {'Salary': {'confidence': 1.0, 'value': 50}}}

Stage: processing

Inputs

features: record_features(type: Dict[str, Any])

Outputs

prediction: model_predictions(type: Dict[str, Any])

Args

model: Entrypoint

extract_tar_archive¶

Official

Extracts a given tar file.

Parameters¶

input_file_pathstr: Path to the tar file
output_directory_pathstr: Path where all the files should be extracted

Returns¶

dict: Path to the directory where the archive has been extracted

Stage: processing

Inputs

input_file_path: tar_file(type: str)
output_directory_path: directory(type: str)

Outputs

output_path: output_directory_path(type: str)

extract_zip_archive¶

Official

Extracts a given zip file.

Parameters¶

input_file_pathstr: Path to the zip file
output_directory_pathstr: Path where all the files should be extracted

Returns¶

dict: Path to the directory where the archive has been extracted

Stage: processing

Inputs

input_file_path: zip_file(type: str)
output_directory_path: directory(type: str)

Outputs

output_path: output_directory_path(type: str)

get_multi¶

Official

Output operation to get all Inputs matching given definitions.

Parameters¶

speclist: List of definition names. Any Inputs with matching definition will be returned.

Returns¶

dict: Maps definition names to all the Inputs of that definition

Examples¶

The following shows how to grab all Inputs with the URL definition. If we had we run an operation which output a URL, that output URL would have also been returned to us.

>>> import asyncio
>>> from dffml import *
>>>
>>> URL = Definition(name="URL", primitive="string")
>>>
>>> dataflow = DataFlow.auto(GetMulti)
>>> dataflow.seed.append(
...     Input(
...         value=[URL.name],
...         definition=GetMulti.op.inputs["spec"]
...     )
... )
>>>
>>> async def main():
...     async for ctx, results in MemoryOrchestrator.run(dataflow, [
...         Input(
...             value="https://github.com/intel/dffml",
...             definition=URL
...         ),
...         Input(
...             value="https://github.com/intel/cve-bin-tool",
...             definition=URL
...         )
...     ]):
...         print(results)
...
>>> asyncio.run(main())
{'URL': ['https://github.com/intel/dffml', 'https://github.com/intel/cve-bin-tool']}

Stage: output

Inputs

spec: get_multi_spec(type: array)

Outputs

output: get_multi_output(type: map)

get_single¶

Official

Output operation to get a single Input for each definition given.

Parameters¶

speclist: List of definition names. An Input with matching definition will be returned.

Returns¶

dict: Maps definition names to an Input of that definition

Examples¶

The following shows how to grab an Inputs with the URL definition. If we had we run an operation which output a URL, that output URL could have also been returned to us.

>>> import asyncio
>>> from dffml import *
>>>
>>> URL = Definition(name="URL", primitive="string")
>>> ORG = Definition(name="ORG", primitive="string")
>>>
>>> dataflow = DataFlow.auto(GetSingle)
>>> dataflow.seed.append(
...     Input(
...         value=[{"Repo Link": URL.name}, ORG.name],
...         definition=GetSingle.op.inputs["spec"]
...     )
... )
>>>
>>> async def main():
...     async for ctx, results in MemoryOrchestrator.run(dataflow, [
...         Input(
...             value="https://github.com/intel/dffml",
...             definition=URL
...         ),
...         Input(
...             value="Intel",
...             definition=ORG
...         )
...     ]):
...         print(results)
...
>>> asyncio.run(main())
{'ORG': 'Intel', 'Repo Link': 'https://github.com/intel/dffml'}

Stage: output

Inputs

spec: get_single_spec(type: array)

Outputs

output: get_single_output(type: map)

group_by¶

Official

No description

Stage: output

Inputs

spec: group_by_spec(type: Dict[str, Any])

Outputs

output: group_by_output(type: Dict[str, List[Any]])

gz_compress¶

Official

No description

Stage: processing

Inputs

input_file_path: decompressed_gz_file_path(type: str)
output_file_path: compressed_gz_file_path(type: str)

Outputs

output_path: compressed_output_gz_file_path(type: str)

gz_decompress¶

Official

No description

Stage: processing

Inputs

input_file_path: compressed_gz_file_path(type: str)
output_file_path: decompressed_gz_file_path(type: str)

Outputs

output_path: decompressed_output_gz_file_path(type: str)

literal_eval¶

Official

Evaluate the input using ast.literal_eval()

Parameters¶

str_to_evalstr: A string to be evaluated.

Returns¶

dict: A dict containing python literal.

Examples¶

The following example shows how to use literal_eval.

>>> import asyncio
>>> from dffml import *
>>>
>>> dataflow = DataFlow.auto(literal_eval, GetSingle)
>>> dataflow.seed.append(
...    Input(
...        value=[literal_eval.op.outputs["str_after_eval"].name,],
...        definition=GetSingle.op.inputs["spec"],
...    )
... )
>>> inputs = [
...    Input(
...        value="[1,2,3]",
...        definition=literal_eval.op.inputs["str_to_eval"],
...        parents=None,
...    )
... ]
>>>
>>> async def main():
...     async for ctx, results in MemoryOrchestrator.run(dataflow, inputs):
...         print(results)
>>>
>>> asyncio.run(main())
{'EvaluatedStr': [1, 2, 3]}

Stage: processing

Inputs

str_to_eval: InputStr(type: str)

Outputs

str_after_eval: EvaluatedStr(type: generic)

make_tar_archive¶

Official

Creates tar file of a directory.

Parameters¶

input_directory_pathstr: Path to directory to be archived as a tarfile.
output_file_pathstr: Path where the output archive should be saved (should include file name)

Returns¶

dict: Path to the created tar file.

Stage: processing

Inputs

input_directory_path: directory(type: str)
output_file_path: tar_file(type: str)

Outputs

output_path: output_tarfile_path(type: str)

make_zip_archive¶

Official

Creates zip file of a directory.

Parameters¶

input_directory_pathstr: Path to directory to be archived
output_file_pathstr: Path where the output archive should be saved (should include file name)

Returns¶

dict: Path to the output zip file

Stage: processing

Inputs

input_directory_path: directory(type: str)
output_file_path: zip_file(type: str)

Outputs

output_path: output_zipfile_path(type: str)

multiply¶

Official

Multiply record values

Parameters¶

multiplicandgeneric: An arithmetic type value.
multipliergeneric: An arithmetic type value.

Returns¶

dict: A dict containing the product.

Examples¶

The following example shows how to use multiply.

>>> import asyncio
>>> from dffml import *
>>>
>>> dataflow = DataFlow.auto(multiply, GetSingle)
>>> dataflow.seed.append(
...    Input(
...        value=[multiply.op.outputs["product"].name,],
...        definition=GetSingle.op.inputs["spec"],
...    )
... )
>>> inputs = [
...    Input(
...        value=12,
...        definition=multiply.op.inputs["multiplicand"],
...    ),
...    Input(
...        value=3,
...        definition=multiply.op.inputs["multiplier"],
...    ),
... ]
>>>
>>> async def main():
...     async for ctx, results in MemoryOrchestrator.run(dataflow, inputs):
...         print(results)
>>>
>>> asyncio.run(main())
{'product': 36}

Stage: processing

Inputs

multiplicand: multiplicand_def(type: generic)
multiplier: multiplier_def(type: generic)

Outputs

product: product(type: generic)

print_output¶

Official

Print the output on stdout using python print()

Parameters¶

dataAny: A python literal to be printed.

Examples¶

The following example shows how to use print_output.

>>> import asyncio
>>> from dffml import *
>>>
>>> dataflow = DataFlow.auto(print_output)
>>> inputs = [
...     Input(
...         value="print_output example", definition=print_output.op.inputs["data"]
...     )
... ]
>>>
>>> async def main():
...     async for ctx, results in MemoryOrchestrator.run(dataflow, inputs):
...         pass
>>>
>>> asyncio.run(main())
print_output example

Stage: processing

Inputs

data: DataToPrint(type: generic)

xz_compress¶

Official

No description

Stage: processing

Inputs

input_file_path: decompressed_xz_file_path(type: str)
output_file_path: compressed_xz_file_path(type: str)

Outputs

output_path: compressed_output_xz_file_path(type: str)

xz_decompress¶

Official

No description

Stage: processing

Inputs

input_file_path: compressed_xz_file_path(type: str)
output_file_path: decompressed_xz_file_path(type: str)

Outputs

output_path: decompressed_output_xz_file_path(type: str)

dffml_operations_image¶

pip install dffml-operations-image

Haralick¶

Official

Computes Haralick texture features

Stage: processing

Inputs

f: Haralick.inputs.f(type: array)
ignore_zeros: Haralick.inputs.ignore_zeros(type: bool)
preserve_haralick_bug: Haralick.inputs.preserve_haralick_bug(type: bool)
compute_14th_feature: Haralick.inputs.compute_14th_feature(type: bool)
return_mean: Haralick.inputs.return_mean(type: bool)
return_mean_ptp: Haralick.inputs.return_mean_ptp(type: bool)
use_x_minus_y_variance: Haralick.inputs.use_x_minus_y_variance(type: bool)
distance: Haralick.inputs.distance(type: int)

Outputs

result: Haralick.outputs.result(type: array)

HuMoments¶

Official

Calculates seven Hu invariants

Stage: processing

Inputs

m: HuMoments.inputs.m(type: array)

Outputs

result: HuMoments.outputs.result(type: array)

calcHist¶

Official

Calculates a histogram

Stage: processing

Inputs

images: calcHist.inputs.images(type: array)
channels: calcHist.inputs.channels(type: array)
mask: calcHist.inputs.mask(type: array)
histSize: calcHist.inputs.histSize(type: array)
ranges: calcHist.inputs.ranges(type: array)

Outputs

result: calcHist.outputs.result(type: array)

convert_color¶

Official

Converts images from one color space to another

Stage: processing

Inputs

src: convert_color.inputs.src(type: array)
code: convert_color.inputs.code(type: str)

Outputs

result: convert_color.outputs.result(type: array)

flatten¶

Official

No description

Stage: processing

Inputs

array: flatten.inputs.array(type: array)

Outputs

result: flatten.outputs.result(type: array)

normalize¶

Official

Normalizes arrays

Stage: processing

Inputs

src: normalize.inputs.src(type: array)
alpha: normalize.inputs.alpha(type: int)
beta: normalize.inputs.beta(type: int)
norm_type: normalize.inputs.norm_type(type: int)
dtype: normalize.inputs.dtype(type: int)
mask: normalize.inputs.mask(type: array)

Outputs

result: normalize.outputs.result(type: array)

resize¶

Official

Resizes image array to the specified new dimensions

If the new dimensions are in 2D, the image is converted to grayscale.
To enlarge the image (src dimensions < dsize),
it will resize the image with INTER_CUBIC interpolation.
To shrink the image (src dimensions > dsize),
it will resize the image with INTER_AREA interpolation

Stage: processing

Inputs

src: resize.inputs.src(type: array)
dsize: resize.inputs.dsize(type: array)
fx: resize.inputs.fx(type: float)
fy: resize.inputs.fy(type: float)
interpolation: resize.inputs.interpolation(type: int)

Outputs

result: resize.outputs.result(type: array)

dffml_feature_git¶

pip install dffml-feature-git

check_if_valid_git_repository_URL¶

Official

No description

Stage: processing

Inputs

URL: URL(type: string)

Outputs

valid: valid_git_repository_URL(type: boolean)

cleanup_git_repo¶

Official

No description

Stage: cleanup

Inputs

repo: git_repository(type: Dict[str, str])
- directory: str
- URL: str(default: None)

clone_git_repo¶

Official

No description

Stage: processing

Inputs

URL: URL(type: string)

Outputs

repo: git_repository(type: Dict[str, str])
- directory: str
- URL: str(default: None)

Conditions

valid_git_repository_URL: boolean

count_authors¶

Official

No description

Stage: processing

Inputs

author_lines: author_line_count(type: Dict[str, int])

Outputs

authors: author_count(type: int)

git_commits¶

Official

No description

Stage: processing

Inputs

repo: git_repository(type: Dict[str, str])
- directory: str
- URL: str(default: None)
branch: git_branch(type: str)
start_end: date_pair(type: List[date])

Outputs

commits: commit_count(type: int)

git_repo_author_lines_for_dates¶

Official

No description

Stage: processing

Inputs

repo: git_repository(type: Dict[str, str])
- directory: str
- URL: str(default: None)
branch: git_branch(type: str)
start_end: date_pair(type: List[date])

Outputs

author_lines: author_line_count(type: Dict[str, int])

git_repo_checkout¶

Official

No description

Stage: processing

Inputs

repo: git_repository(type: Dict[str, str])
- directory: str
- URL: str(default: None)
commit: git_commit(type: string)

Outputs

repo: git_repository_checked_out(type: Dict[str, str])
- directory: str
- URL: str(default: None)
- commit: str(default: None)

git_repo_commit_from_date¶

Official

No description

Stage: processing

Inputs

repo: git_repository(type: Dict[str, str])
- directory: str
- URL: str(default: None)
branch: git_branch(type: str)
date: date(type: string)

Outputs

commit: git_commit(type: string)

git_repo_default_branch¶

Official

No description

Stage: processing

Inputs

repo: git_repository(type: Dict[str, str])
- directory: str
- URL: str(default: None)

Outputs

branch: git_branch(type: str)

Conditions

no_git_branch_given: boolean

git_repo_release¶

Official

Was there a release within this date range

Stage: processing

Inputs

repo: git_repository(type: Dict[str, str])
- directory: str
- URL: str(default: None)
branch: git_branch(type: str)
start_end: date_pair(type: List[date])

Outputs

present: release_within_period(type: bool)

lines_of_code_by_language¶

Official

This operation relys on tokei. Here’s how to install version 10.1.1, check it’s releases page to make sure you’re installing the latest version.

On Linux

$ curl -sSL 'https://github.com/XAMPPRocky/tokei/releases/download/v10.1.1/tokei-v10.1.1-x86_64-apple-darwin.tar.gz' \
  | tar -xvz && \
  echo '22699e16e71f07ff805805d26ee86ecb9b1052d7879350f7eb9ed87beb0e6b84fbb512963d01b75cec8e80532e4ea29a tokei' | sha384sum -c - && \
  sudo mv tokei /usr/local/bin/

On OSX

$ curl -sSL 'https://github.com/XAMPPRocky/tokei/releases/download/v10.1.1/tokei-v10.1.1-x86_64-apple-darwin.tar.gz' \
  | tar -xvz && \
  echo '8c8a1d8d8dd4d8bef93dabf5d2f6e27023777f8553393e269765d7ece85e68837cba4374a2615d83f071dfae22ba40e2 tokei' | sha384sum -c - && \
  sudo mv tokei /usr/local/bin/

Stage: processing

Inputs

repo: git_repository_checked_out(type: Dict[str, str])
- directory: str
- URL: str(default: None)
- commit: str(default: None)

Outputs

lines_by_language: lines_by_language_count(type: Dict[str, Dict[str, int]])

lines_of_code_to_comments¶

Official

No description

Stage: processing

Inputs

langs: lines_by_language_count(type: Dict[str, Dict[str, int]])

Outputs

code_to_comment_ratio: language_to_comment_ratio(type: int)

make_quarters¶

Official

No description

Stage: processing

Inputs

number: quarters(type: int)

Outputs

quarters: quarter(type: int)

quarters_back_to_date¶

Official

No description

Stage: processing

Inputs

date: quarter_start_date(type: int)
number: quarter(type: int)

Outputs

date: date(type: string)
start_end: date_pair(type: List[date])

work¶

Official

No description

Stage: processing

Inputs

author_lines: author_line_count(type: Dict[str, int])

Outputs

work: work_spread(type: int)

dffml_feature_auth¶

pip install dffml-feature-auth

scrypt¶

Official

No description

Stage: processing

Inputs

password: UnhashedPassword(type: string)

Outputs

password: ScryptPassword(type: string)

dffml_operations_deploy¶

pip install dffml-operations-deploy

check_if_default_branch¶

Official

No description

Stage: processing

Inputs

payload: git_payload(type: Dict[str,Any])

Outputs

is_default_branch: is_default_branch(type: bool)

check_secret_match¶

Official

No description

Stage: processing

Inputs

headers: webhook_headers(type: Dict[str,Any])
body: payload(type: bytes)

Outputs

git_payload: git_payload(type: Dict[str,Any])

Args

secret: Entrypoint

docker_build_image¶

Official

No description

Stage: processing

Inputs

docker_commands: docker_commands(type: Dict[str,Any])

Outputs

build_status: is_image_built(type: bool)

Conditions

is_default_branch: bool
got_running_containers: bool

get_image_tag¶

Official

No description

Stage: processing

Inputs

payload: git_payload(type: Dict[str,Any])

Outputs

image_tag: docker_image_tag(type: str)

get_running_containers¶

Official

No description

Stage: processing

Inputs

tag: docker_image_tag(type: str)

Outputs

running_containers: docker_running_containers(type: List[str])

get_status_running_containers¶

Official

No description

Stage: processing

Inputs

containers: docker_running_containers(type: List[str])

Outputs

status: got_running_containers(type: bool)

get_url_from_payload¶

Official

No description

Stage: processing

Inputs

payload: git_payload(type: Dict[str,Any])

Outputs

url: URL(type: string)

parse_docker_commands¶

Official

No description

Stage: processing

Inputs

repo: git_repository(type: Dict[str, str])
- directory: str
- URL: str(default: None)
image_tag: docker_image_tag(type: str)

Outputs

docker_commands: docker_commands(type: Dict[str,Any])

restart_running_containers¶

Official

No description

Stage: processing

Inputs

docker_commands: docker_commands(type: Dict[str,Any])
containers: docker_running_containers(type: List[str])

Outputs

containers: docker_restarted_containers(type: str)

Conditions

is_image_built: bool

dffml_operations_data¶

pip install dffml-operations-data

one_hot_encoder¶

Official

One hot encoding for categorical data columns

Parameters¶

dataList[List[int]]: data to be encoded.
categoriesList[List[str]]: Categorical values which needs to be encoded

Returns¶

result: Encoded data for categorical values

Stage: processing

Inputs

data: input_data(type: List[List[int]])
categories: categories(type: List[List[Any]])

Outputs

result: output_data(type: List[List[int]])

ordinal_encoder¶

Official

One hot encoding for categorical data columns

Parameters¶

dataList[List[int]]: data to be encoded.
categoriesList[List[str]]: Categorical values which needs to be encoded

Returns¶

result: Encoded data for categorical values

Stage: processing

Inputs

data: input_data(type: List[List[int]])

Outputs

result: output_data(type: List[List[int]])

principal_component_analysis¶

Official

Decomposes the data into (n_samples, n_components) using PCA method

Parameters¶

dataList[List[int]]: data to be decomposed.
n_componentsint: number of colums the data should have after decomposition.

Returns¶

result: Data having dimensions (n_samples, n_components)

Stage: processing

Inputs

data: input_data(type: List[List[int]])
n_components: n_components(type: int)

Outputs

result: output_data(type: List[List[int]])

remove_whitespaces¶

Official

Removes white-spaces from the dataset

Parameters¶

dataList[List[int]]: dataset.

Returns¶

result: dataset having whitespaces removed

Stage: processing

Inputs

data: input_data(type: List[List[int]])

Outputs

result: output_data(type: List[List[int]])

simple_imputer¶

Official

Imputation method for missing values

Parameters¶

dataList[List[int]]: data in which missing values are present
missing_valuesAny str, int, float, None default = np.nan: The value present in place of missing value
strategystr “mean”, “median”, “constant”, “most_frequent” default = “mean”: The value present in place of missing value

Returns¶

result: Dataset having missing values imputed with the strategy

Stage: processing

Inputs

data: input_data(type: List[List[int]])
missing_values: missing_values(type: Any)
strategy: strategy(type: str)

Outputs

result: output_data(type: List[List[int]])

singular_value_decomposition¶

Official

Decomposes the data into (n_samples, n_components) using SVD method.

Parameters¶

dataList[List[int]]: data to be decomposed.
n_componentsint: number of colums the data should have after decomposition.

Returns¶

result: Data having dimensions (n_samples, n_components)

Stage: processing

Inputs

data: input_data(type: List[List[int]])
n_components: n_components(type: int)
n_iter: n_iter(type: int)
random_state: random_state(type: int)

Outputs

result: output_data(type: List[List[int]])

standard_scaler¶

Official

Standardize features by removing the mean and scaling to unit variance.

Parameters¶

data: List[List[int]]: data that needs to be standardized

Returns¶

result: Standardized data

Stage: processing

Inputs

data: input_data(type: List[List[int]])

Outputs

result: output_data(type: List[List[int]])

dffml_operations_binsec¶

pip install dffml-operations-binsec

cleanup_rpm¶

Official

No description

Stage: cleanup

Inputs

rpm: RPMObject(type: python_obj)

files_in_rpm¶

Official

No description

Stage: processing

Inputs

rpm: RPMObject(type: python_obj)

Outputs

files: rpm_filename(type: str)

is_binary_pie¶

Official

No description

Stage: processing

Inputs

rpm: RPMObject(type: python_obj)
filename: rpm_filename(type: str)

Outputs

is_pie: binary_is_PIE(type: bool)

url_to_urlbytes¶

Official

No description

Stage: processing

Inputs

URL: URL(type: string)

Outputs

download: URLBytes(type: python_obj)

urlbytes_to_rpmfile¶

Official

No description

Stage: processing

Inputs

download: URLBytes(type: python_obj)

Outputs

rpm: RPMObject(type: python_obj)

urlbytes_to_tarfile¶

Official

No description

Stage: processing

Inputs

download: URLBytes(type: python_obj)

Outputs

rpm: RPMObject(type: python_obj)

shouldi¶

pip install shouldi

cleanup_pypi_package¶

Official

Remove the directory containing the source code release.

Stage: cleanup

Inputs

directory: run_bandit.inputs.pkg(type: str)

pypi_package_contents¶

Official

Download a source code release and extract it to a temporary directory.

Stage: processing

Inputs

url: pypi_package_contents.inputs.url(type: str)

Outputs

directory: run_bandit.inputs.pkg(type: str)

pypi_package_json¶

Official

Download the information on the package in JSON format.

Stage: processing

Inputs

package: safety_check.inputs.package(type: str)

Outputs

version: safety_check.inputs.version(type: str)
url: pypi_package_contents.inputs.url(type: str)

run_bandit¶

Official

CLI usage: dffml service dev run -log debug shouldi.bandit:run_bandit -pkg .

Stage: processing

Inputs

pkg: run_bandit.inputs.pkg(type: str)

Outputs

result: run_bandit.outputs.result(type: map)

safety_check¶

Official

No description

Stage: processing

Inputs

package: safety_check.inputs.package(type: str)
version: safety_check.inputs.version(type: str)

Outputs

result: safety_check.outputs.result(type: int)

dffml_operations_nlp¶

pip install dffml-operations-nlp

collect_output¶

Official

No description

Stage: processing

Inputs

sentence: sentence(type: string)
length: source_length(type: string)

Outputs

all: all_sentences(type: List[string])

count_vectorizer¶

Official

Converts a collection of text documents to a matrix of token counts using sklearn CountVectorizer’s fit_transform method. For details on parameters check https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html Parameters specific to this operation are described below.

Parameters¶

textlist: A list of strings.
get_feature_names: bool: If True return feature names using get_feature_names method of CountVectorizer.

Returns¶

result: list: A list containing token counts and feature names if get_feature_names is True.

Stage: processing

Inputs

text: count_vectorizer.inputs.text(type: array)
encoding: count_vectorizer.inputs.encoding(type: str)
decode_error: count_vectorizer.inputs.decode_error(type: str)
strip_accents: count_vectorizer.inputs.strip_accents(type: str)
lowercase: count_vectorizer.inputs.lowercase(type: bool)
stop_words: count_vectorizer.inputs.stop_words(type: str)
token_pattern: count_vectorizer.inputs.token_pattern(type: str)
ngram_range: count_vectorizer.inputs.ngram_range(type: array)
analyzer: count_vectorizer.inputs.analyzer(type: str)
max_df: count_vectorizer.inputs.max_df(type: float)
min_df: count_vectorizer.inputs.min_df(type: float)
max_features: count_vectorizer.inputs.max_features(type: int)
vocabulary: count_vectorizer.inputs.vocabulary(type: map)
binary: count_vectorizer.inputs.binary(type: bool)
get_feature_names: count_vectorizer.inputs.get_feature_names(type: bool)

Outputs

result: count_vectorizer.outputs.result(type: array)

extract_array_from_matrix¶

Official

Returns row from input_matrix based on index of single_text_example in collected_text.

Parameters¶

single_text_examplestr: String to be used for indexing into collected_text.
collected_text: list: List of strings.
input_matrix: list: A 2-D matrix where each row represents vector corresponding to single_text_example.

Returns¶

result: A 1-d array.

Stage: processing

Inputs

single_text_example: extract_array_from_matrix.inputs.single_text_example(type: str)
collected_text: extract_array_from_matrix.inputs.collected_text(type: array)
input_matrix: extract_array_from_matrix.inputs.input_matrix(type: array)

Outputs

result: extract_array_from_matrix.outputs.result(type: array)

get_embedding¶

Official

Maps words of text data to their corresponding word vectors.

Parameters¶

textstr: String to be converted to word vectors.
max_len: int: Maximum length of sentence. If the length of text > max_len, text is truncated to have length = max_len. If the length of text < max_len, text is padded with pad_token such that len(text) = max_len.
pad_token: str: Token to be used for padding text if len(text) < max_len
spacy_model: str: Spacy model to be used for assigning vectors to tokens.

Returns¶

result: A 2-d array of shape (max_len, embedding_size of vectors).

Stage: processing

Inputs

text: text_def(type: str)
spacy_model: spacy_model_name_def(type: str)
max_len: max_len_def(type: int)
pad_token: pad_token_def(type: str)

Outputs

embedding: embedding(type: generic)

get_noun_chunks¶

Official

Extracts the noun chunks from text.

Parameters¶

textstr: String to extract noun chunks from.
spacy_model: str: A spacy model with the capability of parsing.

Returns¶

result: list: A list containing noun chunks.

Stage: processing

Inputs

text: get_noun_chunks.inputs.text(type: str)
spacy_model: get_noun_chunks.inputs.spacy_model(type: str)

Outputs

result: get_noun_chunks.outputs.result(type: array)

get_sentences¶

Official

Extracts the sentences from text.

Parameters¶

textstr: String to extract sentences from.
spacy_model: str: A spacy model with the capability of parsing. Sentence boundaries are calculated from the syntactic dependency parse.

Returns¶

result: list: A list containing sentences.

Stage: processing

Inputs

text: get_sentences.inputs.text(type: str)
spacy_model: get_sentences.inputs.spacy_model(type: str)

Outputs

result: get_sentences.outputs.result(type: array)

get_similarity¶

Official

Calculates similarity between two text strings as a score between 0 and 1.

Parameters¶

text_1str: First string to compare.
text_2str: Second string to compare.
spacy_model: str: Spacy model to be used for extracting word vectors which are used for calculating similarity.

Returns¶

result: float: A similarity score between 0 and 1.

Stage: processing

Inputs

text_1: get_similarity.inputs.text_1(type: str)
text_2: get_similarity.inputs.text_2(type: str)
spacy_model: get_similarity.inputs.spacy_model(type: str)

Outputs

result: get_similarity.outputs.result(type: float)

lemmatizer¶

Official

Reduce words in the text to their dictionary form (lemma)

Parameters¶

textstr: String to lemmatize.
spacy_model: str: Spacy model to be used for lemmatization.

Returns¶

result: list: A list containing base form of the words.

Stage: processing

Inputs

text: lemmatizer.inputs.text(type: str)
spacy_model: lemmatizer.inputs.spacy_model(type: str)

Outputs

result: lemmatizer.outputs.result(type: array)

pos_tagger¶

Official

Assigns part-of-speech tags to text.

Parameters¶

textstr: Text to be tagged.
spacy_model: str: A spacy model with tagger and parser.

Returns¶

result: list: A list containing tuples of word and their respective pos tag.

Stage: processing

Inputs

text: pos_tagger.inputs.text(type: str)
spacy_model: pos_tagger.inputs.spacy_model(type: str)
tag_type: pos_tagger.inputs.tag_type(type: str)

Outputs

result: pos_tagger.outputs.result(type: array)

remove_stopwords¶

Official

Removes stopword from text data.

Parameters¶

textstr: String to be cleaned.
custom_stop_words: List[str], default = None: List of words to be considered as stop words.

Returns¶

result: A string without stop words.

Stage: processing

Inputs

text: remove_stopwords.inputs.text(type: str)
custom_stop_words: remove_stopwords.inputs.custom_stop_words(type: array)

Outputs

result: remove_stopwords.outputs.result(type: str)

tfidf_vectorizer¶

Official

Convert a collection of raw documents to a matrix of TF-IDF features using sklearn TfidfVectorizer’s fit_transform method. For details on parameters check https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html Parameters specific to this operation are described below.

Parameters¶

textlist: A list of strings.
get_feature_names: bool: If True return feature names using get_feature_names method of TfidfVectorizer.

Returns¶

result: list: A list containing token counts and feature names if get_feature_names is True.

Stage: processing

Inputs

text: tfidf_vectorizer.inputs.text(type: array)
encoding: tfidf_vectorizer.inputs.encoding(type: str)
decode_error: tfidf_vectorizer.inputs.decode_error(type: str)
strip_accents: tfidf_vectorizer.inputs.strip_accents(type: str)
lowercase: tfidf_vectorizer.inputs.lowercase(type: bool)
analyzer: tfidf_vectorizer.inputs.analyzer(type: str)
stop_words: tfidf_vectorizer.inputs.stop_words(type: str)
token_pattern: tfidf_vectorizer.inputs.token_pattern(type: str)
ngram_range: tfidf_vectorizer.inputs.ngram_range(type: array)
max_df: tfidf_vectorizer.inputs.max_df(type: str)
min_df: tfidf_vectorizer.inputs.min_df(type: str)
max_features: tfidf_vectorizer.inputs.max_features(type: str)
vocabulary: tfidf_vectorizer.inputs.vocabulary(type: str)
binary: tfidf_vectorizer.inputs.binary(type: bool)
norm: tfidf_vectorizer.inputs.norm(type: str)
use_idf: tfidf_vectorizer.inputs.use_idf(type: bool)
smooth_idf: tfidf_vectorizer.inputs.smooth_idf(type: bool)
sublinear_tf: tfidf_vectorizer.inputs.sublinear_tf(type: bool)
get_feature_names: tfidf_vectorizer.inputs.get_feature_names(type: bool)

Outputs

result: tfidf_vectorizer.outputs.result(type: array)