Operations

Operations Implementations are subclasses of dffml.df.base.OperationImplementation, they are functions or classes which could do anything, make HTTP requests, do inference, etc.

They don’t necessarily have to be written in Python. Although DFFML isn’t quite to the point where it can use operations written in other languages yet, it’s on the roadmap.

dffml

pip install dffml

AcceptUserInput

Official

Accept input from stdin using python input()

Returns

dict

A dictionary containing user input.

Examples

The following example shows how to use AcceptUserInput. (Assumes that the input from stdio is “Data flow is awesome”!)

>>> import asyncio
>>> from dffml import *
>>>
>>> dataflow = DataFlow.auto(AcceptUserInput, GetSingle)
>>> dataflow.seed.append(
...     Input(
...         value=[AcceptUserInput.op.outputs["InputData"].name],
...         definition=GetSingle.op.inputs["spec"],
...     )
... )
>>>
>>> async def main():
...     async for ctx, results in MemoryOrchestrator.run(dataflow, {"input": []}):
...         print(results)
>>>
>>> asyncio.run(main())
Enter the value: {'UserInput': 'Data flow is awesome'}

Stage: processing

Outputs

  • InputData: UserInput(type: str)

associate

Official

No description

Stage: output

Inputs

  • spec: associate_spec(type: List[str])

Outputs

  • output: associate_output(type: Dict[str, Any])

associate_definition

Official

Examples

>>> import asyncio
>>> from dffml import *
>>>
>>> feed_def = Definition(name="feed", primitive="string")
>>> dead_def = Definition(name="dead", primitive="string")
>>> output = Definition(name="output", primitive="string")
>>>
>>> feed_input = Input(value="my favorite value", definition=feed_def)
>>> face_input = Input(
...     value="face", definition=output, parents=[feed_input]
... )
>>>
>>> dead_input = Input(
...     value="my second favorite value", definition=dead_def
... )
>>> beef_input = Input(
...     value="beef", definition=output, parents=[dead_input]
... )
>>>
>>> async def main():
...     for value in ["feed", "dead"]:
...         async for ctx, results in MemoryOrchestrator.run(
...             DataFlow.auto(AssociateDefinition),
...             [
...                 feed_input,
...                 face_input,
...                 dead_input,
...                 beef_input,
...                 Input(
...                     value={value: "output"},
...                     definition=AssociateDefinition.op.inputs["spec"],
...                 ),
...             ],
...         ):
...             print(results)
>>>
>>> asyncio.run(main())
{'feed': 'face'}
{'dead': 'beef'}

Stage: output

Inputs

  • spec: associate_spec(type: List[str])

Outputs

  • output: associate_output(type: Dict[str, Any])

bz2_compress

Official

No description

Stage: processing

Inputs

  • input_file_path: decompressed_bz2_file_path(type: str)

  • output_file_path: compressed_bz2_file_path(type: str)

Outputs

  • output_path: compressed_output_bz2_file_path(type: str)

bz2_decompress

Official

No description

Stage: processing

Inputs

  • input_file_path: compressed_bz2_file_path(type: str)

  • output_file_path: decompressed_bz2_file_path(type: str)

Outputs

  • output_path: decompressed_output_bz2_file_path(type: str)

convert_list_to_records

Official

No description

Stage: processing

Inputs

  • matrix: matrix(type: List[List[Any]])

  • features: features(type: List[str])

  • keys: keys(type: List[str])

  • predict_features: predict_features(type: List[str])

  • unprocessed_matrix: unprocessed_matrix(type: List[List[Any]])

Outputs

  • records: records(type: Dict[str, Any])

convert_records_to_list

Official

No description

Stage: processing

Inputs

  • features: features(type: List[str])

  • predict_features: predict_features(type: List[str])

Outputs

  • matrix: matrix(type: List[List[Any]])

  • keys: keys(type: List[str])

  • unprocessed_matrix: unprocessed_matrix(type: List[List[Any]])

Args

  • source: Entrypoint

db_query_create_table

Official

Generates a create table query in the database.

Parameters

table_namestr

The name of the table to be created.

colslist[str]

Columns of the table.

Examples

>>> import asyncio
>>> from dffml import *
>>>
>>> sdb = SqliteDatabase(SqliteDatabaseConfig(filename="examples.db"))
>>>
>>> dataflow = DataFlow(
...     operations={"db_query_create": db_query_create_table.op,},
...     configs={"db_query_create": DatabaseQueryConfig(database=sdb),},
...     seed=[],
... )
>>>
>>> inputs = [
...     Input(
...         value="myTable1",
...         definition=db_query_create_table.op.inputs["table_name"],
...     ),
...     Input(
...         value={
...             "key": "real",
...             "firstName": "text",
...             "lastName": "text",
...             "age": "real",
...         },
...         definition=db_query_create_table.op.inputs["cols"],
...     ),
... ]
>>>
>>> async def main():
...     async for ctx, result in MemoryOrchestrator.run(dataflow, inputs):
...         pass
>>>
>>> asyncio.run(main())

Stage: processing

Inputs

  • table_name: query_table(type: str)

  • cols: query_cols(type: Dict[str, str])

Args

  • database: Entrypoint

db_query_insert

Official

Generates an insert query in the database.

Parameters

table_namestr

The name of the table to insert data in to.

datadict

Data to be inserted into the table.

Examples

>>> import asyncio
>>> from dffml import *
>>>
>>> sdb = SqliteDatabase(SqliteDatabaseConfig(filename="examples.db"))
>>>
>>> dataflow = DataFlow(
...     operations={
...         "db_query_insert": db_query_insert.op,
...         "db_query_lookup": db_query_lookup.op,
...         "get_single": GetSingle.imp.op,
...     },
...     configs={
...         "db_query_lookup": DatabaseQueryConfig(database=sdb),
...         "db_query_insert": DatabaseQueryConfig(database=sdb),
...     },
...     seed=[],
... )
>>>
>>> inputs = {
...     "insert": [
...         Input(
...             value="myTable", definition=db_query_insert.op.inputs["table_name"],
...         ),
...         Input(
...            value={"key": 10, "firstName": "John", "lastName": "Doe", "age": 16},
...             definition=db_query_insert.op.inputs["data"],
...         ),
...     ],
...     "lookup": [
...         Input(
...             value="myTable", definition=db_query_lookup.op.inputs["table_name"],
...         ),
...         Input(
...             value=["firstName", "lastName", "age"],
...             definition=db_query_lookup.op.inputs["cols"],
...         ),
...         Input(value=[], definition=db_query_lookup.op.inputs["conditions"],),
...         Input(
...             value=[db_query_lookup.op.outputs["lookups"].name],
...             definition=GetSingle.op.inputs["spec"],
...         ),
...     ]
... }
>>>
>>> async def main():
...     async for ctx, result in MemoryOrchestrator.run(dataflow, inputs):
...         if result:
...             print(result)
>>>
>>> asyncio.run(main())
{'query_lookups': [{'firstName': 'John', 'lastName': 'Doe', 'age': 16}]}

Stage: processing

Inputs

  • table_name: query_table(type: str)

  • data: query_data(type: Dict[str, Any])

Args

  • database: Entrypoint

db_query_insert_or_update

Official

Automatically uses the better suited operation, insert query or update query.

Parameters

table_namestr

The name of the table to insert data in to.

datadict

Data to be inserted or updated into the table.

Examples

>>> import asyncio
>>> from dffml import *
>>>
>>> sdb = SqliteDatabase(SqliteDatabaseConfig(filename="examples.db"))
>>>
>>> person = {"key": 11, "firstName": "John", "lastName": "Wick", "age": 38}
>>>
>>> dataflow = DataFlow(
...     operations={
...         "db_query_insert_or_update": db_query_insert_or_update.op,
...         "db_query_lookup": db_query_lookup.op,
...         "get_single": GetSingle.imp.op,
...     },
...     configs={
...         "db_query_insert_or_update": DatabaseQueryConfig(database=sdb),
...         "db_query_lookup": DatabaseQueryConfig(database=sdb),
...     },
...     seed=[],
... )
>>>
>>> inputs = {
...     "insert_or_update": [
...         Input(
...             value="myTable", definition=db_query_update.op.inputs["table_name"],
...         ),
...         Input(
...             value=person,
...             definition=db_query_update.op.inputs["data"],
...         ),
...     ],
...     "lookup": [
...         Input(
...             value="myTable",
...             definition=db_query_lookup.op.inputs["table_name"],
...         ),
...         Input(
...             value=["firstName", "lastName", "age"],
...             definition=db_query_lookup.op.inputs["cols"],
...         ),
...         Input(value=[], definition=db_query_lookup.op.inputs["conditions"],),
...         Input(
...             value=[db_query_lookup.op.outputs["lookups"].name],
...             definition=GetSingle.op.inputs["spec"],
...         ),
...     ],
... }
>>>
>>> async def main():
...     async for ctx, result in MemoryOrchestrator.run(dataflow, inputs):
...         if result:
...             print(result)
>>>
>>> asyncio.run(main())
{'query_lookups': [{'firstName': 'John', 'lastName': 'Wick', 'age': 38}]}
>>>
>>> person["age"] += 1
>>>
>>> asyncio.run(main())
{'query_lookups': [{'firstName': 'John', 'lastName': 'Wick', 'age': 39}]}

Stage: processing

Inputs

  • table_name: query_table(type: str)

  • data: query_data(type: Dict[str, Any])

Args

  • database: Entrypoint

db_query_lookup

Official

Generates a lookup query in the database.

Parameters

table_namestr

The name of the table.

colslist[str]

Columns of the table.

conditionsConditions

Query conditions.

Examples

>>> import asyncio
>>> from dffml import *
>>>
>>> sdb = SqliteDatabase(SqliteDatabaseConfig(filename="examples.db"))
>>>
>>> dataflow = DataFlow(
...     operations={
...         "db_query_lookup": db_query_lookup.op,
...         "get_single": GetSingle.imp.op,
...     },
...     configs={"db_query_lookup": DatabaseQueryConfig(database=sdb),},
...     seed=[],
... )
>>>
>>> inputs = {
...     "lookup": [
...         Input(
...             value="myTable",
...             definition=db_query_lookup.op.inputs["table_name"],
...         ),
...         Input(
...             value=["firstName", "lastName", "age"],
...             definition=db_query_lookup.op.inputs["cols"],
...         ),
...         Input(value=[], definition=db_query_lookup.op.inputs["conditions"],),
...         Input(
...             value=[db_query_lookup.op.outputs["lookups"].name],
...             definition=GetSingle.op.inputs["spec"],
...         ),
...     ],
... }
>>>
>>> async def main():
...     async for ctx, result in MemoryOrchestrator.run(dataflow, inputs):
...         if result:
...             print(result)
>>>
>>> asyncio.run(main())
{'query_lookups': [{'firstName': 'John', 'lastName': 'Doe', 'age': 16}, {'firstName': 'John', 'lastName': 'Wick', 'age': 39}]}

Stage: processing

Inputs

  • table_name: query_table(type: str)

  • cols: query_cols(type: Dict[str, str])

  • conditions: query_conditions(type: Conditions)

Outputs

  • lookups: query_lookups(type: Dict[str, Any])

Args

  • database: Entrypoint

db_query_remove

Official

Generates a remove table query in the database.

Parameters

table_namestr

The name of the table to insert data in to.

conditionsConditions

Query conditions.

Examples

>>> import asyncio
>>> from dffml import *
>>>
>>> sdb = SqliteDatabase(SqliteDatabaseConfig(filename="examples.db"))
>>>
>>> dataflow = DataFlow(
...     operations={
...         "db_query_lookup": db_query_lookup.op,
...         "db_query_remove": db_query_remove.op,
...         "get_single": GetSingle.imp.op,
...     },
...     configs={
...         "db_query_remove": DatabaseQueryConfig(database=sdb),
...         "db_query_lookup": DatabaseQueryConfig(database=sdb),
...     },
...     seed=[],
... )
>>>
>>> inputs = {
...     "remove": [
...         Input(
...             value="myTable",
...             definition=db_query_remove.op.inputs["table_name"],
...         ),
...         Input(value=[],
...         definition=db_query_remove.op.inputs["conditions"],),
...     ],
...     "lookup": [
...         Input(
...             value="myTable",
...             definition=db_query_lookup.op.inputs["table_name"],
...         ),
...         Input(
...             value=["firstName", "lastName", "age"],
...             definition=db_query_lookup.op.inputs["cols"],
...         ),
...         Input(value=[], definition=db_query_lookup.op.inputs["conditions"],),
...         Input(
...             value=[db_query_lookup.op.outputs["lookups"].name],
...             definition=GetSingle.op.inputs["spec"],
...         ),
...     ],
... }
>>>
>>> async def main():
...     async for ctx, result in MemoryOrchestrator.run(dataflow, inputs):
...         if result:
...             print(result)
>>>
>>> asyncio.run(main())
{'query_lookups': []}

Stage: processing

Inputs

  • table_name: query_table(type: str)

  • conditions: query_conditions(type: Conditions)

Args

  • database: Entrypoint

db_query_update

Official

Generates an Update table query in the database.

Parameters

table_namestr

The name of the table to insert data in to.

datadict

Data to be updated into the table.

conditionslist

List of query conditions.

Examples

>>> import asyncio
>>> from dffml import *
>>>
>>> sdb = SqliteDatabase(SqliteDatabaseConfig(filename="examples.db"))
>>>
>>> dataflow = DataFlow(
...     operations={
...         "db_query_update": db_query_update.op,
...         "db_query_lookup": db_query_lookup.op,
...         "get_single": GetSingle.imp.op,
...     },
...     configs={
...         "db_query_update": DatabaseQueryConfig(database=sdb),
...         "db_query_lookup": DatabaseQueryConfig(database=sdb),
...     },
...     seed=[],
... )
>>>
>>> inputs = {
...     "update": [
...         Input(
...             value="myTable",
...             definition=db_query_update.op.inputs["table_name"],
...         ),
...         Input(
...             value={
...                 "key": 10,
...                 "firstName": "John",
...                 "lastName": "Doe",
...                 "age": 17,
...             },
...             definition=db_query_update.op.inputs["data"],
...         ),
...         Input(value=[], definition=db_query_update.op.inputs["conditions"],),
...     ],
...     "lookup": [
...         Input(
...             value="myTable",
...             definition=db_query_lookup.op.inputs["table_name"],
...         ),
...         Input(
...             value=["firstName", "lastName", "age"],
...             definition=db_query_lookup.op.inputs["cols"],
...         ),
...         Input(value=[], definition=db_query_lookup.op.inputs["conditions"],),
...         Input(
...             value=[db_query_lookup.op.outputs["lookups"].name],
...             definition=GetSingle.op.inputs["spec"],
...         ),
...     ],
... }
>>>
>>> async def main():
...     async for ctx, result in MemoryOrchestrator.run(dataflow, inputs):
...         if result:
...             print(result)
>>>
>>> asyncio.run(main())
{'query_lookups': [{'firstName': 'John', 'lastName': 'Doe', 'age': 17}]}

Stage: processing

Inputs

  • table_name: query_table(type: str)

  • data: query_data(type: Dict[str, Any])

  • conditions: query_conditions(type: Conditions)

Args

  • database: Entrypoint

dffml.dataflow.run

Official

Starts a subflow self.config.dataflow and adds inputs in it.

Parameters

inputsdict

The inputs to add to the subflow. These should be a key value mapping of the context string to the inputs which should be seeded for that context string.

Returns

dict

Maps context strings in inputs to output after running through dataflow.

Examples

The following shows how to use run dataflow in its default behavior.

>>> import asyncio
>>> from dffml import *
>>>
>>> URL = Definition(name="URL", primitive="string")
>>>
>>> subflow = DataFlow.auto(GetSingle)
>>> subflow.definitions[URL.name] = URL
>>> subflow.seed.append(
...     Input(
...         value=[URL.name],
...         definition=GetSingle.op.inputs["spec"]
...     )
... )
>>>
>>> dataflow = DataFlow.auto(run_dataflow, GetSingle)
>>> dataflow.configs[run_dataflow.op.name] = RunDataFlowConfig(subflow)
>>> dataflow.seed.append(
...     Input(
...         value=[run_dataflow.op.outputs["results"].name],
...         definition=GetSingle.op.inputs["spec"]
...     )
... )
>>>
>>> async def main():
...     async for ctx, results in MemoryOrchestrator.run(dataflow, {
...         "run_subflow": [
...             Input(
...                 value={
...                     "dffml": [
...                         {
...                             "value": "https://github.com/intel/dffml",
...                             "definition": URL.name
...                         }
...                     ]
...                 },
...                 definition=run_dataflow.op.inputs["inputs"]
...             )
...         ]
...     }):
...         print(results)
>>>
>>> asyncio.run(main())
{'flow_results': {'dffml': {'URL': 'https://github.com/intel/dffml'}}}

The following shows how to use run dataflow with custom inputs and outputs. This allows you to run a subflow as if it were an operation.

>>> import asyncio
>>> from dffml import *
>>>
>>> URL = Definition(name="URL", primitive="string")
>>>
>>> @op(
...     inputs={"url": URL},
...     outputs={"last": Definition("last_element_in_path", primitive="string")},
... )
... def last_path(url):
...     return {"last": url.split("/")[-1]}
>>>
>>> subflow = DataFlow.auto(last_path, GetSingle)
>>> subflow.seed.append(
...     Input(
...         value=[last_path.op.outputs["last"].name],
...         definition=GetSingle.op.inputs["spec"],
...     )
... )
>>>
>>> dataflow = DataFlow.auto(run_dataflow, GetSingle)
>>> dataflow.operations[run_dataflow.op.name] = run_dataflow.op._replace(
...     inputs={"URL": URL},
...     outputs={last_path.op.outputs["last"].name: last_path.op.outputs["last"]},
...     expand=[],
... )
>>> dataflow.configs[run_dataflow.op.name] = RunDataFlowConfig(subflow)
>>> dataflow.seed.append(
...     Input(
...         value=[last_path.op.outputs["last"].name],
...         definition=GetSingle.op.inputs["spec"],
...     )
... )
>>> dataflow.update(auto_flow=True)
>>>
>>> async def main():
...     async for ctx, results in MemoryOrchestrator.run(
...         dataflow,
...         {
...             "run_subflow": [
...                 Input(value="https://github.com/intel/dffml", definition=URL)
...             ]
...         },
...     ):
...         print(results)
>>>
>>> asyncio.run(main())
{'last_element_in_path': 'dffml'}

Stage: processing

Inputs

  • inputs: flow_inputs(type: Dict[str,Any])

Outputs

  • results: flow_results(type: Dict[str,Any])

Args

  • dataflow: DataFlow

dffml.mapping.create

Official

Creates a mapping of a given key and value.

Parameters

keystr

The key for the mapping.

valueAny

The value for the mapping.

Returns

dict

A dictionary containing the mapping created.

Examples

>>> import asyncio
>>> from dffml import *
>>>
>>> dataflow = DataFlow.auto(create_mapping, GetSingle)
>>> dataflow.seed.append(
...     Input(
...         value=[create_mapping.op.outputs["mapping"].name],
...         definition=GetSingle.op.inputs["spec"],
...     )
... )
>>> inputs = [
...     Input(
...         value="key1", definition=create_mapping.op.inputs["key"],
...     ),
...     Input(
...         value=42, definition=create_mapping.op.inputs["value"],
...     ),
... ]
>>>
>>> async def main():
...     async for ctx, result in MemoryOrchestrator.run(dataflow, inputs):
...         print(result)
>>>
>>> asyncio.run(main())
{'mapping': {'key1': 42}}

Stage: processing

Inputs

  • key: key(type: str)

  • value: value(type: generic)

Outputs

  • mapping: mapping(type: map)

dffml.mapping.extract

Official

Extracts value from a given mapping.

Parameters

mappingdict

The mapping to extract the value from.

traverselist[str]

A list of keys to traverse through the mapping dictionary and extract the values.

Returns

dict

A dictionary containing the value of the keys.

Examples

>>> import asyncio
>>> from dffml import *
>>>
>>> dataflow = DataFlow.auto(mapping_extract_value, GetSingle)
>>>
>>> dataflow.seed.append(
...     Input(
...         value=[mapping_extract_value.op.outputs["value"].name],
...         definition=GetSingle.op.inputs["spec"],
...     )
... )
>>> inputs = [
...     Input(
...         value={"key1": {"key2": 42}},
...         definition=mapping_extract_value.op.inputs["mapping"],
...     ),
...     Input(
...         value=["key1", "key2"],
...         definition=mapping_extract_value.op.inputs["traverse"],
...     ),
... ]
>>>
>>> async def main():
...     async for ctx, result in MemoryOrchestrator.run(dataflow, inputs):
...         print(result)
>>>
>>> asyncio.run(main())
{'value': 42}

Stage: processing

Inputs

  • mapping: mapping(type: map)

  • traverse: mapping_traverse(type: List[str])

Outputs

  • value: value(type: generic)

dffml.model.predict

Official

Predict using dffml models.

Parameters

featuresdict

A dictionary contaning feature name and feature value.

Returns

dict

A dictionary containing prediction.

Examples

The following example shows how to use model_predict.

>>> import asyncio
>>> from dffml import *
>>>
>>> slr_model = SLRModel(
...     features=Features(Feature("Years", int, 1)),
...     predict=Feature("Salary", int, 1),
...     location="tempdir",
... )
>>> dataflow = DataFlow(
...     operations={
...         "prediction_using_model": model_predict,
...         "get_single": GetSingle,
...     },
...     configs={"prediction_using_model": ModelPredictConfig(model=slr_model)},
... )
>>> dataflow.seed.append(
...     Input(
...         value=[model_predict.op.outputs["prediction"].name],
...         definition=GetSingle.op.inputs["spec"],
...     )
... )
>>>
>>> async def main():
...     await train(
...         slr_model,
...         {"Years": 0, "Salary": 10},
...         {"Years": 1, "Salary": 20},
...         {"Years": 2, "Salary": 30},
...         {"Years": 3, "Salary": 40},
...     )
...     inputs = [
...        Input(
...            value={"Years": 4}, definition=model_predict.op.inputs["features"],
...        )
...     ]
...     async for ctx, results in MemoryOrchestrator.run(dataflow, inputs):
...         print(results)
>>>
>>> asyncio.run(main())
{'model_predictions': {'Salary': {'confidence': 1.0, 'value': 50}}}

Stage: processing

Inputs

  • features: record_features(type: Dict[str, Any])

Outputs

  • prediction: model_predictions(type: Dict[str, Any])

Args

  • model: Entrypoint

extract_tar_archive

Official

Extracts a given tar file.

Parameters

input_file_pathstr

Path to the tar file

output_directory_pathstr

Path where all the files should be extracted

Returns

dict

Path to the directory where the archive has been extracted

Stage: processing

Inputs

  • input_file_path: tar_file(type: str)

  • output_directory_path: directory(type: str)

Outputs

  • output_path: output_directory_path(type: str)

extract_zip_archive

Official

Extracts a given zip file.

Parameters

input_file_pathstr

Path to the zip file

output_directory_pathstr

Path where all the files should be extracted

Returns

dict

Path to the directory where the archive has been extracted

Stage: processing

Inputs

  • input_file_path: zip_file(type: str)

  • output_directory_path: directory(type: str)

Outputs

  • output_path: output_directory_path(type: str)

get_multi

Official

Output operation to get all Inputs matching given definitions.

Parameters

speclist

List of definition names. Any Inputs with matching definition will be returned.

Returns

dict

Maps definition names to all the Inputs of that definition

Examples

The following shows how to grab all Inputs with the URL definition. If we had we run an operation which output a URL, that output URL would have also been returned to us.

>>> import asyncio
>>> from dffml import *
>>>
>>> URL = Definition(name="URL", primitive="string")
>>>
>>> dataflow = DataFlow.auto(GetMulti)
>>> dataflow.seed.append(
...     Input(
...         value=[URL.name],
...         definition=GetMulti.op.inputs["spec"]
...     )
... )
>>>
>>> async def main():
...     async for ctx, results in MemoryOrchestrator.run(dataflow, [
...         Input(
...             value="https://github.com/intel/dffml",
...             definition=URL
...         ),
...         Input(
...             value="https://github.com/intel/cve-bin-tool",
...             definition=URL
...         )
...     ]):
...         print(results)
...
>>> asyncio.run(main())
{'URL': ['https://github.com/intel/dffml', 'https://github.com/intel/cve-bin-tool']}

Stage: output

Inputs

  • spec: get_multi_spec(type: array)

Outputs

  • output: get_multi_output(type: map)

get_single

Official

Output operation to get a single Input for each definition given.

Parameters

speclist

List of definition names. An Input with matching definition will be returned.

Returns

dict

Maps definition names to an Input of that definition

Examples

The following shows how to grab an Inputs with the URL definition. If we had we run an operation which output a URL, that output URL could have also been returned to us.

>>> import asyncio
>>> from dffml import *
>>>
>>> URL = Definition(name="URL", primitive="string")
>>> ORG = Definition(name="ORG", primitive="string")
>>>
>>> dataflow = DataFlow.auto(GetSingle)
>>> dataflow.seed.append(
...     Input(
...         value=[{"Repo Link": URL.name}, ORG.name],
...         definition=GetSingle.op.inputs["spec"]
...     )
... )
>>>
>>> async def main():
...     async for ctx, results in MemoryOrchestrator.run(dataflow, [
...         Input(
...             value="https://github.com/intel/dffml",
...             definition=URL
...         ),
...         Input(
...             value="Intel",
...             definition=ORG
...         )
...     ]):
...         print(results)
...
>>> asyncio.run(main())
{'ORG': 'Intel', 'Repo Link': 'https://github.com/intel/dffml'}

Stage: output

Inputs

  • spec: get_single_spec(type: array)

Outputs

  • output: get_single_output(type: map)

group_by

Official

No description

Stage: output

Inputs

  • spec: group_by_spec(type: Dict[str, Any])

Outputs

  • output: group_by_output(type: Dict[str, List[Any]])

gz_compress

Official

No description

Stage: processing

Inputs

  • input_file_path: decompressed_gz_file_path(type: str)

  • output_file_path: compressed_gz_file_path(type: str)

Outputs

  • output_path: compressed_output_gz_file_path(type: str)

gz_decompress

Official

No description

Stage: processing

Inputs

  • input_file_path: compressed_gz_file_path(type: str)

  • output_file_path: decompressed_gz_file_path(type: str)

Outputs

  • output_path: decompressed_output_gz_file_path(type: str)

literal_eval

Official

Evaluate the input using ast.literal_eval()

Parameters

str_to_evalstr

A string to be evaluated.

Returns

dict

A dict containing python literal.

Examples

The following example shows how to use literal_eval.

>>> import asyncio
>>> from dffml import *
>>>
>>> dataflow = DataFlow.auto(literal_eval, GetSingle)
>>> dataflow.seed.append(
...    Input(
...        value=[literal_eval.op.outputs["str_after_eval"].name,],
...        definition=GetSingle.op.inputs["spec"],
...    )
... )
>>> inputs = [
...    Input(
...        value="[1,2,3]",
...        definition=literal_eval.op.inputs["str_to_eval"],
...        parents=None,
...    )
... ]
>>>
>>> async def main():
...     async for ctx, results in MemoryOrchestrator.run(dataflow, inputs):
...         print(results)
>>>
>>> asyncio.run(main())
{'EvaluatedStr': [1, 2, 3]}

Stage: processing

Inputs

  • str_to_eval: InputStr(type: str)

Outputs

  • str_after_eval: EvaluatedStr(type: generic)

make_tar_archive

Official

Creates tar file of a directory.

Parameters

input_directory_pathstr

Path to directory to be archived as a tarfile.

output_file_pathstr

Path where the output archive should be saved (should include file name)

Returns

dict

Path to the created tar file.

Stage: processing

Inputs

  • input_directory_path: directory(type: str)

  • output_file_path: tar_file(type: str)

Outputs

  • output_path: output_tarfile_path(type: str)

make_zip_archive

Official

Creates zip file of a directory.

Parameters

input_directory_pathstr

Path to directory to be archived

output_file_pathstr

Path where the output archive should be saved (should include file name)

Returns

dict

Path to the output zip file

Stage: processing

Inputs

  • input_directory_path: directory(type: str)

  • output_file_path: zip_file(type: str)

Outputs

  • output_path: output_zipfile_path(type: str)

multiply

Official

Multiply record values

Parameters

multiplicandgeneric

An arithmetic type value.

multipliergeneric

An arithmetic type value.

Returns

dict

A dict containing the product.

Examples

The following example shows how to use multiply.

>>> import asyncio
>>> from dffml import *
>>>
>>> dataflow = DataFlow.auto(multiply, GetSingle)
>>> dataflow.seed.append(
...    Input(
...        value=[multiply.op.outputs["product"].name,],
...        definition=GetSingle.op.inputs["spec"],
...    )
... )
>>> inputs = [
...    Input(
...        value=12,
...        definition=multiply.op.inputs["multiplicand"],
...    ),
...    Input(
...        value=3,
...        definition=multiply.op.inputs["multiplier"],
...    ),
... ]
>>>
>>> async def main():
...     async for ctx, results in MemoryOrchestrator.run(dataflow, inputs):
...         print(results)
>>>
>>> asyncio.run(main())
{'product': 36}

Stage: processing

Inputs

  • multiplicand: multiplicand_def(type: generic)

  • multiplier: multiplier_def(type: generic)

Outputs

  • product: product(type: generic)

xz_compress

Official

No description

Stage: processing

Inputs

  • input_file_path: decompressed_xz_file_path(type: str)

  • output_file_path: compressed_xz_file_path(type: str)

Outputs

  • output_path: compressed_output_xz_file_path(type: str)

xz_decompress

Official

No description

Stage: processing

Inputs

  • input_file_path: compressed_xz_file_path(type: str)

  • output_file_path: decompressed_xz_file_path(type: str)

Outputs

  • output_path: decompressed_output_xz_file_path(type: str)

dffml_operations_image

pip install dffml-operations-image

Haralick

Official

Computes Haralick texture features

Stage: processing

Inputs

  • f: Haralick.inputs.f(type: array)

  • ignore_zeros: Haralick.inputs.ignore_zeros(type: bool)

  • preserve_haralick_bug: Haralick.inputs.preserve_haralick_bug(type: bool)

  • compute_14th_feature: Haralick.inputs.compute_14th_feature(type: bool)

  • return_mean: Haralick.inputs.return_mean(type: bool)

  • return_mean_ptp: Haralick.inputs.return_mean_ptp(type: bool)

  • use_x_minus_y_variance: Haralick.inputs.use_x_minus_y_variance(type: bool)

  • distance: Haralick.inputs.distance(type: int)

Outputs

  • result: Haralick.outputs.result(type: array)

HuMoments

Official

Calculates seven Hu invariants

Stage: processing

Inputs

  • m: HuMoments.inputs.m(type: array)

Outputs

  • result: HuMoments.outputs.result(type: array)

calcHist

Official

Calculates a histogram

Stage: processing

Inputs

  • images: calcHist.inputs.images(type: array)

  • channels: calcHist.inputs.channels(type: array)

  • mask: calcHist.inputs.mask(type: array)

  • histSize: calcHist.inputs.histSize(type: array)

  • ranges: calcHist.inputs.ranges(type: array)

Outputs

  • result: calcHist.outputs.result(type: array)

convert_color

Official

Converts images from one color space to another

Stage: processing

Inputs

  • src: convert_color.inputs.src(type: array)

  • code: convert_color.inputs.code(type: str)

Outputs

  • result: convert_color.outputs.result(type: array)

flatten

Official

No description

Stage: processing

Inputs

  • array: flatten.inputs.array(type: array)

Outputs

  • result: flatten.outputs.result(type: array)

normalize

Official

Normalizes arrays

Stage: processing

Inputs

  • src: normalize.inputs.src(type: array)

  • alpha: normalize.inputs.alpha(type: int)

  • beta: normalize.inputs.beta(type: int)

  • norm_type: normalize.inputs.norm_type(type: int)

  • dtype: normalize.inputs.dtype(type: int)

  • mask: normalize.inputs.mask(type: array)

Outputs

  • result: normalize.outputs.result(type: array)

resize

Official

Resizes image array to the specified new dimensions

  • If the new dimensions are in 2D, the image is converted to grayscale.

  • To enlarge the image (src dimensions < dsize),

    it will resize the image with INTER_CUBIC interpolation.

  • To shrink the image (src dimensions > dsize),

    it will resize the image with INTER_AREA interpolation

Stage: processing

Inputs

  • src: resize.inputs.src(type: array)

  • dsize: resize.inputs.dsize(type: array)

  • fx: resize.inputs.fx(type: float)

  • fy: resize.inputs.fy(type: float)

  • interpolation: resize.inputs.interpolation(type: int)

Outputs

  • result: resize.outputs.result(type: array)

dffml_feature_git

pip install dffml-feature-git

check_if_valid_git_repository_URL

Official

No description

Stage: processing

Inputs

  • URL: URL(type: string)

Outputs

  • valid: valid_git_repository_URL(type: boolean)

cleanup_git_repo

Official

No description

Stage: cleanup

Inputs

  • repo: git_repository(type: Dict[str, str])

    • directory: str

    • URL: str(default: None)

clone_git_repo

Official

No description

Stage: processing

Inputs

  • URL: URL(type: string)

Outputs

  • repo: git_repository(type: Dict[str, str])

    • directory: str

    • URL: str(default: None)

Conditions

  • valid_git_repository_URL: boolean

count_authors

Official

No description

Stage: processing

Inputs

  • author_lines: author_line_count(type: Dict[str, int])

Outputs

  • authors: author_count(type: int)

git_commits

Official

No description

Stage: processing

Inputs

  • repo: git_repository(type: Dict[str, str])

    • directory: str

    • URL: str(default: None)

  • branch: git_branch(type: str)

  • start_end: date_pair(type: List[date])

Outputs

  • commits: commit_count(type: int)

git_repo_author_lines_for_dates

Official

No description

Stage: processing

Inputs

  • repo: git_repository(type: Dict[str, str])

    • directory: str

    • URL: str(default: None)

  • branch: git_branch(type: str)

  • start_end: date_pair(type: List[date])

Outputs

  • author_lines: author_line_count(type: Dict[str, int])

git_repo_checkout

Official

No description

Stage: processing

Inputs

  • repo: git_repository(type: Dict[str, str])

    • directory: str

    • URL: str(default: None)

  • commit: git_commit(type: string)

Outputs

  • repo: git_repository_checked_out(type: Dict[str, str])

    • directory: str

    • URL: str(default: None)

    • commit: str(default: None)

git_repo_commit_from_date

Official

No description

Stage: processing

Inputs

  • repo: git_repository(type: Dict[str, str])

    • directory: str

    • URL: str(default: None)

  • branch: git_branch(type: str)

  • date: date(type: string)

Outputs

  • commit: git_commit(type: string)

git_repo_default_branch

Official

No description

Stage: processing

Inputs

  • repo: git_repository(type: Dict[str, str])

    • directory: str

    • URL: str(default: None)

Outputs

  • branch: git_branch(type: str)

Conditions

  • no_git_branch_given: boolean

git_repo_release

Official

Was there a release within this date range

Stage: processing

Inputs

  • repo: git_repository(type: Dict[str, str])

    • directory: str

    • URL: str(default: None)

  • branch: git_branch(type: str)

  • start_end: date_pair(type: List[date])

Outputs

  • present: release_within_period(type: bool)

lines_of_code_by_language

Official

This operation relys on tokei. Here’s how to install version 10.1.1, check it’s releases page to make sure you’re installing the latest version.

On Linux

$ curl -sSL 'https://github.com/XAMPPRocky/tokei/releases/download/v10.1.1/tokei-v10.1.1-x86_64-apple-darwin.tar.gz' \
  | tar -xvz && \
  echo '22699e16e71f07ff805805d26ee86ecb9b1052d7879350f7eb9ed87beb0e6b84fbb512963d01b75cec8e80532e4ea29a tokei' | sha384sum -c - && \
  sudo mv tokei /usr/local/bin/

On OSX

$ curl -sSL 'https://github.com/XAMPPRocky/tokei/releases/download/v10.1.1/tokei-v10.1.1-x86_64-apple-darwin.tar.gz' \
  | tar -xvz && \
  echo '8c8a1d8d8dd4d8bef93dabf5d2f6e27023777f8553393e269765d7ece85e68837cba4374a2615d83f071dfae22ba40e2 tokei' | sha384sum -c - && \
  sudo mv tokei /usr/local/bin/

Stage: processing

Inputs

  • repo: git_repository_checked_out(type: Dict[str, str])

    • directory: str

    • URL: str(default: None)

    • commit: str(default: None)

Outputs

  • lines_by_language: lines_by_language_count(type: Dict[str, Dict[str, int]])

lines_of_code_to_comments

Official

No description

Stage: processing

Inputs

  • langs: lines_by_language_count(type: Dict[str, Dict[str, int]])

Outputs

  • code_to_comment_ratio: language_to_comment_ratio(type: int)

make_quarters

Official

No description

Stage: processing

Inputs

  • number: quarters(type: int)

Outputs

  • quarters: quarter(type: int)

quarters_back_to_date

Official

No description

Stage: processing

Inputs

  • date: quarter_start_date(type: int)

  • number: quarter(type: int)

Outputs

  • date: date(type: string)

  • start_end: date_pair(type: List[date])

work

Official

No description

Stage: processing

Inputs

  • author_lines: author_line_count(type: Dict[str, int])

Outputs

  • work: work_spread(type: int)

dffml_feature_auth

pip install dffml-feature-auth

scrypt

Official

No description

Stage: processing

Inputs

  • password: UnhashedPassword(type: string)

Outputs

  • password: ScryptPassword(type: string)

dffml_operations_deploy

pip install dffml-operations-deploy

check_if_default_branch

Official

No description

Stage: processing

Inputs

  • payload: git_payload(type: Dict[str,Any])

Outputs

  • is_default_branch: is_default_branch(type: bool)

check_secret_match

Official

No description

Stage: processing

Inputs

  • headers: webhook_headers(type: Dict[str,Any])

  • body: payload(type: bytes)

Outputs

  • git_payload: git_payload(type: Dict[str,Any])

Args

  • secret: Entrypoint

docker_build_image

Official

No description

Stage: processing

Inputs

  • docker_commands: docker_commands(type: Dict[str,Any])

Outputs

  • build_status: is_image_built(type: bool)

Conditions

  • is_default_branch: bool

  • got_running_containers: bool

get_image_tag

Official

No description

Stage: processing

Inputs

  • payload: git_payload(type: Dict[str,Any])

Outputs

  • image_tag: docker_image_tag(type: str)

get_running_containers

Official

No description

Stage: processing

Inputs

  • tag: docker_image_tag(type: str)

Outputs

  • running_containers: docker_running_containers(type: List[str])

get_status_running_containers

Official

No description

Stage: processing

Inputs

  • containers: docker_running_containers(type: List[str])

Outputs

  • status: got_running_containers(type: bool)

get_url_from_payload

Official

No description

Stage: processing

Inputs

  • payload: git_payload(type: Dict[str,Any])

Outputs

  • url: URL(type: string)

parse_docker_commands

Official

No description

Stage: processing

Inputs

  • repo: git_repository(type: Dict[str, str])

    • directory: str

    • URL: str(default: None)

  • image_tag: docker_image_tag(type: str)

Outputs

  • docker_commands: docker_commands(type: Dict[str,Any])

restart_running_containers

Official

No description

Stage: processing

Inputs

  • docker_commands: docker_commands(type: Dict[str,Any])

  • containers: docker_running_containers(type: List[str])

Outputs

  • containers: docker_restarted_containers(type: str)

Conditions

  • is_image_built: bool

dffml_operations_data

pip install dffml-operations-data

one_hot_encoder

Official

One hot encoding for categorical data columns

Parameters

dataList[List[int]]

data to be encoded.

categoriesList[List[str]]

Categorical values which needs to be encoded

Returns

result: Encoded data for categorical values

Stage: processing

Inputs

  • data: input_data(type: List[List[int]])

  • categories: categories(type: List[List[Any]])

Outputs

  • result: output_data(type: List[List[int]])

ordinal_encoder

Official

One hot encoding for categorical data columns

Parameters

dataList[List[int]]

data to be encoded.

categoriesList[List[str]]

Categorical values which needs to be encoded

Returns

result: Encoded data for categorical values

Stage: processing

Inputs

  • data: input_data(type: List[List[int]])

Outputs

  • result: output_data(type: List[List[int]])

principal_component_analysis

Official

Decomposes the data into (n_samples, n_components) using PCA method

Parameters

dataList[List[int]]

data to be decomposed.

n_componentsint

number of colums the data should have after decomposition.

Returns

result: Data having dimensions (n_samples, n_components)

Stage: processing

Inputs

  • data: input_data(type: List[List[int]])

  • n_components: n_components(type: int)

Outputs

  • result: output_data(type: List[List[int]])

remove_whitespaces

Official

Removes white-spaces from the dataset

Parameters

dataList[List[int]]

dataset.

Returns

result: dataset having whitespaces removed

Stage: processing

Inputs

  • data: input_data(type: List[List[int]])

Outputs

  • result: output_data(type: List[List[int]])

simple_imputer

Official

Imputation method for missing values

Parameters

dataList[List[int]]

data in which missing values are present

missing_valuesAny str, int, float, None default = np.nan

The value present in place of missing value

strategystr “mean”, “median”, “constant”, “most_frequent” default = “mean”

The value present in place of missing value

Returns

result: Dataset having missing values imputed with the strategy

Stage: processing

Inputs

  • data: input_data(type: List[List[int]])

  • missing_values: missing_values(type: Any)

  • strategy: strategy(type: str)

Outputs

  • result: output_data(type: List[List[int]])

singular_value_decomposition

Official

Decomposes the data into (n_samples, n_components) using SVD method.

Parameters

dataList[List[int]]

data to be decomposed.

n_componentsint

number of colums the data should have after decomposition.

Returns

result: Data having dimensions (n_samples, n_components)

Stage: processing

Inputs

  • data: input_data(type: List[List[int]])

  • n_components: n_components(type: int)

  • n_iter: n_iter(type: int)

  • random_state: random_state(type: int)

Outputs

  • result: output_data(type: List[List[int]])

standard_scaler

Official

Standardize features by removing the mean and scaling to unit variance.

Parameters

data: List[List[int]]

data that needs to be standardized

Returns

result: Standardized data

Stage: processing

Inputs

  • data: input_data(type: List[List[int]])

Outputs

  • result: output_data(type: List[List[int]])

dffml_operations_binsec

pip install dffml-operations-binsec

cleanup_rpm

Official

No description

Stage: cleanup

Inputs

  • rpm: RPMObject(type: python_obj)

files_in_rpm

Official

No description

Stage: processing

Inputs

  • rpm: RPMObject(type: python_obj)

Outputs

  • files: rpm_filename(type: str)

is_binary_pie

Official

No description

Stage: processing

Inputs

  • rpm: RPMObject(type: python_obj)

  • filename: rpm_filename(type: str)

Outputs

  • is_pie: binary_is_PIE(type: bool)

url_to_urlbytes

Official

No description

Stage: processing

Inputs

  • URL: URL(type: string)

Outputs

  • download: URLBytes(type: python_obj)

urlbytes_to_rpmfile

Official

No description

Stage: processing

Inputs

  • download: URLBytes(type: python_obj)

Outputs

  • rpm: RPMObject(type: python_obj)

urlbytes_to_tarfile

Official

No description

Stage: processing

Inputs

  • download: URLBytes(type: python_obj)

Outputs

  • rpm: RPMObject(type: python_obj)

shouldi

pip install shouldi

cleanup_pypi_package

Official

Remove the directory containing the source code release.

Stage: cleanup

Inputs

  • directory: run_bandit.inputs.pkg(type: str)

pypi_package_contents

Official

Download a source code release and extract it to a temporary directory.

Stage: processing

Inputs

  • url: pypi_package_contents.inputs.url(type: str)

Outputs

  • directory: run_bandit.inputs.pkg(type: str)

pypi_package_json

Official

Download the information on the package in JSON format.

Stage: processing

Inputs

  • package: safety_check.inputs.package(type: str)

Outputs

  • version: safety_check.inputs.version(type: str)

  • url: pypi_package_contents.inputs.url(type: str)

run_bandit

Official

CLI usage: dffml service dev run -log debug shouldi.bandit:run_bandit -pkg .

Stage: processing

Inputs

  • pkg: run_bandit.inputs.pkg(type: str)

Outputs

  • result: run_bandit.outputs.result(type: map)

safety_check

Official

No description

Stage: processing

Inputs

  • package: safety_check.inputs.package(type: str)

  • version: safety_check.inputs.version(type: str)

Outputs

  • result: safety_check.outputs.result(type: int)

dffml_operations_nlp

pip install dffml-operations-nlp

collect_output

Official

No description

Stage: processing

Inputs

  • sentence: sentence(type: string)

  • length: source_length(type: string)

Outputs

  • all: all_sentences(type: List[string])

count_vectorizer

Official

Converts a collection of text documents to a matrix of token counts using sklearn CountVectorizer’s fit_transform method. For details on parameters check https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html Parameters specific to this operation are described below.

Parameters

textlist

A list of strings.

get_feature_names: bool

If True return feature names using get_feature_names method of CountVectorizer.

Returns

result: list

A list containing token counts and feature names if get_feature_names is True.

Stage: processing

Inputs

  • text: count_vectorizer.inputs.text(type: array)

  • encoding: count_vectorizer.inputs.encoding(type: str)

  • decode_error: count_vectorizer.inputs.decode_error(type: str)

  • strip_accents: count_vectorizer.inputs.strip_accents(type: str)

  • lowercase: count_vectorizer.inputs.lowercase(type: bool)

  • stop_words: count_vectorizer.inputs.stop_words(type: str)

  • token_pattern: count_vectorizer.inputs.token_pattern(type: str)

  • ngram_range: count_vectorizer.inputs.ngram_range(type: array)

  • analyzer: count_vectorizer.inputs.analyzer(type: str)

  • max_df: count_vectorizer.inputs.max_df(type: float)

  • min_df: count_vectorizer.inputs.min_df(type: float)

  • max_features: count_vectorizer.inputs.max_features(type: int)

  • vocabulary: count_vectorizer.inputs.vocabulary(type: map)

  • binary: count_vectorizer.inputs.binary(type: bool)

  • get_feature_names: count_vectorizer.inputs.get_feature_names(type: bool)

Outputs

  • result: count_vectorizer.outputs.result(type: array)

extract_array_from_matrix

Official

Returns row from input_matrix based on index of single_text_example in collected_text.

Parameters

single_text_examplestr

String to be used for indexing into collected_text.

collected_text: list

List of strings.

input_matrix: list

A 2-D matrix where each row represents vector corresponding to single_text_example.

Returns

result: A 1-d array.

Stage: processing

Inputs

  • single_text_example: extract_array_from_matrix.inputs.single_text_example(type: str)

  • collected_text: extract_array_from_matrix.inputs.collected_text(type: array)

  • input_matrix: extract_array_from_matrix.inputs.input_matrix(type: array)

Outputs

  • result: extract_array_from_matrix.outputs.result(type: array)

get_embedding

Official

Maps words of text data to their corresponding word vectors.

Parameters

textstr

String to be converted to word vectors.

max_len: int

Maximum length of sentence. If the length of text > max_len, text is truncated to have length = max_len. If the length of text < max_len, text is padded with pad_token such that len(text) = max_len.

pad_token: str

Token to be used for padding text if len(text) < max_len

spacy_model: str

Spacy model to be used for assigning vectors to tokens.

Returns

result: A 2-d array of shape (max_len, embedding_size of vectors).

Stage: processing

Inputs

  • text: text_def(type: str)

  • spacy_model: spacy_model_name_def(type: str)

  • max_len: max_len_def(type: int)

  • pad_token: pad_token_def(type: str)

Outputs

  • embedding: embedding(type: generic)

get_noun_chunks

Official

Extracts the noun chunks from text.

Parameters

textstr

String to extract noun chunks from.

spacy_model: str

A spacy model with the capability of parsing.

Returns

result: list

A list containing noun chunks.

Stage: processing

Inputs

  • text: get_noun_chunks.inputs.text(type: str)

  • spacy_model: get_noun_chunks.inputs.spacy_model(type: str)

Outputs

  • result: get_noun_chunks.outputs.result(type: array)

get_sentences

Official

Extracts the sentences from text.

Parameters

textstr

String to extract sentences from.

spacy_model: str

A spacy model with the capability of parsing. Sentence boundaries are calculated from the syntactic dependency parse.

Returns

result: list

A list containing sentences.

Stage: processing

Inputs

  • text: get_sentences.inputs.text(type: str)

  • spacy_model: get_sentences.inputs.spacy_model(type: str)

Outputs

  • result: get_sentences.outputs.result(type: array)

get_similarity

Official

Calculates similarity between two text strings as a score between 0 and 1.

Parameters

text_1str

First string to compare.

text_2str

Second string to compare.

spacy_model: str

Spacy model to be used for extracting word vectors which are used for calculating similarity.

Returns

result: float

A similarity score between 0 and 1.

Stage: processing

Inputs

  • text_1: get_similarity.inputs.text_1(type: str)

  • text_2: get_similarity.inputs.text_2(type: str)

  • spacy_model: get_similarity.inputs.spacy_model(type: str)

Outputs

  • result: get_similarity.outputs.result(type: float)

lemmatizer

Official

Reduce words in the text to their dictionary form (lemma)

Parameters

textstr

String to lemmatize.

spacy_model: str

Spacy model to be used for lemmatization.

Returns

result: list

A list containing base form of the words.

Stage: processing

Inputs

  • text: lemmatizer.inputs.text(type: str)

  • spacy_model: lemmatizer.inputs.spacy_model(type: str)

Outputs

  • result: lemmatizer.outputs.result(type: array)

pos_tagger

Official

Assigns part-of-speech tags to text.

Parameters

textstr

Text to be tagged.

spacy_model: str

A spacy model with tagger and parser.

Returns

result: list

A list containing tuples of word and their respective pos tag.

Stage: processing

Inputs

  • text: pos_tagger.inputs.text(type: str)

  • spacy_model: pos_tagger.inputs.spacy_model(type: str)

  • tag_type: pos_tagger.inputs.tag_type(type: str)

Outputs

  • result: pos_tagger.outputs.result(type: array)

remove_stopwords

Official

Removes stopword from text data.

Parameters

textstr

String to be cleaned.

custom_stop_words: List[str], default = None

List of words to be considered as stop words.

Returns

result: A string without stop words.

Stage: processing

Inputs

  • text: remove_stopwords.inputs.text(type: str)

  • custom_stop_words: remove_stopwords.inputs.custom_stop_words(type: array)

Outputs

  • result: remove_stopwords.outputs.result(type: str)

tfidf_vectorizer

Official

Convert a collection of raw documents to a matrix of TF-IDF features using sklearn TfidfVectorizer’s fit_transform method. For details on parameters check https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html Parameters specific to this operation are described below.

Parameters

textlist

A list of strings.

get_feature_names: bool

If True return feature names using get_feature_names method of TfidfVectorizer.

Returns

result: list

A list containing token counts and feature names if get_feature_names is True.

Stage: processing

Inputs

  • text: tfidf_vectorizer.inputs.text(type: array)

  • encoding: tfidf_vectorizer.inputs.encoding(type: str)

  • decode_error: tfidf_vectorizer.inputs.decode_error(type: str)

  • strip_accents: tfidf_vectorizer.inputs.strip_accents(type: str)

  • lowercase: tfidf_vectorizer.inputs.lowercase(type: bool)

  • analyzer: tfidf_vectorizer.inputs.analyzer(type: str)

  • stop_words: tfidf_vectorizer.inputs.stop_words(type: str)

  • token_pattern: tfidf_vectorizer.inputs.token_pattern(type: str)

  • ngram_range: tfidf_vectorizer.inputs.ngram_range(type: array)

  • max_df: tfidf_vectorizer.inputs.max_df(type: str)

  • min_df: tfidf_vectorizer.inputs.min_df(type: str)

  • max_features: tfidf_vectorizer.inputs.max_features(type: str)

  • vocabulary: tfidf_vectorizer.inputs.vocabulary(type: str)

  • binary: tfidf_vectorizer.inputs.binary(type: bool)

  • norm: tfidf_vectorizer.inputs.norm(type: str)

  • use_idf: tfidf_vectorizer.inputs.use_idf(type: bool)

  • smooth_idf: tfidf_vectorizer.inputs.smooth_idf(type: bool)

  • sublinear_tf: tfidf_vectorizer.inputs.sublinear_tf(type: bool)

  • get_feature_names: tfidf_vectorizer.inputs.get_feature_names(type: bool)

Outputs

  • result: tfidf_vectorizer.outputs.result(type: array)