Data Nodes
  • 15 Aug 2024
  • Dark
    Light
  • PDF

Data Nodes

  • Dark
    Light
  • PDF

Article summary

Data Nodes

Dataset Node

The Dataset offers the flexibility to either generate a new dataset or leverage an existing one as a filter or storage container within your data pipeline. You can incorporate the Dataset node at the beginning, middle, or end of your pipeline, providing flexibility in your data processing workflow.

Use the Dataset node at the beginning of the pipeline: It filters out triggered items, ensuring that the pipeline works exclusively with items from the selected Dataset (and folder, if specified).

Executing Pipeline over Existing Data

When using the Dataset node at the start of the pipeline, the items in the Dataset won't be automatically triggered when the pipeline activates. To trigger these items, you'll need to use an event trigger, manually execute the pipeline via the Dataset Browser (by filtering the relevant items), or utilize the Dataloop SDK.

Incorporating the Dataset node as an intermediate or end point within the pipeline: It will clone the delivered items to the specified Dataset (at its root folder) or to the selected folder, if chosen. If the item already exists at the target location, it will be skipped.

Details

When you click on a Dataset Node, its details, such as Configuration, Executions, Logs, and available Actions are shown on the right-side panel.

For the actions available on each node in the right-side panel, see the Pipeline Node Actions.

The Dataset node details are presented in three tabs as follows:

Config Tab

  • Dataset: Select an existing dataset, or click Create Dataset to create a new dataset. A Dataset node can only have one output channel and one input channel.
  • Set Fixed Dataset or Set Variable: It allows you to set the selected dataset as a fixed dataset or set a Pipeline Variable (follow from the Step 3).
  • Folder (Optional): Select a folder within the selected dataset. This option will not be accessible if no dataset is selected.
  • Trigger Existing Dataset and Folder Data to the Pipeline:
    • Enable this option to automatically load existing data into the pipeline dataset node upon activation, based on the chosen dataset, folder, and any DQL filter in the trigger.
    • This option is only available when this node is the start node.
    • Note: This is a one-time action and does not re-trigger after changes to the dataset, folder, or filters, or if the pipeline is paused and resumed.
  • Node Input: Input channels are set to be of type: item by default. Click Set Parameter to set input parameter for the dataset node. For more information, see the Node Inputs article.
  • Node Output: Output channels are set to be of type: item by default.
  • Trigger (Optional): An Event/Cron Triggers can be set on this node, enabling you to initiate the pipeline run from this specific point.

For information on the Executions and Logs tabs, see the Node Details article.

Update Variable

The Update Variable node allows you to manage the pipeline variables and update their values dynamically in real-time during pipeline execution.

  • You can select the required variables from the dropdown list.
  • The node input/output will be updated automatically according to your selection.
  • When the Update Variable node executes, the node delivered input will be set as the new value for the variable.

Details

When you click on an Update Variable node, its details, such as Configuration, Executions, Logs, Instances, and available Actions are shown on the right-side panel.

For the actions available on each node in the right-side panel, see the Pipeline Node Actions.

The Update Variable node details are presented in four tabs as follows:

Config Tab

  • Node Name: By default, Update Variable is displayed as name. Make changes accordingly.
  • Variables: Select required variables from the dropdown list, or create a new one.
  • Node Input: It will be set automatically after selecting a variable. Click Set Parameter to set input parameter for the dataset node. For more information, see the Node Inputs article.
  • Node Output: It will be set automatically after selecting a variable.
  • Trigger (Optional): An Event/Cron Triggers can be set on this node, enabling you to initiate the pipeline run from this specific point.

For information on the Executions, Logs, and Instances tabs, see the Node Details article.

Data Split Node

The Data Split node is a powerful data processing tool that allows you to randomly split your data into multiple groups at runtime. Whether you need to sample items for QA tasks or allocate your ground truth into training, test, and validation sets, the Data Split node simplifies the process.

Simply define the groups, set their distribution, and optionally tag each item with its assigned group. The tag will be appended to the item's metadata under metadata.system.tags (list). Use the Data Split node at any point in the pipeline to tailor the data processing.

Groups limitations

Minimum groups: 2
Maximum groups: 5
Distribution must sum up to 100%

For instance, to sample 20% of the annotated data for review (QA Task), create two groups ("Sampled"/"Not_Sampled") and set the required distribution (20-80). Afterward, add a node connection from the "Sampled" group to the QA task, ensuring that only 20% of the data is directed for QA during runtime.

Node Actions Menu

For the actions available on each node in the right-side panel, see the Pipeline Node Actions.

The Data Split node details are presented in four tabs as follows:

Configuration

  • Node Name: Display name on the canvas.
  • Groups and Distribution: Allows to create groups and manage data distribution (%). At least 2 groups must be specified, and no more than 5 groups.
  • Distribute equally: Mark this option to force equal distribution between the groups.
  • Group Name and Distribution fields: Enter the name for the groups and add distribution percentages.
  • Item Tags:
    • Tag items based on their assigned group name: By default, this option allows you to add a metadata tag items once they are assigned to a group. The tag will be the group name and will be added to the item's metadata field: metadata.system.tags (list).
    • Override existing item tags: When you select this option, the tags that are already available in the items will be replaced with the newly assigned tag. This option will be disabled, if you unselect the above option.

Node Input: item that will be automatically assigned to a group (randomly, based on the required distribution). Click Set Parameter to set input parameter for the Data Split node. For more information, see the Node Inputs article.
Node Output: The output will be set automatically according to the defined groups.
Trigger (Optional): An Event/Cron trigger can be set on this node, enabling you to initiate the pipeline run from this specific point.

For information on the Executions, Logs, and Instances tabs, see the Node Details article.



What's Next