Active Learning Pipeline
  • 16 Oct 2024
  • Dark
    Light
  • PDF

Active Learning Pipeline

  • Dark
    Light
  • PDF

Article summary

Overview

In this era of rapidly advancing technology, machine-learning models have become essential tools for various industries. However, training these models requires large labeled datasets, which can be expensive and time-consuming to obtain.

Active learning is a supervised machine learning approach that aims to optimize annotation using a few small training samples. One of the biggest challenges in building machine learning (ML) models is annotating large datasets. Active learning can help you overcome these challenges.

This documentation serves as a comprehensive guide for users who wish to implement the Active Learning Pipeline using Dataloop's platform.


Active Learning Pipeline (ALP)

  • The Dataloop's Active Learning Pipeline (ALP) is a powerful and customizable feature designed to automate and streamline the iterative model training process on unstructured data.
  • Active learning is a method in machine learning that selects more informative data points to label, prioritizing those that would provide the most valuable information.
  • By selectively labeling only informative examples, active learning helps efficiency of the learning process and achieve high accuracy with fewer labeled samples.
  • It automates the active learning workflow, making it accessible to users without a strong technical background.
  • By automating the data ingestion, filtering, annotation task creation, model training, evaluation, and comparison processes, the pipeline simplifies and accelerates the process of training high-quality machine learning models.
  • You can customize the active learning pipeline as per your requirement.

Benefits

The Active Learning Pipeline gives you several significant benefits:

  • Speed: It helps train models more quickly than traditional methods by automating the model training process, freeing businesses to focus on other tasks.
  • Efficiency: It helps save resources by automating the model training process by training models with less human intervention, which can save time and money.
  • Flexibility: It is a flexible tool that you can customize to meet the specific needs of any business.

If you are looking for a way to improve your AIOps capabilities, then the Dataloop Active Learning Pipeline is the way to go.


Active Learning Pipeline Flow

image.png

The flow of Dataloop's active learning pipeline is divided into two flows:

  1. Ground Truth Enrichment: Collecting train Data into the Ground Truth - The upper flow
  2. Training a new version of your model based on the collected data - The lower flow

For more information about the pipeline nodes, see the Pipeline Nodes article.

(1) Upper Flow: Collection of Ground-Truth Data

The collection of ground truth data consists of the following steps:
image.png

Trigger The Pipeline

Trigger

By default, an 'item_created' event trigger is set on the first node ('Dataset Node') to trigger the ALP on every item creation (following data upload/sync) in the dataset you set on the Dataset Node. This trigger can be modified so you can select different events or add a DQL filter to the trigger.

Dataset Node

Use a Dataset node as the starting point of your pipeline to ensure it only triggers events in that specific dataset. For example, when setting the 'item_created' event as a trigger, only items created in the selected dataset will trigger the pipeline.
:::(Info) (Annotation Tasks Dataset)
The Dataset selected in this node must match the dataset configured in the Annotation workflow task nodes after the predict node.
:::

Create Annotations on New Data

Predict Node

The active learning pipeline begins with the Predict node, where a pre-trained model generates annotations on unlabeled data, using the model set under 'Best model' variable.

Human in The Loop (HITL) - Human Annotation Workflow

Once pre-annotation using your model is done, data items move directly into human annotation workflow tasks according to your needs. The default ALP consists of a labeling task and a QA (Review) task to correct the model's pre-annotations. Customize the workflow section according to your needs and fully set up the task.

Split Data into Ground Truth Subsets

ML Data Split Node

The ML Data Split node randomly splits the annotated data into three subset groups based on the given distribution percentage. By default, the distribution is: 80-10-10 for Train-Validation-Test subsets. Metadata tags will be added to the items' metadata under metadata.system.tags, for example:
{ "metadata": { "system": { "tags": { "train": true } } } }
These metadata tags to be later easily found and used as DQL filter for the train and evaluation.

Here you can find an Example of DQL Query to filter out the train subset from the Dataset:  [GitHub Link](https://github.com/dataloop-ai-apps/active-learning/blob/main/pipeline_configs/train_subset_filter.json){target=`_blank`}

:::(Info) (GitHub Docs:)
Click to see the full Node configuration & code implementation
[https://github.com/dataloop-ai-apps/active-learning?tab=readme-ov-file#model-data-split-node](https://github.com/dataloop-ai-apps/active-learning?tab=readme-ov-file#model-data-split-node){target=`_blank`}
:::

Dataset Node - Ground Truth

Select the Ground Truth dataset to store the new data in. The data will be cloned from the dataset selected in the first pipeline node.
You can use the Dataset Browser at any time to see the latest ML subset divisions, using the smart search to filter items based on the item.metadata.
:::(Info) (Utilizing same dataset for start & end nodes)
If you want to use the same dataset as you set in the pipeline starting point (the first node), then you can either remove the current Dataset node or keep it there; it won't affect the data.
:::
:::(Info) (Splitting data into separate folders in Ground-Truth dataset)
To distribute data into distinct folders within the Ground-Truth dataset, use three Dataset Nodes, each linked to a different output port of the ML Data Split node.
Make sure not to use the same dataset as both the start and end nodes in this process. Default subset filters (variables) offered by Dataloop are compatible with this setup.
:::

(2) Lower Flow: Create, Train, Deploy New Model Version

Creating and training new models using ground truth data requires you to complete the following process:
image.png

Trigger New Model Version

Trigger

It is recommended to initiate the new model version creation and training flow using a cron trigger (once a week, for example). Alternatively, you can use an event trigger to initiate the flow or add your own logic using a code node (for example, trigger it based on event count).

Create a New Model

The Create New Model node clones a base model using the variables set on the node inputs (dataset, training, and validation subsets, and model configurations). This node outputs a model that is prepared for training. The name of the new model entity is taken from the text box in the Pipelines panel. If a model already exists with the same name, an incremental number will be automatically added as a suffix.
:::(Info) (GitHub Docs:)
Click to see the full Node configuration & code implementation
https://github.com/dataloop-ai-apps/active-learning?tab=readme-ov-file#create-new-model-node
:::

Train & Evaluate

Train Model

The Train Model node executes the training process for the newly created model version. The model will be trained over the Ground-Truth dataset using the train and validation subsets that were saved on the model during its creation.

Evaluate Node - Base Model

The Evaluate Model node creates evaluation for the base model to make sure its evaluation is up-to-date, and based on the entire test subset of the Ground-Truth dataset (as defined in the variables). The evaluation will be used later for comparison with the newly created model ('Best model' variable).

  • To evaluate a model, a test subset with ground truth annotations will be compared to the predictions made by those model.
  • The model will make predictions on the test items, and scores will be created based on the annotation types. Currently, supported types for scoring include classification, bounding boxes, polygons, segmentation, and point annotations.
  • By default, the annotation type(s) to be compared are defined by the model output type. Scores include a label agreement score, an attribute agreement score, and a geometry score (for example, IOU). Scores will be uploaded to the platform and available for other uses (for example, comparing models).

Evaluate Node - New Model

The Evaluate Model node creates evaluation for the newly created model based on the entire test subset of the Ground-Truth dataset (as defined in the variables). The evaluation will be used later for comparison with the base model ('Best model' variable)

Compare Models & Deploy the Best Version

Compare Models

The Compare Models node compares two models: a previously trained model and a newly trained model. The default Dataloop compare model node can compare any two models that have either:

Update Variable

If the winning model is the new model (based on the comparison config), the Update Variable node will automatically deploy the model and update the 'Best model' variable. Your pipeline will start using the new model immediately across the pipeline - every node that uses this variable (predict, create model).


How to Create Your Active Learning Pipeline?

  1. Navigate to the Pipelines page.
  2. Click Create Pipeline to start creating a new pipeline.
  3. Select Use a Template option. It opens the Select Pipeline Template Page.
  4. Select the Active Learning Pipeline template.
  5. Click Create Pipeline. The template will be created.
  6. Name the pipeline and click Create.
  7. Configuration:
    • Adjust the variables according to your needs (details below).
    • Adjust the nodes config according to your requirement (described above).
    • Adjust the cron trigger (schedule) on the lower flow (on the 'Create New Model' node).
  8. Click Start to activate the pipeline.

Configure Your Active Learning Pipeline

Configure the pipeline variables and nodes according to your specific requirements.

Manage Your Pipeline Variables

To control your pipeline in real-time, the Active Learning template provides pipeline variables, enabling you to execute the pipeline with your preferred base model, dataset, configurations, and other specifications.

To customize the variables based on your requirements:

  1. From the top menu bar, click on the Pipeline Variables icon.
    image.png
  2. Click Manage Variables.
  3. Identify variables and click the Edit icon to set values according to your needs. Model & Dataset are missing and require your action, please read below.

Active Learning Variables

Here are the variables managed within the active learning:

  1. Best model (Model) - Set the ID of the trained model version you want to start your active learning pipeline with. To get the model: Dataloop main menu > click "Models" > "Versions" tab > 3-dots actions of the required model > 'Copy Model ID'

    The model must be located within your project scope. You can import any public model provided to your active project from the Model Management page.

  2. Ground Truth Dataset (Dataset) - Set the ID of your ground Truth Dataset.
  3. Train/Validation Subset Filters (JSON) - Set a DQL filter to retrieve Train/Validation set items from the Ground Truth Dataset for model training. We recommend using the default filter that targets items marked with train/validation metadata tag, added by ML Data Split node subsets.
  4. Test Subset Filter (JSON) - Set a DQL filter to retrieve Test set items from the Ground Truth Dataset for model evaluation.
    We recommend using the default filter that targets items marked with "test" metadata tag, added by ML Data Split node subsets.
  5. Model Configuration (JSON) - Set the configuration for new models (for 'Create New Model' node). The configuration will be used for both training and prediction. If no value is provided (empty JSON '{}'), then the base model's configuration will be applied.
  6. Model Comparison Configuration (JSON) - Set the comparison config for the models' comparison (new model vs. best model) based on the evaluation results (precision-recall).
    For more details, read here: GitHub

Customizing the Active Learning Pipeline

  • Modify the pipeline composition using drag & drop to add custom processing steps or rearrange the existing nodes.
  • Configurable Logic: Every step and logic within the pipeline is fully configurable. You can update the pipeline composition with simple drag-and-drop actions.
  • Use the Code node to incorporate your own code to introduce custom processing steps within the pipeline. For example:
    • Filter data for the Annotation tasks (after inferencing).
    • Trigger the model creation & train flow based on event counts.
    • Nonrandom ML data split.

Executing the Pipeline

Once you configure your Active Learning Pipeline, click Start Pipeline to activate it. Any new event triggers will execute the pipeline.
To manually trigger the pipeline over existing data, use the SDK or:

  1. Go to the Browser of the Dataset you set in your task nodes
  2. Filter the required data.
  3. Click Dataset Actions > Run with pipeline > Select your pipeline.