- 16 Oct 2024
- Print
- DarkLight
- PDF
Active Learning Pipeline
- Updated On 16 Oct 2024
- Print
- DarkLight
- PDF
Overview
In this era of rapidly advancing technology, machine-learning models have become essential tools for various industries. However, training these models requires large labeled datasets, which can be expensive and time-consuming to obtain.
Active learning is a supervised machine learning approach that aims to optimize annotation using a few small training samples. One of the biggest challenges in building machine learning (ML) models is annotating large datasets. Active learning can help you overcome these challenges.
This documentation serves as a comprehensive guide for users who wish to implement the Active Learning Pipeline using Dataloop's platform.
Active Learning Pipeline (ALP)
- The Dataloop's Active Learning Pipeline (ALP) is a powerful and customizable feature designed to automate and streamline the iterative model training process on unstructured data.
- Active learning is a method in machine learning that selects more informative data points to label, prioritizing those that would provide the most valuable information.
- By selectively labeling only informative examples, active learning helps efficiency of the learning process and achieve high accuracy with fewer labeled samples.
- It automates the active learning workflow, making it accessible to users without a strong technical background.
- By automating the data ingestion, filtering, annotation task creation, model training, evaluation, and comparison processes, the pipeline simplifies and accelerates the process of training high-quality machine learning models.
- You can customize the active learning pipeline as per your requirement.
Benefits
The Active Learning Pipeline gives you several significant benefits:
- Speed: It helps train models more quickly than traditional methods by automating the model training process, freeing businesses to focus on other tasks.
- Efficiency: It helps save resources by automating the model training process by training models with less human intervention, which can save time and money.
- Flexibility: It is a flexible tool that you can customize to meet the specific needs of any business.
If you are looking for a way to improve your AIOps capabilities, then the Dataloop Active Learning Pipeline is the way to go.
Active Learning Pipeline Flow
The flow of Dataloop's active learning pipeline is divided into two flows:
- Ground Truth Enrichment: Collecting train Data into the Ground Truth - The upper flow
- Training a new version of your model based on the collected data - The lower flow
For more information about the pipeline nodes, see the Pipeline Nodes article.
(1) Upper Flow: Collection of Ground-Truth Data
The collection of ground truth data consists of the following steps:
Trigger The Pipeline
Trigger
By default, an 'item_created' event trigger is set on the first node ('Dataset Node') to trigger the ALP on every item creation (following data upload/sync) in the dataset you set on the Dataset Node. This trigger can be modified so you can select different events or add a DQL filter to the trigger.
Dataset Node
Use a Dataset node as the starting point of your pipeline to ensure it only triggers events in that specific dataset. For example, when setting the 'item_created' event as a trigger, only items created in the selected dataset will trigger the pipeline.
:::(Info) (Annotation Tasks Dataset)
The Dataset selected in this node must match the dataset configured in the Annotation workflow task nodes after the predict node.
:::
Create Annotations on New Data
Predict Node
The active learning pipeline begins with the Predict node, where a pre-trained model generates annotations on unlabeled data, using the model set under 'Best model' variable.
Human in The Loop (HITL) - Human Annotation Workflow
Once pre-annotation using your model is done, data items move directly into human annotation workflow tasks according to your needs. The default ALP consists of a labeling task and a QA (Review) task to correct the model's pre-annotations. Customize the workflow section according to your needs and fully set up the task.
Split Data into Ground Truth Subsets
ML Data Split Node
The ML Data Split node randomly splits the annotated data into three subset groups based on the given distribution percentage. By default, the distribution is: 80-10-10 for Train-Validation-Test subsets. Metadata tags will be added to the items' metadata under metadata.system.tags, for example:
{ "metadata": { "system": { "tags": { "train": true } } } }
These metadata tags to be later easily found and used as DQL filter for the train and evaluation.
Here you can find an Example of DQL Query to filter out the train subset from the Dataset: [GitHub Link](https://github.com/dataloop-ai-apps/active-learning/blob/main/pipeline_configs/train_subset_filter.json){target=`_blank`}
:::(Info) (GitHub Docs:)
Click to see the full Node configuration & code implementation
[https://github.com/dataloop-ai-apps/active-learning?tab=readme-ov-file#model-data-split-node](https://github.com/dataloop-ai-apps/active-learning?tab=readme-ov-file#model-data-split-node){target=`_blank`}
:::
Dataset Node - Ground Truth
Select the Ground Truth dataset to store the new data in. The data will be cloned from the dataset selected in the first pipeline node.
You can use the Dataset Browser at any time to see the latest ML subset divisions, using the smart search to filter items based on the item.metadata.
:::(Info) (Utilizing same dataset for start & end nodes)
If you want to use the same dataset as you set in the pipeline starting point (the first node), then you can either remove the current Dataset node or keep it there; it won't affect the data.
:::
:::(Info) (Splitting data into separate folders in Ground-Truth dataset)
To distribute data into distinct folders within the Ground-Truth dataset, use three Dataset Nodes, each linked to a different output port of the ML Data Split node.
Make sure not to use the same dataset as both the start and end nodes in this process. Default subset filters (variables) offered by Dataloop are compatible with this setup.
:::
(2) Lower Flow: Create, Train, Deploy New Model Version
Creating and training new models using ground truth data requires you to complete the following process:
Trigger New Model Version
Trigger
It is recommended to initiate the new model version creation and training flow using a cron trigger (once a week, for example). Alternatively, you can use an event trigger to initiate the flow or add your own logic using a code node (for example, trigger it based on event count).
Create a New Model
The Create New Model node clones a base model using the variables set on the node inputs (dataset, training, and validation subsets, and model configurations). This node outputs a model that is prepared for training. The name of the new model entity is taken from the text box in the Pipelines panel. If a model already exists with the same name, an incremental number will be automatically added as a suffix.
:::(Info) (GitHub Docs:)
Click to see the full Node configuration & code implementation
https://github.com/dataloop-ai-apps/active-learning?tab=readme-ov-file#create-new-model-node
:::
Train & Evaluate
Train Model
The Train Model node executes the training process for the newly created model version. The model will be trained over the Ground-Truth dataset using the train and validation subsets that were saved on the model during its creation.
Evaluate Node - Base Model
The Evaluate Model node creates evaluation for the base model to make sure its evaluation is up-to-date, and based on the entire test subset of the Ground-Truth dataset (as defined in the variables). The evaluation will be used later for comparison with the newly created model ('Best model' variable).
- To evaluate a model, a test subset with ground truth annotations will be compared to the predictions made by those model.
- The model will make predictions on the test items, and scores will be created based on the annotation types. Currently, supported types for scoring include classification, bounding boxes, polygons, segmentation, and point annotations.
- By default, the annotation type(s) to be compared are defined by the model output type. Scores include a label agreement score, an attribute agreement score, and a geometry score (for example, IOU). Scores will be uploaded to the platform and available for other uses (for example, comparing models).
Evaluate Node - New Model
The Evaluate Model node creates evaluation for the newly created model based on the entire test subset of the Ground-Truth dataset (as defined in the variables). The evaluation will be used later for comparison with the base model ('Best model' variable)
Compare Models & Deploy the Best Version
Compare Models
The Compare Models node compares two models: a previously trained model and a newly trained model. The default Dataloop compare model node can compare any two models that have either:
- Uploaded metrics to model management during model training, or
- Been evaluated on a common test subset.
The Compare Models node uses the Comparison Config variable.GitHub Docs:Click to see the full Node configuration & code implementation
https://github.com/dataloop-ai-apps/active-learning?tab=readme-ov-file#compare-models-node
Update Variable
If the winning model is the new model (based on the comparison config), the Update Variable node will automatically deploy the model and update the 'Best model' variable. Your pipeline will start using the new model immediately across the pipeline - every node that uses this variable (predict, create model).
How to Create Your Active Learning Pipeline?
- Navigate to the Pipelines page.
- Click Create Pipeline to start creating a new pipeline.
- Select Use a Template option. It opens the Select Pipeline Template Page.
- Select the Active Learning Pipeline template.
- Click Create Pipeline. The template will be created.
- Name the pipeline and click Create.
- Configuration:
- Adjust the variables according to your needs (details below).
- Adjust the nodes config according to your requirement (described above).
- Adjust the cron trigger (schedule) on the lower flow (on the 'Create New Model' node).
- Click Start to activate the pipeline.
Configure Your Active Learning Pipeline
Configure the pipeline variables and nodes according to your specific requirements.
Manage Your Pipeline Variables
To control your pipeline in real-time, the Active Learning template provides pipeline variables, enabling you to execute the pipeline with your preferred base model, dataset, configurations, and other specifications.
To customize the variables based on your requirements:
- From the top menu bar, click on the Pipeline Variables icon.
- Click Manage Variables.
- Identify variables and click the Edit icon to set values according to your needs. Model & Dataset are missing and require your action, please read below.
Active Learning Variables
Here are the variables managed within the active learning:
- Best model (Model) - Set the ID of the trained model version you want to start your active learning pipeline with. To get the model: Dataloop main menu > click "Models" > "Versions" tab > 3-dots actions of the required model > 'Copy Model ID'
The model must be located within your project scope. You can import any public model provided to your active project from the Model Management page.
- Ground Truth Dataset (Dataset) - Set the ID of your ground Truth Dataset.
- Train/Validation Subset Filters (JSON) - Set a DQL filter to retrieve Train/Validation set items from the Ground Truth Dataset for model training. We recommend using the default filter that targets items marked with train/validation metadata tag, added by ML Data Split node subsets.
- Test Subset Filter (JSON) - Set a DQL filter to retrieve Test set items from the Ground Truth Dataset for model evaluation.
We recommend using the default filter that targets items marked with "test" metadata tag, added by ML Data Split node subsets. - Model Configuration (JSON) - Set the configuration for new models (for 'Create New Model' node). The configuration will be used for both training and prediction. If no value is provided (empty JSON '{}'), then the base model's configuration will be applied.
- Model Comparison Configuration (JSON) - Set the comparison config for the models' comparison (new model vs. best model) based on the evaluation results (precision-recall).
For more details, read here: GitHub
Customizing the Active Learning Pipeline
- Modify the pipeline composition using drag & drop to add custom processing steps or rearrange the existing nodes.
- Configurable Logic: Every step and logic within the pipeline is fully configurable. You can update the pipeline composition with simple drag-and-drop actions.
- Use the Code node to incorporate your own code to introduce custom processing steps within the pipeline. For example:
- Filter data for the Annotation tasks (after inferencing).
- Trigger the model creation & train flow based on event counts.
- Nonrandom ML data split.
Executing the Pipeline
Once you configure your Active Learning Pipeline, click Start Pipeline to activate it. Any new event triggers will execute the pipeline.
To manually trigger the pipeline over existing data, use the SDK or:
- Go to the Browser of the Dataset you set in your task nodes
- Filter the required data.
- Click Dataset Actions > Run with pipeline > Select your pipeline.