Manage Datasets
  • 05 Feb 2025
  • Dark
    Light
  • PDF

Manage Datasets

  • Dark
    Light
  • PDF

Article summary

Overview

The Manage Datasets documentation provides a comprehensive guide on how to efficiently organize, structure, and manipulate datasets within the Dataloop platform. Datasets serve as the foundation for data annotation, machine learning model training, and AI-driven workflows. Dataloop provides the capability to perform a variety of dataset management actions, as described below.


Add Items to a Collection

You can create a new collection by selecting items from your dataset and adding them to a designated collection.

  1. Open the Data Browser.
  2. Select the items you want to add to a collection.
  3. Right-click on the selected items.
  4. Select Collections and choose your desired collection. The selected items will now be added to the chosen collection.

Add Custom Metadata

Adding custom metadata involves attaching additional information or tags to various types of data items. Custom metadata can be user-defined and is not limited to the predefined categories or attributes provided by the Dataloop platform.

To attach metadata to any entity, such as Datasets, you can utilize the SDK's 'Update' function. To learn how to upload items with metadata, read here.

// Example
dataset.metadata["MyBoolean"] = True
dataset.metadata["Mycontext"] = Blue
dataset.update()

Display Custom Metadata

The datasets page provides a list of all the datasets present within the project. The table contains default columns, including dataset name, the count of items, the percentage of annotated items, and additional information.

To include and display columns with your custom context (metadata fields):

  1. From the Project Overview, click on Settings.
  2. Select Configuration.
  3. Select the Dataset Columns from the left-side menu.
  4. Click Update Setting.
  5. Click Add column.
  6. Enter the required information as follows.
    1. Name: A general name for this column (not visible outside the project-settings).
    2. Label: The column header displayed on the Datasets page.
    3. Field: The Metadata field to map to this column.
  7. Configure the desired feature settings as needed:
    1. Link: If the field value is a URL and should open in a new tab, select this option.
    2. Resizable: Check this option if the column needs to be resizable, useful for displaying long values.
    3. Sortable: Enable this option to allow sorting the table by clicking the column header.
  8. Click Apply. A successful message is displayed.

After completing the above steps, the Datasets table on the Datasets page will display the custom column and the data you've populated there.

  • To ensure that any new data added via SDK is reflected, refresh the page.
  • You can use the search box to search for datasets that match your search term, provided that the search term is included in any of the custom columns you've added to the table. This allows you to filter datasets based on the custom metadata you've defined.

Add a Dataset to an Existing Task

  1. In the Dataset Browser, click Dataset Actions.
  2. Select Labeling Tasks -> Add to an Existing Task from the list.
Items to Task or Model

When creating a task or model from the Dataset browser, it includes all items in the dataset.


Assign an Item to a Model's Test Datasets

You can assign the selected items to model's test datasets. When you assign, a tag (Test) will be added to the item details.

  1. In the Dataset Browser, select the item.
  2. Click Dataset Actions.
  3. Select Models -> Assign to Subset.
  4. Select the Test. The Test Dataset is used to evaluate the performance of a trained model on new, unseen data.

Assign an Item to a Model's Train Datasets

You can assign the selected items to model's train datasets. When you assign, a tag (Train) will be added to the item details.

  1. In the Dataset Browser, select the item.
  2. Click Dataset Actions.
  3. Select Models -> Assign to Subset.
  4. Select the Train. The Train Dataset is used to train the machine learning model, helping it learn patterns and make predictions.

Assign an Item to a Model's Validation Datasets

You can assign the selected items to model's Validation datasets. When you assign, a tag (Validation) will be added to the item details.

  1. In the Dataset Browser, select the item.
  2. Click Dataset Actions.
  3. Select Models -> Assign to Subset.
  4. Select the Validation. The Validation Dataset is used to fine-tune the model and optimize its hyperparameters, helping prevent overfitting.

Clone Datasets

Refer to the Clone Datasets article for more information.


Clone a Collection

  1. Open the Data Browser.
  2. In the left-side panel, click on the Collections icon located below the Folder icon.
  3. Hover-over the collection you want to clone.
  4. Click on the three dots and select Clone from the list.
  5. Click Yes to confirm the cloning process. The cloned collection will be created and named as original_name-clone-1.

Clone an Item

  1. In the Dataset Browser, select the item you want to clone.
  2. Click Dataset Actions.
  3. Select File Actions > Clone. Learn more about the cloning process.

Classify an Item

  1. In the Dataset Browser, select the item you want to classify.
  2. Click Dataset Actions.
  3. Select File Actions > Classification from the list. Learn more about the classification.

Copy a Dataset ID

You can copy the Dataset ID by using one of the following options:

  • Clicking on Dataset Details from the Dataset Browser page and click on the Copy icon next to the Dataset ID field.
  • Select the Data page from the left-side panel and click on the Ellipsis (three-dots) icon of the dataset and select the Copy Dataset ID option from the list.

Create Collections

Creating Collections can be customized to match the requirements of your specific task, such as grouping items by type, project phase, or other relevant attributes.

Limitations:
- You can create up to 10 collection folders.
- Each item can be tagged in a maximum of 10 collections at once.

  1. Open the Data Browser.
  2. In the left-side panel, click on the Collections icon located below the Folder icon.
  1. Click on Create a Collection.
  2. Type your desired collection's name, and press the Enter key. The new collection will now be created and displayed in Collections.

Create a Dataset Using Dataloop Storage

The Dataloop storage is the internal dataset storage of Dataloop platform. Internal file storage allows you to store digital files, such as images, videos, audios, text files, and other data for annotation process.

  1. Log in to the Dataloop platform.
  2. Select Data from the left-side panel.
  3. In the Datasets tab, click Create Dataset, or click on the down-arrow and select Create Dataset from the list. The Data Management Resource Creation right-side panel is displayed.
  4. Dataset Name: Enter a Name for the dataset.
  5. Recipe (Optional): Select a recipe from the list.
  6. Provider: Ensure that, by default, Dataloop is selected. If not, select the Dataloop option from the list.
  7. Click Create Dataset. The new dataset will be created.

Create a Dataset Based on an External Cloud Storage

Cloud storage services are online platforms that allow organization to store and manage their data. Dataloop supports the following cloud storage services:

  • Amazon Web Services (AWS) S3: Amazon S3 (Simple Storage Service) is a highly scalable, object storage service offered by AWS.
  • Microsoft Azure Blob Storage: Microsoft Azure provides Blob Storage for storing and managing unstructured data. It integrates well with other Azure services.
  • Google Cloud Storage: Google Cloud Storage is part of the Google Cloud Platform and offers object storage, archival storage, and data transfer services. It's often used alongside other GCP services.
Prerequisites

To create a Dataset based on external cloud storage, the process requires:

  1. Create a Storage-Driver to connect to the cloud-storage resource. For more information, see the Storage Driver Overview​.
  2. Create an integration. For more information, see the Integration Overview​.
  1. Log in to the Dataloop platform.
  2. Select Data from the left-side panel.
  3. Select the Datasets tab, if it is not selected by default.
  4. Click Create Dataset, or click on the down-arrow and select Create Dataset from the list. The Data Management Resource Creation right-side panel is displayed.
  5. Dataset Name: Enter a Name for the dataset.
  6. Recipe (Optional): Select a recipe from the list.
  7. Provider: Select one of the following external provider from the list, by default Dataloop is selected:
    1. AWS
    2. GCP
    3. Azure
  8. Storage Driver: Select a Storage Driver from the list. If not available, create a new Storage Driver​.
  9. Click Create Dataset.

Create a Task

  1. In the Dataset Browser, Click Dataset Actions.
  2. Select Labeling Tasks -> Create a New Task from the list.
Items to Task or Model

When creating a task or model from the Dataset browser, it includes all items in the dataset.


Delete a Collection

  1. Open the Data Browser.
  2. In the left-side panel, click on the Collections icon located below the Folder icon.
  3. Hover-over the collection you want to delete.
  4. Click on the three dots and select Delete from the list.
  5. Click Yes to confirm the deletion process.

Delete a Dataset

Dataloop provides the capability to delete datasets stored in both the internal file system and external cloud storage.

Important

When deleting a dataset, it removes the items in the dataset, and any related tasks and assignments associated with that dataset are also removed.

To delete a dataset, perform the following instructions:

  1. Go to the Data page using the left-side navigation.
  2. In the Datasets tab, find the dataset that you want to delete.
  3. Click on the Ellipsis (3-dots) icon and select the Delete Dataset option from the list. A confirmation message is displayed.
  4. Click Yes. A confirmation message indicating the successful deletion of the dataset is displayed.

Delete Annotations from an Item

  1. In the Dataset Browser, select the item you want to delete annotations.
  2. Click Dataset Actions.
  3. Select File Actions > Delete Annotations. A confirmation message is displayed.

Delete Dataset Items

  1. In the Dataset Browser, select the item you want to delete.
  2. Click Dataset Actions.
  3. Select File Actions > Delete Items.
  4. Click Yes. A confirmation message is displayed.

Download Annotations

  1. In the Dataset Browser, select the item to download annotation.
  2. Click Dataset Actions.
  3. Select Download Annotations from the list. The annotations of the selected file will be downloaded as a JSON file.

Download Files

Download up to 100 files
  • You can download up to 100 files per selection. To download more, use the SDK.
  • Only Developer or Owner can download files.
  1. In the Dataset Browser, select the item(s) you want to export.
  2. Click Dataset Actions.
  3. Select File Actions > Download Files. The selected item will be downloaded. For example, JPG image will be downloaded as a JPG file.

Download Items by Using Pipelines

A pipeline can include a phase for automatic data and metadata export to a connected location. This can be done as a function (FaaS) to export all data via a remote API or a connected Driver. For example, use a dataset node and a FaaS Node, and select an export function. For more information, refer to the Create Pipelines.


Export Items as JSON file

  1. In the Dataset Browser, select the item you want to export.
  2. Click Dataset Actions.
  3. Select File Actions > Export JSON. The selected dataset or items will be exported as a JSON file in a ZIP file and will contain annotation, if available. For example, a JPG image will be downloaded as a JSON file.

Export the Entire Dataset

Use either the Dashboard > Data Management widget, or the Data > Datasets tab to export the data:

  1. Select the dataset from the list.
  2. Click on the Ellipsis (3-dots) icon and select Download data from the list.
  3. Select the export Scope:
    • Entire dataset: The ZIP file includes JSON files for all items in the dataset.
  4. (Optional) include the Annotations JSON files. By default, the Item JSON file is selected.
  5. Selecting the Annotations JSON files option enables you to Include PNG per semantic label.
  6. Click Export.

Export Datasets in COCO/YOLO/VOC Formats

The Dataset browser incorporates significant automation capabilities, enabling you to export dataset items in industry-standard formats through the following functions. Any function available within this application can be applied to selected items or an active query.

In addition to the Dataloop format, annotations can be exported in industry-standard formats. These are facilitated as functions deployed to the UI Slot of the Dataset-browser.

To learn more about the converters, their input, and output structures, visit their Git repo.

  • COCO Converter: This tool is used to convert data annotations from other formats into the COCO (Common Objects in Context) format or vice versa. COCO is a popular dataset format for object detection, segmentation, and image captioning tasks.
  • YOLO Converter: YOLO (You Only Look Once) is a popular object detection algorithm. A YOLO Converter is used to convert annotations between YOLO format and other annotation formats, making it easier to work with YOLO-based models and datasets.
  • VOC Converter: VOC (Visual Object Classes) is another dataset format commonly used in computer vision tasks. A VOC Converter allows you to convert annotations between VOC format and other formats, facilitating compatibility with different tools and models.
Info

Develop a custom converter and deploy it to a UI-Slot anywhere on the platform, or embed it as a Pipeline node. To learn more, contact Dataloop support.

  1. In the Dataset Browser, select the item(s) in the Dataset Browser.
  2. Click Dataset Actions.
  3. Select Deployment Slot and select one of the following format:
    1. COCO Converter.
    2. YOLO Converter.
    3. VOC Converter.
  4. A message is displayed as the execution of function <global-converter> was created successfully, please check activity bell. A ZIP file will be created and downloaded.

Extract Embeddings from a Dataset

Extracting embeddings is the process of generating numerical representations (vectors) of data, such as text, images, or other types of content, in a lower-dimensional space. Dataloop allows you to extract embeddings using a model (trained, pre-trained, and deployed) from the Marketplace. These embeddings capture the essential features or meanings of the original data in a way that makes it easier for machine learning models to process and analyze.

  1. Navigate to the Data page using the left-side navigation.
  2. In the Datasets tab, find the dataset that you want to extract embeddings.
  3. Click on the Ellipsis (3-dots) icon and select the Extract Embeddings option from the list. An Extract Embeddings pop-up with a list of all the models (trained, pre-trained, and deployed) is displayed.
  4. Select a Model from the Deployed section, or click on the Marketplace to install a new model.
  5. Choose the option Automatically run on new dataset items to enable automatic extraction for newly added items. This creates a trigger that automatically generates embeddings for the new items.
  6. Click Embed. The Extracting Embeddings process will be initiated. Use the Notifications bell icon to view status.

Extract Embeddings from an Item(s)

  1. In the Dataset Browser, select the items you want to extract embeddings.
  2. Click Dataset Actions and select the Models -> Extract Embeddings option from the list. An Extract Embeddings pop-up is displayed.
  1. Select a Model from the Deployed section, or click on the Marketplace to install a new model. If there are no models available, click Install Model.
    1. Once installed a model, click Deploy.
  2. Click Embed to initiate the embeddings' extraction. Use the Notifications bell icon to view status.

Find Similar Items

  1. In the Dataset Browser, select the item to download annotation.
  2. Click Dataset Actions.
  1. Select Similarity -> Find similar items
  2. Click on the Feature Set name. The Clustering tab is displayed with similar items are selected.

Find Collections Using Smart Search

  1. Open the Data Browser.
  2. Click on the Items search field.
  3. Enter the query code as metadata.system.collections.c0 = true where c0 is collection ID. The available collections will be listed as a dropdown.

Generate Predictions with a Model

You can use the dataset items to generate predictions by using a trained, pre-trained, and deployed model.

  1. In the Dataset Browser, select the item.
  2. Click Dataset Actions.
  3. Select Models > Predict.
  4. Search and select a trained and deployed model from the list.
  5. Click Predict. A confirmation message is displayed.
Additional actions
  • Search models by model name, project name, application name, and status.
  • Use the filter to sort the models by scope and model status.

Generate Predictions by Using a Trained Model

You can use only trained and deployed models for generating predictions. To deploy a trained model, perform the following instructions:

  1. In the Dataset Browser, select the item.
  2. Click Dataset Actions.
  3. Select Models > Predict.
  4. Identify the trained model, and click Deploy. The Model Version Deployment page is displayed.
  5. In the Deployment and Service Fields tabs, make changes in the available fields as needed.
  6. Click Deploy. A confirmation message is displayed.

Move Items to a Folder

  1. In the Dataset Browser, select the item you want to move.
  2. Click Dataset Actions.
  3. Select File Actions > Move to Folder.
  4. Select a folder from the list.
  5. Click Move. A confirmation message is displayed.

Merge Datasets

Refer to the Merge Datasets article for more information.


Open an Item in a New Browser Tab

It allows you to view images, play audio files, etc. in a new browser tab.

  1. In the Dataset Browser, select the item.
  2. Click Dataset Actions.
  3. Select File Actions > Open File in New Tab. The selected file will be opened in a new browser tab.

Open an Item in a Specific Annotation Studio

  1. In the Dataset Browser, select the item.
  2. Click Dataset Actions.
  3. Select File Actions > Open With.
  4. Select the Annotation Studio. The item will be opened in the annotation studio based on the type of the item, such as image, audio, video, etc.

Open an Item in the Annotation Studio

It allows you to view images, play audio files, etc. in a new browser tab. In the Dataset Browser, identify the item and double-click. The item will be opened in the default annotation studio based on the type of the item, such as image, audio, video, etc. or, you can follow these steps:

  1. In the Dataset Browser, select the item.
  2. Click Dataset Actions.
  3. Select File Actions > Open File in Studio. The selected file will be opened in the default annotation studio.

Perform Bulk Operations

The Dataset browser facilitates bulk operations within the specified context like Move to Folder, Export, Clone, Classification, etc. . To carry out bulk operations:

  1. Manually select one or more items using the Command or Windows key + mouse left-click.
  2. Perform the available actions for the items, such as Move to Folder, Export, Clone, Classification, etc.

Remove an Item from the Model Test, Train, or Validation Datasets

  1. In the Dataset Browser, select the item.
  2. Click Dataset Actions.
  3. Select Models.
  4. Select the following options as per requirement. A confirmation message will be displayed, and the respective tag will be deleted from the item details.
    1. Remove from Test Set.
    2. Remove from Train Set.
    3. Remove from Validation Set.

Rename a Dataset

  1. Navigate to the Data page using the left-side navigation.
  2. In the Datasets tab, find the dataset that you want to rename.
  3. Click on the Ellipsis (3-dots) icon and select the Rename Dataset option from the list. A Change Dataset Name pop-up is displayed.
  4. Edit the dataset name.
  5. Click Rename. A name change message is displayed.

Rename a Collection

  1. Open the Data Browser.
  2. In the left-side panel, click on the Collections icon located below the Folder icon.
  3. Hover-over the collection that to be renamed.
  4. Click on the three dots and select Rename from the list.
  5. Make the changes and press Enter key.

Remove Items from a Collection

  1. Open the Data Browser.
  2. In the left-side panel, click on the Collections icon located below the Folder icon.
  3. Click on the collection containing the items you want to remove.
  4. Select the items, then right-click on them.
  5. Select Collections -> Remove From Collections option from the list.
  6. Select the specific collection from which you want to remove the items (if they belong to multiple collections).
  7. Click Remove. A successful deletion message will be displayed.

Remove Collections from Items

  1. Open the Data Browser.
  2. Select Item(s) from the browser.
  3. Right-click and select Collections -> Remove from Collections.
  4. Select the Collection(s) that are to be removed.
  5. Click Remove. A confirmation message is displayed.

Rename an Item

  1. In the Dataset Browser, select the item you want to rename.
  2. Click Dataset Actions.
  3. Select File Actions -> Rename.
  4. Edit the name, and click Rename. A confirmation message is displayed.

Run an Item with a FaaS or Pipeline

Run a selected item to a function from a running service (FaaS) or a running pipeline.

  1. In the Dataset Browser, select the item.
  2. Click Dataset Actions.
  3. Select:
    1. Run with FaaS: It allows you to select a function to execute with the selected items.
    2. Run with Pipeline: It allows you to select a pipeline to execute with the selected items.
  4. Select a function or pipeline from the list.
  5. Click Execute. A confirmation message will be displayed.
Additional actions
  • Search functions by function name, project name, and service name.
  • Search pipelines by pipeline name.
  • Filter functions by public functions, project functions and all functions in the user’s projects.

Automation Info and Warning Messages

The following information and warning messages are displayed when you run the item with a FaaS, Pipeline, or Model predictions.

  • When you select more than one item with a function/pipeline/model with item input: Triggering multiple items to a function with single-item input will execute each item separately, resulting in the creation of multiple executions.
  • When you select more than one item to a functions/pipelines with item[] input: Triggering multiple items to a function with an item[] list input will execute all items in a single execution.
  • When you select more than 1000 items to a functions/pipelines with item[] input: The functions with the item[] input are disabled, and displays a warning message that the function with the item[] input cannot be executed with more than 1000 items in the list.

Show Hidden Files

  1. In the Dataset Browser, click on the Settings icon.
  2. Enable the Show Hidden Files option.

The hidden files will have the hidden icon (crossed eye) in the corner of the hidden item/folders. Also, the thumbnail will be grayed out.


Split Items Into Subsets

Split Data Into Subsets feature allows users to divide their dataset into multiple subsets, such as train, validation, and test, based on a specified distribution. This splitting is important for ensuring that the dataset is well-prepared for machine learning or data analysis tasks. Custom Distribution: By default, the items are divided as follows:

  • Train set: 80% of the data, which is used to train the machine learning model.
  • Validation set: 10% of the data, which is used during training to fine-tune model hyperparameters and prevent overfitting.
  • Test set: 10% of the data, which is used to evaluate the final model performance after training.
  1. In the Dataset Browser, select one or more items.
  2. Click Dataset Actions.
  3. Select Models -> Split Into Subsets. The ML Data Split pop-up is displayed.
  4. Customize the distribution by moving the slider. By default, the items are divided as mentioned above.
  5. Click Split Data. A confirmation message is displayed, and the selected items are divided into respective subsets.
  6. Click on the ML Data Split section in the right-side panel to view the items' distribution.

For example, the number of items split into subsets as follows as per the default distribution:

Number of ItemsTrain SetValidation SetTest Set
1100
2200
3300
4400
5500
6510
7610
8710
9810
10811

Switch Recipe

Refer to the Switch Recipe article for more information.


Use SDK to Export Datasets in COCO/YOLO/VOC Formats

To export by SDK, refer to the Download Annotations in COCO/YOLO/VOC Format page.


Use SDK to Download Items

To learn how to download data using the SDK, read this tutorial.