- 05 Feb 2025
- Print
- DarkLight
- PDF
Manage Datasets
- Updated On 05 Feb 2025
- Print
- DarkLight
- PDF
Overview
The Manage Datasets documentation provides a comprehensive guide on how to efficiently organize, structure, and manipulate datasets within the Dataloop platform. Datasets serve as the foundation for data annotation, machine learning model training, and AI-driven workflows. Dataloop provides the capability to perform a variety of dataset management actions, as described below.
Add Items to a Collection
You can create a new collection by selecting items from your dataset and adding them to a designated collection.
- Open the Data Browser.
- Select the items you want to add to a collection.
- Right-click on the selected items.
- Select Collections and choose your desired collection. The selected items will now be added to the chosen collection.
Add Custom Metadata
Adding custom metadata involves attaching additional information or tags to various types of data items. Custom metadata can be user-defined and is not limited to the predefined categories or attributes provided by the Dataloop platform.
To attach metadata to any entity, such as Datasets, you can utilize the SDK's 'Update' function. To learn how to upload items with metadata, read here.
// Example
dataset.metadata["MyBoolean"] = True
dataset.metadata["Mycontext"] = Blue
dataset.update()
Display Custom Metadata
The datasets page provides a list of all the datasets present within the project. The table contains default columns, including dataset name, the count of items, the percentage of annotated items, and additional information.
To include and display columns with your custom context (metadata fields):
- From the Project Overview, click on Settings.
- Select Configuration.
- Select the Dataset Columns from the left-side menu.
- Click .
- Click .
- Enter the required information as follows.
- Name: A general name for this column (not visible outside the project-settings).
- Label: The column header displayed on the Datasets page.
- Field: The Metadata field to map to this column.
- Configure the desired feature settings as needed:
- Link: If the field value is a URL and should open in a new tab, select this option.
- Resizable: Check this option if the column needs to be resizable, useful for displaying long values.
- Sortable: Enable this option to allow sorting the table by clicking the column header.
- Click . A successful message is displayed.
After completing the above steps, the Datasets table on the Datasets page will display the custom column and the data you've populated there.
- To ensure that any new data added via SDK is reflected, refresh the page.
- You can use the search box to search for datasets that match your search term, provided that the search term is included in any of the custom columns you've added to the table. This allows you to filter datasets based on the custom metadata you've defined.
Add a Dataset to an Existing Task
- In the Dataset Browser, click .
- Select Labeling Tasks -> Add to an Existing Task from the list.

When creating a task or model from the Dataset browser, it includes all items in the dataset.
Assign an Item to a Model's Test Datasets
You can assign the selected items to model's test datasets. When you assign, a tag (Test) will be added to the item details.
- In the Dataset Browser, select the item.
- Click .
- Select Models -> Assign to Subset.
- Select the Test. The Test Dataset is used to evaluate the performance of a trained model on new, unseen data.
Assign an Item to a Model's Train Datasets
You can assign the selected items to model's train datasets. When you assign, a tag (Train) will be added to the item details.
- In the Dataset Browser, select the item.
- Click .
- Select Models -> Assign to Subset.
- Select the Train. The Train Dataset is used to train the machine learning model, helping it learn patterns and make predictions.
Assign an Item to a Model's Validation Datasets
You can assign the selected items to model's Validation datasets. When you assign, a tag (Validation) will be added to the item details.
- In the Dataset Browser, select the item.
- Click .
- Select Models -> Assign to Subset.
- Select the Validation. The Validation Dataset is used to fine-tune the model and optimize its hyperparameters, helping prevent overfitting.
Clone Datasets
Refer to the Clone Datasets article for more information.
Clone a Collection
- Open the Data Browser.
- In the left-side panel, click on the Collections icon located below the Folder icon.
- Hover-over the collection you want to clone.
- Click on the three dots and select Clone from the list.
- Click
original_name-clone-1
. to confirm the cloning process. The cloned collection will be created and named as
Clone an Item
- In the Dataset Browser, select the item you want to clone.
- Click .
- Select File Actions > Clone. Learn more about the cloning process.
Classify an Item
- In the Dataset Browser, select the item you want to classify.
- Click .
- Select File Actions > Classification from the list. Learn more about the classification.
Copy a Dataset ID
You can copy the Dataset ID by using one of the following options:
- Clicking on Dataset Details from the Dataset Browser page and click on the Copy icon next to the Dataset ID field.
- Select the Data page from the left-side panel and click on the Ellipsis (three-dots) icon of the dataset and select the Copy Dataset ID option from the list.
Create Collections
Creating Collections can be customized to match the requirements of your specific task, such as grouping items by type, project phase, or other relevant attributes.
Limitations:
- You can create up to 10 collection folders.
- Each item can be tagged in a maximum of 10 collections at once.
- Open the Data Browser.
- In the left-side panel, click on the Collections icon located below the Folder icon.

- Click on Create a Collection.
- Type your desired collection's name, and press the Enter key. The new collection will now be created and displayed in Collections.
Create a Dataset Using Dataloop Storage
The Dataloop storage is the internal dataset storage of Dataloop platform. Internal file storage allows you to store digital files, such as images, videos, audios, text files, and other data for annotation process.
- Log in to the Dataloop platform.
- Select Data from the left-side panel.
- In the Datasets tab, click , or click on the down-arrow and select Create Dataset from the list. The Data Management Resource Creation right-side panel is displayed.
- Dataset Name: Enter a Name for the dataset.
- Recipe (Optional): Select a recipe from the list.
- Provider: Ensure that, by default, Dataloop is selected. If not, select the Dataloop option from the list.
- Click . The new dataset will be created.
Create a Dataset Based on an External Cloud Storage
Cloud storage services are online platforms that allow organization to store and manage their data. Dataloop supports the following cloud storage services:
- Amazon Web Services (AWS) S3: Amazon S3 (Simple Storage Service) is a highly scalable, object storage service offered by AWS.
- Microsoft Azure Blob Storage: Microsoft Azure provides Blob Storage for storing and managing unstructured data. It integrates well with other Azure services.
- Google Cloud Storage: Google Cloud Storage is part of the Google Cloud Platform and offers object storage, archival storage, and data transfer services. It's often used alongside other GCP services.
To create a Dataset based on external cloud storage, the process requires:
- Create a Storage-Driver to connect to the cloud-storage resource. For more information, see the Storage Driver Overview.
- Create an integration. For more information, see the Integration Overview.

- Log in to the Dataloop platform.
- Select Data from the left-side panel.
- Select the Datasets tab, if it is not selected by default.
- Click , or click on the down-arrow and select Create Dataset from the list. The Data Management Resource Creation right-side panel is displayed.
- Dataset Name: Enter a Name for the dataset.
- Recipe (Optional): Select a recipe from the list.
- Provider: Select one of the following external provider from the list, by default Dataloop is selected:
- AWS
- GCP
- Azure
- Storage Driver: Select a Storage Driver from the list. If not available, create a new Storage Driver.
- Click .
Create a Task
- In the Dataset Browser, Click .
- Select Labeling Tasks -> Create a New Task from the list.

When creating a task or model from the Dataset browser, it includes all items in the dataset.
Delete a Collection
- Open the Data Browser.
- In the left-side panel, click on the Collections icon located below the Folder icon.
- Hover-over the collection you want to delete.
- Click on the three dots and select Delete from the list.
- Click to confirm the deletion process.
Delete a Dataset
Dataloop provides the capability to delete datasets stored in both the internal file system and external cloud storage.
When deleting a dataset, it removes the items in the dataset, and any related tasks and assignments associated with that dataset are also removed.
To delete a dataset, perform the following instructions:
- Go to the Data page using the left-side navigation.
- In the Datasets tab, find the dataset that you want to delete.
- Click on the Ellipsis (3-dots) icon and select the Delete Dataset option from the list. A confirmation message is displayed.
- Click . A confirmation message indicating the successful deletion of the dataset is displayed.
Delete Annotations from an Item
- In the Dataset Browser, select the item you want to delete annotations.
- Click .
- Select File Actions > Delete Annotations. A confirmation message is displayed.
Delete Dataset Items
- In the Dataset Browser, select the item you want to delete.
- Click .
- Select File Actions > Delete Items.
- Click . A confirmation message is displayed.
Download Annotations
- In the Dataset Browser, select the item to download annotation.
- Click .
- Select Download Annotations from the list. The annotations of the selected file will be downloaded as a JSON file.
Download Files
- You can download up to 100 files per selection. To download more, use the SDK.
- Only Developer or Owner can download files.
- In the Dataset Browser, select the item(s) you want to export.
- Click .
- Select File Actions > Download Files. The selected item will be downloaded. For example, JPG image will be downloaded as a JPG file.
Download Items by Using Pipelines
A pipeline can include a phase for automatic data and metadata export to a connected location. This can be done as a function (FaaS) to export all data via a remote API or a connected Driver. For example, use a dataset node and a FaaS Node, and select an export function. For more information, refer to the Create Pipelines.
Export Items as JSON file
- In the Dataset Browser, select the item you want to export.
- Click .
- Select File Actions > Export JSON. The selected dataset or items will be exported as a JSON file in a ZIP file and will contain annotation, if available. For example, a JPG image will be downloaded as a JSON file.
Export the Entire Dataset
Use either the Dashboard > Data Management widget, or the Data > Datasets tab to export the data:

- Select the dataset from the list.
- Click on the Ellipsis (3-dots) icon and select Download data from the list.
- Select the export Scope:
- Entire dataset: The ZIP file includes JSON files for all items in the dataset.
- (Optional) include the Annotations JSON files. By default, the Item JSON file is selected.
- Selecting the Annotations JSON files option enables you to Include PNG per semantic label.
- Click .
Export Datasets in COCO/YOLO/VOC Formats
The Dataset browser incorporates significant automation capabilities, enabling you to export dataset items in industry-standard formats through the following functions. Any function available within this application can be applied to selected items or an active query.
In addition to the Dataloop format, annotations can be exported in industry-standard formats. These are facilitated as functions deployed to the UI Slot of the Dataset-browser.
To learn more about the converters, their input, and output structures, visit their Git repo.
- COCO Converter: This tool is used to convert data annotations from other formats into the COCO (Common Objects in Context) format or vice versa. COCO is a popular dataset format for object detection, segmentation, and image captioning tasks.
- YOLO Converter: YOLO (You Only Look Once) is a popular object detection algorithm. A YOLO Converter is used to convert annotations between YOLO format and other annotation formats, making it easier to work with YOLO-based models and datasets.
- VOC Converter: VOC (Visual Object Classes) is another dataset format commonly used in computer vision tasks. A VOC Converter allows you to convert annotations between VOC format and other formats, facilitating compatibility with different tools and models.
Develop a custom converter and deploy it to a UI-Slot anywhere on the platform, or embed it as a Pipeline node. To learn more, contact Dataloop support.
- In the Dataset Browser, select the item(s) in the Dataset Browser.
- Click .
- Select Deployment Slot and select one of the following format:
- COCO Converter.
- YOLO Converter.
- VOC Converter.
- A message is displayed as
the execution of function <global-converter> was created successfully, please check activity bell
. A ZIP file will be created and downloaded.
Extract Embeddings from a Dataset
Extracting embeddings is the process of generating numerical representations (vectors) of data, such as text, images, or other types of content, in a lower-dimensional space. Dataloop allows you to extract embeddings using a model (trained, pre-trained, and deployed) from the Marketplace. These embeddings capture the essential features or meanings of the original data in a way that makes it easier for machine learning models to process and analyze.

- Navigate to the Data page using the left-side navigation.
- In the Datasets tab, find the dataset that you want to extract embeddings.
- Click on the Ellipsis (3-dots) icon and select the Extract Embeddings option from the list. An Extract Embeddings pop-up with a list of all the models (trained, pre-trained, and deployed) is displayed.
- Select a Model from the Deployed section, or click on the Marketplace to install a new model.
- Choose the option Automatically run on new dataset items to enable automatic extraction for newly added items. This creates a trigger that automatically generates embeddings for the new items.
- Click . The Extracting Embeddings process will be initiated. Use the Notifications bell icon to view status.
Extract Embeddings from an Item(s)
- In the Dataset Browser, select the items you want to extract embeddings.
- Click and select the Models -> Extract Embeddings option from the list. An Extract Embeddings pop-up is displayed.

- Select a Model from the Deployed section, or click on the Marketplace to install a new model. If there are no models available, click Install Model.
- Once installed a model, click .
- Click to initiate the embeddings' extraction. Use the Notifications bell icon to view status.
Find Similar Items
- In the Dataset Browser, select the item to download annotation.
- Click .

- Select Similarity -> Find similar items
- Click on the Feature Set name. The Clustering tab is displayed with similar items are selected.
Find Collections Using Smart Search
- Open the Data Browser.
- Click on the Items search field.
- Enter the query code as
metadata.system.collections.c0 = true
where c0 is collection ID. The available collections will be listed as a dropdown.
Generate Predictions with a Model
You can use the dataset items to generate predictions by using a trained, pre-trained, and deployed model.
- In the Dataset Browser, select the item.
- Click .
- Select Models > Predict.
- Search and select a trained and deployed model from the list.
- Click . A confirmation message is displayed.
- Search models by model name, project name, application name, and status.
- Use the filter to sort the models by scope and model status.
Generate Predictions by Using a Trained Model
You can use only trained and deployed models for generating predictions. To deploy a trained model, perform the following instructions:
- In the Dataset Browser, select the item.
- Click .
- Select Models > Predict.
- Identify the trained model, and click . The Model Version Deployment page is displayed.
- In the Deployment and Service Fields tabs, make changes in the available fields as needed.
- Click . A confirmation message is displayed.
Move Items to a Folder
- In the Dataset Browser, select the item you want to move.
- Click .
- Select File Actions > Move to Folder.
- Select a folder from the list.
- Click . A confirmation message is displayed.
Merge Datasets
Refer to the Merge Datasets article for more information.
Open an Item in a New Browser Tab
It allows you to view images, play audio files, etc. in a new browser tab.
- In the Dataset Browser, select the item.
- Click .
- Select File Actions > Open File in New Tab. The selected file will be opened in a new browser tab.
Open an Item in a Specific Annotation Studio
- In the Dataset Browser, select the item.
- Click .
- Select File Actions > Open With.
- Select the Annotation Studio. The item will be opened in the annotation studio based on the type of the item, such as image, audio, video, etc.
Open an Item in the Annotation Studio
It allows you to view images, play audio files, etc. in a new browser tab. In the Dataset Browser, identify the item and double-click. The item will be opened in the default annotation studio based on the type of the item, such as image, audio, video, etc. or, you can follow these steps:
- In the Dataset Browser, select the item.
- Click .
- Select File Actions > Open File in Studio. The selected file will be opened in the default annotation studio.
Perform Bulk Operations
The Dataset browser facilitates bulk operations within the specified context like Move to Folder, Export, Clone, Classification, etc. . To carry out bulk operations:
- Manually select one or more items using the Command or Windows key + mouse left-click.
- Perform the available actions for the items, such as Move to Folder, Export, Clone, Classification, etc.
Remove an Item from the Model Test, Train, or Validation Datasets
- In the Dataset Browser, select the item.
- Click .
- Select Models.
- Select the following options as per requirement. A confirmation message will be displayed, and the respective tag will be deleted from the item details.
- Remove from Test Set.
- Remove from Train Set.
- Remove from Validation Set.
Rename a Dataset
- Navigate to the Data page using the left-side navigation.
- In the Datasets tab, find the dataset that you want to rename.
- Click on the Ellipsis (3-dots) icon and select the Rename Dataset option from the list. A Change Dataset Name pop-up is displayed.
- Edit the dataset name.
- Click . A name change message is displayed.
Rename a Collection
- Open the Data Browser.
- In the left-side panel, click on the Collections icon located below the Folder icon.
- Hover-over the collection that to be renamed.
- Click on the three dots and select Rename from the list.
- Make the changes and press Enter key.
Remove Items from a Collection
- Open the Data Browser.
- In the left-side panel, click on the Collections icon located below the Folder icon.
- Click on the collection containing the items you want to remove.
- Select the items, then right-click on them.
- Select Collections -> Remove From Collections option from the list.
- Select the specific collection from which you want to remove the items (if they belong to multiple collections).
- Click . A successful deletion message will be displayed.
Remove Collections from Items
- Open the Data Browser.
- Select Item(s) from the browser.
- Right-click and select Collections -> Remove from Collections.
- Select the Collection(s) that are to be removed.
- Click . A confirmation message is displayed.
Rename an Item
- In the Dataset Browser, select the item you want to rename.
- Click .
- Select File Actions -> Rename.
- Edit the name, and click . A confirmation message is displayed.
Run an Item with a FaaS or Pipeline
Run a selected item to a function from a running service (FaaS) or a running pipeline.
- In the Dataset Browser, select the item.
- Click .
- Select:
- Run with FaaS: It allows you to select a function to execute with the selected items.
- Run with Pipeline: It allows you to select a pipeline to execute with the selected items.
- Select a function or pipeline from the list.
- Click . A confirmation message will be displayed.
- Search functions by function name, project name, and service name.
- Search pipelines by pipeline name.
- Filter functions by public functions, project functions and all functions in the user’s projects.
Automation Info and Warning Messages
The following information and warning messages are displayed when you run the item with a FaaS, Pipeline, or Model predictions.
- When you select more than one item with a function/pipeline/model with item input: Triggering multiple items to a function with single-item input will execute each item separately, resulting in the creation of multiple executions.
- When you select more than one item to a functions/pipelines with item[] input: Triggering multiple items to a function with an item[] list input will execute all items in a single execution.
- When you select more than 1000 items to a functions/pipelines with item[] input: The functions with the
item[]
input are disabled, and displays a warning message that the function with theitem[]
input cannot be executed with more than 1000 items in the list.
Show Hidden Files
- In the Dataset Browser, click on the Settings icon.
- Enable the Show Hidden Files option.

The hidden files will have the hidden icon (crossed eye) in the corner of the hidden item/folders. Also, the thumbnail will be grayed out.
Split Items Into Subsets
Split Data Into Subsets feature allows users to divide their dataset into multiple subsets, such as train, validation, and test, based on a specified distribution. This splitting is important for ensuring that the dataset is well-prepared for machine learning or data analysis tasks. Custom Distribution: By default, the items are divided as follows:
- Train set: 80% of the data, which is used to train the machine learning model.
- Validation set: 10% of the data, which is used during training to fine-tune model hyperparameters and prevent overfitting.
- Test set: 10% of the data, which is used to evaluate the final model performance after training.
- In the Dataset Browser, select one or more items.
- Click .
- Select Models -> Split Into Subsets. The ML Data Split pop-up is displayed.
- Customize the distribution by moving the slider. By default, the items are divided as mentioned above.
- Click . A confirmation message is displayed, and the selected items are divided into respective subsets.
- Click on the ML Data Split section in the right-side panel to view the items' distribution.
For example, the number of items split into subsets as follows as per the default distribution:
Number of Items | Train Set | Validation Set | Test Set |
---|---|---|---|
1 | 1 | 0 | 0 |
2 | 2 | 0 | 0 |
3 | 3 | 0 | 0 |
4 | 4 | 0 | 0 |
5 | 5 | 0 | 0 |
6 | 5 | 1 | 0 |
7 | 6 | 1 | 0 |
8 | 7 | 1 | 0 |
9 | 8 | 1 | 0 |
10 | 8 | 1 | 1 |
Switch Recipe
Refer to the Switch Recipe article for more information.
Use SDK to Export Datasets in COCO/YOLO/VOC Formats
To export by SDK, refer to the Download Annotations in COCO/YOLO/VOC Format page.
Use SDK to Download Items
To learn how to download data using the SDK, read this tutorial.