Manage Datasets
  • 12 Dec 2024
  • Dark
    Light
  • PDF

Manage Datasets

  • Dark
    Light
  • PDF

Article summary

Overview

Dataloop provides the capability to perform a variety of dataset management actions, including:

Refer below for more actions available on the dataset browser.


How to Create a Dataset Using Dataloop Storage?

The Dataloop storage is the internal dataset storage of Dataloop platform. Internal file storage allows you to store digital files, such as images, videos, audios, text files, and other data for annotation process.

  1. Log in to the Dataloop platform.
  2. Select Data from the left-side panel.
  3. Select the Datasets tab, if it is not selected by default.
  4. Click Create Dataset, or click on the down-arrow and select Create Dataset from the list. The Data Management Resource Creation right-side panel is displayed.
  5. Dataset Name: Enter a Name for the dataset.
  6. Recipe (Optional): Select a recipe from the list.
  7. Provider: Ensure that, by default, Dataloop is selected. If not, select the Dataloop option from the list.
  8. Click Create Dataset. The new dataset will be created.

How to Create a Dataset Based on an External Cloud Storage?

Cloud storage services are online platforms that allow organization to store and manage their data. Dataloop supports the following cloud storage services:

  • Amazon Web Services (AWS) S3: Amazon S3 (Simple Storage Service) is a highly scalable, object storage service offered by AWS.
  • Microsoft Azure Blob Storage: Microsoft Azure provides Blob Storage for storing and managing unstructured data. It integrates well with other Azure services.
  • Google Cloud Storage: Google Cloud Storage is part of the Google Cloud Platform and offers object storage, archival storage, and data transfer services. It's often used alongside other GCP services.
Prerequisites

To create a Dataset based on external cloud storage, the process requires:

  1. Create a Storage-Driver to connect to the cloud-storage resource. For more information, see the Storage Driver Overview​.
  2. Create an integration. For more information, see the Integration Overview​.
  1. Log in to the Dataloop platform.
  2. Select Data from the left-side panel.
  3. Select the Datasets tab, if it is not selected by default.
  4. Click Create Dataset, or click on the down-arrow and select Create Dataset from the list. The Data Management Resource Creation right-side panel is displayed.
  5. Dataset Name: Enter a Name for the dataset.
  6. Recipe (Optional): Select a recipe from the list.
  7. Provider: Select one of the following external provider from the list, by default Dataloop is selected:
    1. AWS
    2. GCP
    3. Azure
  8. Storage Driver: Select a Storage Driver from the list. If not available, create a new Storage Driver​.
  9. Click Create Dataset.

How to Rename a Dataset in the Dataset Browser?

  1. Navigate to the Data page using the left-side navigation.
  2. In the Datasets tab, find the dataset that you want to rename.
  3. Click on the Ellipsis (3-dots) icon and select the Rename Dataset option from the list. A Change Dataset Name pop-up is displayed.
  4. Edit the dataset name.
  5. Click Rename. A name change message is displayed.

How to Copy a Dataset ID in the Dataset Browser?

You can copy the Dataset ID by using one of the following options:

  • Clicking on Dataset Details from the Dataset Browser page and click on the Copy icon next to the Dataset ID field.
  • Select the Data page from the left-side panel and click on the Ellipsis (three-dots) icon of the dataset and select the Copy Dataset ID option from the list.

How to Extract Embeddings from a Dataset?

Extracting embeddings is the process of generating numerical representations (vectors) of data, such as text, images, or other types of content, in a lower-dimensional space. Dataloop allows you to extract embeddings using a model (trained, pre-trained, and deployed) from the Marketplace. These embeddings capture the essential features or meanings of the original data in a way that makes it easier for machine learning models to process and analyze.

  1. Navigate to the Data page using the left-side navigation.
  2. In the Datasets tab, find the dataset that you want to extract embeddings.
  3. Click on the Ellipsis (3-dots) icon and select the Extract Embeddings option from the list. An Extract Embeddings pop-up with a list of all the models (trained, pre-trained, and deployed) is displayed.
  4. Select a Model from the Deployed section, or click on the Marketplace to install a new model.
  5. Choose the option Automatically run on new dataset items to enable automatic extraction for newly added items. This creates a trigger that automatically generates embeddings for the new items.
  6. Click Embed. The Extracting Embeddings process will be initiated. Use the Notifications bell icon to view status.

How to Extract Embeddings from an Item or Items?

  1. In the Dataset Browser, select the items you want to extract embeddings.
  2. Click Dataset Actions and select the Models ->Extract Embeddings option from the list. An Extract Embeddings pop-up is displayed.
  3. Select a Model from the Deployed section, or click on the Marketplace to install a new model. If there are no models available, click Install Model.
    1. Once installed a model, click Deploy.
  4. Click Embed to initiate the embeddings' extraction. Use the Notifications bell icon to view status.

How to Add Custom Metadata in the Dataset Browser?

Adding custom metadata involves attaching additional information or tags to various types of data items. Custom metadata can be user-defined and is not limited to the predefined categories or attributes provided by the Dataloop platform.

To attach metadata to any entity, such as Datasets, you can utilize the SDK's 'Update' function. To learn how to upload items with metadata, read here.

// Example
dataset.metadata["MyBoolean"] = True
dataset.metadata["Mycontext"] = Blue
dataset.update()

Display Custom Metadata in the Dataset Table Columns

The datasets page provides a list of all the datasets present within the project. The table contains default columns, including dataset name, the count of items, the percentage of annotated items, and additional information.

To include and display columns with your custom context (metadata fields):

  1. From the Project Overview, click on Settings.
  2. Select Configuration.
  3. Select the Dataset Columns from the left-side menu.
  4. Click Update Setting.
  5. Click Add column.
  6. Enter the required information as follows.
    1. Name: A general name for this column (not visible outside the project-settings).
    2. Label: The column header displayed on the Datasets page.
    3. Field: The Metadata field to map to this column.
  7. Configure the desired feature settings as needed:
    1. Link: If the field value is a URL and should open in a new tab, select this option.
    2. Resizable: Check this option if the column needs to be resizable, useful for displaying long values.
    3. Sortable: Enable this option to allow sorting the table by clicking the column header.
  8. Click Apply. A successful message is displayed.

After completing the above steps, the Datasets table on the Datasets page will display the custom column and the data you've populated there.

  • To ensure that any new data added via SDK is reflected, refresh the page.
  • You can use the search box to search for datasets that match your search term, provided that the search term is included in any of the custom columns you've added to the table. This allows you to filter datasets based on the custom metadata you've defined.

How to Create a Task in the Dataset Browser?

  1. In the Dataset Browser, click Dataset Actions.
  2. Select Labeling Tasks -> Create a New Task from the list.
Items to Task or Model

When creating a task or model from the Dataset browser, it includes all items in the dataset.


How to Add a Dataset to an Existing Task in the Dataset Browser?

  1. In the Dataset Browser, click Dataset Actions.
  2. Select Labeling Tasks -> Add to an Existing Task from the list.
Items to Task or Model

When creating a task or model from the Dataset browser, it includes all items in the dataset.


How to Rename an Item in the Dataset Browser?

  1. In the Dataset Browser, select the item you want to rename.
  2. Click Dataset Actions.
  3. Select File Actions > Rename.
  4. Edit the name and click Rename. A confirmation message is displayed.

How to Export Datasets or Items as JSON files from the Dataset Browser?

  1. In the Dataset Browser, select the item you want to export.
  2. Click Dataset Actions.
  3. Select File Actions > Export JSON. The selected dataset or items will be exported as a JSON file in a ZIP file and will contain annotation, if available. For example, a JPG image will be downloaded as a JSON file.

How to Export the Entire Dataset?

Use either the Dashboard > Data Management widget, or the Data > Datasets tab to export the data:

  1. Select the dataset from the list.
  2. Click on the Ellipsis (3-dots) icon and select Download data from the list.
  3. Select the export Scope:
    • Entire dataset: The ZIP file includes JSON files for all items in the dataset.
  4. (Optional) include the Annotations JSON files. By default, the Item JSON file is selected.
  5. Selecting the Annotations JSON files option enables you to Include PNG per semantic label.
  6. Click Export.

How to Download Files from the Dataset Browser?

Download up to 100 files
  • You can download up to 100 files per selection. To download more, use the SDK.
  • Only Developer or Owner can download files.
  1. In the Dataset Browser, select the item you want to export.
  2. Click Dataset Actions.
  3. Select File Actions > Download Files. The selected item will be downloaded. For example, JPG image will be downloaded as a JPG file.

How to Clone an Item in the Dataset Browser?

  1. In the Dataset Browser, select the item you want to clone.
  2. Click Dataset Actions.
  3. Select File Actions > Clone. Refer to the link for more information.

How to Classify an Item in the Dataset Browser?

  1. In the Dataset Browser, select the item you want to classify.
  2. Click Dataset Actions.
  3. Select File Actions > Classification from the list. For more information, see the Classification article.

How to Move Items to a Folder in the Dataset Browser?

  1. In the Dataset Browser, select the item you want to move.
  2. Click Dataset Actions.
  3. Select File Actions > Move to Folder.
  4. Select a folder from the list.
  5. Click Move. A confirmation message is displayed.

How to Open an Item in a New Browser Tab in the Dataset Browser?

It allows you to view images, play audio files, etc. in a new browser tab.

  1. In the Dataset Browser, select the item.
  2. Click Dataset Actions.
  3. Select File Actions > Open File in New Tab. The selected file will be opened in a new browser tab.

How to Open an Item in a Specific Annotation Studio Version from the Dataset Browser?

  1. In the Dataset Browser, select the item.
  2. Click Dataset Actions.
  3. Select File Actions > Open With.
  4. Select the Annotation Studio. The item will be opened in the annotation studio based on the type of the item, such as image, audio, video, etc.

How to Open an Item in the Annotation Studio from the Dataset Browser?

It allows you to view images, play audio files, etc. in a new browser tab. In the Dataset Browser, identify the item and double-click. The item will be opened in the default annotation studio based on the type of the item, such as image, audio, video, etc. or, you can follow these steps:

  1. In the Dataset Browser, select the item.
  2. Click Dataset Actions.
  3. Select File Actions > Open File in Studio. The selected file will be opened in the default annotation studio.

How to Show Hidden Files in the Dataset Browser?

  1. In the Dataset Browser, click on the Settings icon.
  2. Enable the Show Hidden Files option.

The hidden files will have the hidden icon (crossed eye) in the corner of the hidden item/folders. Also, the thumbnail will be grayed out.


How to Delete Annotations from an Item in the Dataset Browser?

  1. In the Dataset Browser, select the item you want to delete annotations.
  2. Click Dataset Actions.
  3. Select File Actions > Delete Annotations. A confirmation message is displayed.

How to Delete Items in the Dataset Browser?

  1. In the Dataset Browser, select the item you want to delete.
  2. Click Dataset Actions.
  3. Select File Actions > Delete Items.
  4. Click Yes. A confirmation message is displayed.

How to Export Datasets in COCO/YOLO/VOC Formats from the Dataset Browser?

The Dataset browser incorporates significant automation capabilities, enabling you to export dataset items in industry-standard formats through the following functions. Any function available within this application can be applied to selected items or an active query.

In addition to the Dataloop format, annotations can be exported in industry-standard formats. These are facilitated as functions deployed to the UI Slot of the Dataset-browser.

To learn more about the converters, their input, and output structures, visit their Git repo.

  • COCO Converter: This tool is used to convert data annotations from other formats into the COCO (Common Objects in Context) format or vice versa. COCO is a popular dataset format for object detection, segmentation, and image captioning tasks.
  • YOLO Converter: YOLO (You Only Look Once) is a popular object detection algorithm. A YOLO Converter is used to convert annotations between YOLO format and other annotation formats, making it easier to work with YOLO-based models and datasets.
  • VOC Converter: VOC (Visual Object Classes) is another dataset format commonly used in computer vision tasks. A VOC Converter allows you to convert annotations between VOC format and other formats, facilitating compatibility with different tools and models.
Info

Develop a custom converter and deploy it to a UI-Slot anywhere on the platform, or embed it as a Pipeline node. To learn more, contact Dataloop support.

  1. In the Dataset Browser, select the items in the Dataset Browser.
  2. Click Dataset Actions.
  3. Select Deployment Slot and select one of the following format. A message is displayed as the execution of function <global-converter> was created successfully, please check activity bell. A ZIP file will be created and downloaded.
    1. COCO Converter.
    2. YOLO Converter.
    3. VOC Converter.

How to Export Datasets in COCO/YOLO/VOC Formats from the Dataset Browser by SDK?

To export by SDK, refer to the Download Annotations in COCO/YOLO/VOC Format page.

How to Use an Item to Generate Predictions with a Model in the Dataset Browser?

You can use the dataset items to generate predictions by using a trained, pre-trained, and deployed model.

  1. In the Dataset Browser, select the item.
  2. Click Dataset Actions.
  3. Select Models > Predict.
  4. Search and select a trained and deployed model from the list.
  5. Click Predict. A confirmation message is displayed.
Additional actions
  • Search models by model name, project name, application name, and status.
  • Use the filter to sort the models by scope and model status.

How to Generate Predictions by Using a Trained Model from the Dataset Browser?

You can use only trained and deployed models for generating predictions. To deploy a trained model, perform the following instructions:

  1. In the Dataset Browser, select the item.
  2. Click Dataset Actions.
  3. Select Models > Predict.
  4. Identify the trained model and click Deploy. The Model Version Deployment page is displayed.
  5. In the Deployment and Service Fields tabs, make changes in the available fields as needed.
  6. Click Deploy. A confirmation message is displayed.

How to Download Data Items by Using SDK?

To learn how to download data using the SDK, read this tutorial.

How to Download Data Items by Using Pipeline?

A pipeline can include a phase for automatic data and metadata export to a connected location. This can be done as a function (FaaS) to export all data via a remote API or a connected Driver. For example, use a dataset node and a FaaS Node, and select an export function. For more information, refer to the Create Pipelines.


How to Split Items Into Subsets: Test, Train, or Validation in the Dataset Browser?

Split Data Into Subsets feature allows users to divide their dataset into multiple subsets, such as train, validation, and test, based on a specified distribution. This splitting is important for ensuring that the dataset is well-prepared for machine learning or data analysis tasks. Custom Distribution: By default, the items are divided as follows:

  • Train set: 80% of the data, which is used to train the machine learning model.
  • Validation set: 10% of the data, which is used during training to fine-tune model hyperparameters and prevent overfitting.
  • Test set: 10% of the data, which is used to evaluate the final model performance after training.
  1. In the Dataset Browser, select one or more items.
  2. Click Dataset Actions.
  3. Select Models -> Split Into Subsets. The ML Data Split pop-up is displayed.
  4. Customize the distribution by moving the slider. By default, the items are divided as mentioned above.
  5. Click Split Data. A confirmation message is displayed, and the selected items are divided into respective subsets.
  6. Click on the ML Data Split section in the right-side panel to view the items' distribution.

For example, the number of items split into subsets as follows as per the default distribution:

Number of ItemsTrain SetValidation SetTest Set
1100
2200
3300
4400
5500
6510
7610
8710
9810
10811

How to Assign an Item to a Model Test, Train, or Validation Datasets in the Dataset Browser?

You can assign the selected items to model datasets, such as test, train, and validation. When you assign, a tag (Test, Train, or Validation) will be added to the item details.

  1. In the Dataset Browser, select the item.
  2. Click Dataset Actions.
  3. Select Models -> Assign to Subset.
  4. Select the following options as per requirement:
    1. Test. The Test Dataset is used to evaluate the performance of a trained model on new, unseen data.
    2. Train. The Train Dataset is used to train the machine learning model, helping it learn patterns and make predictions.
    3. Validation. The Validation Dataset is used to fine-tune the model and optimize its hyperparameters, helping prevent overfitting.

How to Remove an Item from the Model Test, Train, or Validation Datasets in the Dataset Browser?

  1. In the Dataset Browser, select the item.
  2. Click Dataset Actions.
  3. Select Models.
  4. Select the following options as per requirement. A confirmation message will be displayed, and the respective tag will be deleted from the item details.
    1. Remove from Test Set.
    2. Remove from Train Set.
    3. Remove from Validation Set.

How to Run an Item with a FaaS or Pipeline in the Dataset Browser?

Run a selected item to a function from a running service (FaaS) or a running pipeline.

  1. In the Dataset Browser, select the item.
  2. Click Dataset Actions.
  3. Select:
    1. Run with FaaS: It allows you to select a function to execute with the selected items.
    2. Run with Pipeline: It allows you to select a pipeline to execute with the selected items.
  4. Select a function or pipeline from the list.
  5. Click Execute. A confirmation message will be displayed.
Additional actions
  • Search functions by function name, project name, and service name.
  • Search pipelines by pipeline name.
  • Filter functions by public functions, project functions and all functions in the user’s projects.

Automation Info and Warning Messages

The following information and warning messages are displayed when you run the item with a FaaS, Pipeline, or Model predictions.

  • When you select more than one item with a function/pipeline/model with item input: Triggering multiple items to a function with single-item input will execute each item separately, resulting in the creation of multiple executions.
  • When you select more than one item to a functions/pipelines with item[] input: Triggering multiple items to a function with an item[] list input will execute all items in a single execution.
  • When you select more than 1000 items to a functions/pipelines with item[] input: The functions with the item[] input are disabled, and displays a warning message that the function with the item[] input cannot be executed with more than 1000 items in the list.

How to Download Annotations in the Dataset Browser?

  1. In the Dataset Browser, select the item.
  2. Click Dataset Actions.
  3. Select Download Annotations from the list. The annotations of the selected file will be downloaded as a JSON file.

How to Perform Bulk Operations in the Dataset Browser?

The Dataset browser facilitates bulk operations within the specified context. To carry out bulk operations:

  1. Manually select one or more items using the Command or Windows key + mouse left-click.
  2. Perform the available actions for the items, such as Move to Folder, Export, Clone, Classification, etc.

How to Create Collections?

Creating Collections can be customized to match the requirements of your specific task, such as grouping items by type, project phase, or other relevant attributes.

Limitations:
- You can create up to 10 collection folders.
- Each item can be tagged in a maximum of 10 collections at once.

  1. Open the Data Browser.
  2. In the left-side panel, click on the Collections icon located below the Folder icon.
  3. Click on Create New Collection.
  4. Type your desired collection name and press the Enter key. The new collection will now be created and displayed in Collections.

How to Add Items to a Collection?

You can create a new collection by selecting items from your dataset and adding them to a designated collection.

  1. Open the Data Browser.
  2. Select the items you want to add to a collection.
  3. Right-click on the selected items.
  4. Select Collections and choose your desired collection. The selected items will now be added to the chosen collection.

How to Rename a Collection?

  1. Open the Data Browser.
  2. In the left-side panel, click on the Collections icon located below the Folder icon.
  3. Hover-over the collection that to be renamed.
  4. Click on the three dots and select Rename from the list.
  5. Make the changes and press Enter key.

How to Find Collections Using Smart Search?

  1. Open the Data Browser.
  2. Click on the Items search field.
  3. Enter the query code as metadata.system.collections.c0 = true where c0 is collection ID. The available collections will be listed as dropdown.

How to Clone a Collection?

  1. Open the Data Browser.
  2. In the left-side panel, click on the Collections icon located below the Folder icon.
  3. Hover-over the collection you want to clone.
  4. Click on the three dots and select Clone from the list.
  5. Click Yes to confirm the cloning process. The cloned collection will be created and named as original_name-clone-1.

How to Remove Items from a Collection?

  1. Open the Data Browser.
  2. In the left-side panel, click on the Collections icon located below the Folder icon.
  3. Click on the collection containing the items you want to remove.
  4. Select the items, then right-click on them.
  5. Select Collections -> Remove From Collections option from the list.
  6. Select the specific collection from which you want to remove the items (if they belong to multiple collections).
  7. Click Remove. A successful deletion message will be displayed.

How to Delete a Collection?

  1. Open the Data Browser.
  2. In the left-side panel, click on the Collections icon located below the Folder icon.
  3. Hover-over the collection you want to delete.
  4. Click on the three dots and select Delete from the list.
  5. Click Yes to confirm the deletion process.

How to Delete a Dataset in the Dataset Browser?

Dataloop provides the capability to delete datasets stored in both the internal file system and external cloud storage.

To delete a dataset, perform the following instructions:

Important

When deleting a dataset, it removes the following:

  • The items in the dataset.
  • Any related tasks and assignments associated with that dataset are also removed.
  1. Go to the Data page using the left-side navigation.
  2. In the Datasets tab, find the dataset that you want to delete.
  3. Click on the Ellipsis (3-dots) icon and select the Delete Dataset option from the list. A confirmation message is displayed.
  4. Click Yes. A confirmation message indicating the successful deletion of the dataset is displayed.