Manage datasets
  • 17 Feb 2025
  • Dark
    Light
  • PDF

Manage datasets

  • Dark
    Light
  • PDF

Article summary

The Manage Datasets documentation provides a comprehensive guide on how to efficiently organize, structure, and manipulate datasets within the Dataloop platform. Datasets serve as the foundation for data annotation, machine learning model training, and AI-driven workflows. Dataloop provides the capability to perform a variety of dataset management actions, as described below.


Dataset creation

Create a dataset using Dataloop storage

The Dataloop storage is the internal dataset storage of Dataloop platform. Internal file storage allows you to store digital files, such as images, videos, audios, text files, and other data for annotation process.

  1. Log in to the Dataloop platform.
  2. Select Data from the left-side panel.
  3. In the Datasets tab, click Create Dataset, or click on the down-arrow and select Create Dataset from the list. The Data Management Resource Creation right-side panel is displayed.
  4. Dataset Name: Enter a Name for the dataset.
  5. Recipe (Optional): Select a recipe from the list.
  6. Provider: Ensure that, by default, Dataloop is selected. If not, select the Dataloop option from the list.
  7. Click Create Dataset. The new dataset will be created.

Create a dataset based on an External Cloud Storage

Cloud storage services are online platforms that allow organization to store and manage their data. Dataloop supports the following cloud storage services:

  • Amazon Web Services (AWS) S3: Amazon S3 (Simple Storage Service) is a highly scalable, object storage service offered by AWS.
  • Microsoft Azure Blob Storage: Microsoft Azure provides Blob Storage for storing and managing unstructured data. It integrates well with other Azure services.
  • Google Cloud Storage: Google Cloud Storage is part of the Google Cloud Platform and offers object storage, archival storage, and data transfer services. It's often used alongside other GCP services.
Prerequisites

To create a Dataset based on external cloud storage, the process requires:

  1. Create a Storage-Driver to connect to the cloud-storage resource. For more information, see the Storage Driver Overview​.
  2. Create an integration. For more information, see the Integration Overview​.
  1. Log in to the Dataloop platform.
  2. Select Data from the left-side panel.
  3. Select the Datasets tab, if it is not selected by default.
  4. Click Create Dataset, or click on the down-arrow and select Create Dataset from the list. The Data Management Resource Creation right-side panel is displayed.
  5. Dataset Name: Enter a Name for the dataset.
  6. Recipe (Optional): Select a recipe from the list.
  7. Provider: Select one of the following external provider from the list, by default Dataloop is selected:
    1. AWS
    2. GCP
    3. Azure
  8. Storage Driver: Select a Storage Driver from the list. If not available, create a new Storage Driver​.
  9. Click Create Dataset.

Dataset versioning

Dataloop enables you to manage your datasets and items, including functions like cloning, merging, moving, as well as refining and segmenting your files.
You can clone either datasets or items along with their annotations or metadata.

Important
  1. You cannot clone the item status, such as approved, completed, discarded, etc.
  2. Cloned datasets are generated using the same recipe as the original ones.
  3. Do not make any changes to items during the cloning process. This includes actions such as adding, editing, or deleting annotations, or moving items, etc.

Clone a dataset

To clone an entire dataset, follow these instructions:

  1. From the left-side menu, go to Data.
  2. Find the desired dataset from the list, and click on the three-dots icon.
  3. Choose Clone Dataset from the list.
  4. In the Clone Dataset/Items window, decide whether you want to clone items from an existing dataset or create a new one:
    1. Existing Dataset:
      1. Select a dataset from the list.
      2. Search for and select the folder within the dataset where you want to clone the dataset (root folder, subfolders, etc.).
    2. New Dataset:
      1. Enter a name for the new dataset.
  5. Choose your cloning options:
    1. Whether you want to clone with item annotations.
    2. Whether you want to clone with item metadata.
  6. Once you've configured your options, click Clone to initiate the cloning process. A confirmation message is displayed.

Clone dataset's items

Dataloop facilitates the cloning of items into target datasets. It's important to note that you can clone items:

  • From internal storage (e.g., Dataloop cloud storage) to internal storage.
  • From external storage (e.g., S3) to external storage, provided that the target storage also uses the same storage driver (e.g., using the same integration secret and storage driver pointing at the same location).

To clone an item, follow the steps:

  1. From the left portal menu, select Data.
  2. Click on the dataset in the list.
  3. Select a single or multiple item(s), and right-click or select File Actions > Clone from the list.
  4. In the Clone Dataset/Items window, decide whether you want to clone items from an existing dataset or create a new one:
    1. Existing Dataset:
      1. Select a dataset from the list.
      2. Search for and select the folder within the dataset where you want to clone the dataset (root folder, subfolders, etc.).
    2. New Dataset:
      1. Enter a name for the new dataset.
  5. Choose your cloning settings:
    1. Whether you want to clone with item annotations.
    2. Whether you want to clone with item metadata.
  6. Once you've configured your options, click Clone to initiate the cloning process.
Parent ID or Dataset ID of the Cloned items

After cloning an item, the metadata (JSON) of the cloned item will display both the parent item ID (srcItem) and parent dataset ID (srcDataset). However, in the Details tab, only the parent item ID is shown.

Merge datasets

Dataloop provides the capability to merge datasets. The result of dataset merging depends on the degree of similarity or dissimilarity between the datasets.

  • Cloned Datasets: When datasets are cloned, their items, annotations, and metadata are merged. This means that you can have annotations from various datasets associated with the same item, allowing you to view and work with combined annotations on a single item.

Merging items from cloned datasets is feasible only if the items being merged originated from the same master item, meaning that the cloned items must both reference the same source.

  • Different datasets (not clones) with similar recipes: Items will be summed up, and similar items will be duplicated.
  • Datasets with different recipes: Datasets with different default recipes cannot be merged. To merge datasets, use the Switch Recipe option at the dataset level (accessible through the ellipsis icon) to align recipes between datasets.

To merge datasets, follow the instructions:

  1. From the left portal menu, select Data.
  2. Choose the datasets you want to merge from the list.
  3. Click Merge Datasets.
  4. In the Merge Datasets window, enter a name for the newly merged dataset in the Dataset Name field.
  5. Indicate whether you wish to merge With Items Annotations? and/or With Items Metadata? (i.e., including information added by annotators).

Upon successful completion of the merge process, the newly created dataset will be listed with the Dataset type labeled as Merge.


Copy a dataset ID

You can copy the Dataset ID by using one of the following options:

  • Clicking on Dataset Details from the Dataset Browser page and click on the Copy icon next to the Dataset ID field.
  • Select the Data page from the left-side panel and click on the Ellipsis (three-dots) icon of the dataset and select the Copy Dataset ID option from the list.

Delete a dataset

Dataloop provides the capability to delete datasets stored in both the internal file system and external cloud storage.

Important

When deleting a dataset, it removes the items in the dataset, and any related tasks and assignments associated with that dataset are also removed.

To delete a dataset, perform the following instructions:

  1. Go to the Data page using the left-side navigation.
  2. In the Datasets tab, find the dataset that you want to delete.
  3. Click on the Ellipsis (3-dots) icon and select the Delete Dataset option from the list. A confirmation message is displayed.
  4. Click Yes. A confirmation message indicating the successful deletion of the dataset is displayed.

Download items by using Pipelines

A pipeline can include a phase for automatic data and metadata export to a connected location. This can be done as a function (FaaS) to export all data via a remote API or a connected Driver. For example, use a dataset node and a FaaS Node, and select an export function. For more information, refer to the Create Pipelines.


Export the entire dataset

Use either the Dashboard > Data Management widget, or the Data > Datasets tab to export the data:

  1. Select the dataset from the list.
  2. Click on the Ellipsis (3-dots) icon and select Download data from the list.
  3. Select the export Scope:
    • Entire dataset: The ZIP file includes JSON files for all items in the dataset.
  4. (Optional) include the Annotations JSON files. By default, the Item JSON file is selected.
  5. Selecting the Annotations JSON files option enables you to Include PNG per semantic label.
  6. Click Export.

Extract Embeddings from a dataset

Extracting embeddings is the process of generating numerical representations (vectors) of data, such as text, images, or other types of content, in a lower-dimensional space. Dataloop allows you to extract embeddings using a model (trained, pre-trained, and deployed) from the Marketplace. These embeddings capture the essential features or meanings of the original data in a way that makes it easier for machine learning models to process and analyze.

  1. Navigate to the Data page using the left-side navigation.
  2. In the Datasets tab, find the dataset that you want to extract embeddings.
  3. Click on the Ellipsis (3-dots) icon and select the Extract Embeddings option from the list. An Extract Embeddings pop-up with a list of all the models (trained, pre-trained, and deployed) is displayed.

Or

  1. Click Down Arrow (next to Create Dataset) -> Extract Embeddings. This option is enabled only after you select a dataset to perform the embedding extraction.
  1. Select a Model from the Deployed section, or click on the Marketplace to install a new model.
  2. Choose the option Automatically run on new dataset items to enable automatic extraction for newly added items. This creates a trigger that automatically generates embeddings for the new items.
  3. Click Embed. The Extracting Embeddings process will be initiated. Use the Notifications bell icon to view status.

Perform bulk operations

The Dataset browser facilitates bulk operations within the specified context like Move to Folder, Export, Clone, Classification, etc. . To carry out bulk operations:

  1. Manually select one or more items using the Command or Windows key + mouse left-click.
  2. Perform the available actions for the items, such as Move to Folder, Export, Clone, Classification, etc.

Rename a dataset

  1. Navigate to the Data page using the left-side navigation.
  2. In the Datasets tab, find the dataset that you want to rename.
  3. Click on the Ellipsis (3-dots) icon and select the Rename Dataset option from the list. A Change Dataset Name pop-up is displayed.
  4. Edit the dataset name.
  5. Click Rename. A name change message is displayed.

Switch recipe

Refer to the Switch Recipe article for more information.



What's Next