1.3.8

Overview

26 Nov 2024

Print
Dark
Light
PDF

Overview

Updated On 26 Nov 2024

Print
Dark
Light
PDF

Article summary

Did you find this summary helpful?

Thank you for your feedback

Overview

A dataset is a collection of Items (files), along with their metadata and annotations. It can be a file-system-like structure with folders and subfolders at any level. A dataset is mapped to a Driver that derives from an Integration, to contain items synced from external cloud storage. Cloning and merging are examples of dataset versioning operations.

Section 1: Datasets tab

In the tab, the Datasets are displayed in a list view. The following list provides the list of available fields and specific criteria of search and filters:

To search: You can search datasets by Dataset Name.
To Filter: You can filter the listed datasets by the following criteria:
- Type: The type of the datasets, whether the dataset is cloned, merged, or the original (master).
  - Master
  - Clone
  - Merge
- Provider: The available storage providers for the datasets.
  - Dataloop
  - AWS
  - GCP
  - Azure
- Driver Type: The type of driver used from the storage provider.
  - File System
  - S3 Bucket
  - GCS Bucket
  - Blob Storage
  - Data Lake Storage Gen2
- Select Creators: It allows you to filter datasets based on the creator.

List of Fields

The column values are populated according to the datasets.

Column Name	Description
Provider	It displays the name of the storage provider.
Dataset Name	The name of the dataset. Clicking on it will open the Data Browser page.
Items	The number of items available in the dataset.
Feature Sets	The number of Feature Sets available in the dataset.
Annotated	It displays the percentage of items that are annotated.
Type	It displays the type of the dataset, whether it is master (original), cloned, or merged.
Driver Type	It displays the name of the storage driver type.
Open Tasks	It displays the number of the tasks that are open.
Created at	The creation date of the dataset.
Created by	The Avatar of the user who created the dataset. You can see the email ID of the user when you hover.

Clicking on a dataset will displays the following features of the dataset:

The Dataset page provides access to all Datasets in the project. Datasets are listed in a customizable table:

Show/hide standard columns according to fields used.
Add custom columns to better manage datasets.

Custom Dataset Fields

You can add your context to Datasets to manage them in your projects according to your needs. Context is added as user Meta-Data in the Dataset entity (any Meta-data field outside the System area). These fields can then be reflected as columns in the Datasets page, presenting the information and context, allowing for Datasets to be sorted and filtered by these fields.

Section 2: Total Numbers of Datasets, Drivers, Feature Sets, and Items

You can view the number of Datasets, Storage Drivers, Feature Sets, and the total number of items available in all the datasets.

Section 3: Dataset Actions

This page allows you to execute various tasks specific to your datasets. The following content provides the available actions for your datasets without even going to their detailed pages.

Merge Datasets: It allows you to clone two or more datasets after entering necessary details on the Merge Datasets window.
Upload Items: Clicking on the Upload Items icon allows you to upload files and folders to the selected dataset.
Dataset Recipe: Clicking on the Dataset Recipe icon allows you to open to make changes on the Recipe page.
Dataset Analytics: Clicking on the Dataset Analytics icon allows you to open and view the Analytics page of the selected dataset.

When you click on the Ellipsis (three dots) icon, the following options are displayed. Clicking on the link provides you with more information.

Rename Dataset: It allows you to rename the dataset.
Copy Dataset ID: It allows you to copy the ID of the dataset.
Download Data: It allows you to download the dataset after entering necessary details on the Export window.
Clone Dataset: It allows you to clone two or more datasets after entering necessary details on the Clone Datasets/Items window.
Open Annotation Studio: It opens the annotation studio based on the item type, including audio, video, image, etc.
Extract Embeddings: It allows you to extract embeddings from the selected dataset.
Switch Recipe: It allows you to select a new recipe for your dataset.
Rescan Cloud Storage: It allows you to sync your data on your cloud storage driver and Dataloop's storage.
Delete Dataset: It allows you to delete the selected dataset.

Learn more to manage datasets.

Dataset Types

Deriving from its data-versioning, there are different types of Datasets:

Master: Original dataset that manages the actual binaries.
Clone: Contains pointers to original files, enabling management of virtual items that do not replicate the binaries of the underlying storage once cloned or copied. When you clone a dataset, you can decide whether the new copy will contain metadata and annotations created over the original.
Merge: Multiple datasets can be merged into one, which enables multiple annotations to be merged onto the same item.

Binaries dataset

The Binaries' dataset visible in your Dataloop project is a system-generated dataset designed for storing binary files associated with the project, such as model binaries. While this dataset is created automatically and is not intended for direct user interaction, it can be viewed through the SDK or API.

Storage Providers

Connect your data to the Dataloop storage system without copying it to have a single point of truth for your files and comply with various regulations.
The Dataloop platform has a flexible storage engine, which enables you to attach different binary storage providers, such as:

Cloud storage services (External):
- AWS
- GCP
- Azure
Dataloop's Storage (Internal).

Limitations & Considerations

Empty folders synced from external storage will not be shown in the dataset.
Moving or renaming files in the external storage will result in new instances (duplications) on the next sync. This can be avoided by avoiding direct work on files in the external storage, or by setting up 'upstream sync' to reflect such changes in the dataset.
New files generated by the Dataloop platform are saved inside the bucket used in the storage-driver, under the folder "/.dataloop" (including video thumbnails, .webm files, snapshots, etc.).
Items on the Dataloop platform cannot be renamed or moved when originating from an external storage.
Items can only be cloned from internal storage (i.e., the Dataloop's File System) to internal storage or from external storage to the same external storage.

Create Datasets

Dataloop allows you to create datasets on the Dataloop platform based on the following storage requirements:

Create a Dataset:

Based on Dataloop's storage.
Based on Cloud Storage.

Read-Only Datasets

Dataloop allows you can set specific Datasets as Read-only. For more information, please contact us.

Folders Structure

Datasets allows you to organize files in nested folders structure. Folder actions supported in the platform, via user-interface and SDK/API, are:

Create folder
Move item to folder (single or Bulk)
Clone item(s) to folder
Delete folder

What's Next

Overview

Table Of Contents

Overview
Dataset Types
Storage Providers
Create Datasets
Read-Only Datasets
Folders Structure