- 26 Nov 2024
- Print
- DarkLight
- PDF
Overview
- Updated On 26 Nov 2024
- Print
- DarkLight
- PDF
Overview
A dataset is a collection of Items (files), along with their metadata and annotations. It can be a file-system-like structure with folders and subfolders at any level. A dataset is mapped to a Driver that derives from an Integration, to contain items synced from external cloud storage. Cloning and merging are examples of dataset versioning operations.
Section 1: Datasets tab
In the tab, the Datasets are displayed in a list view. The following list provides the list of available fields and specific criteria of search and filters:
- To search: You can search datasets by Dataset Name.
- To Filter: You can filter the listed datasets by the following criteria:
- Type: The type of the datasets, whether the dataset is cloned, merged, or the original (master).
- Master
- Clone
- Merge
- Provider: The available storage providers for the datasets.
- Dataloop
- AWS
- GCP
- Azure
- Driver Type: The type of driver used from the storage provider.
- File System
- S3 Bucket
- GCS Bucket
- Blob Storage
- Data Lake Storage Gen2
- Select Creators: It allows you to filter datasets based on the creator.
- Type: The type of the datasets, whether the dataset is cloned, merged, or the original (master).
List of Fields
The column values are populated according to the datasets.
Column Name | Description |
---|---|
Provider | It displays the name of the storage provider. |
Dataset Name | The name of the dataset. Clicking on it will open the Data Browser page. |
Items | The number of items available in the dataset. |
Feature Sets | The number of Feature Sets available in the dataset. |
Annotated | It displays the percentage of items that are annotated. |
Type | It displays the type of the dataset, whether it is master (original), cloned, or merged. |
Driver Type | It displays the name of the storage driver type. |
Open Tasks | It displays the number of the tasks that are open. |
Created at | The creation date of the dataset. |
Created by | The Avatar of the user who created the dataset. You can see the email ID of the user when you hover. |
Clicking on a dataset will displays the following features of the dataset:
The Dataset page provides access to all Datasets in the project. Datasets are listed in a customizable table:
- Show/hide standard columns according to fields used.
- Add custom columns to better manage datasets.
Custom Dataset Fields
You can add your context to Datasets to manage them in your projects according to your needs. Context is added as user Meta-Data in the Dataset entity (any Meta-data field outside the System area). These fields can then be reflected as columns in the Datasets page, presenting the information and context, allowing for Datasets to be sorted and filtered by these fields.
Section 2: Total Numbers of Datasets, Drivers, Feature Sets, and Items
You can view the number of Datasets, Storage Drivers, Feature Sets, and the total number of items available in all the datasets.
Section 3: Dataset Actions
This page allows you to execute various tasks specific to your datasets. The following content provides the available actions for your datasets without even going to their detailed pages.
- Merge Datasets: It allows you to clone two or more datasets after entering necessary details on the Merge Datasets window.
- Upload Items: Clicking on the Upload Items icon allows you to upload files and folders to the selected dataset.
- Dataset Recipe: Clicking on the Dataset Recipe icon allows you to open to make changes on the Recipe page.
- Dataset Analytics: Clicking on the Dataset Analytics icon allows you to open and view the Analytics page of the selected dataset.
When you click on the Ellipsis (three dots) icon, the following options are displayed. Clicking on the link provides you with more information.
- Rename Dataset: It allows you to rename the dataset.
- Copy Dataset ID: It allows you to copy the ID of the dataset.
- Download Data: It allows you to download the dataset after entering necessary details on the Export window.
- Clone Dataset: It allows you to clone two or more datasets after entering necessary details on the Clone Datasets/Items window.
- Open Annotation Studio: It opens the annotation studio based on the item type, including audio, video, image, etc.
- Extract Embeddings: It allows you to extract embeddings from the selected dataset.
- Switch Recipe: It allows you to select a new recipe for your dataset.
- Rescan Cloud Storage: It allows you to sync your data on your cloud storage driver and Dataloop's storage.
- Delete Dataset: It allows you to delete the selected dataset.
Learn more to manage datasets.
Dataset Types
Deriving from its data-versioning, there are different types of Datasets:
- Master: Original dataset that manages the actual binaries.
- Clone: Contains pointers to original files, enabling management of virtual items that do not replicate the binaries of the underlying storage once cloned or copied. When you clone a dataset, you can decide whether the new copy will contain metadata and annotations created over the original.
- Merge: Multiple datasets can be merged into one, which enables multiple annotations to be merged onto the same item.
The Binaries' dataset visible in your Dataloop project is a system-generated dataset designed for storing binary files associated with the project, such as model binaries. While this dataset is created automatically and is not intended for direct user interaction, it can be viewed through the SDK or API.
Storage Providers
Connect your data to the Dataloop storage system without copying it to have a single point of truth for your files and comply with various regulations.
The Dataloop platform has a flexible storage engine, which enables you to attach different binary storage providers, such as:
Cloud storage services (External):
- AWS
- GCP
- Azure
Dataloop's Storage (Internal).
Limitations & Considerations
- Empty folders synced from external storage will not be shown in the dataset.
- Moving or renaming files in the external storage will result in new instances (duplications) on the next sync. This can be avoided by avoiding direct work on files in the external storage, or by setting up 'upstream sync' to reflect such changes in the dataset.
- New files generated by the Dataloop platform are saved inside the bucket used in the storage-driver, under the folder "/.dataloop" (including video thumbnails, .webm files, snapshots, etc.).
- Items on the Dataloop platform cannot be renamed or moved when originating from an external storage.
- Items can only be cloned from internal storage (i.e., the Dataloop's File System) to internal storage or from external storage to the same external storage.
Create Datasets
Dataloop allows you to create datasets on the Dataloop platform based on the following storage requirements:
Create a Dataset:
- Based on Dataloop's storage.
- Based on Cloud Storage.
Read-Only Datasets
Dataloop allows you can set specific Datasets as Read-only. For more information, please contact us.
Folders Structure
Datasets allows you to organize files in nested folders structure. Folder actions supported in the platform, via user-interface and SDK/API, are:
- Create folder
- Move item to folder (single or Bulk)
- Clone item(s) to folder
- Delete folder