Overview
  • 26 Aug 2024
  • Dark
    Light
  • PDF

Overview

  • Dark
    Light
  • PDF

Article summary

Overview

A dataset is a collection of Items (files), along with their metadata and annotations. It can be a file-system-like structure with folders and subfolders at any level. A dataset is mapped to a Driver that derives from an Integration, to contain items synced from external cloud storage. Cloning and merging are examples of dataset versioning operations.


Dataset Types

Deriving from its data-versioning, there are different types of Datasets:

  • Master: Original dataset that manages the actual binaries.
  • Clone: Contains pointers to original files, enabling management of virtual items that do not replicate the binaries of the underlying storage once cloned or copied. When you clone a dataset, you can decide whether the new copy will contain metadata and annotations created over the original.
  • Merge: Multiple datasets can be merged into one, which enables multiple annotations to be merged onto the same item.
Binaries dataset

The Binaries' dataset visible in your Dataloop project is a system-generated dataset designed for storing binary files associated with the project, such as model binaries. While this dataset is created automatically and is not intended for direct user interaction, it can be viewed through the SDK or API.


Storage Providers

Connect your data to the Dataloop storage system without copying it to have a single point of truth for your files and comply with various regulations.
The Dataloop platform has a flexible storage engine, which enables you to attach different binary storage providers, such as:

Limitations & Considerations

  • Empty folders synced from external storage will not be shown in the dataset.
  • Moving or renaming files in the external storage will result in new instances (duplications) on the next sync. This can be avoided by avoiding direct work on files in the external storage, or by setting up 'upstream sync' to reflect such changes in the dataset.
  • New files generated by the Dataloop platform are saved inside the bucket used in the storage-driver, under the folder "/.dataloop" (including video thumbnails, .webm files, snapshots, etc.).
  • Items on the Dataloop platform cannot be renamed or moved when originating from an external storage.
  • Items can only be cloned from internal storage (i.e., the Dataloop's File System) to internal storage or from external storage to the same external storage.

Create Datasets

Dataloop allows you to create datasets on the Dataloop platform based on the following storage requirements:

Create a Dataset:


Read-Only Datasets

Dataloop allows you can set specific Datasets as Read-only. For more information, please contact us.


Folders Structure

Datasets allows you to organize files in nested folders structure. Folder actions supported in the platform, via user-interface and SDK/API, are:

  • Create folder
  • Move item to folder (single or Bulk)
  • Clone item(s) to folder
  • Delete folder

Datasets Page

The Dataset page provides access to all Datasets in the project. Datasets are listed in a customizable table:

  1. Show/hide standard columns according to fields used.
  2. Add custom columns to better manage datasets.

Custom Dataset Fields

You can add your context to Datasets to manage them in your projects according to your needs. Context is added as user Meta-Data in the Dataset entity (any Meta-data field outside the System area). These fields can then be reflected as columns in the Datasets page, presenting the information and context, allowing for Datasets to be sorted and filtered by these fields.



What's Next