Overview
  • 18 Jul 2024
  • Dark
    Light
  • PDF

Overview

  • Dark
    Light
  • PDF

Article summary

Overview

Dataloop brings enterprise level performances for unstructured data management and versioning. Enables sub-second queries on millions of files by item attributes, item metadata, or user metadata.

The data management page now enables you to manage your datasets and storage drivers, as well as create integrations.


Data Management Features

The important features of the data management are listed below.

  • Browser: Browse data from a user-friendly interface and it supports different view options.
    • Thumbnails view with adjustable thumbnail size
    • List view with file details
    • Filters based on item data and annotation data.
    • Direct DQL queries
    • Save and reuse DQL queries
    • Folders management: Create, rename, or delete folders
    • File management: Move between folders, clone, delete
    • Create models from selected data
    • Create annotation or QA tasks from a selected data
    • Trigger the selected data to a function (FaaS) or Pipeline
    • View item metadata
    • Item function executions log
    • Export data (Item JSON file)
    • Upload data (when using File system storage)
  • Data Insights: The Insights Tab provides deep visibility into your annotations, offering features like an annotation location heat map, a histogram of annotation labels, and detailed attributes per label, among others.
  • Data Clustering: Integrating clustering and visualization tools like UMAP, t-SNE, and PCA into the Dataloop platform enhances data analysis, enabling users to efficiently extract insights from complex datasets through a user-friendly interface.
  • Cloud native: Ingest and sync from popular cloud storage providers, such as AWS, GCP, Azure, etc.
  • Dataloop's and cloud storage: Optionally, upload file binaries to Dataloop, or sync cloud storage to Dataloop.
  • Linked items: Create URL items without storing them on the Dataloop platform or even connecting to cloud storage.
  • Metadata layer: Every item has metadata that is populated automatically with item-attributes when the item is added to a dataset. User metadata can be added anytime.
  • DQL: Dataloop Query Language allows querying by:
    • Item attributes: Mime type, file name, creation/update time, size, etc.
    • Item metadata: Annotations, labels & attributes added to items, users working on items, etc.
    • User metadata: Any context added to the item metadata, such as order-number, GEO location, camera number, etc.
  • Performance: Sub-second queries on millions of files by item attributes, item metadata, or user metadata.
  • Version control: Clone and Merge actions to version the data accordingly with the model version.
  • Privacy: Meet data privacy standards
  • Developer tools: All Data-management actions are available from API and SDK interfaces, such as DQL filters, versioning control, import, export, etc.

Data Management Resource Creation

The Data Management Resource Creation feature of Dataloop enables you to create Integrations, Storage Drivers, and Datasets (both internal and cloud storage) all in one place, streamlining the process and eliminating the need to navigate multiple locations.

To access the Data Management Resource Creation feature:

  1. Open the Data page.
  2. Click Create Dataset. The Data Management Resource Creation window will be displayed on the right-side, where you can view Integrations, Storage Drivers, and Datasets sections.

Data Management Page

The Data Management page displays Datasets and Storage Drivers available in your project by tabs and enable a more provider-focused view.

The common features of Data Management page for both Datasets and Storage Drivers tabs are:

  • Create Dateset
  • Create Storage Driver
  • Create Integration
  • SDK: It displays SDK codes for creating Datasets and Storage Driver based on your tab selection.
    • For the Datasets tab: The system displays codes based on the selected internal or external storage provider.
      • Internal Storage Based Dataset
      • External Storage Based Dataset
    • For the Storage Drivers tab: The system displays codes based on the external storage driver.
      • AWS
      • GCP
      • Azure
  • Refresh tabs
  • Pagination

The main sections of the Data Management page are explained below.

Section 1: Datasets and Storage Drivers tabs

The Data Management page displays Datasets and Storage Drivers available for your project in a list view. By default, the Datasets tab is displayed and the search & filter criteria are also displays according to the Datasets.

Section 2: Total Numbers of Datasets, Drivers, and Items

Data Management page displays the number of Datasets, Storage Drivers and the total number of items available in all the datasets.

Section 3: Search and Filter

By default, the Datasets tab is displayed and the search & filter criteria are also displays according to the Datasets. The following list provides the specific criteria of search and filters for both Datasets and Storage Drivers:

Datasets

  • To search: You can search datasets by Dataset Name.
  • To Filter: You can filter the listed datasets by the following criteria:
    • Type: The type of the datasets, whether the dataset is cloned, merged, or the original (master).
      • Master
      • Clone
      • Merge
    • Provider: The available storage providers for the datasets.
      • Dataloop
      • AWS
      • GCP
      • Azure
    • Driver Type: The type of driver used from the storage provider.
      • File System
      • S3 Bucket
      • GCS Bucket
      • Blob Storage
      • Data Lake Storage Gen2

Storage Drivers

  • To search: You can search storage drivers by Driver Name.
  • To Filter: You can filter the listed drivers by the following criteria:
    • Provider:
      • AWS
      • GCP
      • Azure
    • Driver Type:
      • S3 Bucket
      • GCS Bucket
      • Blob Storage
      • Data Lake Storage Gen2

Section 4: List of Datasets and Storage Drivers

The Data Management page displays available Datasets and Storage Drivers in your project in a list view. The column values are populated according to the datasets and storage drivers. The following tables provide the available columns for both datasets and storage drivers.

Datasets

Column NameDescription
Dataset NameThe name of the dataset. Clicking on it will open the Data Browser page.
ItemsThe number of items available in the dataset.
AnnotatedIt displays the percentage of items that are annotated.
TypeIt displays the type of the dataset, whether it is master (original), cloned, or merged.
ProviderIt displays the name of the storage provider.
Driver TypeIt displays the name of the storage driver type.
Open TasksIt displays the number of the tasks that are open.
Created atThe creation date of the dataset.
Created byThe Avatar of the user who created the dataset. You can see the email ID of the user when you hover.

Storage Drivers

Column NameDescription
ProviderIt displays the name of the storage provider.
Driver NameThe name of the storage driver. Click on the Copy Driver ID to copy it.
Driver TypeIt displays the name of the storage driver type.
Resource NameIt displays the name of the driver type.
Integration NameIt displays the name of the integration you created on the Dataloop platform.
Created atThe creation date of the storage driver.
Created byThe Avatar of the user who created the storage driver. You can see the email ID of the user when you hover.

Section 5: Dataset and Storage Driver Actions

Data Management page allows you to execute various tasks specific to your datasets and storage drivers. The following content provides the available actions for your datasets and datasets without even going to their detailed pages.

Datasets

  • Merge Datasets: It allows you to clone two or more datasets after entering necessary details on the Merge Datasets window.
  • Upload Items: Clicking on the Upload Items icon allows you to upload files and folders to the selected dataset.
  • Dataset Recipe: Clicking on the Dataset Recipe icon allows you to open to make changes on the Recipe page.
  • Dataset Analytics: Clicking on the Dataset Analytics icon allows you to open and view the Analytics page of the selected dataset.

When you click on the Ellipsis (three dots) icon, the following options are displayed. Clicking on the link provides you with more information.

  • Rename Dataset: It allows you to rename the dataset.
  • Copy Dataset ID: It allows you to copy the ID of the dataset.
  • Download Data: It allows you to download the dataset after entering necessary details on the Export window.
  • Clone Dataset: It allows you to clone two or more datasets after entering necessary details on the Clone Datasets/Items window.
  • Open Annotation Studio: It opens the annotation studio based on the item type, including audio, video, image, etc.
  • Switch Recipe: It allows you to select a new recipe for your dataset.
  • Rescan Cloud Storage: It allows you to sync your data on your cloud storage driver and Dataloop's storage.
  • Delete Dataset: It allows you to delete the selected dataset.

Storage Drivers

  • Edit Storage Driver: Clicking on the Edit icon allows you to edit storage driver details including driver name, allow deleting items, etc.
  • Delete Storage Driver: Clicking on the Delete icon allows you to delete the selected storage driver.

Data Management Specifications

Cloud Providers & Features

Cloud ProviderResource TypeIntegration Type
AWSS3 BucketCross Account
AWSS3 BucketAccess Key
AWSS3 BucketSTS
GCPGCS BucketPrivate Key
GCPGCS BucketCross Project
AzureBlobClient Secret
AzureDatalake Gen2Client Secret
  • Dataloop supports sub-folder specific access in buckets, which offers security and versatility in managing your data
Platform Specifications

Find the Dataloop Specifications in the specifications page.



What's Next