Create Datasets

Overview

DDOE enables you to create and manage datasets while giving you flexibility in choosing your preferred storage location. You can upload data directly into DDOE as a local upload, ingest data using the SDK, or connect external storage such as cloud storage services or on‑premises storage systems.

Now the Data page, when you click on Create Dataset will enable you choose all these options.

Learn more about the supported data formats.

Upload Your Data in DDOE (Local Upload)

DDOE storage is the platform’s internal dataset storage, where files are securely stored on DDOE’s GCP‑hosted cloud storage. This internal storage enables you to manage digital files—such as images, videos, audio files, text files, and other data—used for the annotation process.

Log in to the DDOE platform.
From the left‑side navigation panel, select Data. Alternatively, use the Dashboard → Data Management widget.
In the Datasets tab, click Create Dataset, or click on the down-arrow and select Create Dataset from the list. The New Dataset popup is displayed.
Select Local Upload tile from list. The Dataset Details section is displayed.

Dataset Name: Enter a Name for the dataset.
Recipe (Optional): Select a recipe based on your data type. If suitable recipe is not available, click Create New to create a recipe for your dataset.

Refer to the What you need to know on the right-side panel for additional information.

Information
Refer to the What you need to know on the right-side panel for additional information.

Click Create Dataset. The new dataset will be created, and the Dataset Browser page is displayed. You can now start uploading your data.

Connect and Sync Your Cloud Storages

Cloud storage services are online platforms that allow organization to store and manage their data. DDOE supports the following cloud storage services:

Amazon Web Services (AWS) S3: Amazon S3 (Simple Storage Service) is a highly scalable, object storage service offered by AWS.
Microsoft Azure Blob Storage: Microsoft Azure provides Blob Storage for storing and managing unstructured data. It integrates well with other Azure services.
Google Cloud Storage: Google Cloud Storage is part of the Google Cloud Platform and offers object storage, archival storage, and data transfer services. It's often used alongside other GCP services.

Prerequisites
Before starting the Cloud Storage Sync, ensure that the following components are set up:
Integration.
Storage Driver.

Log in to the DDOE platform.
From the left‑side navigation panel, select Data. Alternatively, use the Dashboard → Data Management widget.
In the Datasets tab, click Create Dataset, or click on the down-arrow and select Create Dataset from the list. The New Dataset popup is displayed.
Select Cloud Storage Sync tile from list.
Select your cloud provider AWS, GCP, or Azure from the list.

AWS

GCP

Azure

Enables DDOE to securely connect to AWS services (such as S3), allowing you to sync, manage, and process datasets stored in AWS cloud environments.

Select AWS tile from the list.
Select an Integration from the list. If suitable integration is not available, click Add Integration.
Dataset Name: Enter a Name for the dataset.
Recipe (Optional): Select a recipe based on your data type. If suitable recipe is not available, click Create New to create a recipe for your dataset.
Bucket Name: Enter your S3 bucket name.
Path Prefix: Specify the directory path prefix within the bucket
Region: Select the AWS region from the list.
Storage Class: Enter your S3 storage class.
Allo Deletion of Items from Storage:
1. If Yes, deleting items in DDOE will permanently delete them from your external storage.
2. If No, items deleted in DDOE will remain in your storage and won't be restored during re-sync.
Run initial sync: Enable to start the initial sync. You can also set up Automatic sync.

Allows DDOE to access and synchronize data from Google Cloud Platform services (such as GCS), supporting seamless dataset ingestion and management within DDOE workflows.

Select GCP tile from the list.
Select an Integration from the list. If suitable integration is not available, click Add Integration.
Dataset Name: Enter a Name for the dataset.
Recipe (Optional): Select a recipe based on your data type. If suitable recipe is not available, click Create New to create a recipe for your dataset.
Bucket Name: Enter your GCS bucket name.
Path Prefix: Specify the directory path prefix within the bucket
Allo Deletion of Items from Storage:
1. If Yes, deleting items in DDOE will permanently delete them from your external storage.
2. If No, items deleted in DDOE will remain in your storage and won't be restored during re-sync.
Run initial sync: Enable to start the initial sync. You can also set up Automatic sync.
Click Create Dataset. The new dataset will be created, and the Dataset Browser page is displayed. You can now start syncing your data.

Connects DDOE with Azure storage services (such as Blob Storage and Data Lake Gen2), enabling secure data synchronization and management directly from Azure cloud storage.

Select Azure tile from the list.
Select an Integration from the list. If suitable integration is not available, click Add Integration.
Dataset Name: Enter a Name for the dataset.
Recipe (Optional): Select a recipe based on your data type. If suitable recipe is not available, click Create New to create a recipe for your dataset.
Storage Type: Select the Azure storage type you want to configure:
1. Blob Storage: Select this option to store data in an Azure Blob container by specifying the container name.
  1. Container Name: Enter the name of the blob container.
2. Data Lake Storage Gen2: Select this option to store data in an Azure Data Lake Gen2 file system by specifying the file system name.
  1. File System Name: Enter the name of the file system.
Path Prefix: Specify the directory path prefix within the bucket
Allo Deletion of Items from Storage:
1. If Yes, deleting items in DDOE will permanently delete them from your external storage.
2. If No, items deleted in DDOE will remain in your storage and won't be restored during re-sync.
Run initial sync: Enable to start the initial sync. You can also set up Automatic sync.
Click Create Dataset. The new dataset will be created, and the Dataset Browser page is displayed. You can now start syncing your data.

Information
Refer to the What you need to know on the right-side panel for additional information.

Connect and Sync Your On-Premise Data

Enables to connect to storage hosted in your own data center or infrastructure and automatically sync data.

Log in to the DDOE platform.
From the left‑side navigation panel, select Data. Alternatively, use the Dashboard → Data Management widget.
In the Datasets tab, click Create Dataset, or click on the down-arrow and select Create Dataset from the list. The New Dataset popup is displayed.
Select On-Prem Sync tile from list.
Select your storage type Network File System (NFS), NFS with MetadataIQ, S3-compatible API, and S3 API with MetadataIQ.

Network File System (NFS)

NFS with MetadataIQ

S3-Compatible API

S3 API with MetadataIQ

Enables DDOE to connect to on‑premises file storage using standard NFS, allowing datasets to be synced and managed directly from local file systems.

NFS Share Mounting
Make sure your NFS share is mounted before creating the dataset. Contact our support team for setup assistance.

Select Network File System tile from the list.
Dataset Name: Enter a Name for the dataset.
Recipe (Optional): Select a recipe based on your data type. If suitable recipe is not available, click Create New to create a recipe for your dataset.
Path Prefix: Specify the directory path prefix within the bucket
Allo Deletion of Items from Storage:
1. If Yes, deleting items in DDOE will permanently delete them from your external storage.
2. If No, items deleted in DDOE will remain in your storage and won't be restored during re-sync.
Run initial sync: Enable to start the initial sync. You can also set up Automatic sync.
Click Create Dataset. The new dataset will be created, and the Dataset Browser page is displayed. You can now start syncing your data.

Dell PowerScale OneFS: Enables DDOE to connect to on‑premises NFS storage while leveraging MetadataIQ to accelerate data discovery, indexing, and metadata‑driven dataset management.

NFS Share Mounting
Make sure your NFS share is mounted and MetadataIQ is properly configured before creating the dataset. Contact our support team for setup assistance.

Select NFS with MetadataIQ tile from the list.
Dataset Name: Enter a Name for the dataset.
Recipe (Optional): Select a recipe based on your data type. If suitable recipe is not available, click Create New to create a recipe for your dataset.
Elastic Index Name: Enter the name of the Elasticsearch index created in by MetadataIQ. The Elastic Index Name refers to the Elasticsearch index created and maintained by MetadataIQ for the target NFS data source.
Path Prefix: Specify the directory path prefix within the bucket
Allo Deletion of Items from Storage:
1. If Yes, deleting items in DDOE will permanently delete them from your external storage.
2. If No, items deleted in DDOE will remain in your storage and won't be restored during re-sync.
Run initial sync: Enable to start the initial sync. You can also set up Automatic sync.
Click Create Dataset. The new dataset will be created, and the Dataset Browser page is displayed. You can now start syncing your data.

Allows DDOE to integrate with on‑premises object storage that exposes an S3‑compatible API, enabling cloud‑like data access and synchronization.

Select S3-Compatible API tile from the list.
Select an Integration from the list. If suitable integration is not available, click Add Integration.
Dataset Name: Enter a Name for the dataset.
Recipe (Optional): Select a recipe based on your data type. If suitable recipe is not available, click Create New to create a recipe for your dataset.
Endpoint URL: Enter the Custom S3-compatible endpoint URL.
Bucket Name: Enter your S3 bucket name.
Path Prefix: Specify the directory path prefix within the bucket
Allo Deletion of Items from Storage:
1. If Yes, deleting items in DDOE will permanently delete them from your external storage.
2. If No, items deleted in DDOE will remain in your storage and won't be restored during re-sync.
Run initial sync: Enable to start the initial sync. You can also set up Automatic sync.
Click Create Dataset. The new dataset will be created, and the Dataset Browser page is displayed. You can now start syncing your data.

Combines S3‑compatible object storage access with MetadataIQ to provide enhanced metadata indexing, faster data ingestion, and efficient dataset management in DDOE.

MetadataIQ Configuring
Make sure MetadataIQ is properly configured before creating the dataset. Contact our support team for setup assistance.

Select S3-Compatible API tile from the list.
Select an Integration from the list. If suitable integration is not available, click Add Integration.
Dataset Name: Enter a Name for the dataset.
Recipe (Optional): Select a recipe based on your data type. If suitable recipe is not available, click Create New to create a recipe for your dataset.
Endpoint URL: Enter the Custom S3-compatible endpoint URL.
Bucket Name: Enter your S3 bucket name.
Elastic Index Name: Enter the name of the Elasticsearch index created in by MetadataIQ. The Elastic Index Name refers to the Elasticsearch index created and maintained by MetadataIQ for the target NFS data source.
Path Prefix: Specify the directory path prefix within the bucket
Allo Deletion of Items from Storage:
1. If Yes, deleting items in DDOE will permanently delete them from your external storage.
2. If No, items deleted in DDOE will remain in your storage and won't be restored during re-sync.
Run initial sync: Enable to start the initial sync. You can also set up Automatic sync.
Click Create Dataset. The new dataset will be created, and the Dataset Browser page is displayed. You can now start syncing your data.

Information
Refer to the What you need to know on the right-side panel for additional information.

Create Datasets with Sample Data

Enables to browse and install datasets from DDOE’s Marketplace.

Log in to the DDOE platform.
From the left‑side navigation panel, select Data. Alternatively, use the Dashboard → Data Management widget.
In the Datasets tab, click Create Dataset, or click on the down-arrow and select Create Dataset from the list. The New Dataset popup is displayed.
Select Sample Datasets tile from list. The Marketplace page is displayed.

Select a dataset based on your requirement. You can use the search and filter on the left-side panel.
After making a selection, a Details panel appears on the right side. Click Install to install the selected dataset. Once installation is complete, the Dataset Browser page is displayed.
1. Installed datasets are marked with a green checkmark ✅ icon.
2. Auto-Update: In the Details panel on the right, enable Auto‑Update to automatically update installed datasets when a newer version becomes available in the Marketplace.

Upload Data in DDOE Using the SDK

Enables programmatic data import into DDOE using the SDK.

Log in to the DDOE platform.
From the left‑side navigation panel, select Data. Alternatively, use the Dashboard → Data Management widget.
In the Datasets tab, click Create Dataset, or click on the down-arrow and select Create Dataset from the list. The New Dataset popup is displayed.
Select Import via SDK tile from list. The developer guide page is displayed, start following the steps as per your requirement.

Integrations

An integration defines the connection between DDOE and an external cloud storage provider, enabling secure access, synchronization, and data management.

Add AWS Integration

When creating datasets to connect and sync data from your AWS cloud storage provider, ensure that a suitable integration already exists.

If the required integration is not available, follow the steps below to create a new one:

Click Add Integration.
Integration Name: Enter the name of the integration.
Integration Type: Select the integration type from the list. Refer to the links for more information.
1. Cross Account
  1. DDOE IAM User ARN
  2. IAM Role ARN
2. STS
  1. AWS Access Key ID
  2. AWS Secret Access Key
  3. IAM Role ARN
3. Access Key
  1. AWS Access Key ID
  2. AWS Secret Access Key
Click Create Integration. The Integration will be created and listed.

Add GCP Integration

When creating datasets to connect and sync data from your GCP cloud storage provider, ensure that a suitable integration already exists. If the required integration is not available, follow the steps below to create a new one:

Click Add Integration.
Integration Name: Enter the name of the integration.
Integration Type: Select the integration type from the list. Refer to the links for more information.
1. Cross Project
  1. Service Account ID
  2. Bucket Name
2. Workload Identity Federation: It allows DDOE to securely access cloud resources without storing long‑lived credentials (like service account keys or access keys).
  1. Azure Token URL
  2. Azure Client ID
  3. Azure Client Secret
  4. Azure Scope
  5. GCP Credential Configuration File
3. Private Key
  1. JSON Private Key
Click Create Integration. The Integration will be created and listed.

Add Azure Integration

When creating datasets to connect and sync data from your Azure cloud storage provider, ensure that a suitable integration already exists. If the required integration is not available, follow the steps below to create a new one:

Click Add Integration.
Integration Name: Enter the name of the integration.
Tenant ID: Go to Azure Active Directory → Overview, Copy the Tenant ID
Client ID: The Client ID is the Application (App Registration) ID.
Client Secret: The Client Secret is generated under the app registration.
Storage Account Name: A Storage Account Name is the unique name assigned to an Azure Storage Account, which is used to identify and access storage resources such as Blob Storage and Data Lake within Azure. For example, https://<storage-account-name>.blob.core.windows.net
Click Create Integration. The Integration will be created and listed.