Upload Items
  • 18 Dec 2024
  • Dark
    Light
  • PDF

Upload Items

  • Dark
    Light
  • PDF

Article summary

With Dataloop, you can upload data directly into Dataloop's storage or syncing data from external cloud storage sources.

Uploading Data into Dataloop's Storage

Users can directly upload their data files, datasets, or other relevant information into Dataloop's storage infrastructure. This process typically involves selecting the files or data to be uploaded and initiating the upload through Dataloop's user interface or SDK.

File System option for Datasets

When using the File system option for Datasets, files' binaries are uploaded and stored on the Dataloop storage (GCP hosted).

To upload files to your dataset:

  1. In the Dashboard page, go to the Data-management widget and find the dataset.
  2. Click the Upload Items icon. Alternatively, when in the Dataset-browser, click the Upload Items icon.
  3. Select either Upload file or Upload folder from the list if you need to upload individual files or entire folders.
  4. Find the file or folder and click Upload. A confirmation message is displayed.
Browser upload limits

A browser crash or halt is possible while attempting to upload one or more files larger than 1 GB and more than 100 folders, since browsers are not designed to manage the upload process. Your computer's configuration and Internet connection can affect the precise number. To upload many files and folders, use the SDK as detailed in the Upload and Manage Data & Metadata tutorial.


Syncing Data from External Cloud Storage Driver

Dataloop provides the capability to synchronize data from external cloud storage platforms. Users can connect their Dataloop accounts with popular cloud storage services such as Amazon S3, Google Cloud Storage, or Microsoft Azure, and seamlessly transfer data between these platforms and Dataloop.

Syncing data allows users to leverage existing data stored in external cloud services within the Dataloop environment, enabling them to centralize and manage all their data in one place.

Prerequisites for Syncing a Storage Driver

To sync the storage driver:

  1. In the Data Browser page, click the Sync Storage Driver icon available on the right-side panel. The Initiate External Storage Sync popup is displayed.
  2. Click Sync Data.

Key points

Here are the key points to keep in mind when initiating the cloud storage sync process:

  1. Verify Write Access: Make sure you have write access to save thumbnails, modalities, and converted files to a hidden .dataloop folder on your storage.
  2. Permission Validation: As part of the process, a permission test-file is added to your storage folder to validate the necessary permissions.
  3. Annotations and Metadata: It's important to note that annotations and metadata are stored on the Dataloop platform, separate from your external storage.
  4. Deletion Handling: If you delete a file from your external storage, you may need to initiate a file deletion process in Dataloop or set up an Upstream sync in advance to ensure that these events are properly managed and accounted for.

Setup Process

The setup process for external storage includes the following instructions:

  1. Prepare External Storage: In your external storage account (AWS S3, Azure Blob, GCP GCS, or Private Container Registry), ensure you have the necessary credentials and permissions for use by Dataloop. Specific instructions for each storage type are provided:
    • List (Mandatory): Allowing Dataloop to list all items in the storage.
    • Get (Mandatory): Allowing retrieval of items and performing pre-processing functions like generating thumbnails and fetching item information.
    • Put/Write (Mandatory): Enabling you to upload your items directly from the Dataloop platform to the external storage.
    • Delete: Allowing you to delete items directly from the external storage using the Dataloop platform.
  2. Create Integration in Dataloop Organization: Within your Dataloop Organization, set up a new Integration to input the credentials prepared in the previous step. These credentials are securely saved in a Vault and can be utilized by projects owned by the organization.
  3. Generate Storage Driver: In a specific project, create a new storage driver. To create a new storage-driver, see the Storage Drivers overview and relevant steps, such as for AWS, GCP, or Azure cloud providers.
  4. Create a Dataset and Configure: In the same project, create a new Dataset and configure it to use external storage by employing the storage driver you've already created. This configuration will ensure that the dataset interacts with and stores data on the specified external storage resource.

Initial Sync

When your dataset is configured correctly, an initial synchronization operation begins automatically. For a successful completion of this process, the following conditions must be met:

  • Ensure that your integration credentials are valid and have the necessary permissions.
  • When creating the dataset, be sure to enable the Sync option. If you don't enable it during dataset creation, you can initiate the sync process manually from the dataset browser.
  • Ensure that you have an adequate number of available data-points in your quota. You should have enough data-points that match the number of file items you intend to synchronize.

Once the sync process begins, you can monitor its progress in the notifications area to ensure that all data items are properly indexed and synchronized.

Ongoing Sync: Upstream and Downstream

After connecting your cloud storage to a dataset and completing the initial sync, the dataset reflects your directory structure and file content. Managing your files within the dataset, including actions like moving between folders, cloning, merging, etc., acts as an additional layer of management and does not impact the binary files in your cloud storage.

  1. Syncing Empty Folders: The platform does not display the syncing of empty folders from your storage.
  2. Moving or Renaming Items: If you move or rename items on your external storage, it will result in duplicates within the platform.
  3. Saving New Files: The Dataloop platform saves newly generated files inside your storage bucket under the folder /.dataloop. This includes items like video thumbnails, .webm files, snapshots, etc.
  4. Limitations on Renaming or Moving Items: You cannot rename or move items within the Dataloop platform if they are linked to an external source.
  5. Cloning Items: Items can only be cloned from internal storage (Dataloop cloud storage) to internal storage or from external storage to the same external storage.

You cannot sync cloned and merged datasets on the Dataloop platform because these datasets are not directly indexed to external storage.

Downstream

Downstream sync is the process of updating any file-item changes from your Dataloop platform (for example, the dataset - adding/deleting files) into your external storage (or original storage). Downstream sync is always active.

  • New files: If you decide to add new files directly to your Dataloop dataset, bypassing your external storage, Dataloop will make an effort to write these new files to your external storage to ensure synchronization.

  • File deletion: By default, Dataloop does not automatically delete your binary files. Any files deleted from Dataloop datasets will not be removed from your cloud storage. If you wish to enable the deletion of files and have your cloud storage reflect changes made within the Dataloop platform, you need to explicitly select the Allow delete option.

    • Allow deletion permissions in your IAM for AWS, GCS, and Azure.
    • Check the Allow delete option in the storage driver.

Upstream

Upstream sync is the process of updating any file-item changes from your external storage to the Dataloop platform. There are various ways in which upstream sync occurs:

  1. Automatically, Once: This happens at the time the storage driver is created, assuming the option was enabled during dataset creation.
  2. Manually: You can initiate this process each time you click Sync for a specific dataset, accessible through the Dataset browser. This action triggers a scan of the files on your storage, indexing them into Dataloop. Keep in mind that files deleted from the cloud storage might persist as 'ghost' files in your dataset.
  3. Automatic Upstream sync: Dataloop provides a code that can monitor changes in your bucket and update the respective datasets accordingly. It enables automatic upstream synchronization of your data when file items are modified. For more information, see the Dataset Binding with AWS article.

User Metadata Upload

Once your items are placed within a Dataset, you have the ability to utilize the SDK or API to include your custom context as user metadata. Various examples of such context:

  • Camera number
  • Area taken
  • LOT/Order number
Developers

To learn how to upload items with user-metadata, read here.


Linked Items / URL Items

Linked items enable the utilization of a file on the Dataloop platform without the need to store it on Dataloop servers or establish integration with cloud storage. These linked items are made possible through the use of JSON files that contain URLs serving as pointers to the binary files stored on the customer's storage.
Use can opt to use links in any of the following scenarios:

  • To Keep Binaries Out of the Dataloop Platform: By utilizing links, you can avoid storing the binary files directly on Dataloop servers. This is advantageous when you want to reduce data storage costs or maintain data on your own infrastructure while still benefiting from Dataloop's functionality.
  • To Duplicate Items Without Actually Duplicating Their Binaries: Links allow you to create copies or references to items in Dataloop without replicating the associated binary files. This is helpful when you want to organize or categorize items differently within Dataloop without increasing the storage footprint.
  • To Reference Public Images by URL: When working with public images or files hosted externally, you can use links to reference them by their URLs. This simplifies the process of incorporating and displaying these external resources within your Dataloop projects without the need for redundant storage.

Create Linked Item (JSON)

  1. Create a JSON file that contains a URL to an item from your bucket.
  2. Upload the JSON file to the platform.

The JSON representation will be displayed as a thumbnail of the original item, accompanied by a link symbol. Whenever you click to open the file, the Dataloop platform will retrieve the stream of the original item.

For linked items, the video duration is not available, unlike local items where you can retrieve this information.

{
  "type": "link",
  "shebang": "dataloop",
  "metadata": {
    "dltype": "link",
        "linkInfo": {
             "type":"url",
            "ref":"https://www.example.com",
                "mimetype": "video"
        }  
  }
}

Bulk Connection of Multiple Linked Items

To efficiently link multiple items to a Dataset in bulk, the platform offers a feature for importing them from a CSV file

Simply upload a CSV file that includes a list of URLs corresponding to the items in your storage bucket. Once the CSV is successfully uploaded, the platform automatically generates a folder using the name of the CSV file and creates JSON files within that folder. These JSON files are then linked to the original items and are readily available within your dataset.

Bulk Upload

CSV files can also serve as a means to bulk-upload file binaries.

CSV File Format

FieldRequiredDescription
IDYesOptional file name in case a name is not provided.
URLYesLink to the item. This link must be public for the browser to be displayed.
image_bytesYesMandatory if the URL is not provided.
nameNoItem name to use in the Dataset.
actionNolink - to generate the JSON
upload - fetch the file to upload into Dataloop.
mimetypeNoImage/jpg - default mime type to be used.
Video - will be used for items that include videos.
item_metadataNoUpdate the metadata of the item.
item_descriptionNoAdd text to the description root property of the item.

ETL Pre-Processing

ETL (Extract, Transform, Load) pre-processing, also known as data pre-processing, is a critical stage in the data pipeline that occurs before data is loaded into the Dataloop platform.

A created event trigger on the Dataloop platform becomes active each time a file is added to a dataset. By default, any image files that are uploaded or added to Dataloop are automatically queued for processing through a global image preprocessing service, that:

  1. Generate a thumbnail for the images.
  2. Extracts information about the file and stores it in the items' meta-data entity of the Dataloop platform, including the following information:
    • Name
    • Size
    • Encoding (for example, 7bit)
    • Mimetype (for example, image/jpeg)
    • Exif (image orientation)
    • Height (in pixels)
    • Width (in pixels)
    • Dimensions

Private Preprocessing Service

Dataloop's global preprocessing service is designed to serve all of its customers as a unified service, capable of scaling in response to varying workloads while maintaining an optimal configuration.

For customers who aim to gain independence from the global preprocessing service and ensure uninterrupted performance in production-level projects, unaffected by potential load fluctuations generated by other Dataloop customers, they have the option to install and run it within their projects using their dedicated resources.

To learn more about this option, contact Dataloop.

Additional Preprocessing Services

You can define a created trigger within their projects, enabling you to call uploaded items in your own functions for tailored preprocessing.


What's Next