Create Dataset & Upload Data
In the Dataloop system, a dataset is a collection of items (files), their metadata and annotations. A dataset can have a file-system-like structure, with folders and subfolders at any level.
There are different types of Datasets:
- Master - Original dataset, managing the actual binaries
- Clone - Contains pointers to original files, enabling management of virtual items that do not replicate the binaries of the underlying storage once cloned or copied. When cloning a dataset, users can decide if the new copy will contain metadata and annotations created over the original.
- Merge - Multiple datasets can be merged into one, enabling multiple annotations to be merged onto the same item.
The Dataloop platform has a flexible storage engine, which enables to attach different binary storage devices such as:
- Cloud storage devices like GCS, S3, Elastifile etc.- connect your data without copying them to Dataloop storage system, to have a single point of truth for your files and comply with various regulations.
- File system storage
- Network drives
- Databases, such as: mongo GridFS
Each storage medium is supported through its drivers and additional drivers are continuously being added.
Creating a Dataset
- Click “NEW Dataset” from the Project-Overview page, or the Datasets main page of your project (Data-Management->Datasets on the left-side navigation menu).
- Type in the dataset’s name in the popup box, Select your Storage type (Dataloop file system or external storage) and then click "OK".
- New datasets are created with a new recipe linked to them. To connect your new dataset to an existing recipe, enable the “ Existing Recipe” option and select the required recipe from the drop-down list.
- If you’ve select an external/cloud storage option, another step will be added to the process
- Integration selection is pre-populated based on your selection in the previous step. You can still change it, or click 'Add New Integration' to add a new integration/secret.
- Select your storage driver from the list or click 'Add Driver' to start the process of adding a new driver.
- By default, files stored on your external storage will begin syncing as soon as you create the dataset. To avoid immediate sync, uncheck the option 'Start Sync Process Now'
The dataset browser enables you to organize your files in file-system-like folders and subfolders structure. From the dataset-browser, click on the Upload icon and select if you would like to upload individual files, or entire folders. This selection is required, since the browsers' pop-up windows for selecting files or folders are different.
Another option is to drag-and-drop your files and folder onto the thumbnails area, and the upload process will initiate.
Since browsers are not optimized for managing the upload process, attempting to upload 1 or more files that are over 1GB, and more than 100 folders, is prone to browser crash/halts. The actual number may depend on your PC setup and Internet connection. To upload large number of files and folders, use the SDK as detailed in our tutorial.
Browsers may align rotated images when viewed in the annotation studio. To avoid unwanted rotations, we strongly advise to ensure the EXIF values on your images are as intended. For more information read here.