Skip to main content

Private cloud integration

Before adding your cloud data to a dataset, you need to integrate your cloud storage with Encord. Please see the Data integrations section to learn how to create integrations for AWS S3 , Azure blob, GCP storage or Open Telekom Cloud.

To add your cloud-stored data, turn on the Private cloud toggle in the Upload data part of the data creation flow.

To add your cloud data

  1. Select the relevant integration using the Select integration drop down
  2. Upload an appropriately formatted JSON or CSV file specifying the data you would like to add to the dataset

Please see below on how to format an appropriate JSON or CSV file. Once the file has been specified, select one of your integrations then click the upload rectangle or drag the file into it.

Your stored objects may contain files which are not supported by Encord and which may produce errors on upload. If this is the case, toggle the 'Ignore individual file errors' toggle.

Once the JSON or CSV file is uploaded, click the Create dataset button. The data will now be fetched from your cloud storage and processed asynchronously. This processing involves fetching appropriate metadata and other file information to help us render the files appropriately and to check for any framerate inconsistencies. We do not store your files in any way.

You can check the progress of the processing job by clicking the notification bell in the top right. A spinning progress indicator will indicate the processing job is still in progress. If successful, the processing will complete with a green tick; if not, there will be a red cross. If this is the case, please check that your provider permissions have been set correctly, that the object data format is supported, and that the JSON or CSV file is correctly formatted.

JSON format

The JSON file format is a JSON object with top-level keys specifying the type of data and object URLs of the content you wish add to the dataset. The object URLs must not contain any whitespace. You can add one data type at a time, or combine multiple data types in one JSON file according to your preferences or development flows. The supported top-level keys are videos, image_groups, images, and dicom_series. The format for each is described in detail below.

note

You can optionally add some custom client metadata per data item in the clientMetadata field. See examples below on how to add this.

It is important to know that we enforce a 10MB limit on the client metadata per data item. Also, this metadata is being stored as a PostgreSQL jsonb type internally. Please read the relevant PostgreSQL docs about the jsonb type and its behaviours. For example, jsonb type will not preserve key order or duplicate keys.

note

Add the top level "skip_duplicate_urls": true flag to make the uploads idempotent. Skipping URLs already in the dataset can help complete large upload operations, which may have been interrupted due to unstable network, etc. Since previously processed assets don't have to be uploaded again, you can simply retry the failed operation without editing the upload specification file.

See the example in the Videos section below.

The exact semantics are discussed in the relevant sections below. The default of this flag is set to false. This is currently only supported for the JSON uploads.

Videos

Each object in the videos array is a JSON object with the key objectUrl specifying the full URL of where to find the video resource. The title field is optional. If not specified, the file name of the video will be used.

Here we add the skip_duplicate_urls flag and set it to true. If set to true, videos that have been previously uploaded to the dataset with the same object URL will be skipped. The default of this flag is set to false.

See the sample below.

{
"videos": [
{
"objectUrl": "<object url>"
},
{
"objectUrl": "<object url>",
"title": "my-custom-video-title.mp4",
"clientMetadata": {"optional": "metadata"}
}
],
"skip_duplicate_urls": true,
}

Single Images

The JSON structure for images parallels that of videos. The title field is optional. If not specified, the file name of the image will be used.

If the skip_duplicate_urls is set to true, images that have been previously uploaded to the dataset with the same object URL will be skipped. The default of this flag is set to false.

See the sample below.

{
"images": [
{
"objectUrl": "<object url>"
},
{
"objectUrl": "<object url>",
"title": "my-custom-image-title.jpeg",
"clientMetadata": {"optional": "metadata"}
}
]
}

Image groups

Image groups are a group of images that should be processed as one annotation task. Encord supports representing image groups in two formats. The first format is called the "native" or "original" representation where images are presented unaltered. This means that images of different sizes and resolutions can form one image group, and no data is lost. The second format is what we call the video representation, sometimes also known as an image sequence. For further details, see the relevant editor documentation. For the details on how to select the original/video representation when uploading from private cloud, please consult the documentation below.

Each object in the image_groups array represents an individual image group. In all cases, it is necessary to specify the title of the group and the position of each image in the group by naming the keys according to the sequence number e.g. objectUrl_#{sequence_number}. The objectUrl_#{sequence_number} keys need to be in order for the upload to succeed, as shown in the sample below.

The createVideo argument specifies if an image group will be created as the video representation or not. It's an optional parameter to the JSON format. Leave the parameter out, or include it and set to true to use the video representation. Include the value and set explicitly to false to use the original images. CSV details are provided below.

The column create_video is mandatory, if value is left blank at any given row the default value will be true, explicitly set it to false to opt out of video representation. Image groups without a video representation do not require write permissions to your private bucket. See the sample below.

If the skip_duplicate_urls is set to true, image groups where all object URLs exactly match an existing image group in the dataset will be skipped. The default of this flag is set to false.

The custom client metadata is per image group, not per image.

{
"image_groups": [
{
"title": "<title 1>",
"createVideo": true,
"objectUrl_0": "<object url>"
},
{
"title": "<title 2>",
"createVideo": false,
"objectUrl_0": "<object url>",
"objectUrl_1": "<object url>",
"objectUrl_2": "<object url>",
"clientMetadata": {"optional": "metadata"}
}
]
}

DICOM

Like image_groups, the dicom_series elements require a title and a sequenced object URL. See the sample below.

If the skip_duplicate_urls is set to true, DICOM series where all object URLs exactly match an existing DICOM series in the dataset will be skipped. The default of this flag is set to false.

The custom client metadata is per dicom series.

{
"dicom_series": [
{
"title": "<title 1>",
"objectUrl_0": "<object url>"
},
{
"title": "<title 2>",
"objectUrl_0": "<object url>",
"objectUrl_1": "<object url>",
"objectUrl_2": "<object url>",
"clientMetadata": {"optional": "metadata"}
}
]
}

CSV format

The CSV file should be structured with three columns with the following headings: ObjectUrl, Image group title, and Create video. ObjectUrl is used for all data modalities, and specifies the URL of the resource. The object URLs are from your cloud provider and can not contain any whitespace.

The other two columns must be present in the CSV in all cases, but only need to contain a value when creating an image group. Image group title is the name of the group to which to assign the image. You can leave this column blank in other cases. The Create video argument specifies if the image group will be created as the video representation or not. Again, the column heading is necessary in all cases, but a value is only necessary when dealing with image groups. The behavior will default to true (use the video representation) -- set the value to false for all images in an image group to use the original representation. Note that the video representation requires write permissions against your storage bucket to save the compressed video. The original image representation does not require any permissions beyond reading individual objects.

Here is the format if there are 3 videos, and 3 images split into 2 image groups

ObjectUrlImage group titleCreate video
<object url>
<object url>
<object url>
<object url><title 1>true
<object url><title 1>true
<object url><title 2>false

See below for examples for each of the providers we support.

Examples and Helpful Scripts

Use the following examples and helpful scripts to quickly learn how to create JSON and CSV files formatted for the dataset creation process, by constructing the URLs from the specified path in your private storage.

AWS S3

AWS S3 object URLs can follow a few set patterns:

  • Virtual-hosted style: https://<bucket-name>.s3.<region>.amazonaws.com/<key-name>
  • Path-style: https://s3.<region>.amazonaws.com/<bucket-name>/<key-name>
  • S3 protocol: S3://<bucket-name>/<key-name>
  • Legacy: those without regions or those with S3-<region> in the URL

AWS best practice is to use Virtual-hosted style. Path-style is planned to be deprecated and the legacy URLs are already deprecated.

We support Virtual-hosted style, Path-style and S3 protocol object URLs. We recommend you use Virtual-hosted style object URLs wherever possible.

Object URLs can be found in the Properties tab of the object in question. Navigate to AWS S3 > bucket > object > Properties to find the Object URL.

Here is an example of a JSON file with two images, two videos, and three image groups


{
"images": [
{
"objectUrl": "https://cord-dev.s3.eu-west-2.amazonaws.com/Image1.png"
},
{
"objectUrl": "https://cord-dev.s3.eu-west-2.amazonaws.com/Image2.png"
}
],
"videos": [
{
"objectUrl": "https://cord-dev.s3.eu-west-2.amazonaws.com/Cooking.mp4"
},
{
"objectUrl": "https://cord-dev.s3.eu-west-2.amazonaws.com/Oranges.mp4"
}
],
"image_groups": [
{
"title": "apple-samsung-light",
"createVideo": true,
"objectUrl_0": "https://cord-dev.s3.eu-west-2.amazonaws.com/food-dataset/Apple/1-Samsung-S4-Light+Environment/1+(32).jpg",
"objectUrl_1": "https://cord-dev.s3.eu-west-2.amazonaws.com/food-dataset/Apple/1-Samsung-S4-Light+Environment/1+(33).jpg",
"objectUrl_2": "https://cord-dev.s3.eu-west-2.amazonaws.com/food-dataset/Apple/1-Samsung-S4-Light+Environment/1+(34).jpg",
"objectUrl_3": "https://cord-dev.s3.eu-west-2.amazonaws.com/food-dataset/Apple/1-Samsung-S4-Light+Environment/1+(35).jpg"
},
{
"title": "apple-samsung-dark",
"createVideo": true,
"objectUrl_0": "https://cord-dev.s3.eu-west-2.amazonaws.com/food-dataset/Apple/2-samsung-S4-Dark+Environment/2+(32).jpg",
"objectUrl_1": "https://cord-dev.s3.eu-west-2.amazonaws.com/food-dataset/Apple/2-samsung-S4-Dark+Environment/2+(33).jpg",
"objectUrl_2": "https://cord-dev.s3.eu-west-2.amazonaws.com/food-dataset/Apple/2-samsung-S4-Dark+Environment/2+(34).jpg",
"objectUrl_3": "https://cord-dev.s3.eu-west-2.amazonaws.com/food-dataset/Apple/2-samsung-S4-Dark+Environment/2+(35).jpg"
},
{
"title": "apple-ios-light",
"createVideo": false,
"objectUrl_0": "https://cord-dev.s3.eu-west-2.amazonaws.com/food-dataset/Apple/3-IOS-4-Light+Environment/3+(32).jpg",
"objectUrl_1": "https://cord-dev.s3.eu-west-2.amazonaws.com/food-dataset/Apple/3-IOS-4-Light+Environment/3+(33).jpg"
}
]
}

Here are the same object URLs in a CSV file

ObjectUrlImage group title
https://cord-dev.s3.eu-west-2.amazonaws.com/Cooking.mp4
https://cord-dev.s3.eu-west-2.amazonaws.com/Oranges.mp4
https://cord-dev.s3.eu-west-2.amazonaws.com/food-dataset/Apple/1-Samsung-S4-Light+Environment/1+(32).jpgapple-samsung-light
https://cord-dev.s3.eu-west-2.amazonaws.com/food-dataset/Apple/1-Samsung-S4-Light+Environment/1+(33).jpgapple-samsung-light
https://cord-dev.s3.eu-west-2.amazonaws.com/food-dataset/Apple/1-Samsung-S4-Light+Environment/1+(34).jpgapple-samsung-light
https://cord-dev.s3.eu-west-2.amazonaws.com/food-dataset/Apple/1-Samsung-S4-Light+Environment/1+(35).jpgapple-samsung-light
https://cord-dev.s3.eu-west-2.amazonaws.com/food-dataset/Apple/2-samsung-S4-Dark+Environment/2+(32).jpgapple-samsung-dark
https://cord-dev.s3.eu-west-2.amazonaws.com/food-dataset/Apple/2-samsung-S4-Dark+Environment/2+(33).jpgapple-samsung-dark
https://cord-dev.s3.eu-west-2.amazonaws.com/food-dataset/Apple/2-samsung-S4-Dark+Environment/2+(34).jpgapple-samsung-dark
https://cord-dev.s3.eu-west-2.amazonaws.com/food-dataset/Apple/2-samsung-S4-Dark+Environment/2+(35).jpgapple-samsung-dark
https://cord-dev.s3.eu-west-2.amazonaws.com/food-dataset/Apple/3-IOS-4-Light+Environment/3+(32).jpgapple-ios-light
https://cord-dev.s3.eu-west-2.amazonaws.com/food-dataset/Apple/3-IOS-4-Light+Environment/3+(33).jpgapple-ios-light

Here's a python script which creates a JSON file for single images by constructing the URLs from the specified path in a given S3 bucket. You'll need to configure the following variables to match your setup.

  1. region: needs to be the AWS resource region you intend to use. For S3, it's the region where your bucket is.
  2. aws_profile: is the name of the profile in the AWS ~/.aws/credentials file. See AWS Credentials Documentation to set up the credentials file properly.
  3. bucket_name: the name of your S3 bucket you want to pull files from.
  4. s3_directory: the path to the directory where your files are stored inside the S3 bucket. Include all slashes but final slash. For example:
# my file is at my-bucket/some_top_level_dir/video_files/my_video.mp4
# then set s3 directory as follows
s3_directory = 'some_top_level_dir/video_files'

And the script itself:

import boto3
import logging
import sys
import json
from botocore.config import Config

region = 'FILL_ME_IN'
aws_profile = 'FILL_ME_IN'
bucket_name = 'FILL_ME_IN'
s3_directory = 'FILL_ME_IN'

domain = f's3.{region}.amazonaws.com'
root_url = f'https://{domain}/{bucket_name}'
session = boto3.Session(profile_name=aws_profile)
sandbox_s3_client = session.client('s3')
s3 = boto3.resource('s3')
bucket = s3.Bucket(bucket_name)

images = []
for object_summary in bucket.objects.all():
split_key = object_summary.key.split('/')

if len(split_key) >= 2 and '/'.join(split_key[0:-1]) == s3_directory:
object_url = f'{root_url}/{object_summary.key}'
images.append({'objectUrl': object_url})

outer_json_dict = {
"images": images
}

output_file = open(f'upload_images_{s3_directory}.json', 'w')
json.dump(outer_json_dict, output_file, indent=4)
output_file.close()
Azure blob
{
"videos": [
{
"objectUrl": "https://myaccount.blob.core.windows.net/myblob"
},
{
"objectUrl": "https://myaccount.blob.core.windows.net/mycontainer/myblob.jpg"
},
{
"objectUrl": "https://myaccount.blob.core.windows.net/mycontainer/myblobs/myblob.jpg"
}
],
"image_groups": [
{
"title": "image_group_1",
"objectUrl_0": "https://myaccount.blob.core.windows.net/mycontainer/myblob1.jpg",
"objectUrl_1": "https://myaccount.blob.core.windows.net/mycontainer/myblob2.jpg"
},
{
"title": "image_group2",
"objectUrl_0": "https://myaccount.blob.core.windows.net/mycontainer/myblob3.jpg",
"objectUrl_1": "https://myaccount.blob.core.windows.net/mycontainer/myblob4.jpg"
}
]
}
GCP storage
{
"videos": [
{
"objectUrl": "gs://example-url/object.mp4"
}
],
"image_groups": [
{
"title": "image_group_1",
"objectUrl_0": "https://storage.cloud.google.com/example-image-bucket/object_1.jpg",
"objectUrl_1": "https://storage.cloud.google.com/example-image-bucket/object_2.jpg"

},
{
"title": "image_group_2",
"objectUrl_0": "https://storage.cloud.google.com/example-image-bucket/object_3.jpg",
"objectUrl_1": "https://storage.cloud.google.com/example-image-bucket/object_4.jpg"
}
]
}
Open Telekom Cloud OSS
{
"dicom_series": [
{
"title": "OPEN_TELEKOM_DICOM_SERIES",
"objectUrl_0": "https://bucket-name.obs.eu-de.otc.t-systems.com/dicom-file-0",
"objectUrl_1": "https://bucket-name.obs.eu-de.otc.t-systems.com/dicom-file-1",
"objectUrl_2": "https://bucket-name.obs.eu-de.otc.t-systems.com/dicom-file-2",
"objectUrl_3": "https://bucket-name.obs.eu-de.otc.t-systems.com/dicom-file-3"
}
]
}