Private cloud integration

Private cloud integration

Before adding your cloud data to a dataset, you need to integrate your cloud storage with Encord.

Please see the Data integrations section to learn how to create integrations for:


To add your cloud data to a Dataset:

  1. Turn on the Private cloud toggle in the Create dataset part of the data creation flow when creating a new dataset.

  2. Select the relevant integration using the Select integration drop-down.

  1. Upload an appropriately formatted JSON or CSV file specifying the data you would like to add to the dataset. Please see below on how to format an appropriate JSON or CSV file. Your stored objects may contain files that are not supported by Encord and which may produce errors on upload - toggle the Ignore individual file errors toggle to ignore these.

👍

Tip

We recommend turning on the Ignore individual file errors toggle to ensure that the entire upload doesn't fail if only one file can't be added.

👍

Tip

We recommend setting the expiration time for presigned URLs, in your cloud storage settings, to be greater than the time it takes to complete an annotation task. More information can be found in the documentation of your cloud service provider:

  1. Click Add data to add data.

ℹ️

Note

The data will be fetched from your cloud storage and processed asynchronously. This involves fetching appropriate metadata and other file information to help us render the files appropriately and to check for any framerate inconsistencies. We do not store your files in any way.

Checking upload status

You can check the progress of the processing job by clicking in the top right.
A spinning progress indicator will indicate the processing job is still in progress.

  • If successful, the processing will complete with a icon.
  • If unsuccessful, there will be a icon, as seen below.

If this is the case, please check that your provider permissions have been set correctly, that the object data format is supported, and that the JSON or CSV file is correctly formatted.

Check which files failed to upload by clicking the icon to download a CSV log file. Every row in the CSV will correspond to a file which failed to be uploaded.

ℹ️

Note

You will only see failed uploads if the Ignore individual file errors toggle wasn't enabled when uploading your data.


Creating a dataset using cloud data

To create a dataset using data from your private cloud, you will need to upload either a JSON or CSV file, specifying the URLs of all the files you'd like to add.

JSON format

The JSON file format is a JSON object with top-level keys specifying the type of data and object URLs of the content you wish to add to the dataset. Object URLs must not contain any whitespace. You can add one data type at a time, or combine multiple data types in one JSON file according to your preferences or development flows. The supported top-level keys are: videos, image_groups, images, and dicom_series. The details for each data format are given in the sections below.

👍

Tip

Confused about the difference between image groups and image sequences? See our documentation here to learn about different data types in Encord.

Videos

Each object in the videos array is a JSON object with the key objectUrl specifying the full URL of where to find the video resource. The title field is optional. If not specified, the video's file name will be used.

  • Video metadata (separate from client metadata) may be specified for videos. Click here to read more.

  • If skip_duplicate_urls is set to true, all object URLs that exactly match existing videos in the dataset will be skipped.

Key or FlagRequired?Default value
"objectUrl"Yes
"title"No<file title>
"clientMetadata"No
"skip_duplicate_urls"Nofalse
"createVideo"Nofalse

ℹ️

Note

Keys / Flags that aren't required can be omitted from the JSON file entirely.

{
  "videos": [
    {
      "objectUrl": "<object url_1>"
    },
    {
      "objectUrl": "<object url_2>",
      "title": "my-custom-video-title.mp4",
      "clientMetadata": {"optional": "metadata"}
    }
  ],
  "skip_duplicate_urls": true
}
Specifying video metadata

The JSON format allows you to specify videoMetadata for video files. videoMetadata is essential information used by the Label Editor and is crucial for aligning annotations to the correct frame when using Strict client-only access.

Example JSON including video metadata
{
    "videos": [
      {
        "objectUrl": "video_file.mp4",
        "videoMetadata": {
            "fps": 23.98,
            "duration": 29.09,
            "width": 1280,
            "height": 720,
            "file_size": 5468354,
            "mime_type": "video/mp4"
        }
      }
    ]
  }

  • fps: Frames per second.
  • duration: Duration of the video (in seconds).
  • width / height: Dimensions of the video (in pixels).
  • file_size: The size of the file (in bytes).
  • mime_type: Specifies the file type extension according to the MIME standard.

When videos are supplied with video metadata, Encord assumes the metadata to be correct and our servers will neither download nor pre-process your data. This may be a particularly useful feature for customers with strict data compliance concerns.

One way to find the necessary metadata is shown below. Run the following commands in your terminal.

  • ffmpeg -i 'video_title.mp4' to retrieve fps, duration, width, and height - as highlighted below.
  • ls -l 'video_title.mp4' to retrieve the file size - as highlighted below.

Single images

The JSON structure for single images parallels that of videos.

  • The title field is optional.
  • If not specified, the file name of the image will be used.
  • If skip_duplicate_urls is set to true, images that have been previously uploaded to the dataset with the same object URL will be skipped.
Key or FlagRequired?Default value
"objectUrl"Yes
"title"No<file title>
"clientMetadata"No
"skip_duplicate_urls"Nofalse
"createVideo"Nofalse

ℹ️

Note

Keys / Flags that aren't required can be omitted from the JSON file entirely.

{
  "images": [
    {
      "objectUrl": "<object url>"
    },
    {
      "objectUrl": "<object url>",
      "title": "my-custom-image-title.jpeg",
      "clientMetadata": {"optional": "metadata"}
    }
  ]
}

Image groups

  • Image groups are collections of images that are processed as one annotation task.
  • Images within image groups remain unaltered, meaning that images of different sizes and resolutions can form an image group without the loss of data.
  • Image groups do not require 'write' permissions to your cloud storage.
  • Custom client metadata is defined per image group, not per image.
  • If skip_duplicate_urls is set to true, all URLs exactly matching existing image groups in the dataset will be skipped.
Key or FlagRequired?Default value
"objectUrl"Yes
"title"Yes<file title>
"clientMetadata"No
"skip_duplicate_urls"Nofalse
"createVideo"Yestrue (change this to false for image groups)

ℹ️

Note

The position of each image within the sequence needs to be specified in the key - e.g. objectUrl_{position_number} as seen in the example below.

ℹ️

Note

Keys / Flags that aren't required can be omitted from the JSON file entirely.

ℹ️

Note

Set the "createVideo" flag to false for image groups.

{
  "image_groups": [
    {
      "title": "<title 1>",
      "createVideo": false,
      "objectUrl_0": "<object url>"
    },
    {
      "title": "<title 2>",
      "createVideo": false,
      "objectUrl_0": "<object url>",
      "objectUrl_1": "<object url>",
      "objectUrl_2": "<object url>",
      "clientMetadata": {"optional": "metadata"}
    }
  ]
}

Image sequences

  • Image sequences are collections of images that are processed as one annotation task and represented as a video.
  • Images within image sequences may be altered as images of varying sizes are resolutions are made to match that of the first image in the sequence.
  • Creating Image sequences from cloud storage requires 'write' permissions, as new files have to be created in order to be read as a video.
  • Each object in the image_groups array with the createVideo flag set to true represents a single image sequence.
  • Custom client metadata is defined per image sequence, not per image.
  • If skip_duplicate_urls is set to true, all URLs exactly matching existing image sequences in the dataset will be skipped.

👍

Tip

The only difference between adding image groups and image sequences via a JSON is that image sequences require the createVideo flag to be set to true. Both use the key image_groups.

Key or FlagRequired?Default value
"objectUrl"Yes
"title"Yes<file title>
"clientMetadata"No
"skip_duplicate_urls"Nofalse
"createVideo"Yestrue

ℹ️

Note

The position of each image within the sequence needs to be specified in the key - e.g objectUrl_{position_number}. See the example below.

ℹ️

Note

Keys / Flags that aren't required can be omitted from the JSON file entirely.

{
  "image_groups": [
    {
      "title": "<title 1>",
      "createVideo": true,
      "objectUrl_0": "<object url>"
    },
    {
      "title": "<title 2>",
      "createVideo": true,
      "objectUrl_0": "<object url>",
      "objectUrl_1": "<object url>",
      "objectUrl_2": "<object url>",
      "clientMetadata": {"optional": "metadata"}
    }
  ]
}

DICOM

  • Each dicom_series element can contain one or more DICOM series.
  • Each series requires a title and at least one object URL, as shown in the example below.
  • If skip_duplicate_urls is set to true, all object URLs exactly matching existing DICOM files in the dataset will be skipped.
Key or FlagRequired?Default value
"objectUrl"Yes
"title"Yes<file title>
"clientMetadata"No
"skip_duplicate_urls"Nofalse
"createVideo"Yesfalse

ℹ️

Note

Keys / Flags that aren't required, such as clientMetadata, can be omitted from the JSON file entirely. clientMetadata is distinct from patient metadata, which is included in the .dcm file and does not have to be specific during the upload to Encord.

The following is an example JSON for uploading three DICOM series belonging to a study. Each title and object URL correspond to individual DICOM series.

  • The first series contains only a single object URL, as it is composed of a single file.
  • The second series contains 3 object URLs, as it is composed of three separate files.
  • The third series contains 2 object URLs, as it is composed of two separate files.
{
  "dicom_series": [
    {
      "title": "<series-1>",
      "objectUrl_0": "https://my-bucket/.../study1-series1-file.dcm"
    },
    {
      "title": "<series-2>",
      "objectUrl_0": "https://my-bucket/.../study1-series2-file1.dcm",
      "objectUrl_1": "https://my-bucket/.../study1-series2-file2.dcm",
      "objectUrl_2": "https://my-bucket/.../study1-series2-file3.dcm",
    },
      {
      "title": "<series-3>",
      "objectUrl_0": "https://my-bucket/.../study1-series3-file1.dcm",
      "objectUrl_1": "https://my-bucket/.../study1-series3-file2.dcm",
    }
  ]
}


Multiple file types

You can upload multiple file types using a single JSON file. The example below shows 1 image, 2 videos, 2 image sequences, and 1 image group.

ℹ️

Note

Keys / Flags that aren't required can be omitted from the JSON file entirely.


{
  "images": [
    {
      "objectUrl": "https://cord-dev.s3.eu-west-2.amazonaws.com/Image1.png"
    }
  ],
  "videos": [
    {
      "objectUrl": "https://cord-dev.s3.eu-west-2.amazonaws.com/Cooking.mp4"
    },
    {
      "objectUrl": "https://cord-dev.s3.eu-west-2.amazonaws.com/Oranges.mp4"
    }
  ],
  "image_groups": [
    {
      "title": "apple-samsung-light",
      "createVideo": true,
      "objectUrl_0": "https://cord-dev.s3.eu-west-2.amazonaws.com/food-dataset/Apple/1-Samsung-S4-Light+Environment/1+(32).jpg",
      "objectUrl_1": "https://cord-dev.s3.eu-west-2.amazonaws.com/food-dataset/Apple/1-Samsung-S4-Light+Environment/1+(33).jpg",
      "objectUrl_2": "https://cord-dev.s3.eu-west-2.amazonaws.com/food-dataset/Apple/1-Samsung-S4-Light+Environment/1+(34).jpg",
      "objectUrl_3": "https://cord-dev.s3.eu-west-2.amazonaws.com/food-dataset/Apple/1-Samsung-S4-Light+Environment/1+(35).jpg"
    },
    {
      "title": "apple-samsung-dark",
      "createVideo": true,
      "objectUrl_0": "https://cord-dev.s3.eu-west-2.amazonaws.com/food-dataset/Apple/2-samsung-S4-Dark+Environment/2+(32).jpg",
      "objectUrl_1": "https://cord-dev.s3.eu-west-2.amazonaws.com/food-dataset/Apple/2-samsung-S4-Dark+Environment/2+(33).jpg",
      "objectUrl_2": "https://cord-dev.s3.eu-west-2.amazonaws.com/food-dataset/Apple/2-samsung-S4-Dark+Environment/2+(34).jpg",
      "objectUrl_3": "https://cord-dev.s3.eu-west-2.amazonaws.com/food-dataset/Apple/2-samsung-S4-Dark+Environment/2+(35).jpg"
    }
  ],
  "image_groups": [
    {
      "title": "apple-ios-light",
      "createVideo": false,
      "objectUrl_0": "https://cord-dev.s3.eu-west-2.amazonaws.com/food-dataset/Apple/3-IOS-4-Light+Environment/3+(32).jpg",
      "objectUrl_1": "https://cord-dev.s3.eu-west-2.amazonaws.com/food-dataset/Apple/3-IOS-4-Light+Environment/3+(33).jpg"
    }
  ]
}


Client metadata & skip duplicate URLs

You can optionally add some custom client metadata per data item in the clientMetadata field (examples below show how this is done). Client metadata is separate from video metadata, and is intended as an arbitrary store of data you would like to associate with any particular file.

It is important to know that we enforce a 10MB limit on the client metadata per data item. Also, this metadata is being stored as a PostgreSQL jsonb type internally. Please read the relevant PostgreSQL docs about the jsonb type and its behaviors. For example, jsonb type will not preserve key order or duplicate keys.

Add the "skip_duplicate_urls": true flag at the top level to make the uploads idempotent. Skipping URLs in the dataset can help speed up large upload operations. Since previously processed assets don't have to be uploaded again, you can simply retry the failed operation without editing the upload specification file. The flag's default value isfalse.

ℹ️

Note

These features are currently only supported for JSON uploads.

When using a Multi-Region Access Point

When using a Multi-Region Access Point for your AWS S3 buckets the JSON file will have to be slightly different from the examples provided. Instead of an object's URL, objects are specified using the ARN of the Multi-Region Access Point followed by the object name. The example below shows how video files from a Multi-Region Access Point would be specified.

{
  "videos": [
    {
      "objectUrl": "Multi-Region-Access-Point-ARN + <object name_1>"
    },
    {
      "objectUrl": "Multi-Region-Access-Point-ARN + <object name_2>",
      "title": "my-custom-video-title.mp4",
      "clientMetadata": {"optional": "metadata"}
    }
  ],
  "skip_duplicate_urls": true
}

CSV format

In the CSV file format, the column headers specify which type of data is being uploaded. You can add and single file format at a time, or combine multiple data types in a single CSV file.

Details for each data format are given in the sections below.

🚧

Caution

  • Object URLs can't contain whitespace.
  • For backwards compatibility reasons, a single column CSV is supported. A file with the single ObjectUrl column is interpreted as a request for video upload. If your objects are of a different type (for example, images), this error displays: "Expected a video, got a file of type XXX".
Videos

Videos

A CSV file containing videos should contain two columns with the following mandatory column headings:
'ObjectURL' and 'Video title'. All headings are case-insensitive.

  • The 'ObjectURL' column containing the objectUrl. This field is mandatory for each file, as it specifies the full URL of the video resource.

  • The 'Video title' column containing the video_title. If left blank, the original file name is used.

In the example below files 1, 2 and 4 are assigned the names in the title column, while file 3 keeps its original file name.

ObjectUrlVideo title
https://storage/frame1.mp4Video 1
https://storage/frame2.mp4Video 2
https://storage/frame3.mp4
https://storage/frame4.mp4Video 3
Single images

A CSV file containing single images MUST contain two columns with the following mandatory headings:
'ObjectURL' and 'Image title'. All headings are case-insensitive.

  • The 'ObjectURL' column containing the objectUrl. This field is mandatory for each file, as it specifies the full URL of the image resource.

  • The 'Image title' column containing the image_title. If left blank, the original file name is used.

In the example below files 1, 2 and 4 are assigned the names in the title column, while file 3 keeps its original file name.

ObjectUrlImage title
https://storage/frame1.jpgImage 1
https://storage/frame2.jpgImage 2
https://storage/frame3.jpg
https://storage/frame4.jpgImage 3
Image groups

Image groups

A CSV file containing image groups MUST contain three columns with the following mandatory headings:
'ObjectURL', 'Image group title', and 'Create video'. All three headings are case-insensitive.

  • The 'ObjectURL' column containing the objectUrl. This field is mandatory for each file, as it specifies the full URL of the resource.

  • The 'Image group title' column containing the image_group_title. This field is mandatory, as it determines which image group a file will be assigned to.

In the example below the first two URLs are grouped together into 'Group 1', while the following two files are grouped together into 'Group 2'.

ObjectUrlImage group titleCreate video
https://storage/frame1.jpgGroup 1false
https://storage/frame2.jpgGroup 1false
https://storage/frame3.jpgGroup 2false
https://storage/frame4.jpgGroup 2false

ℹ️

Note

Image groups do not require 'write' permissions.

Image sequences

Image sequences

A CSV file containing image sequences MUST contain three columns with the following mandatory headings: 'ObjectURL', 'Image group title', and 'Create video'. All three headings are case-insensitive.

  • The 'ObjectURL' column containing the objectUrl. This field is mandatory for each file, as it specifies the full URL of the resource.

  • The 'Image group title' column containing the image_group_title. This field is mandatory, as it determines which image sequence a file will be assigned to. The dimensions of the image sequence are determined by the first file in the sequence.

  • The 'Create video' column. This can be left blank, as the default value is 'true'.

In the example below the first two URLs are grouped together into 'Sequence 1', while the second two files are grouped together into 'Sequence 2'.

ObjectUrlImage group titleCreate video
https://storage/frame1.jpgSequence 1true
https://storage/frame2.jpgSequence 1true
https://storage/frame3.jpgSequence 2true
https://storage/frame4.jpgSequence 2true

👍

Tip

Image groups and image sequences are only distinguished by the presence of the 'Create video' column.

ℹ️

Note

Image sequences require 'write' permissions against your storage bucket to save the compressed video.

DICOM

A CSV file containing DICOM files MUST contain two columns with the following headings: 'ObjectURL' and 'Series title'. Both headings are case-insensitive.

  • The 'ObjectURL' column contains the objectUrl. This field is mandatory for each file, as it specifies the full URL of the resource.

  • The 'Series title' column contains the dicom_title. When two files are given the same title they are grouped into the same DICOM series. If left blank, the original file name is used.

In the example below the first two files are grouped into 'dicom series 1', the next two files are grouped into 'dicom series 2', while the final file will remain separated as 'dicom series 3'.

ObjectUrlSeries title
https://storage/frame1.dcmdicom series 1
https://storage/frame2.dcmdicom series 1
https://storage/frame3.dcmdicom series 2
https://storage/frame4.dcmdicom series 2
https://storage/frame5.dcmdicom series 3
Multiple file types

Multiple file types

You can upload multiple file types with a single CSV file by using a new header each time there is a change of file type. Three headings will be required if image sequences are included.

🚧

Caution

Since the 'Create video' column defaults to "true" all files that aren't image sequences have to contain the value "false"

The example below shows a CSV file for the following:

  • Two image sequences composed of 2 files each.
  • One image group composed of 2 files.
  • One single image.
  • One video.
ObjectUrlImage group titleCreate video
https://storage/frame1.jpgSequence 1true
https://storage/frame2.jpgSequence 1true
https://storage/frame3.jpgSequence 2true
https://storage/frame4.jpgSequence 2true
https://storage/frame5.jpgGroup 1false
https://storage/frame6.jpgGroup 1false
ObjectUrlImage titleCreate video
https://storage/frame1.jpgImage 1false
ObjectUrlImage titleCreate video
https://storage/video.mp4Video 1false

Helpful Scripts and Examples

Use the following examples and helpful scripts to quickly learn how to create JSON and CSV files formatted for the dataset creation process, by constructing the URLs from the specified path in your private storage.

AWS S3

AWS S3 object URLs can follow a few set patterns:

  • Virtual-hosted style: https://<bucket-name>.s3.<region>.amazonaws.com/<key-name>
  • Path-style: https://s3.<region>.amazonaws.com/<bucket-name>/<key-name>
  • S3 protocol: S3://<bucket-name>/<key-name>
  • Legacy: those without regions or those with S3-<region> in the URL

AWS best practice is to use Virtual-hosted style. Path-style is planned to be deprecated and the legacy URLs are already deprecated.

We support Virtual-hosted style, Path-style and S3 protocol object URLs. We recommend you use Virtual-hosted style object URLs wherever possible.

Object URLs can be found in the Properties tab of the object in question. Navigate to AWS S3 > bucket > object > Properties to find the Object URL.

Here's a python script which creates a JSON file for single images by constructing the URLs from the specified path in a given S3 bucket. You'll need to configure the following variables to match your setup.

  1. region: needs to be the AWS resource region you intend to use. For S3, it's the region where your bucket is.
  2. aws_profile: is the name of the profile in the AWS ~/.aws/credentials file. See AWS Credentials Documentation to set up the credentials file properly.
  3. bucket_name: the name of your S3 bucket you want to pull files from.
  4. s3_directory: the path to the directory where your files are stored inside the S3 bucket. Include all slashes but final slash. For example:
# my file is at my-bucket/some_top_level_dir/video_files/my_video.mp4
# then set s3 directory as follows
s3_directory = 'some_top_level_dir/video_files'

And the script itself:

import boto3
import logging
import sys
import json
from botocore.config import Config

region = 'FILL_ME_IN'
aws_profile = 'FILL_ME_IN'
bucket_name = 'FILL_ME_IN'
s3_directory = 'FILL_ME_IN'

domain = f's3.{region}.amazonaws.com'
root_url = f'https://{domain}/{bucket_name}'
session = boto3.Session(profile_name=aws_profile)
sandbox_s3_client = session.client('s3')
s3 = boto3.resource('s3')
bucket = s3.Bucket(bucket_name)

images = []
for object_summary in bucket.objects.all():
    split_key = object_summary.key.split('/')

    if len(split_key) >= 2 and '/'.join(split_key[0:-1]) == s3_directory:
        object_url = f'{root_url}/{object_summary.key}'
        images.append({'objectUrl': object_url})

outer_json_dict = {
    "images": images
}

output_file = open(f'upload_images_{s3_directory}.json', 'w')
json.dump(outer_json_dict, output_file, indent=4)
output_file.close()
Azure blob
{
    "videos": [
        {
            "objectUrl": "https://myaccount.blob.core.windows.net/myblob"
        },
        {
            "objectUrl": "https://myaccount.blob.core.windows.net/mycontainer/myblob.jpg"
        },
        {
            "objectUrl": "https://myaccount.blob.core.windows.net/mycontainer/myblobs/myblob.jpg"
        }
    ],
    "image_groups": [
      {
        "title": "image_group_1",
        "objectUrl_0": "https://myaccount.blob.core.windows.net/mycontainer/myblob1.jpg",
        "objectUrl_1": "https://myaccount.blob.core.windows.net/mycontainer/myblob2.jpg"
      },
      {
        "title": "image_group2",
        "objectUrl_0": "https://myaccount.blob.core.windows.net/mycontainer/myblob3.jpg",
        "objectUrl_1": "https://myaccount.blob.core.windows.net/mycontainer/myblob4.jpg"
      }
    ]
}
GCP storage
{
    "videos": [
        {
            "objectUrl": "gs://example-url/object.mp4"
        }
    ],
    "image_groups": [
      {
        "title": "image_group_1",
        "objectUrl_0": "https://storage.cloud.google.com/example-image-bucket/object_1.jpg",
        "objectUrl_1": "https://storage.cloud.google.com/example-image-bucket/object_2.jpg"
        
      },
      {
        "title": "image_group_2",
        "objectUrl_0": "https://storage.cloud.google.com/example-image-bucket/object_3.jpg",
        "objectUrl_1": "https://storage.cloud.google.com/example-image-bucket/object_4.jpg"
      }
    ]
}
Open Telekom Cloud OSS
{
  "dicom_series": [
    {
      "title": "OPEN_TELEKOM_DICOM_SERIES",
      "objectUrl_0": "https://bucket-name.obs.eu-de.otc.t-systems.com/dicom-file-0",
      "objectUrl_1": "https://bucket-name.obs.eu-de.otc.t-systems.com/dicom-file-1",
      "objectUrl_2": "https://bucket-name.obs.eu-de.otc.t-systems.com/dicom-file-2",
      "objectUrl_3": "https://bucket-name.obs.eu-de.otc.t-systems.com/dicom-file-3"
    }
  ]
}