Private cloud integration
Private cloud integration
Before adding your cloud data to a Dataset, you need to integrate your cloud storage with Encord.
Please see the Data integrations section to learn how to create integrations for:
To add your cloud data to a Dataset:
-
Turn on the Import from integration toggle in the Create dataset part of the data creation flow when creating a new Dataset.
-
Select the relevant integration using the Select integration drop-down.
![](https://storage.googleapis.com/docs-media.encord.com/static/img/datasets/private-cloud-integration-toggle.png)
- Upload an appropriately formatted JSON or CSV file specifying the data you would like to add to the Dataset. Your stored objects may contain files that are not supported by Encord, which may produce errors on upload - toggle the Ignore individual file errors toggle to ignore these.
Tip
We recommend turning on the Ignore individual file errors feature. This ensures that individual file errors do not lead to the whole upload process being aborted.
Tip
We recommend setting the expiration time for pre-signed URLs, in your cloud storage settings, to be greater than the time it takes to complete an annotation task. More information can be found in the documentation of your cloud service provider:
- Click Add data to add data.
![](https://storage.googleapis.com/docs-media.encord.com/static/img/datasets/private-cloud-integration-add-data.png)
Note
The data will be fetched from your cloud storage and processed asynchronously. This involves fetching appropriate metadata and other file information to help us render the files appropriately and to check for any framerate inconsistencies. We do not store your files in any way.
Checking upload status
You can check the progress of the processing job by clicking in the top right.
A spinning progress indicator will indicate the processing job is still in progress.
- If successful, the processing will complete with a
icon.
- If unsuccessful, there will be a
icon. Ensure that your provider permissions have been set correctly, that the object data format is supported, and that the JSON or CSV file is correctly formatted.
![](https://storage.googleapis.com/docs-media.encord.com/static/img/datasets/failed-data-upload.png)
Check which files failed to upload by clicking the icon to download a CSV log file. Every row in the CSV will correspond to a file which failed to be uploaded.
Note
You will only see failed uploads if the Ignore individual file errors toggle was not enabled when uploading your data.
Creating a Dataset using cloud data
To create a Dataset using data from your private cloud, you will need to upload either a JSON or CSV file, specifying the URLs of all the files you'd like to add.
Tip
We recommend uploading files in batches not exceeding 2GB, to ensure upload does not exceed 3 hours.
JSON format
The JSON file format is a JSON object with top-level keys specifying the type of data and object URLs of the content you wish to add to the dataset. Object URLs must not contain any whitespace. You can add one data type at a time, or combine multiple data types in one JSON file according to your preferences or development flows. The supported top-level keys are: videos
, image_groups
, images
, and dicom_series
. The details for each data format are given in the sections below.
CRITICAL INFORMATION
Encord supports up to 10,000 entries in the JSON file when uploading data to Encord.
Videos
Each object in the videos array is a JSON object with the key objectUrl
specifying the full URL of where to find the video resource. The title
field is optional. If not specified, the video's file name will be used.
-
Video metadata (separate from client metadata) may be specified for videos. Click here to read more.
-
If
skip_duplicate_urls
is set totrue
, all object URLs that exactly match existing videos in the Dataset will be skipped.
Key or Flag | Required? | Default value |
---|---|---|
"objectUrl" | Yes | |
"title" | No | <file title > |
"clientMetadata" | No | |
"skip_duplicate_urls" | No | false |
"createVideo" | No | false |
Note
Keys / Flags that aren't required can be omitted from the JSON file entirely.
{
"videos": [
{
"objectUrl": "<object url_1>"
},
{
"objectUrl": "<object url_2>",
"title": "my-custom-video-title.mp4",
"clientMetadata": {"optional": "metadata"}
}
],
"skip_duplicate_urls": true
}
Video metadata
The JSON format allows you to specify videoMetadata
for video files. videoMetadata
is essential information used by the Label Editor and is crucial for aligning annotations to the correct frame.
CRITICAL INFORMATION
When the
videometadata
flag is present in the JSON file, we directly use the supplied metadata without performing any additional validation, and do not store the file on our servers. To guarantee accurate labels, it is crucial that the metadata you provide is accurate.
Note
videoMetadata
must be specified when a Strict client-only access integration is used. In all other casesvideoMetadata
is optional.
Example JSON including video metadata
{
"videos": [
{
"objectUrl": "video_file.mp4",
"videoMetadata": {
"fps": 23.98,
"duration": 29.09,
"width": 1280,
"height": 720,
"file_size": 5468354,
"mime_type": "video/mp4"
}
}
]
}
- fps: Frames per second.
- duration: Duration of the video (in seconds).
- width / height: Dimensions of the video (in pixels).
- file_size: The size of the file (in bytes).
- mime_type: Specifies the file type extension according to the MIME standard.
When videos are supplied with video metadata, Encord assumes the metadata to be correct and our servers will neither download nor pre-process your data. This may be a particularly useful feature for customers with strict data compliance concerns.
One way to find the necessary metadata is shown below. Run the following commands in your terminal.
ffmpeg -i 'video_title.mp4'
to retrieve fps, duration, width, and height - as highlighted below.
![](https://storage.googleapis.com/docs-media.encord.com/static/img/admins/settings/video-metadata-1.png)
ls -l 'video_title.mp4'
to retrieve the file size - as highlighted below.
![](https://storage.googleapis.com/docs-media.encord.com/static/img/admins/settings/video-metadata-2.png)
Single images
The JSON structure for single images parallels that of videos.
- The
title
field is optional. - If not specified, the file name of the image will be used.
- If
skip_duplicate_urls
is set totrue
, images that have been previously uploaded to the dataset with the same object URL will be skipped. - Image metadata (separate from client metadata) may be specified for images. Click here to read more.
Key or Flag | Required? | Default value |
---|---|---|
"objectUrl" | Yes | |
"title" | No | <file title > |
"clientMetadata" | No | |
"skip_duplicate_urls" | No | false |
"createVideo" | No | false |
Note
Keys / Flags that are not required can be omitted from the JSON file entirely.
{
"images": [
{
"objectUrl": "<object url>"
},
{
"objectUrl": "<object url>",
"title": "my-custom-image-title.jpeg",
"clientMetadata": {"optional": "metadata"}
}
]
}
Image metadata
The JSON format allows you to specify imageMetadata
for image files. imageMetadata
contains essential information used by the Label Editor and is crucial for aligning annotations to the correct image properties.
CRITICAL INFORMATION
When the
imageMetadata
flag is present in the JSON file, we directly use the supplied metadata without performing any additional validation and do not store the file on our servers. To guarantee accurate labels, it is crucial that the metadata you provide is accurate.
Note
imageMetadata
must be specified when a Strict client-only access integration is used. In all other cases,imageMetadata
is optional.
Example JSON including image metadata
{
"images": [
{
"objectUrl": "s3://my_image.jpg",
"imageMetadata": {
"mimeType": "image/jpg",
"fileSize": 124,
"width": 640,
"height": 480
}
}
]
}
objectUrl
: URL or path to the image file.mimeType
: The MIME type of the image file (e.g.,image/jpg
,image/png
).fileSize
: The size of the image file in bytes.width
: The width of the image in pixels.height
: The height of the image in pixels.
Image groups
- Image groups are collections of images that are processed as one annotation task.
- Images within image groups remain unaltered, meaning that images of different sizes and resolutions can form an image group without the loss of data.
- Image groups do not require 'write' permissions to your cloud storage.
- Custom client metadata is defined per image group, not per image.
- If
skip_duplicate_urls
is set totrue
, all URLs exactly matching existing image groups in the dataset will be skipped.
Key or Flag | Required? | Default value |
---|---|---|
"objectUrl" | Yes | |
"title" | Yes | <file title > |
"clientMetadata" | No | |
"skip_duplicate_urls" | No | false |
"createVideo" | Yes | true (change this to false for image groups) |
Note
The position of each image within the sequence needs to be specified in the key - e.g.
objectUrl_{position_number}
as seen in the example below.
Note
Keys / Flags that aren't required can be omitted from the JSON file entirely.
Note
Set the "createVideo" flag to false for image groups.
{
"image_groups": [
{
"title": "<title 1>",
"createVideo": false,
"objectUrl_0": "<object url>"
},
{
"title": "<title 2>",
"createVideo": false,
"objectUrl_0": "<object url>",
"objectUrl_1": "<object url>",
"objectUrl_2": "<object url>",
"clientMetadata": {"optional": "metadata"}
}
]
}
Image sequences
- Image sequences are collections of images that are processed as one annotation task and represented as a video.
- Images within image sequences may be altered as images of varying sizes are resolutions are made to match that of the first image in the sequence.
- Creating Image sequences from cloud storage requires 'write' permissions, as new files have to be created in order to be read as a video.
- Each object in the
image_groups
array with thecreateVideo
flag set totrue
represents a single image sequence. - Custom client metadata is defined per image sequence, not per image.
- If
skip_duplicate_urls
is set totrue
, all URLs exactly matching existing image sequences in the Dataset are skipped.
Tip
The only difference between adding image groups and image sequences via a JSON is that image sequences require the
createVideo
flag to be set totrue
. Both use the keyimage_groups
.
Key or Flag | Required? | Default value |
---|---|---|
"objectUrl" | Yes | |
"title" | Yes | <file title > |
"clientMetadata" | No | |
"skip_duplicate_urls" | No | false |
"createVideo" | Yes | true |
Note
The position of each image within the sequence needs to be specified in the key - e.g
objectUrl_{position_number}
. See the example below.
Note
Keys / Flags that are not required can be omitted from the JSON file entirely.
{
"image_groups": [
{
"title": "<title 1>",
"createVideo": true,
"objectUrl_0": "<object url>"
},
{
"title": "<title 2>",
"createVideo": true,
"objectUrl_0": "<object url>",
"objectUrl_1": "<object url>",
"objectUrl_2": "<object url>",
"clientMetadata": {"optional": "metadata"}
}
]
}
DICOM
Note
Ensure your DICOM files and metadata follow the format outlined in the official DICOM specification.
- Each
dicom_series
element can contain one or more DICOM series. - Each file requires a title and at least one object URL, as shown in the example below.
- If
skip_duplicate_urls
is set totrue
, all object URLs exactly matching existing DICOM files in the Dataset are skipped.
Key or Flag | Required? | Default value |
---|---|---|
"objectUrl" | Yes | |
"title" | Yes | <file title > |
"clientMetadata" | No | |
"skip_duplicate_urls" | No | false |
"createVideo" | Yes | false |
Note
Keys / Flags that are not required, such as
clientMetadata
, can be omitted from the JSON file entirely.clientMetadata
is distinct from patient metadata, which is included in the.dcm
file and does not have to be specified during the upload to Encord.
The following is an example JSON for uploading three DICOM series belonging to a study. Each title and object URL correspond to individual DICOM series.
- The first series contains only a single object URL, as it is composed of a single file.
- The second series contains 3 object URLs, as it is composed of three separate files.
- The third series contains 2 object URLs, as it is composed of two separate files.
{
"dicom_series": [
{
"title": "<series-1>",
"objectUrl_0": "https://my-bucket/.../study1-series1-file.dcm"
},
{
"title": "<series-2>",
"objectUrl_0": "https://my-bucket/.../study1-series2-file1.dcm",
"objectUrl_1": "https://my-bucket/.../study1-series2-file2.dcm",
"objectUrl_2": "https://my-bucket/.../study1-series2-file3.dcm",
},
{
"title": "<series-3>",
"objectUrl_0": "https://my-bucket/.../study1-series3-file1.dcm",
"objectUrl_1": "https://my-bucket/.../study1-series3-file2.dcm",
}
]
}
NIfTI
- Each series requires a title and at least one object URL.
- If
skip_duplicate_urls
is set totrue
, all object URLs exactly matching existing NIfTI files in the Dataset are skipped.
Key or Flag | Required? | Default value |
---|---|---|
"objectUrl" | Yes | |
"title" | Yes | <file title > |
"clientMetadata" | No | |
"skip_duplicate_urls" | No | false |
"createVideo" | Yes | false |
The following is an example JSON file for uploading two NIfTI files to Encord.
{
"nifti_files": [
{
"title": "<file-1>",
"objectUrl_1": "https://my-bucket/.../nifti-file1.nii"
},
{
"title": "<file-2>",
"objectUrl_0": "https://my-bucket/.../nifti-file2.nii.gz",
}
]
}
Multiple file types
You can upload multiple file types using a single JSON file. The example below shows 1 image, 2 videos, 2 image sequences, and 1 image group.
Note
Keys / Flags that are not required can be omitted from the JSON file entirely.
{
"images": [
{
"objectUrl": "https://cord-dev.s3.eu-west-2.amazonaws.com/Image1.png"
}
],
"videos": [
{
"objectUrl": "https://cord-dev.s3.eu-west-2.amazonaws.com/Cooking.mp4"
},
{
"objectUrl": "https://cord-dev.s3.eu-west-2.amazonaws.com/Oranges.mp4"
}
],
"image_groups": [
{
"title": "apple-samsung-light",
"createVideo": true,
"objectUrl_0": "https://cord-dev.s3.eu-west-2.amazonaws.com/food-dataset/Apple/1-Samsung-S4-Light+Environment/1+(32).jpg",
"objectUrl_1": "https://cord-dev.s3.eu-west-2.amazonaws.com/food-dataset/Apple/1-Samsung-S4-Light+Environment/1+(33).jpg",
"objectUrl_2": "https://cord-dev.s3.eu-west-2.amazonaws.com/food-dataset/Apple/1-Samsung-S4-Light+Environment/1+(34).jpg",
"objectUrl_3": "https://cord-dev.s3.eu-west-2.amazonaws.com/food-dataset/Apple/1-Samsung-S4-Light+Environment/1+(35).jpg"
},
{
"title": "apple-samsung-dark",
"createVideo": true,
"objectUrl_0": "https://cord-dev.s3.eu-west-2.amazonaws.com/food-dataset/Apple/2-samsung-S4-Dark+Environment/2+(32).jpg",
"objectUrl_1": "https://cord-dev.s3.eu-west-2.amazonaws.com/food-dataset/Apple/2-samsung-S4-Dark+Environment/2+(33).jpg",
"objectUrl_2": "https://cord-dev.s3.eu-west-2.amazonaws.com/food-dataset/Apple/2-samsung-S4-Dark+Environment/2+(34).jpg",
"objectUrl_3": "https://cord-dev.s3.eu-west-2.amazonaws.com/food-dataset/Apple/2-samsung-S4-Dark+Environment/2+(35).jpg"
}
],
"image_groups": [
{
"title": "apple-ios-light",
"createVideo": false,
"objectUrl_0": "https://cord-dev.s3.eu-west-2.amazonaws.com/food-dataset/Apple/3-IOS-4-Light+Environment/3+(32).jpg",
"objectUrl_1": "https://cord-dev.s3.eu-west-2.amazonaws.com/food-dataset/Apple/3-IOS-4-Light+Environment/3+(33).jpg"
}
]
}
Client metadata & skip duplicate URLs
You can optionally add some custom client metadata per data item in the clientMetadata
field (examples below show how this is done). Client metadata is separate from video metadata, and is intended as an arbitrary store of data you would like to associate with any particular file.
We enforce a 10MB limit on the client metadata per data item. Also, this metadata is being stored as a PostgreSQL jsonb
type internally. Read the relevant PostgreSQL docs about the jsonb
type and its behaviors. For example, jsonb
type will not preserve key order or duplicate keys.
Add the "skip_duplicate_urls": true
flag at the top level to make the uploads idempotent. Skipping URLs in the dataset can help speed up large upload operations. Since previously processed assets don't have to be uploaded again, you can simply retry the failed operation without editing the upload specification file. The flag's default value isfalse
.
Note
These features are currently only supported for JSON uploads.
When using a Multi-Region Access Point
When using a Multi-Region Access Point for your AWS S3 buckets, objects are specified using the ARN of the Multi-Region Access Point followed by the object name. The following example shows how video files from a Multi-Region Access Point would be specified.
Tip
We provide a number of scripts to create a JSON file for uploading cloud data here. The AWS example includes a multi-region access point.
{
"videos": [
{
"objectUrl": "Multi-Region-Access-Point-ARN + <object name_1>"
},
{
"objectUrl": "Multi-Region-Access-Point-ARN + <object name_2>",
"title": "my-custom-video-title.mp4",
"clientMetadata": {"optional": "metadata"}
}
],
"skip_duplicate_urls": true
}
CSV format
In the CSV file format, the column headers specify which type of data is being uploaded. You can add and single file format at a time, or combine multiple data types in a single CSV file.
CRITICAL INFORMATION
Encord supports up to 10,000 entries in the CSV file when uploading data to Encord.
Caution
- Object URLs cannot contain whitespace.
- For backwards compatibility reasons, a single column CSV is supported. A file with the single
ObjectUrl
column is interpreted as a request for video upload. If your objects are of a different type (for example, images), this error displays: "Expected a video, got a file of type XXX".
Videos
Videos
A CSV file containing videos should contain two columns with the following mandatory column headings:
'ObjectURL' and 'Video title'. All headings are case-insensitive.
-
The 'ObjectURL' column containing the
objectUrl
. This field is mandatory for each file, as it specifies the full URL of the video resource. -
The 'Video title' column containing the
video_title
. If left blank, the original file name is used.
In the example below files 1, 2 and 4 are assigned the names in the title column, while file 3 keeps its original file name.
ObjectUrl | Video title |
---|---|
https://storage/frame1.mp4 | Video 1 |
https://storage/frame2.mp4 | Video 2 |
https://storage/frame3.mp4 | |
https://storage/frame4.mp4 | Video 3 |
Single images
A CSV file containing single images MUST contain two columns with the following mandatory headings:
'ObjectURL' and 'Image title'. All headings are case-insensitive.
-
The 'ObjectURL' column containing the
objectUrl
. This field is mandatory for each file, as it specifies the full URL of the image resource. -
The 'Image title' column containing the
image_title
. If left blank, the original file name is used.
In the following example files 1, 2 and 4 are assigned the names in the title column, while file 3 keeps its original file name.
ObjectUrl | Image title |
---|---|
https://storage/frame1.jpg | Image 1 |
https://storage/frame2.jpg | Image 2 |
https://storage/frame3.jpg | |
https://storage/frame4.jpg | Image 3 |
Image groups
Image groups
A CSV file containing image groups MUST contain three columns with the following mandatory headings:
'ObjectURL', 'Image group title', and 'Create video'. All three headings are case-insensitive.
-
The 'ObjectURL' column containing the
objectUrl
. This field is mandatory for each file, as it specifies the full URL of the resource. -
The 'Image group title' column containing the
image_group_title
. This field is mandatory, as it determines which image group a file will be assigned to.
In the following example the first two URLs are grouped together into 'Group 1', while the following two files are grouped together into 'Group 2'.
ObjectUrl | Image group title | Create video |
---|---|---|
https://storage/frame1.jpg | Group 1 | false |
https://storage/frame2.jpg | Group 1 | false |
https://storage/frame3.jpg | Group 2 | false |
https://storage/frame4.jpg | Group 2 | false |
Note
Image groups do not require 'write' permissions.
Image sequences
Image sequences
A CSV file containing image sequences MUST contain three columns with the following mandatory headings: 'ObjectURL', 'Image group title', and 'Create video'. All three headings are case-insensitive.
-
The 'ObjectURL' column containing the
objectUrl
. This field is mandatory for each file, as it specifies the full URL of the resource. -
The 'Image group title' column containing the
image_group_title
. This field is mandatory, as it determines which image sequence a file will be assigned to. The dimensions of the image sequence are determined by the first file in the sequence. -
The 'Create video' column. This can be left blank, as the default value is 'true'.
In the example below the first two URLs are grouped together into 'Sequence 1', while the second two files are grouped together into 'Sequence 2'.
ObjectUrl | Image group title | Create video |
---|---|---|
https://storage/frame1.jpg | Sequence 1 | true |
https://storage/frame2.jpg | Sequence 1 | true |
https://storage/frame3.jpg | Sequence 2 | true |
https://storage/frame4.jpg | Sequence 2 | true |
Tip
Image groups and image sequences are only distinguished by the presence of the 'Create video' column.
Note
Image sequences require 'write' permissions against your storage bucket to save the compressed video.
DICOM
A CSV file containing DICOM files MUST contain two columns with the following headings: 'ObjectURL' and 'Series title'. Both headings are case-insensitive.
-
The 'ObjectURL' column contains the
objectUrl
. This field is mandatory for each file, as it specifies the full URL of the resource. -
The 'Series title' column contains the
dicom_title
. When two files are given the same title they are grouped into the same DICOM series. If left blank, the original file name is used.
In the following example the first two files are grouped into 'dicom series 1', the next two files are grouped into 'dicom series 2', while the final file will remain separated as 'dicom series 3'.
ObjectUrl | Series title |
---|---|
https://storage/frame1.dcm | dicom series 1 |
https://storage/frame2.dcm | dicom series 1 |
https://storage/frame3.dcm | dicom series 2 |
https://storage/frame4.dcm | dicom series 2 |
https://storage/frame5.dcm | dicom series 3 |
NIfTI
A CSV file containing NIfTI files MUST contain two columns with the following headings: 'ObjectURL' and 'NIfTI title'. Both headings are case-insensitive.
-
The 'ObjectURL' column contains the
objectUrl
. This field is mandatory for each file, as it specifies the full URL of the resource. -
The 'NIfTI title' column contains the title of the Nifti file. If left blank, the original file name is used.
The following example shows how to format the CSV file to upload two NIfTI files to Encord.
ObjectUrl | NIfTI title |
---|---|
https://storage/niftifile1.nii.gz | Brain Image 1 |
https://storage/niftifile2.nii | Brain Image 2 |
Multiple file types
Multiple file types
You can upload multiple file types with a single CSV file by using a new header each time there is a change of file type. Three headings will be required if image sequences are included.
Caution
Since the 'Create video' column defaults to "true" all files that aren't image sequences have to contain the value "false"
The following example shows a CSV file for the following:
- Two image sequences composed of 2 files each.
- One image group composed of 2 files.
- One single image.
- One video.
ObjectUrl | Image group title | Create video |
---|---|---|
https://storage/frame1.jpg | Sequence 1 | true |
https://storage/frame2.jpg | Sequence 1 | true |
https://storage/frame3.jpg | Sequence 2 | true |
https://storage/frame4.jpg | Sequence 2 | true |
https://storage/frame5.jpg | Group 1 | false |
https://storage/frame6.jpg | Group 1 | false |
ObjectUrl | Image title | Create video |
https://storage/frame1.jpg | Image 1 | false |
ObjectUrl | Image title | Create video |
https://storage/video.mp4 | Video 1 | false |
Helpful Scripts and Examples
Use the following scripts to create JSON and CSV files to upload your cloud data to Encord.
AWS S3
-
Get an AWS Access Key
- Sign in to the AWS Management Console.
- Navigate to the IAM (Identity and Access Management) service.
- Create a new IAM user (or select an existing user).
- Assign the necessary permissions (e.g., S3 access) to the user.
- Generate an access key for the user, which includes an
AWS_ACCESS_KEY_ID
andAWS_SECRET_ACCESS_KEY
.
-
Specify the Credentials in a Local File
- Create a file named
credentials.env
in a secure location on your local machine. - Add the following content to the
credentials.env
file, replacingYOUR_ACCESS_KEY_ID
andYOUR_SECRET_ACCESS_KEY
with your actual AWS access key values:export AWS_ACCESS_KEY_ID=YOUR_ACCESS_KEY_ID export AWS_SECRET_ACCESS_KEY=YOUR_SECRET_ACCESS_KEY
- Create a file named
-
Source the Credentials File in Your Shell
- Open a terminal window.
- Navigate to the directory where your
credentials.env
file is located. - Source the credentials file to set the environment variables:
source credentials.env
-
Run the Script
- With the environment variables set, you can now run the script below.
- Execute the script:
python your_script_name.py
The following python script creates a JSON file for videos or images by constructing URLs to files in a specific S3 bucket. Ensure that you:
- Replace <bucket-region> with the AWS bucket region your bucket is located.
- Replace <aws-profile> with the name of the profile in the AWS ~/.aws/credentials file. See AWS Credentials Documentation for information on setting up your credentials file.
- Replace <s3-bucket-name> with the name of the S3 bucket you want to upload files from.
- Replace <s3-directory> with the path to the directory where your files are stored inside the S3 bucket. Include all slashes except for the final slash. For example the file
my-bucket/some_top_level_dir/video_files/my_video.mp4
is in the S3 directorysome_top_level_dir/video_files
. - Replace <data-modality> with the modality of the files you want to upload. This can only be
videos
orimages
. - (If using a Multi-region access point) replace <global-access-point> with the ARN of the multi-region access point.
import boto3
import json
from botocore.config import Config
REGION = "<bucket-region>"
AWS_PROFILE = "<aws-profile>"
BUCKET_NAME = "<s3-bucket-name>"
S3_DIRECTORY = "<s3-directory>"
DATA_MODALITY = "<data-modality>"
GLOBAL_ENDPOINT = "<global-access-point>" # Optional, set to None if not using
# AWS S3 domain and root URL
DOMAIN = f's3.{REGION}.amazonaws.com'
ROOT_URL = GLOBAL_ENDPOINT if GLOBAL_ENDPOINT else f'https://{DOMAIN}/{BUCKET_NAME}'
# AWS session and S3 resource
session = boto3.Session(profile_name=AWS_PROFILE)
s3 = boto3.resource('s3')
bucket = s3.Bucket(BUCKET_NAME)
# Function to generate JSON upload specification
def generate_upload_spec(bucket_name, s3_directory, data_modality, root_url):
files = []
for object_summary in bucket.objects.all():
key_split = object_summary.key.split('/')
key_path = "/".join(key_split[:-1])
if key_path == s3_directory:
object_url = f'{root_url}/{object_summary.key}'
files.append({'objectUrl': object_url})
# Create the JSON structure based on data modality
outer_json_dict = {data_modality: files}
# Write the JSON to a file
output_filename = f'{bucket_name}-{s3_directory.replace("/", "_")}.json'
with open(output_filename, 'w') as output_file:
json.dump(outer_json_dict, output_file, indent=4)
print(f'JSON upload specification file created: {output_filename}')
# Run the function with provided configuration
generate_upload_spec(BUCKET_NAME, S3_DIRECTORY, DATA_MODALITY, ROOT_URL)
Azure blob
{
"videos": [
{
"objectUrl": "https://myaccount.blob.core.windows.net/myblob"
},
{
"objectUrl": "https://myaccount.blob.core.windows.net/mycontainer/myblob.jpg"
},
{
"objectUrl": "https://myaccount.blob.core.windows.net/mycontainer/myblobs/myblob.jpg"
}
],
"image_groups": [
{
"title": "image_group_1",
"objectUrl_0": "https://myaccount.blob.core.windows.net/mycontainer/myblob1.jpg",
"objectUrl_1": "https://myaccount.blob.core.windows.net/mycontainer/myblob2.jpg"
},
{
"title": "image_group2",
"objectUrl_0": "https://myaccount.blob.core.windows.net/mycontainer/myblob3.jpg",
"objectUrl_1": "https://myaccount.blob.core.windows.net/mycontainer/myblob4.jpg"
}
]
}
GCP storage
{
"videos": [
{
"objectUrl": "gs://example-url/object.mp4"
}
],
"image_groups": [
{
"title": "image_group_1",
"objectUrl_0": "https://storage.cloud.google.com/example-image-bucket/object_1.jpg",
"objectUrl_1": "https://storage.cloud.google.com/example-image-bucket/object_2.jpg"
},
{
"title": "image_group_2",
"objectUrl_0": "https://storage.cloud.google.com/example-image-bucket/object_3.jpg",
"objectUrl_1": "https://storage.cloud.google.com/example-image-bucket/object_4.jpg"
}
]
}
Open Telekom Cloud OSS
{
"dicom_series": [
{
"title": "OPEN_TELEKOM_DICOM_SERIES",
"objectUrl_0": "https://bucket-name.obs.eu-de.otc.t-systems.com/dicom-file-0",
"objectUrl_1": "https://bucket-name.obs.eu-de.otc.t-systems.com/dicom-file-1",
"objectUrl_2": "https://bucket-name.obs.eu-de.otc.t-systems.com/dicom-file-2",
"objectUrl_3": "https://bucket-name.obs.eu-de.otc.t-systems.com/dicom-file-3"
}
]
}
Updated 8 days ago