Data Groups - E2E - Basic Cloud Storage

Contact support to gain early access to Data Groups.

Why do this?

Quick way to get going with Data Groups in Encord using cloud data.

If you intend to use Encord at scale, with Data Groups, we strongly recommend using the Encord SDK.

Pros and Cons

Pros	Cons
Simple way to get data into Encord Able to sync your cloud data with Encord easily Data groups available for multi-tile multi-modal functionality	Requires a little bit of technical knowledge to set integrations No data management or curation (custom metadata needs to be imported separately)

Data Groups can include custom metadata, but for the purposes of this end-to-end example none are included.

Import/Register Data

We’re going to register our dataset of videos (portion of Nexar open source dataset) and text files (Events captured for the videos).

Create Integration

Select your cloud provider.

Download Data

Download and extract the contents of nexar-first-100-osds.zip file.

Re-encode Videos

We strongly recommend re-encoding any videos with issues. Re-encoding your videos ensures the best performance when annotating your data.

You can do this locally or in your cloud storage.

For more information on re-encoding videos, go here.

Import Data to Cloud Storage

Import the contents of nexar-first-100-osds.zip into your cloud storage.

Create Cloud-synced Folder

Syncing the data registers the data in Encord. Your data stays in your cloud storage.

Go to Index > Files.
Click New folder > Cloud-synced folder. The New Cloud-synced folder dialog appears.
Provide the following:
- Title: E2E - Data Groups - Cloud-synced Folder.
- Description: OPTIONAL - Provide a meaningful description for the Cloud-synced folder.
- Select your integration: Select the integration to use from the drop down.
- Storage path: Specify the storage/file path to your cloud storage. For example: gs://encord-gcp-bucket/CloudSync/ or s3://encord-aws-bucket/CloudSync.
Click Test to verify that Encord can commincate with your cloud storage.
Click Create. The page for the new Cloud-synced folder appears.

Find Storage Path

Finding the Storage path for your folder or object varies across Cloud Storage platforms.AWS Find AWS storage path

GCP

Sync Data Between Encord and Cloud Storage

Go to Index > Files > E2E - Data Groups - Cloud-synced Folder. The Cloud-synced folder page appears.
Click Initiate sync. The sync between the folder and your cloud storage begins.

Create Ontology

Create the following Ontology for the Project. Ontology name: E2E - Ontology - Data Groups Classifications

Prediction correct?
- YES! (Radio button)
- No (Radio button)
  - What's wrong? (Text)
Summary correct?
- YES! (Radio button)
- No (Radio button)
  - What's wrong? (Text)

Create Dataset

Create a Dataset for your Data Groups. Name: E2E - Dataset - Data Groups

Create Project

Once all the videos are re-encoded, and you created an Ontology and Dataset you are ready to create an Annotate Project. Once you create a Project you need to create your Data Groups and then your team will be ready to annotate your data. Name: E2E - Project - Data Groups

Mapping File for Data Units

Creating Data Groups requires mapping your data units to the layout, used during annotation and review. Currently mapping to the layout uses the File ID/UUID of the data unit Encord assigns the data unit. To find the File ID/UUID of your data units use storage_folder.list_items. The following script provides a way to get the file name and ID of your data units. The output saves to a JSON and CSV file.

List File Name and File ID


from encord import EncordUserClient
import json
import csv

# --- Configuration ---
SSH_PATH = "/Users/chris-encord/ssh-private-key.txt" # Replace with the file path to your SSH private key
FOLDER_ID = "00000000-0000-0000-0000-000000000000"  # Replace with the Folder ID

# Output file paths
JSON_OUTPUT_PATH = "/file/path/to/save/file_mapping.json" # Update this as required
CSV_OUTPUT_PATH = "/file/path/to/save/file_mapping.csv" # Update this as required

# Authenticate with Encord using the path to your private key
user_client: EncordUserClient = EncordUserClient.create_with_ssh_private_key(
    ssh_private_key_path=SSH_PATH,
    # For US platform users use "https://api.us.encord.com"
    domain="https://api.encord.com",
)

# Find the storage folder by name
folder_name = FOLDER_ID
folders = list(user_client.find_storage_folders(search=folder_name, page_size=1000))

# Ensure the folder was found
if folders:
    storage_folder = folders[0]

    # List all data units
    items = list(storage_folder.list_items())

    # Create a list of dicts for structured output
    file_data = [
        {
            "file_id": str(item.uuid),  # Convert UUID to string
            "file_name": item.name,
            "file_type": item.item_type
        }
        for item in items
    ]

    # --- Save to JSON File ---
    with open(JSON_OUTPUT_PATH, "w") as f:
        json.dump(file_data, f, indent=4)

    # --- Save to CSV File ---
    with open(CSV_OUTPUT_PATH, "w", newline="") as f:
        writer = csv.DictWriter(f, fieldnames=["file_id", "file_name", "file_type"])
        writer.writeheader()
        writer.writerows(file_data)

    print(f"Saved output to:\n- {JSON_OUTPUT_PATH}\n- {CSV_OUTPUT_PATH}")

else:
    print("Folder not found.")

Create Data Groups

Use the output file from the Map Data Units for Data Groups section to map File IDs to their corresponding layout for Data Groups.

Use the script in this section to create Data Groups, add those Data Groups to a Dataset, and add the Dataset to a Project. The script creates Data Groups with five data units in the following layout:

+-------------------------------------------+
|              text file                    |
+------------------+------------------------+
|     video 1      |        video 2         |
+------------------+------------------------+
|     video 3      |        video 4         |
+------------------+------------------------+

To create Data Groups the File Ids for data units need to be mapped to the Data Group. Refer to the following:


# --- Group definitions (name + UUIDs) ---
groups = [
    {
        "name": "group-001",
        "uuids": {
            "instructions": UUID("00000000-0000-0000-0000-000000000000"), # Replace with File ID of clustered_event_log_01.txt
            "top-left": UUID("11111111-1111-1111-1111-111111111111"), # Replace with File ID of 00001_normalized.mp4
            "top-right": UUID("22222222-2222-2222-2222-222222222222"), # Replace with File ID of 00002_normalized.mp4
            "bottom-left": UUID("33333333-3333-3333-3333-333333333333"), # Replace with File ID of 00009.mp4
            "bottom-right": UUID("44444444-4444-4444-4444-444444444444"), # Replace with File ID of 00011_normalized.mp4
        },
    },
    {
        "name": "group-002",
        "uuids": {
            "instructions": UUID("55555555-5555-5555-5555-555555555555"), # Replace with File ID of clustered_event_log_02.txt
            "top-left": UUID("66666666-6666-6666-6666-666666666666"), # Replace with File ID of 00012.mp4
            "top-right": UUID("77777777-7777-7777-7777-777777777777"), # Replace with File ID of 00020.mp4
            "bottom-left": UUID("88888888-8888-8888-8888-888888888888"), # Replace with File ID of 00030.mp4
            "bottom-right": UUID("99999999-9999-9999-9999-999999999999"), # Replace with File ID of 00033.mp4
        },
    },
    {
        "name": "group-003",
        "uuids": {
            "instructions": UUID("12312312-3123-1231-2312-312312312312"), # Replace with File ID of clustered_event_log_03.txt
            "top-left": UUID("23232323-2323-2323-2323-232323232323"), # Replace with File ID of 00034.mp4
            "top-right": UUID("31313131-3131-3131-3131-313131313131"), # Replace with File ID of 00035_normalized.mp4
            "bottom-left": UUID("45645645-6456-4564-5645-645645645645"), # Replace with File ID of 00038_normalized.mp4
            "bottom-right": UUID("56565656-6565-5656-6565-656565656565 "), # Replace with File ID of 00045.mp4
        },
    },
    # More groups...
]

Run this script to create Data Groups:


from uuid import UUID

from encord.constants.enums import DataType
from encord.objects.metadata import DataGroupMetadata
from encord.orm.storage import DataGroupCustom, StorageItemType
from encord.user_client import EncordUserClient

# --- Configuration ---
SSH_PATH = "/Users/chris-encord/ssh-private-key.txt"  # Replace with the file path to your SSH key
FOLDER_ID = "00000000-0000-0000-0000-000000000000"  # Replace with the Folder ID
DATASET_ID = "00000000-0000-0000-0000-000000000000"  # Replace with the Dataset ID
PROJECT_ID = "00000000-0000-0000-0000-000000000000"  # Replace with the Project ID

# --- Connect to Encord ---
user_client: EncordUserClient = EncordUserClient.create_with_ssh_private_key(
    ssh_private_key_path=SSH_PATH,
    # For US platform users use "https://api.us.encord.com"
    domain="https://api.encord.com",
)

folder = user_client.get_storage_folder(FOLDER_ID)

# --- Reusable layout and settings ---
layout = {
    "direction": "column",
    "first": {"type": "data_unit", "key": "instructions"},
    "second": {
        "direction": "column",
        "first": {
            "direction": "row",
            "first": {"type": "data_unit", "key": "top-left"},
            "second": {"type": "data_unit", "key": "top-right"},
            "splitPercentage": 50,
        },
        "second": {
            "direction": "row",
            "first": {"type": "data_unit", "key": "bottom-left"},
            "second": {"type": "data_unit", "key": "bottom-right"},
            "splitPercentage": 50,
        },
        "splitPercentage": 50,
    },
    "splitPercentage": 20,
}
settings = {"tile_settings": {"instructions": {"is_read_only": True}}}

# --- Group definitions (name + UUIDs) ---
groups = [
    {
        "name": "group-001",
        "uuids": {
            "instructions": UUID("00000000-0000-0000-0000-000000000000"), # Replace with File ID of clustered_event_log_01.txt
            "top-left": UUID("11111111-1111-1111-1111-111111111111"), # Replace with File ID of 00001_normalized.mp4
            "top-right": UUID("22222222-2222-2222-2222-222222222222"), # Replace with File ID of 00002_normalized.mp4
            "bottom-left": UUID("33333333-3333-3333-3333-333333333333"), # Replace with File ID of 00009.mp4
            "bottom-right": UUID("44444444-4444-4444-4444-444444444444"), # Replace with File ID of 00011_normalized.mp4
        },
    },
    {
        "name": "group-002",
        "uuids": {
            "instructions": UUID("55555555-5555-5555-5555-555555555555"), # Replace with File ID of clustered_event_log_02.txt
            "top-left": UUID("66666666-6666-6666-6666-666666666666"), # Replace with File ID of 00012.mp4
            "top-right": UUID("77777777-7777-7777-7777-777777777777"), # Replace with File ID of 00020.mp4
            "bottom-left": UUID("88888888-8888-8888-8888-888888888888"), # Replace with File ID of 00030.mp4
            "bottom-right": UUID("99999999-9999-9999-9999-999999999999"), # Replace with File ID of 00033.mp4
        },
    },
    {
        "name": "group-003",
        "uuids": {
            "instructions": UUID("12312312-3123-1231-2312-312312312312"), # Replace with File ID of clustered_event_log_03.txt
            "top-left": UUID("23232323-2323-2323-2323-232323232323"), # Replace with File ID of 00034.mp4
            "top-right": UUID("31313131-3131-3131-3131-313131313131"), # Replace with File ID of 00035_normalized.mp4
            "bottom-left": UUID("45645645-6456-4564-5645-645645645645"), # Replace with File ID of 00038_normalized.mp4
            "bottom-right": UUID("56565656-6565-5656-6565-656565656565 "), # Replace with File ID of 00045.mp4
        },
    },
    # More groups...
]

# Create the data groups

for g in groups:
    group = folder.create_data_group(
        DataGroupCustom(
            name=g["name"],
            layout=layout,
            layout_contents=g["uuids"],
            settings=settings,
        )
    )
    print(f"✅ Created group '{g['name']}' with UUID {group}")

# Add all the data groups in a folder to a Dataset
group_items = folder.list_items(item_types=[StorageItemType.GROUP])
d = user_client.get_dataset(DATASET_ID)
d.link_items([item.uuid for item in group_items])

# Add the Dataset with the Data Groups to a Project

p = user_client.get_project(PROJECT_ID)
rows = p.list_label_rows_v2(include_children=True)

# Label Rows of Data Groups use DataGroupMetadata for the layout to Annotate and Review
for row in rows:
    if row.data_type == DataType.GROUP:
        row.initialise_labels()
        assert isinstance(row.metadata, DataGroupMetadata)
        print(row.metadata.children)

Annotate Data Groups

Annotation of videos depends on your Ontology. Our Ontology E2E Data Groups uses classifications. In this section, you’ll see the following Collaborators:

Annotators labeling data
Reviewers reviewing labels created by Annotators
Team Manager managing the Annotators and Reviewers
Project Admin managing the Project and exporting labels

Prepare to Label

Team Manager or Project Admin

Annotators

Label Data

Team Manager or Project Admin

Annotators

Review Labels

Team Manager or Project Admin

Review Labels

Export Labels

Only Project Admins can export labels from Encord.

Project Admin

Getting Started with Encord

Modalities

Features

Benchmark QA Workflow

Data Groups - E2E - Basic Cloud Storage

Why do this?

Pros and Cons

Import/Register Data

Find Storage Path

Create Ontology

Create Dataset

Create Project

Mapping File for Data Units

Create Data Groups

Annotate Data Groups

Getting Started with Encord

Modalities

Features

Benchmark QA Workflow

​Why do this?

​Pros and Cons

​Import/Register Data

​Find Storage Path

​Create Ontology

​Create Dataset

​Create Project

​Mapping File for Data Units

​Create Data Groups

​Annotate Data Groups

Why do this?

Pros and Cons

Import/Register Data

Find Storage Path

Create Ontology

Create Dataset

Create Project

Mapping File for Data Units

Create Data Groups

Annotate Data Groups