Why do this?

Quick way to get going with Data Groups in Encord using cloud data.

If you intend to use Encord at scale, with Data Groups, we strongly recommend using the Encord SDK.

Pros and Cons

ProsCons
  • Simple way to get data into Encord
  • Able to sync your cloud data with Encord easily
  • Data groups available for multi-tile multi-modal functionality
  • Requires a little bit of technical knowledge to set integrations
  • No data management or curation (custom metadata needs to be imported separately)

Data Groups can include custom metadata, but for the purposes of this end-to-end example none are included.

Import/Register Data

We’re going to register our dataset of videos (portion of Nexar open source dataset) and text files (Events captured for the videos).

1

Create Integration

Select your cloud provider.

2

Download Data

Download and extract the contents of nexar-first-100-osds.zip file.

3

Re-encode Videos

We strongly recommend re-encoding any videos with issues. Re-encoding your videos ensures the best performance when annotating your data.

You can do this locally or in your cloud storage.

For more information on re-encoding videos, go here.

4

Import Data to Cloud Storage

Import the contents of nexar-first-100-osds.zip into your cloud storage.

5

Create Cloud-synced Folder

Syncing the data registers the data in Encord. Your data stays in your cloud storage.

  1. Go to Index > Files.

  2. Click New folder > Cloud-synced folder. The New Cloud-synced folder dialog appears.

  3. Provide the following:

    • Title: E2E - Data Groups - Cloud-synced Folder.
    • Description: OPTIONAL - Provide a meaningful description for the Cloud-synced folder.
    • Select your integration: Select the integration to use from the drop down.
    • Storage path: Specify the storage/file path to your cloud storage. For example: gs://encord-gcp-bucket/CloudSync/ or s3://encord-aws-bucket/CloudSync.
  4. Click Test to verify that Encord can commincate with your cloud storage.

  5. Click Create. The page for the new Cloud-synced folder appears.

Find Storage Path

Finding the Storage path for your folder or object varies across Cloud Storage platforms.

AWS

GCP

6

Sync Data Between Encord and Cloud Storage

  1. Go to Index > Files > E2E - Data Groups - Cloud-synced Folder. The Cloud-synced folder page appears.

  2. Click Initiate sync. The sync between the folder and your cloud storage begins.

Create Ontology

Create the following Ontology for the Project.

Ontology name: E2E - Ontology - Data Groups

Classifications

  • Prediction correct?

    • YES! (Radio button)
    • No (Radio button)
      • What's wrong? (Text)
  • Summary correct?

    • YES! (Radio button)
    • No (Radio button)
      • What's wrong? (Text)

Create Dataset

Create a Dataset for your Data Groups.

Name: E2E - Dataset - Data Groups

Create Project

Once all the videos are re-encoded, and you created an Ontology and Dataset you are ready to create an Annotate Project. Once you create a Project you need to create your Data Groups and then your team will be ready to annotate your data.

Name: E2E - Project - Data Groups

Create Data Groups

Use the script in this section to create Data Groups, add those Data Groups to a Dataset, and add the Dataset to a Project.

The script creates Data Groups with five data units in the following layout:

+-------------------------------------------+
|              text file                    |
+------------------+------------------------+
|     video 1      |        video 2         |
+------------------+------------------------+
|     video 3      |        video 4         |
+------------------+------------------------+

To create Data Groups the File Ids for data units need to be mapped to the Data Group.

Refer to the following:


# --- Group definitions (name + UUIDs) ---
groups = [
    {
        "name": "group-001",
        "uuids": {
            "instructions": UUID("00000000-0000-0000-0000-000000000000"), # Replace with File ID of clustered_event_log_01.txt
            "top-left": UUID("11111111-1111-1111-1111-111111111111"), # Replace with File ID of 00001_normalized.mp4
            "top-right": UUID("22222222-2222-2222-2222-222222222222"), # Replace with File ID of 00002_normalized.mp4
            "bottom-left": UUID("33333333-3333-3333-3333-333333333333"), # Replace with File ID of 00009.mp4
            "bottom-right": UUID("44444444-4444-4444-4444-444444444444"), # Replace with File ID of 00011_normalized.mp4
        },
    },
    {
        "name": "group-002",
        "uuids": {
            "instructions": UUID("55555555-5555-5555-5555-555555555555"), # Replace with File ID of clustered_event_log_02.txt
            "top-left": UUID("66666666-6666-6666-6666-666666666666"), # Replace with File ID of 00012.mp4
            "top-right": UUID("77777777-7777-7777-7777-777777777777"), # Replace with File ID of 00020.mp4
            "bottom-left": UUID("88888888-8888-8888-8888-888888888888"), # Replace with File ID of 00030.mp4
            "bottom-right": UUID("99999999-9999-9999-9999-999999999999"), # Replace with File ID of 00033.mp4
        },
    },
    {
        "name": "group-003",
        "uuids": {
            "instructions": UUID("12312312-3123-1231-2312-312312312312"), # Replace with File ID of clustered_event_log_03.txt
            "top-left": UUID("23232323-2323-2323-2323-232323232323"), # Replace with File ID of 00034.mp4
            "top-right": UUID("31313131-3131-3131-3131-313131313131"), # Replace with File ID of 00035_normalized.mp4
            "bottom-left": UUID("45645645-6456-4564-5645-645645645645"), # Replace with File ID of 00038_normalized.mp4
            "bottom-right": UUID("56565656-6565-5656-6565-656565656565 "), # Replace with File ID of 00045.mp4
        },
    },
    # More groups...
]

Run this script to create Data Groups:


from uuid import UUID

from encord.constants.enums import DataType
from encord.objects.metadata import DataGroupMetadata
from encord.orm.storage import DataGroupCustom, StorageItemType
from encord.user_client import EncordUserClient

# --- Configuration ---
SSH_PATH = "/Users/chris-encord/ssh-private-key.txt"  # Replace with the file path to your SSH key
FOLDER_ID = "00000000-0000-0000-0000-000000000000"  # Replace with the Folder ID
DATASET_ID = "00000000-0000-0000-0000-000000000000"  # Replace with the Dataset ID
PROJECT_ID = "00000000-0000-0000-0000-000000000000"  # Replace with the Project ID

# --- Connect to Encord ---
user_client: EncordUserClient = EncordUserClient.create_with_ssh_private_key(
    ssh_private_key_path=SSH_PATH,
    # For US platform users use "https://api.us.encord.com"
    domain="https://api.encord.com",
)

folder = user_client.get_storage_folder(FOLDER_ID)

# --- Reusable layout and settings ---
layout = {
    "direction": "column",
    "first": {"type": "data_unit", "key": "instructions"},
    "second": {
        "direction": "column",
        "first": {
            "direction": "row",
            "first": {"type": "data_unit", "key": "top-left"},
            "second": {"type": "data_unit", "key": "top-right"},
            "splitPercentage": 50,
        },
        "second": {
            "direction": "row",
            "first": {"type": "data_unit", "key": "bottom-left"},
            "second": {"type": "data_unit", "key": "bottom-right"},
            "splitPercentage": 50,
        },
        "splitPercentage": 50,
    },
    "splitPercentage": 20,
}
settings = {"tile_settings": {"instructions": {"is_read_only": True}}}

# --- Group definitions (name + UUIDs) ---
groups = [
    {
        "name": "group-001",
        "uuids": {
            "instructions": UUID("00000000-0000-0000-0000-000000000000"), # Replace with File ID of clustered_event_log_01.txt
            "top-left": UUID("11111111-1111-1111-1111-111111111111"), # Replace with File ID of 00001_normalized.mp4
            "top-right": UUID("22222222-2222-2222-2222-222222222222"), # Replace with File ID of 00002_normalized.mp4
            "bottom-left": UUID("33333333-3333-3333-3333-333333333333"), # Replace with File ID of 00009.mp4
            "bottom-right": UUID("44444444-4444-4444-4444-444444444444"), # Replace with File ID of 00011_normalized.mp4
        },
    },
    {
        "name": "group-002",
        "uuids": {
            "instructions": UUID("55555555-5555-5555-5555-555555555555"), # Replace with File ID of clustered_event_log_02.txt
            "top-left": UUID("66666666-6666-6666-6666-666666666666"), # Replace with File ID of 00012.mp4
            "top-right": UUID("77777777-7777-7777-7777-777777777777"), # Replace with File ID of 00020.mp4
            "bottom-left": UUID("88888888-8888-8888-8888-888888888888"), # Replace with File ID of 00030.mp4
            "bottom-right": UUID("99999999-9999-9999-9999-999999999999"), # Replace with File ID of 00033.mp4
        },
    },
    {
        "name": "group-003",
        "uuids": {
            "instructions": UUID("12312312-3123-1231-2312-312312312312"), # Replace with File ID of clustered_event_log_03.txt
            "top-left": UUID("23232323-2323-2323-2323-232323232323"), # Replace with File ID of 00034.mp4
            "top-right": UUID("31313131-3131-3131-3131-313131313131"), # Replace with File ID of 00035_normalized.mp4
            "bottom-left": UUID("45645645-6456-4564-5645-645645645645"), # Replace with File ID of 00038_normalized.mp4
            "bottom-right": UUID("56565656-6565-5656-6565-656565656565 "), # Replace with File ID of 00045.mp4
        },
    },
    # More groups...
]

# Create the data groups

for g in groups:
    group = folder.create_data_group(
        DataGroupCustom(
            name=g["name"],
            layout=layout,
            layout_contents=g["uuids"],
            settings=settings,
        )
    )
    print(f"✅ Created group '{g['name']}' with UUID {group}")

# Add all the data groups in a folder to a Dataset
group_items = folder.list_items(item_types=[StorageItemType.GROUP])
d = user_client.get_dataset(DATASET_ID)
d.link_items([item.uuid for item in group_items])

# Add the Dataset with the Data Groups to a Project

p = user_client.get_project(PROJECT_ID)
rows = p.list_label_rows_v2(include_children=True)

# Label Rows of Data Groups use DataGroupMetadata for the layout to Annotate and Review
for row in rows:
    if row.data_type == DataType.GROUP:
        row.initialise_labels()
        assert isinstance(row.metadata, DataGroupMetadata)
        print(row.metadata.children)

Annotate Data Groups

Annotation of videos depends on your Ontology. Our Ontology E2E Data Groups uses classifications.

In this section, you’ll see the following Collaborators:

  • Annotators labelling data
  • Reviewers reviewing labels created by Annotators
  • Team Manager managing the Annotators and Reviewers
  • Project Admin managing the Project and exporting labels
1

Prepare to Label

2

Label Data

3

Review Labels

4

Export Labels

Only Project Admins can export labels from Encord.