You want to guarantee correctness, completeness, or fairness in the predictions of your models.Quick way to get going with Data Groups in Encord using cloud data.
If you intend to use Encord at scale, with Data Groups, we strongly recommend using the Encord SDK.
Once all the videos are re-encoded, and you created an Ontology and Dataset you are ready to create an Annotate Project. Once you create a Project you need to create your Data Groups and then your team will be ready to annotate your data.Name: E2E - Project - Data Groups
Creating Data Groups requires mapping your data units to the layout, used during annotation and review. Currently mapping to the layout uses the File ID/UUID of the data unit Encord assigns the data unit.To find the File ID/UUID of your data units use storage_folder.list_items. The following script provides a way to get the file name and ID of your data units. The output saves to a JSON and CSV file.
List File Name and File ID
from encord import EncordUserClientimport jsonimport csv# --- Configuration ---SSH_PATH = "/Users/chris-encord/ssh-private-key.txt" # Replace with the file path to your SSH private keyFOLDER_ID = "00000000-0000-0000-0000-000000000000" # Replace with the Folder ID# Output file pathsJSON_OUTPUT_PATH = "/file/path/to/save/file_mapping.json" # Update this as requiredCSV_OUTPUT_PATH = "/file/path/to/save/file_mapping.csv" # Update this as required# Authenticate with Encord using the path to your private keyuser_client: EncordUserClient = EncordUserClient.create_with_ssh_private_key( ssh_private_key_path=SSH_PATH, # For US platform users use "https://api.us.encord.com" domain="https://api.encord.com",)# Find the storage folder by namefolder_name = FOLDER_IDfolders = list(user_client.find_storage_folders(search=folder_name, page_size=1000))# Ensure the folder was foundif folders: storage_folder = folders[0] # List all data units items = list(storage_folder.list_items()) # Create a list of dicts for structured output file_data = [ { "file_id": str(item.uuid), # Convert UUID to string "file_name": item.name, "file_type": item.item_type } for item in items ] # --- Save to JSON File --- with open(JSON_OUTPUT_PATH, "w") as f: json.dump(file_data, f, indent=4) # --- Save to CSV File --- with open(CSV_OUTPUT_PATH, "w", newline="") as f: writer = csv.DictWriter(f, fieldnames=["file_id", "file_name", "file_type"]) writer.writeheader() writer.writerows(file_data) print(f"Saved output to:\n- {JSON_OUTPUT_PATH}\n- {CSV_OUTPUT_PATH}")else: print("Folder not found.")
Use the output file from the Map Data Units for Data Groups section to map File IDs to their corresponding layout for Data Groups.
Use the script in this section to create Data Groups, add those Data Groups to a Dataset, and add the Dataset to a Project.The script creates Data Groups with five data units in the following layout:
+-------------------------------------------+| text file |+------------------+------------------------+| video 1 | video 2 |+------------------+------------------------+| video 3 | video 4 |+------------------+------------------------+
To create Data Groups the File Ids for data units need to be mapped to the Data Group.Refer to the following:
# --- Group definitions (name + UUIDs) ---groups = [ { "name": "group-001", "uuids": { "instructions": UUID("00000000-0000-0000-0000-000000000000"), # Replace with File ID of clustered_event_log_01.txt "top-left": UUID("11111111-1111-1111-1111-111111111111"), # Replace with File ID of 00001_normalized.mp4 "top-right": UUID("22222222-2222-2222-2222-222222222222"), # Replace with File ID of 00002_normalized.mp4 "bottom-left": UUID("33333333-3333-3333-3333-333333333333"), # Replace with File ID of 00009.mp4 "bottom-right": UUID("44444444-4444-4444-4444-444444444444"), # Replace with File ID of 00011_normalized.mp4 }, }, { "name": "group-002", "uuids": { "instructions": UUID("55555555-5555-5555-5555-555555555555"), # Replace with File ID of clustered_event_log_02.txt "top-left": UUID("66666666-6666-6666-6666-666666666666"), # Replace with File ID of 00012.mp4 "top-right": UUID("77777777-7777-7777-7777-777777777777"), # Replace with File ID of 00020.mp4 "bottom-left": UUID("88888888-8888-8888-8888-888888888888"), # Replace with File ID of 00030.mp4 "bottom-right": UUID("99999999-9999-9999-9999-999999999999"), # Replace with File ID of 00033.mp4 }, }, { "name": "group-003", "uuids": { "instructions": UUID("12312312-3123-1231-2312-312312312312"), # Replace with File ID of clustered_event_log_03.txt "top-left": UUID("23232323-2323-2323-2323-232323232323"), # Replace with File ID of 00034.mp4 "top-right": UUID("31313131-3131-3131-3131-313131313131"), # Replace with File ID of 00035_normalized.mp4 "bottom-left": UUID("45645645-6456-4564-5645-645645645645"), # Replace with File ID of 00038_normalized.mp4 "bottom-right": UUID("56565656-6565-5656-6565-656565656565 "), # Replace with File ID of 00045.mp4 }, }, # More groups...]
Run this script to create Data Groups:
from uuid import UUIDfrom encord.constants.enums import DataTypefrom encord.objects.metadata import DataGroupMetadatafrom encord.orm.storage import DataGroupCustom, StorageItemTypefrom encord.user_client import EncordUserClient# --- Configuration ---SSH_PATH = "/Users/chris-encord/ssh-private-key.txt" # Replace with the file path to your access keyFOLDER_ID = "00000000-0000-0000-0000-000000000000" # Replace with the Folder IDDATASET_ID = "00000000-0000-0000-0000-000000000000" # Replace with the Dataset IDPROJECT_ID = "00000000-0000-0000-0000-000000000000" # Replace with the Project ID# --- Connect to Encord ---user_client: EncordUserClient = EncordUserClient.create_with_ssh_private_key( ssh_private_key_path=SSH_PATH, # For US platform users use "https://api.us.encord.com" domain="https://api.encord.com",)folder = user_client.get_storage_folder(FOLDER_ID)# --- Reusable layout and settings ---layout = { "direction": "column", "first": {"type": "data_unit", "key": "instructions"}, "second": { "direction": "column", "first": { "direction": "row", "first": {"type": "data_unit", "key": "top-left"}, "second": {"type": "data_unit", "key": "top-right"}, "splitPercentage": 50, }, "second": { "direction": "row", "first": {"type": "data_unit", "key": "bottom-left"}, "second": {"type": "data_unit", "key": "bottom-right"}, "splitPercentage": 50, }, "splitPercentage": 50, }, "splitPercentage": 20,}settings = {"tile_settings": {"instructions": {"is_read_only": True}}}# --- Group definitions (name + UUIDs) ---groups = [ { "name": "group-001", "uuids": { "instructions": UUID("00000000-0000-0000-0000-000000000000"), # Replace with File ID of clustered_event_log_01.txt "top-left": UUID("11111111-1111-1111-1111-111111111111"), # Replace with File ID of 00001_normalized.mp4 "top-right": UUID("22222222-2222-2222-2222-222222222222"), # Replace with File ID of 00002_normalized.mp4 "bottom-left": UUID("33333333-3333-3333-3333-333333333333"), # Replace with File ID of 00009.mp4 "bottom-right": UUID("44444444-4444-4444-4444-444444444444"), # Replace with File ID of 00011_normalized.mp4 }, }, { "name": "group-002", "uuids": { "instructions": UUID("55555555-5555-5555-5555-555555555555"), # Replace with File ID of clustered_event_log_02.txt "top-left": UUID("66666666-6666-6666-6666-666666666666"), # Replace with File ID of 00012.mp4 "top-right": UUID("77777777-7777-7777-7777-777777777777"), # Replace with File ID of 00020.mp4 "bottom-left": UUID("88888888-8888-8888-8888-888888888888"), # Replace with File ID of 00030.mp4 "bottom-right": UUID("99999999-9999-9999-9999-999999999999"), # Replace with File ID of 00033.mp4 }, }, { "name": "group-003", "uuids": { "instructions": UUID("12312312-3123-1231-2312-312312312312"), # Replace with File ID of clustered_event_log_03.txt "top-left": UUID("23232323-2323-2323-2323-232323232323"), # Replace with File ID of 00034.mp4 "top-right": UUID("31313131-3131-3131-3131-313131313131"), # Replace with File ID of 00035_normalized.mp4 "bottom-left": UUID("45645645-6456-4564-5645-645645645645"), # Replace with File ID of 00038_normalized.mp4 "bottom-right": UUID("56565656-6565-5656-6565-656565656565 "), # Replace with File ID of 00045.mp4 }, }, # More groups...]# Create the data groupsfor g in groups: group = folder.create_data_group( DataGroupCustom( name=g["name"], layout=layout, layout_contents=g["uuids"], settings=settings, ) ) print(f"✅ Created group '{g['name']}' with UUID {group}")# Add all the data groups in a folder to a Datasetgroup_items = folder.list_items(item_types=[StorageItemType.GROUP])d = user_client.get_dataset(DATASET_ID)d.link_items([item.uuid for item in group_items])# Add the Dataset with the Data Groups to a Projectp = user_client.get_project(PROJECT_ID)rows = p.list_label_rows_v2(include_children=True)# Label Rows of Data Groups use DataGroupMetadata for the layout to Annotate and Reviewfor row in rows: if row.data_type == DataType.GROUP: row.initialise_labels() assert isinstance(row.metadata, DataGroupMetadata) print(row.metadata.children)
Annotation of videos depends on your Ontology. Our Ontology E2E Data Groups uses classifications.In this section, you’ll see the following Collaborators:
Annotators labeling data
Reviewers reviewing labels created by Annotators
Team Manager managing the Annotators and Reviewers
Project Admin managing the Project and exporting labels
1
Prepare to Label
Team Manager or Project Admin
The Team Manager or Project Admin can prioritize certain data to be labeled and reviewed first. Let’s prioritize a few Data Groups to be labeled first by setting the priority for those files to 75.Set Priority to 75
Annotators
Annotators can configure the Annotate Label Editor so they can more effectively and efficiently label data.
2
Label Data
Team Manager or Project Admin
The Team Manager or Project Admin can monitor the performance and progress of the annotation team.
Annotators
Annotators use the text file to determine if the Prediction and Summary for each video are correct.