Why do this?

You want to understand how to create Projects in Encord that use Tabular Data (CSV files). This example assumes your Tabular Data is stored in cloud storage.

If you intend to use Encord at scale, we strongly recommend using the Encord SDK.

Pros and Cons

ProsCons
  • Simple way to get data into Encord
  • Tabular data available for annotation
  • Requires a little bit of technical knowledge to set integrations
  • No data management or curation of your tabular data in this example

Tabular data currently supports CONSENSUS Projects only.

Import/Register Data

We’re going to register our tiny dataset of CSV files.

1

Create Integration

Select your cloud provider.

2

Download Data

Download and extract the contents of e2e-tabular-data.zip file.

3

Modify JSON

Modify the tabular-data.json file in the e2e-tabular-data.zip file.

  1. Open the tabular-data.json file and replace <file-path> with the file path to the data stored in your cloud storage.
The tabular-data.json file includes the file path and title for each CSV file. It does NOT include clientMetadata.
4

Create a Mirrored Dataset

Create a mirrored Dataset called E2E - Tabular Data - Dataset using the UI. Using mirrored Datasets is a simple way to sync data from folders to Datasets. Mirrored Datasets provide no method of curating or managing your data.

If you want to add more data to your Dataset, add more data to the JSON file. Then re-import the JSON file and data automatically gets added to your Dataset and Project.

5

Register/Import Data

Use the tabular-data.json, from the e2e-tabular-data.zip, to register/import the data to the mirrored Dataset.

Create Ontology

For this step you need the following:

  • genre-options.csv and platform-options.csv from the e2e-tabular-data.zip file.
  • One video_game_annotation_X.csv from the e2e-tabular-data.zip file.
  • tabular_create_ontology.py script. You create this.

The tabular_create_ontology.py script does the following:

  • Creates the Ontology based on the structure of any of the video_game_annotation_X.csv files.
  • Creates feature mapping for the genre column using genre-options.csv.
  • Creates feature mapping for the platform column using platform-options.csv.

E2E - Tabular Data - Ontology appears in your Ontology list after running the script.

tabular_create_ontology script

import pandas as pd
from encord.objects import OntologyStructure, Shape, TextAttribute
from encord.objects.attributes import RadioAttribute
from encord.user_client import EncordUserClient

# --- Configuration ---
ENCORD_SSH_KEY = "/Users/chris-encord/ssh-private-key.txt" # Replace with the file path to your SHH private key
TASK_CSV_PATH = "/file/path/to/video_game_annotation_1.csv" # Replace with the file path to any of the video_game_annotation_X.csv files

READ_ONLY_COLUMNS = [0, 1, 2]
ANNOTATION_COLUMNS = [3, 4]

# Replace these paths with actual mapping column name > options file
MAPPING_FIELD_OPTION_PATHS = {
    "genre": "/file/path/to/genre-options.csv",
    "platform": "/file/path/to/platform-options.csv",
}

ONTOLOGY_NAME = "E2E - Tabular Data - Ontology"
OBJECT_NAME = "Game Row"


def parse_csv():
    csv_df = pd.read_csv(TASK_CSV_PATH)
    readonly_columns = csv_df.columns[READ_ONLY_COLUMNS].tolist()
    mapping_columns = csv_df.columns[ANNOTATION_COLUMNS].tolist()

    return mapping_columns, readonly_columns


def create_ontology(text_attribute_names, radio_option_names):
    ontology_structure = OntologyStructure()
    text_object = ontology_structure.add_object(name=OBJECT_NAME, shape=Shape.TEXT)

    for attribute in text_attribute_names:
        text_object.add_attribute(TextAttribute, attribute)

    for column_name in radio_option_names:
        options_path = MAPPING_FIELD_OPTION_PATHS.get(column_name)
        if options_path is None:
            raise ValueError(f"No options file defined for column '{column_name}'")

        options = pd.read_csv(options_path).iloc[:, 0].dropna().astype(str).tolist()

        radio_attribute = text_object.add_attribute(RadioAttribute, column_name, required=True)
        for option in options:
            radio_attribute.add_option(option)

    user_client = EncordUserClient.create_with_ssh_private_key(
        ssh_private_key_path=ENCORD_SSH_KEY,
        domain="https://api.encord.com",
    )
    return user_client.create_ontology(ONTOLOGY_NAME, structure=ontology_structure)


if __name__ == "__main__":
    mapping_columns, readonly_columns = parse_csv()
    ontology = create_ontology(readonly_columns, mapping_columns)
    print(f"Created ontology {ontology.title}, id: {ontology.ontology_hash}")

Create Project

Create a CONSENSUS Project, after creating the Mirrored Dataset and registering/importing the CSV files, and creating the Ontology.

  • Tabular data currently supports CONSENSUS Projects only.
  • An AGENT block must be the first block for tabular data.
  • The AGENT block and AGENT pathway MUST be the exact name specified below.
  • Name: E2E - Tabular Data - Project
  • Agent name: Pre-label
  • Agent pathway: Labelled

Run the Agent script

The tabular_run_agent.py populates tasks in the AGENT block in your workflow.

Create the following Python scripts. Both scripts must be in the same directory.

  • tabular_run_agent.py
  • tabular_utils.py

After creating the scripts, run the tabular_run_agent.py script.

After running the script, tasks that were in the AGENT stage are now in the CONSENSUS - ANNOTATE stage.


from typing import Annotated
from pathlib import Path
import os

from encord_agents.tasks import Runner
from encord.objects.ontology_labels_impl import LabelRowV2
from encord.project import Project
from encord_agents.tasks.dependencies import dep_asset
from encord_agents.core.dependencies import Depends
from encord.objects.common import Shape

from tabular_utils import parse_csv_and_add_objects

# --- Configuration ---
ENCORD_SSH_KEY = "/Users/chris-encord/ssh-private-key.txt" # Replace with the file path to your SSH private key
PROJECT_HASH = "00000000-0000-0000-0000-000000000000" # Replace with unique Project ID of the tabular data Project
AGENT_STAGE = "Pre-label"
AGENT_PATHWAY = "Labelled"

# Inject into environment so Encord Agents can pick it up
os.environ["ENCORD_SSH_KEY_FILE"] = ENCORD_SSH_KEY

runner = Runner(project_hash=PROJECT_HASH)

@runner.stage(stage=AGENT_STAGE)
def agent_logic(
    lr: LabelRowV2, project: Project, asset: Annotated[Path, Depends(dep_asset)]
):
    ontology = project.ontology_structure
    text_object = ontology.objects[0]
    if text_object is None:
        raise Exception("No objects found")
    elif text_object.shape is not Shape.TEXT:
        raise Exception("Text object required")

    parse_csv_and_add_objects(text_object, lr, asset)

    return AGENT_PATHWAY

if __name__ == "__main__":
    runner.run()

Annotate Tabular Project

Annotation of CSV files depends on your Ontology. Our Ontology E2E - Tabular Data - Ontology uses text regions, but your annotators and reviewers use drop downs.

In this section, you’ll see the following Collaborators:

  • Annotators labelling data
  • Reviewers reviewing labels created by Annotators
  • Team Manager managing the Annotators and Reviewers
  • Project Admin managing the Project and exporting labels
1

Prepare to Label

2

Label Data

3

Review Labels

4

Export Labels

Only Project Admins can export labels from Encord.