GCP Examples

Basic Geometric Example

A simple example showing how to use objectHashes.

agent.py

from typing import Annotated

from encord.objects.ontology_labels_impl import LabelRowV2
from encord.objects.ontology_object_instance import ObjectInstance

from encord_agents.core.data_model import FrameData
from encord_agents.core.dependencies import Depends
from encord_agents.gcp.dependencies import dep_objects
from encord_agents.gcp.wrappers import editor_agent


@editor_agent
def handle_object_hashes(
    frame_data: FrameData,
    lr: LabelRowV2,
    object_instances: Annotated[list[ObjectInstance], Depends(dep_objects)],
) -> None:
    for object_inst in object_instances:
        print(object_inst)

Use Case: Selective OCR on Selected Objects This functionality allows you to apply your own OCR model to specific objects selected directly within the Encord platform. When you trigger your agent from the Encord app after selecting objects, the platform automatically sends a list of objectHashes to your agent. Your agent can then use the dep_objects method to gain immediate access to these specific object instances, which greatly simplifies integrating your OCR model for targeted processing. Test the Agent

Save the above code as agent.py.
Run the following command to run the agent in debug mode in your terminal.

functions-framework --target=handle_object_hashes --debug --source agent.py

Open your Project in the Encord platform and navigate to a frame with an object that you want to act on. Choose an object from the bottom left sider and click Copy URL as shown:

The url should have roughly this format: "https://app.encord.com/label_editor/{project_hash}/{data_hash}/{frame}/0?other_query_params&objectHash={objectHash}".

In another shell operating from the same working directory, source your virtual environment and test the agent.

source venv/bin/activate
encord-agents test local agent '<your_url>'

To see if the test is successful, refresh your browser to see the action taken by the Agent. If the test has run successfully, the agent can be deployed. Visit the deployment documentation to learn more.

Nested Classification using Claude 3.5 Sonnet

The goals of this example are:

Create an editor agent that automatically adds frame-level classifications.
Demonstrate how to use the OntologyDataModel for classifications.

Prerequisites Before you begin, ensure you have:

Created a virtual Python environment.
Installed all necessary dependencies.
Have an Anthropic API key.
Are able to authenticate with Encord.

Run the following commands to set up your environment:

python -m venv venv                 # Create a virtual Python environment  
source venv/bin/activate            # Activate the virtual environment  
python -m pip install encord-agents anthropic  # Install required dependencies  
export ANTHROPIC_API_KEY="<your_api_key>"     # Set your Anthropic API key  
export ENCORD_SSH_KEY_FILE="/path/to/your/private/key"  # Define your Encord access key  

Project Setup Create a Project with visual content (images, image groups, image sequences, or videos) in Encord. This example uses the following Ontology, but any Ontology containing classifications can be used.

Ontology JSON and Script

Ontology JSON

{
  "objects": [],
  "classifications": [
    {
      "id": "1",
      "featureNodeHash": "TTkHMtuD",
      "attributes": [
        {
          "id": "1.1",
          "featureNodeHash": "+1g9I9Sg",
          "type": "text",
          "name": "scene summary",
          "required": false,
          "dynamic": false
        }
      ]
    },
    {
      "id": "2",
      "featureNodeHash": "xGV/wCD0",
      "attributes": [
        {
          "id": "2.1",
          "featureNodeHash": "k3EVexk7",
          "type": "radio",
          "name": "is there a person in the frame?",
          "required": false,
          "options": [
            {
              "id": "2.1.1",
              "featureNodeHash": "EkGwhcO4",
              "label": "yes",
              "value": "yes",
              "options": [
                {
                  "id": "2.1.1.1",
                  "featureNodeHash": "mj9QCDY4",
                  "type": "text",
                  "name": "What is the person doing?",
                  "required": false
                }
              ]
            },
            {
              "id": "2.1.2",
              "featureNodeHash": "37rMLC/v",
              "label": "no",
              "value": "no",
              "options": []
            }
          ],
          "dynamic": false
        }
      ]
    }
  ]
}

To construct the same Ontology as used in this example, run the following script.

Create Ontology

import json
from encord.objects.ontology_structure import OntologyStructure
from encord_agents.core.utils import get_user_client

encord_client = get_user_client()
structure = OntologyStructure.from_dict(json.loads("{the_json_above}"))
ontology = encord_client.create_ontology(
    title="Your ontology title",
    structure=structure
)
print(ontology.ontology_hash)

The aim is to trigger an agent that transforms a labeling task from Figure A to Figure B.

Figure A: No classification labels.

Figure B: Multiple nested classification labels generated by an LLM. Create the Agent This section provides the complete code for creating your editor agent, along with an explanation of its internal workings. Agent Setup Steps

Import dependencies, authenticate with Encord, and set up the Project. Ensure you insert your Project’s unique identifier.
Create a data model and a system prompt based on the Project Ontology to tell Claude how to structure its response.
Set up an Anthropic API client to establish communication with the Claude model.
Define the Editor Agent. This includes

Retrieving Frame Content: It automatically fetches the current frame’s image data using the dep_single_frame dependency.
Analyzing with Claude: The frame image is then sent to the Claude AI model for analysis.
Parsing Classifications: Claude’s response is parsed and transformed into structured classification instances using the predefined data model.
Saving Results: The new classifications are added to the active label row, and the updated results are saved within the Project.

# 1. Import dependencies, authenticate with Encord, and set up the Project. Ensure you insert your Project's unique identifier.
import os

from anthropic import Anthropic
from encord.objects.ontology_labels_impl import LabelRowV2
from numpy.typing import NDArray
from typing_extensions import Annotated

from encord_agents.core.ontology import OntologyDataModel
from encord_agents.core.utils import get_user_client
from encord_agents.core.video import Frame
from encord_agents.gcp import Depends, editor_agent
from encord_agents.gcp.dependencies import FrameData, dep_single_frame

client = get_user_client()
project = client.get_project("<your_project_hash>")

# 2. Create a data model and a system prompt based on the Project Ontology to tell Claude how to structure its response
data_model = OntologyDataModel(project.ontology_structure.classifications)

system_prompt = f"""
You're a helpful assistant that's supposed to help fill in json objects 
according to this schema:

    ```json
    {data_model.model_json_schema_str}
    ```

Please only respond with valid json.
"""

# 3. Set up an Anthropic API client to establish communication with Claude 
ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY")
anthropic_client = Anthropic(api_key=ANTHROPIC_API_KEY)

# 4. Define the Editor Agent
@editor_agent()
def agent(
    frame_data: FrameData,
    lr: LabelRowV2,
    content: Annotated[NDArray, Depends(dep_single_frame)],
):
    # # Retrieving Frame Content: It automatically fetches the current frame's image data using the `dep_single_frame` dependency
    frame = Frame(frame_data.frame, content=content)
    # Analyzing with Claude: The frame image is then sent to the Claude AI model for analysis
    message = anthropic_client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        system=system_prompt,
        messages=[
            {
                "role": "user",
                "content": [frame.b64_encoding(output_format="anthropic")],
            }
        ],
    )
    try:
        # Parsing Classifications: Claude's response is parsed and transformed into structured classification instances using the predefined data model
        classifications = data_model(message.content[0].text)
        for clf in classifications:
            clf.set_for_frames(frame_data.frame, confidence=0.5, manual_annotation=False)
            lr.add_classification_instance(clf)
    except Exception:
        import traceback

        traceback.print_exc()
        print(f"Response from model: {message.content[0].text}")
        
    # Saving Results: The new classifications are added to the active label row, and the updated results are saved within the Project.
    lr.save()

See the contents of data_model.model_json_schema_str here

{
  "$defs": {
    "IsThereAPersonInTheFrameRadioModel": {
      "properties": {
        "feature_node_hash": {
          "const": "k3EVexk7",
          "description": "UUID for discrimination. Must be included in json as is.",
          "enum": [
            "k3EVexk7"
          ],
          "title": "Feature Node Hash",
          "type": "string"
        },
        "choice": {
          "description": "Choose exactly one answer from the given options.",
          "discriminator": {
            "mapping": {
              "37rMLC/v": "#/$defs/NoNestedRadioModel",
              "EkGwhcO4": "#/$defs/YesNestedRadioModel"
            },
            "propertyName": "feature_node_hash"
          },
          "oneOf": [
            {
              "$ref": "#/$defs/YesNestedRadioModel"
            },
            {
              "$ref": "#/$defs/NoNestedRadioModel"
            }
          ],
          "title": "Choice"
        }
      },
      "required": [
        "feature_node_hash",
        "choice"
      ],
      "title": "IsThereAPersonInTheFrameRadioModel",
      "type": "object"
    },
    "NoNestedRadioModel": {
      "properties": {
        "feature_node_hash": {
          "const": "37rMLC/v",
          "description": "UUID for discrimination. Must be included in json as is.",
          "enum": [
            "37rMLC/v"
          ],
          "title": "Feature Node Hash",
          "type": "string"
        },
        "title": {
          "const": "no",
          "default": "Constant value - should be included as-is.",
          "enum": [
            "no"
          ],
          "title": "Title",
          "type": "string"
        }
      },
      "required": [
        "feature_node_hash"
      ],
      "title": "NoNestedRadioModel",
      "type": "object"
    },
    "SceneSummaryTextModel": {
      "properties": {
        "feature_node_hash": {
          "const": "+1g9I9Sg",
          "description": "UUID for discrimination. Must be included in json as is.",
          "enum": [
            "+1g9I9Sg"
          ],
          "title": "Feature Node Hash",
          "type": "string"
        },
        "value": {
          "description": "Please describe the image as accurate as possible focusing on 'scene summary'",
          "maxLength": 1000,
          "minLength": 0,
          "title": "Value",
          "type": "string"
        }
      },
      "required": [
        "feature_node_hash",
        "value"
      ],
      "title": "SceneSummaryTextModel",
      "type": "object"
    },
    "WhatIsThePersonDoingTextModel": {
      "properties": {
        "feature_node_hash": {
          "const": "mj9QCDY4",
          "description": "UUID for discrimination. Must be included in json as is.",
          "enum": [
            "mj9QCDY4"
          ],
          "title": "Feature Node Hash",
          "type": "string"
        },
        "value": {
          "description": "Please describe the image as accurate as possible focusing on 'What is the person doing?'",
          "maxLength": 1000,
          "minLength": 0,
          "title": "Value",
          "type": "string"
        }
      },
      "required": [
        "feature_node_hash",
        "value"
      ],
      "title": "WhatIsThePersonDoingTextModel",
      "type": "object"
    },
    "YesNestedRadioModel": {
      "properties": {
        "feature_node_hash": {
          "const": "EkGwhcO4",
          "description": "UUID for discrimination. Must be included in json as is.",
          "enum": [
            "EkGwhcO4"
          ],
          "title": "Feature Node Hash",
          "type": "string"
        },
        "what_is_the_person_doing": {
          "$ref": "#/$defs/WhatIsThePersonDoingTextModel",
          "description": "A text attribute with carefully crafted text to describe the property."
        }
      },
      "required": [
        "feature_node_hash",
        "what_is_the_person_doing"
      ],
      "title": "YesNestedRadioModel",
      "type": "object"
    }
  },
  "properties": {
    "scene_summary": {
      "$ref": "#/$defs/SceneSummaryTextModel",
      "description": "A text attribute with carefully crafted text to describe the property."
    },
    "is_there_a_person_in_the_frame": {
      "$ref": "#/$defs/IsThereAPersonInTheFrameRadioModel",
      "description": "A mutually exclusive radio attribute to choose exactly one option that best matches to the give visual input."
    }
  },
  "required": [
    "scene_summary",
    "is_there_a_person_in_the_frame"
  ],
  "title": "ClassificationModel",
  "type": "object"
}

Test the Agent

In your current terminal, run the following command to run the agent in debug mode.

functions-framework --target=agent --debug --source agent.py

Open your Project in the Encord platform and navigate to a frame you want to add a classification to. Copy the URL from your browser.

The url should have the following format: "https://app.encord.com/label_editor/{project_hash}/{data_hash}/{frame}".

In another shell operating from the same working directory, source your virtual environment and test the agent.

source venv/bin/activate
encord-agents test local agent '<your_url>'

To see if the test is successful, refresh your browser to view the classifications generated by Claude. Once the test runs successfully, you are ready to deploy your agent. Visit the deployment documentation to learn more.

Nested Attributes using Claude 3.5 Sonnet

The goals of this example are:

Create an editor agent that can convert generic object annotations (class-less coordinates) into class specific annotations with nested attributes like descriptions, radio buttons, and checklists.
Demonstrate how to use both the OntologyDataModel and the dep_object_crops dependency.

Prerequisites Before you begin, ensure you have:

Created a virtual Python environment.
Installed all necessary dependencies.
Have an Anthropic API key.
Are able to authenticate with Encord.

Run the following commands to set up your environment:

python -m venv venv                 # Create a virtual Python environment  
source venv/bin/activate            # Activate the virtual environment  
python -m pip install encord-agents anthropic  # Install required dependencies  
export ANTHROPIC_API_KEY="<your_api_key>"     # Set your Anthropic API key  
export ENCORD_SSH_KEY_FILE="/path/to/your/private/key"  # Define your Encord access key  

Ontology JSON and Script

ontology.json

{
  "objects": [
    {
      "id": "1",
      "name": "person",
      "color": "#D33115",
      "shape": "bounding_box",
      "featureNodeHash": "2xlDPPAG",
      "required": false,
      "attributes": [
        {
          "id": "1.1",
          "featureNodeHash": "aFCN9MMm",
          "type": "text",
          "name": "activity",
          "required": false,
          "dynamic": false
        }
      ]
    },
    {
      "id": "2",
      "name": "animal",
      "color": "#E27300",
      "shape": "bounding_box",
      "featureNodeHash": "3y6JxTUX",
      "required": false,
      "attributes": [
        {
          "id": "2.1",
          "featureNodeHash": "2P7LTUZA",
          "type": "radio",
          "name": "type",
          "required": false,
          "options": [
            {
              "id": "2.1.1",
              "featureNodeHash": "gJvcEeLl",
              "label": "dolphin",
              "value": "dolphin",
              "options": []
            },
            {
              "id": "2.1.2",
              "featureNodeHash": "CxrftGS4",
              "label": "monkey",
              "value": "monkey",
              "options": []
            },
            {
              "id": "2.1.3",
              "featureNodeHash": "OQyWm7Sm",
              "label": "dog",
              "value": "dog",
              "options": []
            },
            {
              "id": "2.1.4",
              "featureNodeHash": "CDKmYJK/",
              "label": "cat",
              "value": "cat",
              "options": []
            }
          ],
          "dynamic": false
        },
        {
          "id": "2.2",
          "featureNodeHash": "5fFgrM+E",
          "type": "text",
          "name": "description",
          "required": false,
          "dynamic": false
        }
      ]
    },
    {
      "id": "3",
      "name": "vehicle",
      "color": "#16406C",
      "shape": "bounding_box",
      "featureNodeHash": "llw7qdWW",
      "required": false,
      "attributes": [
        {
          "id": "3.1",
          "featureNodeHash": "79mo1G7Q",
          "type": "text",
          "name": "type - short and concise",
          "required": false,
          "dynamic": false
        },
        {
          "id": "3.2",
          "featureNodeHash": "OFrk07Ds",
          "type": "checklist",
          "name": "visible",
          "required": false,
          "options": [
            {
              "id": "3.2.1",
              "featureNodeHash": "KmX/HjRT",
              "label": "wheels",
              "value": "wheels"
            },
            {
              "id": "3.2.2",
              "featureNodeHash": "H6qbEcdj",
              "label": "frame",
              "value": "frame"
            },
            {
              "id": "3.2.3",
              "featureNodeHash": "gZ9OucoQ",
              "label": "chain",
              "value": "chain"
            },
            {
              "id": "3.2.4",
              "featureNodeHash": "cit3aZSz",
              "label": "head lights",
              "value": "head_lights"
            },
            {
              "id": "3.2.5",
              "featureNodeHash": "qQ3PieJ/",
              "label": "tail lights",
              "value": "tail_lights"
            }
          ],
          "dynamic": false
        }
      ]
    },
    {
      "id": "4",
      "name": "generic",
      "color": "#FE9200",
      "shape": "bounding_box",
      "featureNodeHash": "jootTFfQ",
      "required": false,
      "attributes": []
    }
  ],
  "classifications": []
}

To construct the Ontology used in this example, run the following script:

import json
from encord.objects.ontology_structure import OntologyStructure
from encord_agents.core.utils import get_user_client

encord_client = get_user_client()
structure = OntologyStructure.from_dict(json.loads("{the_json_above}"))
ontology = encord_client.create_ontology(
    title="Your ontology title",
    structure=structure
)
print(ontology.ontology_hash)

The goal is create an agent that takes a labeling task from Figure A to Figure B

Figure A: No classification labels.

Import dependencies, authenticate with Encord, and set up the Project. Ensure you insert your Project’s unique identifier.
Extract the generic Ontology object and the specific objects of interest. This example sorts Ontology objects based on whether their title is "generic". The generic object is used to query image crops within the agent. Before that, other_objects is used to pass in the specific context we want Claude to focus on. The OntologyDataModel class helps convert Encord Ontology Objects into a Pydantic model and parse JSON into Encord ObjectInstances.
Prepare the system prompt for each object crop using the data_model to generate the JSON schema. Only other_objects is passed to ensure the model can choose only from non-generic object types.
Set up an Anthropic API client to establish communication with the Claude model. You must include your Anthropic API key.
Define the Editor Agent.

All arguments are automatically injected when the agent is called. For details on dependency injection, see here.
The dep_object_crops dependency allows filtering. In this case, it includes only “generic” object crops, excluding those already converted to actual labels.

Query Claude using the image crops. The crop variable has a convenient b64_encoding method to produce an input that Claude understands.
Parse Claude’s message using the data_model. When called with a JSON string, it attempts to parse it with respect to the JSON schema we saw above to create an Encord object instance. If successful, the old generic object can be removed and the newly classified object added.
Save the labels with Encord.

# 1. Import dependencies, authenticate with Encord, and set up the Project. Ensure you insert your Project's unique identifier
import os

from anthropic import Anthropic
from encord.objects.ontology_labels_impl import LabelRowV2
from typing_extensions import Annotated

from encord_agents.core.ontology import OntologyDataModel
from encord_agents.core.utils import get_user_client
from encord_agents.gcp import Depends, editor_agent
from encord_agents.gcp.dependencies import FrameData, InstanceCrop, dep_object_crops

# User client
client = get_user_client()
project = client.get_project("<project_hash>")

# 2. Extract the generic Ontology object and the specific objects of interest. This example sorts Ontology objects based on whether their title is `"generic"`
generic_ont_obj, *other_objects = sorted(
    project.ontology_structure.objects,
    key=lambda o: o.title.lower() == "generic",
    reverse=True,
)

# 3. Prepare the system prompt for each object crop using the `data_model` to generate the JSON schema
data_model = OntologyDataModel(other_objects)
system_prompt = f"""
You're a helpful assistant that's supposed to help fill in 
json objects according to this schema:

`{data_model.model_json_schema_str}`

Please only respond with valid json.
"""

# 4. Set up an Anthropic API client to establish communication with the Claude model. You must include your Anthropic API key

ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY")
anthropic_client = Anthropic(api_key=ANTHROPIC_API_KEY)


# 5. Define the Editor Agent
@editor_agent()
def agent(
    frame_data: FrameData,
    lr: LabelRowV2,
    crops: Annotated[
        list[InstanceCrop],
        Depends(dep_object_crops(filter_ontology_objects=[generic_ont_obj])),
    ],
):
    # 6. Query Claude using the image crops. The `crop` variable has a convenient `b64_encoding` method to produce an input that Claude understands.
    changes = False
    for crop in crops:
        message = anthropic_client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=1024,
            system=system_prompt,
            messages=[
                {
                    "role": "user",
                    "content": [crop.b64_encoding(output_format="anthropic")],
                }
            ],
        )

        # 7. Parse Claude's message using the `data_model`.
        try:
            instance = data_model(message.content[0].text)

            coordinates = crop.instance.get_annotation(frame=frame_data.frame).coordinates
            instance.set_for_frames(
                coordinates=coordinates,
                frames=frame_data.frame,
                confidence=0.5,
                manual_annotation=False,
            )
            lr.remove_object(crop.instance)
            lr.add_object_instance(instance)
            changes = True
        except Exception:
            import traceback

            traceback.print_exc()
            print(f"Response from model: {message.content[0].text}")

    # 8. Save the labels with Encord.
    if changes:
        lr.save()

See the result of `data_model.model_json_schema_str` for the given example

{
  "$defs": {
    "ActivityTextModel": {
      "properties": {
        "feature_node_hash": {
          "const": "aFCN9MMm",
          "description": "UUID for discrimination. Must be included in json as is.",
          "enum": [
            "aFCN9MMm"
          ],
          "title": "Feature Node Hash",
          "type": "string"
        },
        "value": {
          "description": "Please describe the image as accurate as possible focusing on 'activity'",
          "maxLength": 1000,
          "minLength": 0,
          "title": "Value",
          "type": "string"
        }
      },
      "required": [
        "feature_node_hash",
        "value"
      ],
      "title": "ActivityTextModel",
      "type": "object"
    },
    "AnimalNestedModel": {
      "properties": {
        "feature_node_hash": {
          "const": "3y6JxTUX",
          "description": "UUID for discrimination. Must be included in json as is.",
          "enum": [
            "3y6JxTUX"
          ],
          "title": "Feature Node Hash",
          "type": "string"
        },
        "type": {
          "$ref": "#/$defs/TypeRadioModel",
          "description": "A mutually exclusive radio attribute to choose exactly one option that best matches to the give visual input."
        },
        "description": {
          "$ref": "#/$defs/DescriptionTextModel",
          "description": "A text attribute with carefully crafted text to describe the property."
        }
      },
      "required": [
        "feature_node_hash",
        "type",
        "description"
      ],
      "title": "AnimalNestedModel",
      "type": "object"
    },
    "DescriptionTextModel": {
      "properties": {
        "feature_node_hash": {
          "const": "5fFgrM+E",
          "description": "UUID for discrimination. Must be included in json as is.",
          "enum": [
            "5fFgrM+E"
          ],
          "title": "Feature Node Hash",
          "type": "string"
        },
        "value": {
          "description": "Please describe the image as accurate as possible focusing on 'description'",
          "maxLength": 1000,
          "minLength": 0,
          "title": "Value",
          "type": "string"
        }
      },
      "required": [
        "feature_node_hash",
        "value"
      ],
      "title": "DescriptionTextModel",
      "type": "object"
    },
    "PersonNestedModel": {
      "properties": {
        "feature_node_hash": {
          "const": "2xlDPPAG",
          "description": "UUID for discrimination. Must be included in json as is.",
          "enum": [
            "2xlDPPAG"
          ],
          "title": "Feature Node Hash",
          "type": "string"
        },
        "activity": {
          "$ref": "#/$defs/ActivityTextModel",
          "description": "A text attribute with carefully crafted text to describe the property."
        }
      },
      "required": [
        "feature_node_hash",
        "activity"
      ],
      "title": "PersonNestedModel",
      "type": "object"
    },
    "TypeRadioEnum": {
      "enum": [
        "dolphin",
        "monkey",
        "dog",
        "cat"
      ],
      "title": "TypeRadioEnum",
      "type": "string"
    },
    "TypeRadioModel": {
      "properties": {
        "feature_node_hash": {
          "const": "2P7LTUZA",
          "description": "UUID for discrimination. Must be included in json as is.",
          "enum": [
            "2P7LTUZA"
          ],
          "title": "Feature Node Hash",
          "type": "string"
        },
        "choice": {
          "$ref": "#/$defs/TypeRadioEnum",
          "description": "Choose exactly one answer from the given options."
        }
      },
      "required": [
        "feature_node_hash",
        "choice"
      ],
      "title": "TypeRadioModel",
      "type": "object"
    },
    "TypeShortAndConciseTextModel": {
      "properties": {
        "feature_node_hash": {
          "const": "79mo1G7Q",
          "description": "UUID for discrimination. Must be included in json as is.",
          "enum": [
            "79mo1G7Q"
          ],
          "title": "Feature Node Hash",
          "type": "string"
        },
        "value": {
          "description": "Please describe the image as accurate as possible focusing on 'type - short and concise'",
          "maxLength": 1000,
          "minLength": 0,
          "title": "Value",
          "type": "string"
        }
      },
      "required": [
        "feature_node_hash",
        "value"
      ],
      "title": "TypeShortAndConciseTextModel",
      "type": "object"
    },
    "VehicleNestedModel": {
      "properties": {
        "feature_node_hash": {
          "const": "llw7qdWW",
          "description": "UUID for discrimination. Must be included in json as is.",
          "enum": [
            "llw7qdWW"
          ],
          "title": "Feature Node Hash",
          "type": "string"
        },
        "type__short_and_concise": {
          "$ref": "#/$defs/TypeShortAndConciseTextModel",
          "description": "A text attribute with carefully crafted text to describe the property."
        },
        "visible": {
          "$ref": "#/$defs/VisibleChecklistModel",
          "description": "A collection of boolean values indicating which concepts are applicable according to the image content."
        }
      },
      "required": [
        "feature_node_hash",
        "type__short_and_concise",
        "visible"
      ],
      "title": "VehicleNestedModel",
      "type": "object"
    },
    "VisibleChecklistModel": {
      "properties": {
        "feature_node_hash": {
          "const": "OFrk07Ds",
          "description": "UUID for discrimination. Must be included in json as is.",
          "enum": [
            "OFrk07Ds"
          ],
          "title": "Feature Node Hash",
          "type": "string"
        },
        "wheels": {
          "description": "Is 'wheels' applicable or not?",
          "title": "Wheels",
          "type": "boolean"
        },
        "frame": {
          "description": "Is 'frame' applicable or not?",
          "title": "Frame",
          "type": "boolean"
        },
        "chain": {
          "description": "Is 'chain' applicable or not?",
          "title": "Chain",
          "type": "boolean"
        },
        "head_lights": {
          "description": "Is 'head lights' applicable or not?",
          "title": "Head Lights",
          "type": "boolean"
        },
        "tail_lights": {
          "description": "Is 'tail lights' applicable or not?",
          "title": "Tail Lights",
          "type": "boolean"
        }
      },
      "required": [
        "feature_node_hash",
        "wheels",
        "frame",
        "chain",
        "head_lights",
        "tail_lights"
      ],
      "title": "VisibleChecklistModel",
      "type": "object"
    }
  },
  "properties": {
    "choice": {
      "description": "Choose exactly one answer from the given options.",
      "discriminator": {
        "mapping": {
          "2xlDPPAG": "#/$defs/PersonNestedModel",
          "3y6JxTUX": "#/$defs/AnimalNestedModel",
          "llw7qdWW": "#/$defs/VehicleNestedModel"
        },
        "propertyName": "feature_node_hash"
      },
      "oneOf": [
        {
          "$ref": "#/$defs/PersonNestedModel"
        },
        {
          "$ref": "#/$defs/AnimalNestedModel"
        },
        {
          "$ref": "#/$defs/VehicleNestedModel"
        }
      ],
      "title": "Choice"
    }
  },
  "required": [
    "choice"
  ],
  "title": "ObjectsRadioModel",
  "type": "object"
}

Test the Agent

In your current terminal, run the following command to run the agent in debug mode.

functions-framework --target=agent --debug --source agent.py

Open your Project in the Encord platform and navigate to a frame you want to add a generic object to. Copy the URL from your browser.

The url has following format: "https://app.encord.com/label_editor/{project_hash}/{data_hash}/{frame}".

In another shell operating from the same working directory, source your virtual environment and test the agent.

source venv/bin/activate
encord-agents test local agent <your_url>

To see if the test is successful, refresh your browser to view the classifications generated by Claude. Once the test runs successfully, you are ready to deploy your agent. Visit the deployment documentation to learn more.

Video Recaptioning using GPT-4o-mini

The goals of this example are:

Create an Editor Agent that automatically generates multiple variations of video captions.
Demonstrate how to use OpenAI’s GPT-4o-mini model to enhance human-created video captions with a FastAPI-based agent.

Prerequisites Before you begin, ensure you have:

Created a virtual Python environment.
Installed all necessary dependencies.
Have an OpenAI API key.
Are able to authenticate with Encord.

Run the following commands to set up your environment:

python -m venv venv                 # Create a virtual Python environment  
source venv/bin/activate            # Activate the virtual environment  
python -m pip install encord-agents langchain-openai "fastapi[standard]" openai  # Install required dependencies  
export OPENAI_API_KEY="<your-api-key>"     # Set your OpenAI API key  
export ENCORD_SSH_KEY_FILE="/path/to/your/private/key"  # Define your Encord access key  

Project Setup Create a Project containing videos in Encord. This example requires an Ontology with four text classifications:

One text classification for human-created summaries of what is happening in the video.
Three text classifications to be automatically filled by the LLM.

Ontology

Ontology JSON and Script

    {
      "objects": [],
      "classifications": [
        {
          "id": "1",
          "featureNodeHash": "GCH8VHIK",
          "attributes": [
            {
              "id": "1.1",
              "name": "Caption",
              "type": "text",
              "required": false,
              "featureNodeHash": "Yg7xXEfC"
            }
          ]
        },
        {
          "id": "2",
          "featureNodeHash": "PwQAwYid",
          "attributes": [
            {
              "id": "2.1",
              "name": "Caption Rephrased 1",
              "type": "text",
              "required": false,
              "featureNodeHash": "aQdXJwbG"
            }
          ]
        },
        {
          "id": "3",
          "featureNodeHash": "3a/aSnHO",
          "attributes": [
            {
              "id": "3.1",
              "name": "Caption Rephrased 2",
              "type": "text",
              "required": false,
              "featureNodeHash": "8zY6H62x"
            }
          ]
        },
        {
          "id": "4",
          "featureNodeHash": "FNjXp5TU",
          "attributes": [
            {
              "id": "4.1",
              "name": "Caption Rephrased 3",
              "type": "text",
              "required": false,
              "featureNodeHash": "sKg1Kq/m"
            }
          ]
        }
      ]
    }

To construct the Ontology used in this example, run the following script:

import json
from encord.objects.ontology_structure import OntologyStructure
from encord.objects.attributes import TextAttribute

structure = OntologyStructure()
caption = structure.add_classification()
caption.add_attribute(TextAttribute, "Caption")
re1 = structure.add_classification()
re1.add_attribute(TextAttribute, "Recaption 1")
re2 = structure.add_classification()
re2.add_attribute(TextAttribute, "Recaption 2")
re3 = structure.add_classification()
re3.add_attribute(TextAttribute, "Recaption 3")

print(json.dumps(structure.to_dict(), indent=2))

create_ontology = False
if create_ontology:
    from encord.user_client import EncordUserClient
    client = EncordUserClient.create_with_ssh_private_key()  # Look in auth section for authentication
    client.create_ontology("title", "description", structure)

The workflow for this agent is:

A human watches the video and enters a caption in the first text field.
The agent is then triggered and generates three additional caption variations for review.

Each video is first annotated by a human (ANNOTATE stage).
Next, a data agent automatically generates alternative captions (AGENT stage).
Finally, a human reviews all four captions (REVIEW stage) before the task is marked complete.

If no human caption is present when the agent is triggered, the task is sent back for annotation. If the review stage results in rejection, the task is also returned for re-annotation.

Workflow

Create the Agent This section provides the complete code for creating your editor agent, along with an explanation of its internal workings. Agent Setup Steps

Set up imports and create a Pydantic model for our LLM’s structured output.
Create a detailed system prompt for the LLM that explains exactly what kind of rephrasing we want.
Configure the LLM to use structured outputs based on our model.
Create a helper function to prompt the model with both text and image.
Define the agent to handle the recaptioning. This includes:
- Retrieving the existing human-created caption, prioritizing captions from the current frame or falling back to frame zero.
- Sending the first frame of the video along with the human caption to the LLM.
- Processing the response from the LLM, which provides three alternative phrasings of the original caption.
- Updating the label row with the new captions, replacing any existing ones.

# 1. Set up imports and create a Pydantic model for our LLM's structured output.
import os
from typing import Annotated

import numpy as np
from encord.exceptions import LabelRowError
from encord.objects.classification_instance import ClassificationInstance
from encord.objects.ontology_labels_impl import LabelRowV2
from langchain_openai import ChatOpenAI
from numpy.typing import NDArray
from pydantic import BaseModel

from encord_agents import FrameData
from encord_agents.gcp import Depends, editor_agent
from encord_agents.gcp.dependencies import Frame, dep_single_frame



# The response model for the agent to follow.
class AgentCaptionResponse(BaseModel):
    rephrase_1: str
    rephrase_2: str
    rephrase_3: str


# 2. Create a detailed system prompt for the LLM that explains exactly what kind of rephrasing we want.
SYSTEM_PROMPT = """
You are a helpful assistant that rephrases captions.

I will provide you with a video caption and an image of the scene of the video. 

The captions follow this format:

"The droid picks up <cup_0> and puts it on the <table_0>."

The captions that you make should replace the tags, e.g., <cup_0>, with the actual object names.
The replacements should be consistent with the scene.

Here are three rephrases: 

1. The droid picks up the blue mug and puts it on the left side of the table.
2. The droid picks up the cup and puts it to the left of the plate.
3. The droid is picking up the mug on the right side of the table and putting it down next to the plate.

You will rephrase the caption in three different ways, as above, the rephrases should be

1. Diverse in terms of adjectives, object relations, and object positions.
2. Sound in relation to the scene. You cannot talk about objects you cannot see.
3. Short and concise. Keep it within one sentence.

"""

# 3. Configure the LLM to use structured outputs based on our model.
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.4, api_key=os.environ["OPENAI_API_KEY"])
llm_structured = llm.with_structured_output(AgentCaptionResponse)


# 4. Create a helper function to prompt the model with both text and image.
def prompt_gpt(caption: str, image: Frame) -> AgentCaptionResponse:
    prompt = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {
            "role": "user",
            "content": [
                {"type": "text", "text": f"Video caption: `{caption}`"},
                image.b64_encoding(output_format="openai"),
            ],
        },
    ]
    return llm_structured.invoke(prompt)


# 5. Define the agent to handle the recaptioning. This includes:
@editor_agent()
def my_agent(
    frame_data: FrameData,
    label_row: LabelRowV2, # FrameData is automatically received by the agent
    frame_content: Annotated[NDArray[np.uint8], Depends(dep_single_frame)],
) -> None:
    # Retrieve the existing human-created caption, prioritizing captions from the current frame or falling back to frame zero.
    cap, *rs = label_row.ontology_structure.classifications

    # Read the existing human caption
    instances = label_row.get_classification_instances(
        filter_ontology_classification=cap, filter_frames=[0, frame_data.frame]
    )
    if not instances:
        # nothing to do if there are no human labels
        return
    elif len(instances) > 1:

        def order_by_current_frame_else_frame_0(
            instance: ClassificationInstance,
        ) -> bool:
            try:
                instance.get_annotation(frame_data.frame)
                return 2  # The best option
            except LabelRowError:
                pass
            try:
                instance.get_annotation(0)
                return 1
            except LabelRowError:
                return 0

        instance = sorted(instances, key=order_by_current_frame_else_frame_0)[-1]
    else:
        instance = instances[0]

    # Read the actual string caption
    caption = instance.get_answer()

    # Send the first frame of the video along with the human caption to the LLM.
    frame = Frame(frame=0, content=frame_content)
    response = prompt_gpt(caption, frame)

    # Process the response from the LLM, which provides three alternative phrasings of the original caption.
    # Update the label row with the new captions, replacing any existing ones.
    # Upsert the new captions
    for r, t in zip(rs, [response.rephrase_1, response.rephrase_2, response.rephrase_3]):
        # Overwrite any existing re-captions
        existing_instances = label_row.get_classification_instances(filter_ontology_classification=r)
        for existing_instance in existing_instances:
            label_row.remove_classification(existing_instance)

        # Create new instances
        ins = r.create_instance()
        ins.set_answer(t, attribute=r.attributes[0])
        ins.set_for_frames(0)
        label_row.add_classification_instance(ins)

    label_row.save()

Click here for a concrete Vision Language Action model use-case.

This example requires the following dependencies:

encord-agents
langchain-openai
fastapi[standard]
openai

To set up and test the agent locally:

Save the dependencies above into a requirements.txt file.

Set up your Python environment and run the agent:

python -m venv venv
source venv/bin/activate
python -m pip install -r requirements.txt

ENCORD_SSH_KEY_FILE=/path/to/your_private_key \
OPENAI_API_KEY=<your-api-key> \
fastapi dev main.py

(Replace /path/to/your_private_key and <your-api-key> with your actual credentials.)

In a separate terminal, test the agent:
```
source venv/bin/activate
encord-agents test local my_agent <url_from_the_label_editor>
```
(Replace <url_from_the_label_editor> with the URL from your Encord Label Editor session.)

PDF OCR Encord Agent

The goal is to create an Editor Agent that extracts text from target bounding boxes in a PDF using the Document AI API. This Agent performs the following:

Searches for bounding boxes in your PDF that have a Text or OCR text attributes.
Rasterizes PDF pages.
Crops each bounding box.
Sends the crop to Google Document AI OCR.
Writes the extracted text back into the attribute on the object.
Saves the label row after each batch.

Prerequisites

Create a virtual Python environment
Install all necessary dependencies
Are able to authenticate with Encord

Run the following commands to set up your environment:

python -m venv venv       # Create a virtual Python environment
source venv/bin/activate  # Activate the virtual environment

Project Setup

Create Ontology

For the Agent to work, the Ontology for your Project must contain a Bounding Box object with a Text attribute named Text or OCR.For example, create an Ontology with the following:

PDF Document Name (bounding box)
- Text (text attribute)
Error (bounding box)
- OCR (text attribute)
PDF Signature Field (bounding box)
- Signatory Name (text attribute)
- Status (radio button)
  - Signed (radio button option)
  - Unsigned (radio button option)

Create Dataset

Create a Dataset that contains PDFs.

Create Project

Create a Project with the following:

Ontology you created in Step 1
Dataset with PDFs
Standard Workflow

Host the Agent Use the contents of main.py in the process to host the Agent.

import json
import os
from dataclasses import dataclass
from pathlib import Path
from typing import Annotated, Any

import cv2
import numpy as np
from encord.objects.attributes import TextAttribute
from encord.objects.common import Shape
from encord.objects.coordinates import BoundingBoxCoordinates
from encord.objects.ontology_labels_impl import LabelRowV2
from encord_agents.core.data_model import FrameData
from encord_agents.core.dependencies import Depends
from encord_agents.gcp.dependencies import dep_asset
from encord_agents.gcp.wrappers import editor_agent
from google.cloud import documentai
from pdf2image import convert_from_path

# Set your Google Cloud Document AI API credentials path
DOCUMENT_AI_API_DETAILS = os.environ.get("DOCUMENT_AI_API_KEY_FILE")
if not DOCUMENT_AI_API_DETAILS:
    raise ValueError("No DOCUMENT_AI_API_KEY_FILE defined in environment variable.")

if os.path.exists(DOCUMENT_AI_API_DETAILS):
    with open(DOCUMENT_AI_API_DETAILS, "r") as f:
        document_ai_api_details = json.load(f)

else:
    document_ai_api_details = json.loads(DOCUMENT_AI_API_DETAILS)


@editor_agent()
def extract_all_bbox_text_from_pdf(
    frame_data: FrameData,
    label_row: LabelRowV2,
    asset: Annotated[Path, Depends(dep_asset)],
) -> None:
    bbox_text_extraction(label_row, asset, frame_data.frame, document_ai_api_details)


def encord_bbox_to_opencv(
    bbox: BoundingBoxCoordinates, frame_width: int, frame_height: int
) -> tuple[int, int, int, int]:
    x = bbox.top_left_x
    y = bbox.top_left_y
    w = bbox.width
    h = bbox.height
    return (
        int(x * frame_width),
        int(y * frame_height),
        int(w * frame_width),
        int(h * frame_height),
    )


@dataclass
class CropData:
    """Data class to store information about a cropped region from a document."""

    feature_hash: str
    bbox: BoundingBoxCoordinates
    object: Any
    image_bytes: bytes
    frame_num: int


def get_crops(lr: LabelRowV2, file_path: Path, frame: int) -> dict[str, CropData]:
    """Extract crops from the PDF and keep them in memory."""
    crops: dict[str, CropData] = {}
    object_instances = lr.get_object_instances()

    # Get all bounding boxes first
    for obj in object_instances:
        if obj.ontology_item.shape == Shape.BOUNDING_BOX:
            for attr in obj.ontology_item.attributes:
                if isinstance(attr, TextAttribute):
                    crops[obj.object_hash] = CropData(
                        feature_hash=obj.feature_hash,
                        bbox=obj.get_annotations()[0].coordinates,
                        object=obj,
                        image_bytes=None,
                        frame_num=obj.get_annotations()[0].frame,
                    )

    # Convert PDF to image
    pdf_image = convert_from_path(file_path, fmt="jpeg")
    img = np.array(pdf_image)
    frame_height, frame_width = img.shape[1], img.shape[2]

    # Extract crops and convert to bytes
    for v in crops.values():
        x, y, w, h = encord_bbox_to_opencv(v.bbox, frame_width, frame_height)
        frame_img = img[v.frame_num, :, :, :]
        crop_img = frame_img[y : y + h, x : x + w]
        _, buffer = cv2.imencode(".png", crop_img)
        v.image_bytes = buffer.tobytes()

    return crops


def process_batch_ocr(
    client: documentai.DocumentProcessorServiceClient,
    project_id: str,
    location: str,
    processor_id: str,
    batch: list[tuple[str, bytes, Any]],
) -> None:
    """Process a batch of images for OCR."""
    for object_hash, image_bytes, obj in batch:
        document = {
            "content": image_bytes,
            "mime_type": "image/png",
        }

        request = {
            "name": client.processor_path(project_id, location, processor_id),
            "raw_document": document,
        }

        result = client.process_document(request=request)
        ocr_text = result.document.text
        print(f"Extracted text for object {object_hash}: {ocr_text}")

        for attr in obj.ontology_item.attributes:
            if isinstance(attr, TextAttribute):
                obj.set_answer(attribute=attr, answer=ocr_text, overwrite=True)


def bbox_text_extraction(
    lr: LabelRowV2,
    asset: Path,
    frame_number: int,
    document_ai_api_details: dict[str, Any],
) -> None:
    """Extract text from bounding boxes in a PDF using OCR."""
    project_id = document_ai_api_details["project_id"]
    location = document_ai_api_details["location"]
    processor_id = document_ai_api_details["processor_id"]

    # Get crops in memory
    crops = get_crops(lr, asset, frame_number)

    # Process OCR in batches
    client = documentai.DocumentProcessorServiceClient()
    batch_size = 10  # Adjust based on your needs
    batch: list[tuple[str, bytes, Any]] = []

    for object_hash, crop_data in crops.items():
        batch.append((object_hash, crop_data.image_bytes, crop_data.object))

        if len(batch) >= batch_size:
            process_batch_ocr(client, project_id, location, processor_id, batch)
            batch = []
            lr.save()  # Save after each batch

    # Process remaining items
    if batch:
        process_batch_ocr(client, project_id, location, processor_id, batch)
        lr.save()

Annotate PDF Annotate the PDFs using the bounding boxes with text and ocr text attributes.

Run the Agent After Annotating the PDFs, run the Editor Agent.

Basics

Editor Agents

Task Agents

Pre-Built Agents

SDK Reference

Basic Geometric Example

Nested Classification using Claude 3.5 Sonnet

Nested Attributes using Claude 3.5 Sonnet

Video Recaptioning using GPT-4o-mini

PDF OCR Encord Agent

Basics

Editor Agents

Task Agents

Pre-Built Agents

SDK Reference

​Basic Geometric Example

​Nested Classification using Claude 3.5 Sonnet

​Nested Attributes using Claude 3.5 Sonnet

​Video Recaptioning using GPT-4o-mini

​PDF OCR Encord Agent

Basic Geometric Example

Nested Classification using Claude 3.5 Sonnet

Nested Attributes using Claude 3.5 Sonnet

Video Recaptioning using GPT-4o-mini

PDF OCR Encord Agent