This functionality allows you to apply your own OCR model to specific objects selected directly within the Encord platform.
When you trigger your agent from the Encord app after selecting objects, the platform automatically sends a list of objectHashes to your agent. Your agent can then use the dep_objects method to gain immediate access to these specific object instances, which greatly simplifies integrating your OCR model for targeted processing.
Test the Agent
Save the above code as agent.py.
Run the following command to run the agent in debug mode in your terminal.
Copy
uvicorn main:app --reload --port 8080
Open your Project in the Encord platform and navigate to a frame with an object that you want to act on. Choose an object from the bottom left sider and click Copy URL as shown:
The url should have roughly this format: "https://app.encord.com/label_editor/{project_hash}/{data_hash}/{frame}/0?other_query_params&objectHash={objectHash}".
In another shell operating from the same working directory, source your virtual environment and test the agent.
Copy
source venv/bin/activateencord-agents test local agent '<your_url>'
To see if the test is successful, refresh your browser to see the action taken by the Agent. If the test has run successfully, the agent can be deployed. Visit the deployment documentation to learn more.
Run the following commands to set up your environment:
Copy
python -m venv venv # Create a virtual Python environment source venv/bin/activate # Activate the virtual environment python -m pip install "fastapi[standard]" encord-agents anthropic # Install required dependencies export ANTHROPIC_API_KEY="<your_api_key>" # Set your Anthropic API key export ENCORD_SSH_KEY_FILE="/path/to/your/private/key" # Define your Encord SSH key
Project Setup
Create a Project with visual content (images, image groups, image sequences, or videos) in Encord. This example uses the following Ontology, but any Ontology containing classifications can be used.
The aim is to trigger an agent that transforms a labeling task from Figure A to Figure B.
Figure A: No classification labels.
Figure B: Multiple nested classification labels generated by an LLM.
Create the Agent
This section provides the complete code for creating your editor agent, along with an explanation of its internal workings.
Agent Setup Steps
Import dependencies, authenticate with Encord, and set up the Project. Ensure you insert your Project’s unique identifier.
Create a data model and a system prompt based on the Project Ontology to tell Claude how to structure its response.
Set up an Anthropic API client to establish communication with the Claude model.
Define the Editor Agent. This includes:
Receiving frame data using FastAPI’s Form dependency.
Retrieving the associated label row and frame content using Encord Agents’ dependencies.
Constructing a Frame object from the content.
Sending the frame image to Claude for analysis.
Parsing Claude’s response into classification instances.
Adding these classifications to the label row and saving the updated data.
Copy
# 1. Import dependencies and set up the Project. The CORS middleware is crucial as it allows the Encord platform to make requests to your API.import osimport numpy as npfrom anthropic import Anthropicfrom encord.objects.ontology_labels_impl import LabelRowV2from fastapi import Dependsfrom numpy.typing import NDArrayfrom typing_extensions import Annotatedfrom encord_agents.core.data_model import Framefrom encord_agents.core.ontology import OntologyDataModelfrom encord_agents.core.utils import get_user_clientfrom encord_agents.fastapi.cors import get_encord_appfrom encord_agents.fastapi.dependencies import ( FrameData, dep_label_row, dep_single_frame,)# Initialize FastAPI appapp = get_encord_app()# 2. Set up the Project and create a data model based on the Ontology.client = get_user_client()project = client.get_project("<your_project_hash>")data_model = OntologyDataModel(project.ontology_structure.classifications)# 3. Set up Claude and create the system prompt that tells Claude how to structure its response.system_prompt = f"""You're a helpful assistant that's supposed to help fill in json objects according to this schema: ```json {data_model.model_json_schema_str} ```Please only respond with valid json."""ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY")anthropic_client = Anthropic(api_key=ANTHROPIC_API_KEY)# 4. Define the Editor Agent@app.post("/frame_classification")async def classify_frame( frame_data: FrameData, lr: Annotated[LabelRowV2, Depends(dep_label_row)], content: Annotated[NDArray[np.uint8], Depends(dep_single_frame)],): # Receives frame data using FastAPI's Form dependency. # Note: FastAPI handles parsing the incoming request body (which implicitly includes frame_data, # and the dependencies (dep_label_row, dep_single_frame) resolve the lr and content). """Classify a frame using Claude.""" # Constructs a `Frame` object with the content. frame = Frame(frame=frame_data.frame, content=content) # Sends the frame image to Claude for analysis. message = anthropic_client.messages.create( model="claude-3-5-sonnet-20241022", max_tokens=1024, system=system_prompt, messages=[ { "role": "user", "content": [frame.b64_encoding(output_format="anthropic")], } ], ) try: # Parses Claude's response into classification instances. classifications = data_model(message.content[0].text) for clf in classifications: clf.set_for_frames(frame_data.frame, confidence=0.5, manual_annotation=False) # Adds classifications to the label row lr.add_classification_instance(clf) except Exception: import traceback traceback.print_exc() print(f"Response from model: {message.content[0].text}") # Saves the updated data. lr.save()
Test the Agent
In your current terminal run the following command to runFastAPI server in development mode with auto-reload enabled.
Copy
uvicorn main:app --reload --port 8080
Open your Project in the Encord platform and navigate to a frame you want to add a classification to. Copy the URL from your browser.
The url should have the following format: "https://app.encord.com/label_editor/{project_hash}/{data_hash}/{frame}"
In another shell operating from the same working directory, source your virtual environment and test the agent.
Copy
source venv/bin/activateencord-agents test local frame_classification '<your_url>'
To see if the test is successful, refresh your browser to view the classifications generated by Claude. Once the test runs successfully, you are ready to deploy your agent. Visit the deployment documentation to learn more.
Create an editor agent that can convert generic object annotations (class-less coordinates) into class specific annotations with nested attributes like descriptions, radio buttons, and checklists.
Run the following commands to set up your environment:
Copy
python -m venv venv # Create a virtual Python environment source venv/bin/activate # Activate the virtual environment python -m pip install encord-agents anthropic # Install required dependencies export ANTHROPIC_API_KEY="<your_api_key>" # Set your Anthropic API key export ENCORD_SSH_KEY_FILE="/path/to/your/private/key" # Define your Encord SSH key
Project Setup
Create a Project with visual content (images, image groups, image sequences, or videos) in Encord. This example uses the following Ontology, but any Ontology containing classifications can be used provided the object types are the same and there is one entry called “generic”.
The goal is to trigger an agent that takes a labeling task from Figure A to Figure B, below:
Figure A: No classification labels.
Figure B: Multiple nested classification labels generated by an LLM.
Create the Agent
This section provides the complete code for creating your editor agent, along with an explanation of its internal workings.
Agent Setup Steps
Import Dependencies and Configure Project: Import necessary dependencies and set up your project. Remember to insert your project’s unique identifier.
Create a data model and a system prompt based on the Project Ontology to tell Claude how to structure its response.
Initialize Anthropic API Client: Set up an API client to establish communication with the Claude model.
Define the Editor Agent:
Arguments are automatically injected when the agent is called (see dependency injection details [suspicious link removed]).
The dep_object_crops dependency filters to include only “generic” object crops that still need classification.
Call Claude with Image Crops: Use the crop.b64_encoding method to send each image crop to Claude in a format it understands.
Parse Claude’s Response and Update Labels: The data_model parses Claude’s JSON response, creating a new Encord object instance. If successful, the original generic object is replaced with the newly classified instance on the label row.
Save Labels.
Copy
# 1. Import dependencies, authenticate with Encord, and set up the Project.import osfrom anthropic import Anthropicfrom encord.objects.ontology_labels_impl import LabelRowV2from fastapi import Dependsfrom typing_extensions import Annotatedfrom encord_agents.core.data_model import InstanceCropfrom encord_agents.core.ontology import OntologyDataModelfrom encord_agents.core.utils import get_user_clientfrom encord_agents.fastapi.cors import get_encord_appfrom encord_agents.fastapi.dependencies import ( FrameData, dep_label_row, dep_object_crops,)# Initialize FastAPI appapp = get_encord_app()# User client and ontology setupclient = get_user_client()# Ensure you insert your Project's unique identifier.project = client.get_project("<project_id>") generic_ont_obj, *other_objects = sorted( project.ontology_structure.objects, key=lambda o: o.title.lower() == "generic", reverse=True,)# 2. Create a data model and a system prompt based on the Project Ontology to tell Claude how to structure its response.data_model = OntologyDataModel(other_objects)system_prompt = f"""You're a helpful assistant that's supposed to help fill in json objects according to this schema:`{data_model.model_json_schema_str}`Please only respond with valid json."""# 3. Set up an Anthropic API client to establish communication with the Claude model.ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY")anthropic_client = Anthropic(api_key=ANTHROPIC_API_KEY)# 4. Define the Editor Agent.@app.post("/object_classification")async def classify_objects( frame_data: FrameData, lr: Annotated[LabelRowV2, Depends(dep_label_row)], crops: Annotated[ list[InstanceCrop], Depends(dep_object_crops(filter_ontology_objects=[generic_ont_obj])), ],): """Classify generic objects using Claude.""" changes = False # Iterating through each object crop. for crop in crops: # 5. Call Claude with Image Crops. # Sending each crop image to Claude for analysis. message = anthropic_client.messages.create( model="claude-3-haiku-20240307", max_tokens=1024, system=system_prompt, messages=[ { "role": "user", "content": [crop.b64_encoding(output_format="anthropic")], } ], ) # 6. Parse Claude's Response and Update Labels. try: # Parsing Claude's response into an updated object instance. instance = data_model(message.content[0].text) coordinates = crop.instance.get_annotation(frame=frame_data.frame).coordinates instance.set_for_frames( coordinates=coordinates, frames=frame_data.frame, confidence=0.5, manual_annotation=False, ) # Updating the label row by removing the original object and adding the newly classified instance. lr.remove_object(crop.instance) lr.add_object_instance(instance) changes = True except Exception: import traceback traceback.print_exc() print(f"Response from model: {message.content[0].text}") # 7. Save Labels. if changes: lr.save()
Testing the Agent
In your current terminal run the following command to runFastAPI server in development mode with auto-reload enabled.
Copy
fastapi dev agent.py --port 8080
Open your Project in the Encord platform and navigate to a frame you want to add a classification to. Copy the URL from your browser.
The url should have roughly this format: "https://app.encord.com/label_editor/{project_hash}/{data_hash}/{frame}".
In another shell operating from the same working directory, source your virtual environment and test the agent:
Copy
source venv/bin/activateencord-agents test local object_classification '<your_url>'
To see if the test is successful, refresh your browser to view the classifications generated by Claude. Once the test runs successfully, you are ready to deploy your agent. Visit the deployment documentation to learn more.
A human watches the video and enters a caption in the first text field.
The agent is then triggered and generates three additional caption variations for review.
Each video is first annotated by a human (ANNOTATE stage).
Next, a data agent automatically generates alternative captions (AGENT stage).
Finally, a human reviews all four captions (REVIEW stage) before the task is marked complete.
If no human caption is present when the agent is triggered, the task is sent back for annotation.
If the review stage results in rejection, the task is also returned for re-annotation.
Create the Agent
This section provides the complete code for creating your editor agent, along with an explanation of its internal workings.
Agent Setup Steps
Set up imports and create a Pydantic model for our LLM’s structured output
Create a detailed system prompt for the LLM that explains exactly what kind of rephrasing we want
We configure the LLM to use structured outputs based on our model
Create a helper function to prompt the model with both text and image:
Initialize the FastAPI app with the required CORS middleware:
Define the agent to handle the recaptioning. This includes:
Retrieving the existing human-created caption, prioritizing captions from the current frame or falling back to frame zero.
Sending the first frame of the video along with the human caption to the LLM.
Processing the response from the LLM, which provides three alternative phrasings of the original caption.
Updating the label row with the new captions, replacing any existing ones.
Copy
# 1. Set up imports and create a Pydantic model for our LLM's structured output.import osfrom typing import Annotatedimport numpy as npfrom encord.exceptions import LabelRowErrorfrom encord.objects.classification_instance import ClassificationInstancefrom encord.objects.ontology_labels_impl import LabelRowV2from fastapi import Dependsfrom langchain_openai import ChatOpenAIfrom numpy.typing import NDArrayfrom pydantic import BaseModelfrom encord_agents import FrameDatafrom encord_agents.fastapi.cors import get_encord_appfrom encord_agents.fastapi.dependencies import Frame, dep_label_row, dep_single_frame# The response model for the agent to follow.class AgentCaptionResponse(BaseModel): rephrase_1: str rephrase_2: str rephrase_3: str# 2. Create a detailed system prompt for the LLM that explains exactly what kind of rephrasing we want.SYSTEM_PROMPT = """You are a helpful assistant that rephrases captions.I will provide you with a video caption and an image of the scene of the video. The captions follow this format:"The droid picks up <cup_0> and puts it on the <table_0>."The captions that you make should replace the tags, e.g., <cup_0>, with the actual object names.The replacements should be consistent with the scene.Here are three rephrases: 1. The droid picks up the blue mug and puts it on the left side of the table.2. The droid picks up the cup and puts it to the left of the plate.3. The droid is picking up the mug on the right side of the table and putting it down next to the plate.You will rephrase the caption in three different ways, as above, the rephrases should be1. Diverse in terms of adjectives, object relations, and object positions.2. Sound in relation to the scene. You cannot talk about objects you cannot see.3. Short and concise. Keep it within one sentence."""# 3. Configure the LLM to use structured outputs based on our model.llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.4, api_key=os.environ["OPENAI_API_KEY"])llm_structured = llm.with_structured_output(AgentCaptionResponse)# 4. Create a helper function to prompt the model with both text and image.def prompt_gpt(caption: str, image: Frame) -> AgentCaptionResponse: prompt = [ {"role": "system", "content": SYSTEM_PROMPT}, { "role": "user", "content": [ {"type": "text", "text": f"Video caption: `{caption}`"}, image.b64_encoding(output_format="openai"), ], }, ] return llm_structured.invoke(prompt)# 5. Initialize the FastAPI app with the required CORS middleware.app = get_encord_app()# 6. Define the agent to handle the recaptioning.@app.post("/my_agent")def my_agent( frame_data: FrameData, label_row: Annotated[LabelRowV2, Depends(dep_label_row)], frame_content: Annotated[NDArray[np.uint8], Depends(dep_single_frame)],) -> None: # Get the relevant Ontology information # Recall that we expect # [human annotation, llm recaption 1, llm recaption 2, llm recaption 3] # in the Ontology cap, *rs = label_row.ontology_structure.classifications # Retrieve the existing human-created caption, prioritizing captions from the current frame or falling back to frame zero. instances = label_row.get_classification_instances( filter_ontology_classification=cap, filter_frames=[0, frame_data.frame] ) if not instances: # nothing to do if there are no human labels return elif len(instances) > 1: def order_by_current_frame_else_frame_0( instance: ClassificationInstance, ) -> bool: try: instance.get_annotation(frame_data.frame) return 2 # The best option except LabelRowError: pass try: instance.get_annotation(0) return 1 except LabelRowError: return 0 instance = sorted(instances, key=order_by_current_frame_else_frame_0)[-1] else: instance = instances[0] # Read the actual string caption caption = instance.get_answer() # Send the first frame of the video along with the human caption to the LLM. frame = Frame(frame=0, content=frame_content) response = prompt_gpt(caption, frame) # Process the LLM's response, which contains three different rephrasings of the original caption. # Update the label row with the new captions, replacing any existing ones. for r, t in zip(rs, [response.rephrase_1, response.rephrase_2, response.rephrase_3]): # Overwrite any existing re-captions existing_instances = label_row.get_classification_instances(filter_ontology_classification=r) for existing_instance in existing_instances: label_row.remove_classification(existing_instance) # Create new instances ins = r.create_instance() ins.set_answer(t, attribute=r.attributes[0]) ins.set_for_frames(0) label_row.add_classification_instance(ins) label_row.save()
Test the Agent
In your current terminal, run the following command to run the FastAPI server:
Copy
ENCORD_SSH_KEY_FILE=/path/to/your_private_key \OPENAI_API_KEY=<your-api-key> \fastapi dev main.py
Open your Project in the Encord platform, navigate to a video frame, and add your initial caption. Copy the URL from your browser.
In another shell operating from the same working directory, source your virtual environment and test the agent:
Copy
source venv/bin/activateencord-agents test local my_agent '<your_url>'
Refresh your browser to view the three AI-generated caption variations. Once the test runs successfully, you are ready to deploy your agent. Visit the deployment documentation to learn more.