> ## Documentation Index
> Fetch the complete documentation index at: https://docs.encord.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Built-in Agents

This page lists the **built-in agents** that are available in the **Agents Catalog** and can be added to your projects with minimal configuration.

<Note>Each agent card in the catalog includes a **Learn more** button that opens this documentation page for detailed configuration information.</Note>

## OpenAI Image Description

* **What it does**: Generates a detailed text description of an image using OpenAI vision models.
* **Best for**: Scene descriptions, captions, and summarizing what is happening in an image.
* **Input data**: Image data units.
* **Output**: A single text classification containing the description.
* **Configuration**:

| Parameter         | Description                                                                                                                          | Example                                                                                                                                                     |
| ----------------- | ------------------------------------------------------------------------------------------------------------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Model**         | Choose which OpenAI model to use — a faster "mini" model for speed, or a larger model for higher quality at higher cost.             | Use `gpt-4o-mini` for high-volume batch processing, or `gpt-4o` when description accuracy is critical.                                                      |
| **Detail**        | Controls how much visual detail the model sees. Higher detail improves description quality but can increase latency and cost.        | Use `low` for thumbnail images or quick content moderation; use `high` for medical scans or fine-grained scene analysis.                                    |
| **Custom Prompt** | Optional extra instructions that steer the description.                                                                              | `"Focus on the number and position of people in the scene"` for crowd analysis, or `"Describe only the text visible in the image"` for OCR-style summaries. |
| **Temperature**   | Controls how "creative" the wording is. Lower values give more deterministic output; higher values produce more varied descriptions. | Use `0.2` for consistent, factual product descriptions, or `0.8` for diverse creative captions.                                                             |

* **Credentials**: Requires an OpenAI API key.
* **Ontology requirements**: A **classification** with a **TEXT** attribute where the image description is stored.

## Claude Image Description

* **What it does**: Generates a detailed text description of an image using Anthropic Claude vision models.
* **Best for**: Image descriptions where you want to use Claude models.
* **Input data**: Image data units.
* **Output**: A single text classification containing the description.
* **Configuration**:

| Parameter         | Description                                                                                             | Example                                                                                                                              |
| ----------------- | ------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------ |
| **Model**         | Choose which Claude model to use. Larger models provide better quality at higher cost.                  | Use `claude-haiku-4-5` for fast, high-volume pipelines, or `claude-opus-4-6` when description quality is the priority.               |
| **Custom Prompt** | Optional extra instructions to emphasize what matters in the description.                               | `"Focus on visible defects or damage"` for quality control, or `"Describe the background environment only"` for scene context tasks. |
| **Temperature**   | Controls variation in the output. Lower values make descriptions more consistent across similar images. | Use `0.1` for repeatable, audit-friendly descriptions, or `0.7` for more expressive captions in creative workflows.                  |

* **Credentials**: Requires an Anthropic API key.
* **Ontology requirements**: A **classification** with a **TEXT** attribute where the image description is stored.

## Classify an Image

* **What it does**: Classifies images into one or more ontology categories using OpenAI vision models.
* **Best for**: Category or label assignment, such as object presence, scene type, or attributes.
* **Input data**: Image data units.
* **Output**: A classification answer using your ontology options (single- or multi-select).
* **Configuration**:

| Parameter         | Description                                                                                                                                                | Example                                                                                                                                        |
| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- |
| **Model**         | Choose the OpenAI model used to make classification decisions. Smaller models are cheaper and faster; larger models handle more nuanced categories.        | Use `gpt-4o-mini` for straightforward labels like `indoor`/`outdoor`, or `gpt-4o` for fine-grained distinctions like similar product subtypes. |
| **Detail**        | Controls how much visual information the model receives. Higher detail helps with fine-grained distinctions.                                               | Use `low` for broad scene classification; use `high` when distinguishing visually similar product types or detecting small objects.            |
| **Custom Prompt** | Optional guidance that explains how the model should interpret your ontology, such as definitions of borderline classes.                                   | `"If both a person and a vehicle are present, select both labels"` or `"Classify as 'damaged' only if defects are clearly visible"`.           |
| **Temperature**   | Controls how confidently the model sticks to the most likely class vs. exploring alternatives. Lower values are recommended for production classification. | Use `0.0`–`0.2` for consistent, production-grade labeling, or `0.6` when stress-testing edge cases during evaluation.                          |

* **Credentials**: Requires an OpenAI API key.
* **Ontology requirements**: A **classification feature with options**, using **radio** (single-select) or **checklist** (multi-select).

## Ask a Question About an Image

* **What it does**: Answers a natural-language question about an image (visual question answering).
* **Best for**: Targeted questions such as "What brands are visible?" or "What are the people doing?"
* **Input data**: Image data units.
* **Output**: A single text classification containing the answer.
* **Configuration**:

| Parameter       | Description                                                                                                           | Example                                                                                                                                                                                |
| --------------- | --------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Model**       | Choose which OpenAI model should answer the question. Use larger models for more complex or nuanced questions.        | Use `gpt-4o-mini` for simple factual questions like `"Is the light on or off?"`, or `gpt-4o` for questions requiring reasoning like `"Is the safety equipment being used correctly?"`. |
| **Question**    | The natural-language question you want to ask about each image.                                                       | `"How many people are in this image?"`, `"What brand logos are visible?"`, or `"Is the safety helmet worn correctly?"`.                                                                |
| **Detail**      | Controls how much visual detail the model considers when answering. Higher detail helps with dense or complex scenes. | Use `low` for simple presence/absence questions; use `high` for questions about small text, crowded scenes, or fine-grained object attributes.                                         |
| **Temperature** | Controls how deterministic the answers are. Lower values reduce variability between similar images.                   | Use `0.0` for consistent yes/no or count-based answers, or `0.5` for more descriptive open-ended responses.                                                                            |

* **Credentials**: Requires an OpenAI API key.
* **Ontology requirements**: A **classification** with a **TEXT** attribute where the answer is stored.

## Transcribe Audio

* **What it does**: Transcribes entire audio files using OpenAI Whisper into a single transcript.
* **Best for**: Full-file audio transcription (calls, interviews, long recordings) where you want one combined transcript.
* **Input data**: Audio data units.
* **Output**: A single text classification containing the full transcript.
* **Configuration**:

| Parameter    | Description                                                                                         | Example                                                                                                                                       |
| ------------ | --------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------- |
| **Model**    | Select which Whisper model version to use. Larger models may improve accuracy on challenging audio. | Use `whisper-1` for clean, studio-quality recordings, or a larger variant for heavily accented speech or noisy environments.                  |
| **Language** | Optionally specify the spoken language to improve accuracy; leave blank to let Whisper auto-detect. | Set `en` for English-only call center recordings, or leave blank for multilingual interview datasets.                                         |
| **Prompt**   | Optional context to bias the transcript toward domain-specific terms or acronyms.                   | `"This is a medical consultation. Terms include ECG, systolic, and triage."` or `"Speaker discusses cloud services: AWS, Kubernetes, CI/CD."` |

* **Credentials**: Requires an OpenAI API key.
* **Ontology requirements**: A **classification** with a **TEXT** attribute where the transcript is stored.

## Diarize and Transcribe Audio

* **What it does**: Transcribes audio and performs two-speaker diarization, writing separate transcripts per speaker.
* **Best for**: Two-speaker conversations (for example, agent/customer) where you want per-speaker transcripts.
* **Input data**: Audio data units.
* **Output**:
  * Time-based **AUDIO objects** for Speaker 1 and Speaker 2 segments.
  * **TEXT attributes** on those objects with each speaker's transcript.
* **Configuration**:
  * **Language**: Optionally specify the language spoken in the audio. This can improve diarization and transcription quality; leave blank to auto-detect.
* **Credentials**: Requires an OpenAI API key.
* **Ontology requirements**:
  * An **AUDIO object** for Speaker 1 segments, plus a **TEXT attribute** on that object.
  * An **AUDIO object** for Speaker 2 segments, plus a **TEXT attribute** on that object.

## Recognize and Extract Text

* **What it does**: Detects and extracts text from images using Google Document AI, returning polygon regions and associated text.
* **Best for**: OCR on documents, receipts, forms, and other scanned images where you need both text and its region.
* **Input data**: Image data units.
* **Output**:
  * **POLYGON objects** for each detected text region.
  * A **TEXT attribute** on each polygon containing the extracted text.
* **Configuration**:

| Parameter              | Description                                                                                                                | Example                                                                                     |
| ---------------------- | -------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------- |
| **Processor ID**       | Identifies which Document AI processor to use, determining the underlying OCR model and configuration in your GCP project. | `abc123def456` — the ID found on your processor's details page in the Google Cloud Console. |
| **Processor location** | Region where the processor is hosted. Should match the region where you created the processor.                             | `us` for processors created in the United States, or `eu` for processors created in Europe. |
| **Language hints**     | Optional list of likely languages in your documents to improve OCR accuracy.                                               | `["en"]` for English-only invoices, or `["en", "fr"]` for bilingual Canadian documents.     |

* **Credentials**: Requires a Google Document AI service account.
* **Ontology requirements**:
  * A **POLYGON object** type for OCR regions.
  * A **TEXT attribute** attached to that polygon object for the extracted text.