Built-in Agents

This page lists the built-in agents that are available in the Agents Catalog and can be added to your projects with minimal configuration.

Each agent card in the catalog includes a Learn more button that opens this documentation page for detailed configuration information.

OpenAI Image Description

What it does: Generates a detailed text description of an image using OpenAI vision models.
Best for: Scene descriptions, captions, and summarizing what is happening in an image.
Input data: Image data units.
Output: A single text classification containing the description.
Configuration:

Parameter	Description	Example
Model	Choose which OpenAI model to use — a faster “mini” model for speed, or a larger model for higher quality at higher cost.	Use `gpt-4o-mini` for high-volume batch processing, or `gpt-4o` when description accuracy is critical.
Detail	Controls how much visual detail the model sees. Higher detail improves description quality but can increase latency and cost.	Use `low` for thumbnail images or quick content moderation; use `high` for medical scans or fine-grained scene analysis.
Custom Prompt	Optional extra instructions that steer the description.	`"Focus on the number and position of people in the scene"` for crowd analysis, or `"Describe only the text visible in the image"` for OCR-style summaries.
Temperature	Controls how “creative” the wording is. Lower values give more deterministic output; higher values produce more varied descriptions.	Use `0.2` for consistent, factual product descriptions, or `0.8` for diverse creative captions.

Credentials: Requires an OpenAI API key.
Ontology requirements: A classification with a TEXT attribute where the image description is stored.

Claude Image Description

What it does: Generates a detailed text description of an image using Anthropic Claude vision models.
Best for: Image descriptions where you want to use Claude models.
Input data: Image data units.
Output: A single text classification containing the description.
Configuration:

Parameter	Description	Example
Model	Choose which Claude model to use. Larger models provide better quality at higher cost.	Use `claude-haiku-4-5` for fast, high-volume pipelines, or `claude-opus-4-6` when description quality is the priority.
Custom Prompt	Optional extra instructions to emphasize what matters in the description.	`"Focus on visible defects or damage"` for quality control, or `"Describe the background environment only"` for scene context tasks.
Temperature	Controls variation in the output. Lower values make descriptions more consistent across similar images.	Use `0.1` for repeatable, audit-friendly descriptions, or `0.7` for more expressive captions in creative workflows.

Credentials: Requires an Anthropic API key.
Ontology requirements: A classification with a TEXT attribute where the image description is stored.

Classify an Image

What it does: Classifies images into one or more ontology categories using OpenAI vision models.
Best for: Category or label assignment, such as object presence, scene type, or attributes.
Input data: Image data units.
Output: A classification answer using your ontology options (single- or multi-select).
Configuration:

Parameter	Description	Example
Model	Choose the OpenAI model used to make classification decisions. Smaller models are cheaper and faster; larger models handle more nuanced categories.	Use `gpt-4o-mini` for straightforward labels like `indoor`/`outdoor`, or `gpt-4o` for fine-grained distinctions like similar product subtypes.
Detail	Controls how much visual information the model receives. Higher detail helps with fine-grained distinctions.	Use `low` for broad scene classification; use `high` when distinguishing visually similar product types or detecting small objects.
Custom Prompt	Optional guidance that explains how the model should interpret your ontology, such as definitions of borderline classes.	`"If both a person and a vehicle are present, select both labels"` or `"Classify as 'damaged' only if defects are clearly visible"`.
Temperature	Controls how confidently the model sticks to the most likely class vs. exploring alternatives. Lower values are recommended for production classification.	Use `0.0`–`0.2` for consistent, production-grade labeling, or `0.6` when stress-testing edge cases during evaluation.

Credentials: Requires an OpenAI API key.
Ontology requirements: A classification feature with options, using radio (single-select) or checklist (multi-select).

Ask a Question About an Image

What it does: Answers a natural-language question about an image (visual question answering).
Best for: Targeted questions such as “What brands are visible?” or “What are the people doing?”
Input data: Image data units.
Output: A single text classification containing the answer.
Configuration:

Parameter	Description	Example
Model	Choose which OpenAI model should answer the question. Use larger models for more complex or nuanced questions.	Use `gpt-4o-mini` for simple factual questions like `"Is the light on or off?"`, or `gpt-4o` for questions requiring reasoning like `"Is the safety equipment being used correctly?"`.
Question	The natural-language question you want to ask about each image.	`"How many people are in this image?"`, `"What brand logos are visible?"`, or `"Is the safety helmet worn correctly?"`.
Detail	Controls how much visual detail the model considers when answering. Higher detail helps with dense or complex scenes.	Use `low` for simple presence/absence questions; use `high` for questions about small text, crowded scenes, or fine-grained object attributes.
Temperature	Controls how deterministic the answers are. Lower values reduce variability between similar images.	Use `0.0` for consistent yes/no or count-based answers, or `0.5` for more descriptive open-ended responses.

Credentials: Requires an OpenAI API key.
Ontology requirements: A classification with a TEXT attribute where the answer is stored.

Transcribe Audio

What it does: Transcribes entire audio files using OpenAI Whisper into a single transcript.
Best for: Full-file audio transcription (calls, interviews, long recordings) where you want one combined transcript.
Input data: Audio data units.
Output: A single text classification containing the full transcript.
Configuration:

Parameter	Description	Example
Model	Select which Whisper model version to use. Larger models may improve accuracy on challenging audio.	Use `whisper-1` for clean, studio-quality recordings, or a larger variant for heavily accented speech or noisy environments.
Language	Optionally specify the spoken language to improve accuracy; leave blank to let Whisper auto-detect.	Set `en` for English-only call center recordings, or leave blank for multilingual interview datasets.
Prompt	Optional context to bias the transcript toward domain-specific terms or acronyms.	`"This is a medical consultation. Terms include ECG, systolic, and triage."` or `"Speaker discusses cloud services: AWS, Kubernetes, CI/CD."`

Credentials: Requires an OpenAI API key.
Ontology requirements: A classification with a TEXT attribute where the transcript is stored.

Diarize and Transcribe Audio

What it does: Transcribes audio and performs two-speaker diarization, writing separate transcripts per speaker.
Best for: Two-speaker conversations (for example, agent/customer) where you want per-speaker transcripts.
Input data: Audio data units.
Output:
- Time-based AUDIO objects for Speaker 1 and Speaker 2 segments.
- TEXT attributes on those objects with each speaker’s transcript.
Configuration:
- Language: Optionally specify the language spoken in the audio. This can improve diarization and transcription quality; leave blank to auto-detect.
Credentials: Requires an OpenAI API key.
Ontology requirements:
- An AUDIO object for Speaker 1 segments, plus a TEXT attribute on that object.
- An AUDIO object for Speaker 2 segments, plus a TEXT attribute on that object.

Recognize and Extract Text

What it does: Detects and extracts text from images using Google Document AI, returning polygon regions and associated text.
Best for: OCR on documents, receipts, forms, and other scanned images where you need both text and its region.
Input data: Image data units.
Output:
- POLYGON objects for each detected text region.
- A TEXT attribute on each polygon containing the extracted text.
Configuration:

Parameter	Description	Example
Processor ID	Identifies which Document AI processor to use, determining the underlying OCR model and configuration in your GCP project.	`abc123def456` — the ID found on your processor’s details page in the Google Cloud Console.
Processor location	Region where the processor is hosted. Should match the region where you created the processor.	`us` for processors created in the United States, or `eu` for processors created in Europe.
Language hints	Optional list of likely languages in your documents to improve OCR accuracy.	`["en"]` for English-only invoices, or `["en", "fr"]` for bilingual Canadian documents.

Credentials: Requires a Google Document AI service account.
Ontology requirements:
- A POLYGON object type for OCR regions.
- A TEXT attribute attached to that polygon object for the extracted text.

​OpenAI Image Description

​Claude Image Description

​Classify an Image

​Ask a Question About an Image

​Transcribe Audio

​Diarize and Transcribe Audio

​Recognize and Extract Text

OpenAI Image Description

Claude Image Description

Classify an Image

Ask a Question About an Image

Transcribe Audio

Diarize and Transcribe Audio

Recognize and Extract Text