Skip to main content
This page lists the built-in agents that are available in the Agents Catalog and can be added to your projects with minimal configuration.

OpenAI Image Description

  • What it does: Generates a detailed text description of an image using OpenAI vision models.
  • Best for: Scene descriptions, captions, and summarizing what is happening in an image.
  • Input data: Image data units.
  • Output: A single text classification containing the description.
  • Configuration:
ParameterDescriptionExample
ModelChoose which OpenAI model to use — a faster “mini” model for speed, or a larger model for higher quality at higher cost.Use gpt-4o-mini for high-volume batch processing, or gpt-4o when description accuracy is critical.
DetailControls how much visual detail the model sees. Higher detail improves description quality but can increase latency and cost.Use low for thumbnail images or quick content moderation; use high for medical scans or fine-grained scene analysis.
Custom PromptOptional extra instructions that steer the description."Focus on the number and position of people in the scene" for crowd analysis, or "Describe only the text visible in the image" for OCR-style summaries.
TemperatureControls how “creative” the wording is. Lower values give more deterministic output; higher values produce more varied descriptions.Use 0.2 for consistent, factual product descriptions, or 0.8 for diverse creative captions.
  • Credentials: OpenAI API key.
  • Ontology requirements: A classification with a TEXT attribute where the image description is stored.

Claude Image Description

  • What it does: Generates a detailed text description of an image using Anthropic Claude vision models.
  • Best for: Image descriptions where you want to use Claude models.
  • Input data: Image data units.
  • Output: A single text classification containing the description.
  • Configuration:
ParameterDescriptionExample
ModelChoose which Claude model to use. Larger models provide better quality at higher cost.Use claude-haiku-4-5 for fast, high-volume pipelines, or claude-opus-4-6 when description quality is the priority.
Custom PromptOptional extra instructions to emphasize what matters in the description."Focus on visible defects or damage" for quality control, or "Describe the background environment only" for scene context tasks.
TemperatureControls variation in the output. Lower values make descriptions more consistent across similar images.Use 0.1 for repeatable, audit-friendly descriptions, or 0.7 for more expressive captions in creative workflows.
  • Credentials: Anthropic API key.
  • Ontology requirements: A classification with a TEXT attribute where the image description is stored.

Classify an Image

  • What it does: Classifies images into one or more ontology categories using OpenAI vision models.
  • Best for: Category or label assignment, such as object presence, scene type, or attributes.
  • Input data: Image data units.
  • Output: A classification answer using your ontology options (single- or multi-select).
  • Configuration:
ParameterDescriptionExample
ModelChoose the OpenAI model used to make classification decisions. Smaller models are cheaper and faster; larger models handle more nuanced categories.Use gpt-4o-mini for straightforward labels like indoor/outdoor, or gpt-4o for fine-grained distinctions like similar product subtypes.
DetailControls how much visual information the model receives. Higher detail helps with fine-grained distinctions.Use low for broad scene classification; use high when distinguishing visually similar product types or detecting small objects.
Custom PromptOptional guidance that explains how the model should interpret your ontology, such as definitions of borderline classes."If both a person and a vehicle are present, select both labels" or "Classify as 'damaged' only if defects are clearly visible".
TemperatureControls how confidently the model sticks to the most likely class vs. exploring alternatives. Lower values are recommended for production classification.Use 0.00.2 for consistent, production-grade labeling, or 0.6 when stress-testing edge cases during evaluation.
  • Credentials: OpenAI API key.
  • Ontology requirements: A classification feature with options, using radio (single-select) or checklist (multi-select).

Ask a Question About an Image

  • What it does: Answers a natural-language question about an image (visual question answering).
  • Best for: Targeted questions such as “What brands are visible?” or “What are the people doing?”
  • Input data: Image data units.
  • Output: A single text classification containing the answer.
  • Configuration:
ParameterDescriptionExample
ModelChoose which OpenAI model should answer the question. Use larger models for more complex or nuanced questions.Use gpt-4o-mini for simple factual questions like "Is the light on or off?", or gpt-4o for questions requiring reasoning like "Is the safety equipment being used correctly?".
QuestionThe natural-language question you want to ask about each image."How many people are in this image?", "What brand logos are visible?", or "Is the safety helmet worn correctly?".
DetailControls how much visual detail the model considers when answering. Higher detail helps with dense or complex scenes.Use low for simple presence/absence questions; use high for questions about small text, crowded scenes, or fine-grained object attributes.
TemperatureControls how deterministic the answers are. Lower values reduce variability between similar images.Use 0.0 for consistent yes/no or count-based answers, or 0.5 for more descriptive open-ended responses.
  • Credentials: OpenAI API key.
  • Ontology requirements: A classification with a TEXT attribute where the answer is stored.

Transcribe Audio

  • What it does: Transcribes entire audio files using OpenAI Whisper into a single transcript.
  • Best for: Full-file audio transcription (calls, interviews, long recordings) where you want one combined transcript.
  • Input data: Audio data units.
  • Output: A single text classification containing the full transcript.
  • Configuration:
ParameterDescriptionExample
ModelSelect which Whisper model version to use. Larger models may improve accuracy on challenging audio.Use whisper-1 for clean, studio-quality recordings, or a larger variant for heavily accented speech or noisy environments.
LanguageOptionally specify the spoken language to improve accuracy; leave blank to let Whisper auto-detect.Set en for English-only call center recordings, or leave blank for multilingual interview datasets.
PromptOptional context to bias the transcript toward domain-specific terms or acronyms."This is a medical consultation. Terms include ECG, systolic, and triage." or "Speaker discusses cloud services: AWS, Kubernetes, CI/CD."
  • Credentials: OpenAI API key.
  • Ontology requirements: A classification with a TEXT attribute where the transcript is stored.

Diarize and Transcribe Audio

  • What it does: Transcribes audio and performs two-speaker diarization, writing separate transcripts per speaker.
  • Best for: Two-speaker conversations (for example, agent/customer) where you want per-speaker transcripts.
  • Input data: Audio data units.
  • Output:
    • Time-based AUDIO objects for Speaker 1 and Speaker 2 segments.
    • TEXT attributes on those objects with each speaker’s transcript.
  • Configuration:
    • Language: Optionally specify the language spoken in the audio. This can improve diarization and transcription quality; leave blank to auto-detect.
  • Credentials: OpenAI API key.
  • Ontology requirements:
    • An AUDIO object for Speaker 1 segments, plus a TEXT attribute on that object.
    • An AUDIO object for Speaker 2 segments, plus a TEXT attribute on that object.

Recognize and Extract Text

  • What it does: Detects and extracts text from images using Google Document AI, returning polygon regions and associated text.
  • Best for: OCR on documents, receipts, forms, and other scanned images where you need both text and its region.
  • Input data: Image data units.
  • Output:
    • POLYGON objects for each detected text region.
    • A TEXT attribute on each polygon containing the extracted text.
  • Configuration:
ParameterDescriptionExample
Processor IDIdentifies which Document AI processor to use, determining the underlying OCR model and configuration in your GCP project.abc123def456 — the ID found on your processor’s details page in the Google Cloud Console.
Processor locationRegion where the processor is hosted. Should match the region where you created the processor.us for processors created in the United States, or eu for processors created in Europe.
Language hintsOptional list of likely languages in your documents to improve OCR accuracy.["en"] for English-only invoices, or ["en", "fr"] for bilingual Canadian documents.
  • Credentials: Google Document AI service account.
  • Ontology requirements:
    • A POLYGON object type for OCR regions.
    • A TEXT attribute attached to that polygon object for the extracted text.