OpenAI Image Description
- What it does: Generates a detailed text description of an image using OpenAI vision models.
- Best for: Scene descriptions, captions, and summarizing what is happening in an image.
- Input data: Image data units.
- Output: A single text classification containing the description.
- Configuration:
| Parameter | Description | Example |
|---|---|---|
| Model | Choose which OpenAI model to use — a faster “mini” model for speed, or a larger model for higher quality at higher cost. | Use gpt-4o-mini for high-volume batch processing, or gpt-4o when description accuracy is critical. |
| Detail | Controls how much visual detail the model sees. Higher detail improves description quality but can increase latency and cost. | Use low for thumbnail images or quick content moderation; use high for medical scans or fine-grained scene analysis. |
| Custom Prompt | Optional extra instructions that steer the description. | "Focus on the number and position of people in the scene" for crowd analysis, or "Describe only the text visible in the image" for OCR-style summaries. |
| Temperature | Controls how “creative” the wording is. Lower values give more deterministic output; higher values produce more varied descriptions. | Use 0.2 for consistent, factual product descriptions, or 0.8 for diverse creative captions. |
- Credentials: OpenAI API key.
- Ontology requirements: A classification with a TEXT attribute where the image description is stored.
Claude Image Description
- What it does: Generates a detailed text description of an image using Anthropic Claude vision models.
- Best for: Image descriptions where you want to use Claude models.
- Input data: Image data units.
- Output: A single text classification containing the description.
- Configuration:
| Parameter | Description | Example |
|---|---|---|
| Model | Choose which Claude model to use. Larger models provide better quality at higher cost. | Use claude-haiku-4-5 for fast, high-volume pipelines, or claude-opus-4-6 when description quality is the priority. |
| Custom Prompt | Optional extra instructions to emphasize what matters in the description. | "Focus on visible defects or damage" for quality control, or "Describe the background environment only" for scene context tasks. |
| Temperature | Controls variation in the output. Lower values make descriptions more consistent across similar images. | Use 0.1 for repeatable, audit-friendly descriptions, or 0.7 for more expressive captions in creative workflows. |
- Credentials: Anthropic API key.
- Ontology requirements: A classification with a TEXT attribute where the image description is stored.
Classify an Image
- What it does: Classifies images into one or more ontology categories using OpenAI vision models.
- Best for: Category or label assignment, such as object presence, scene type, or attributes.
- Input data: Image data units.
- Output: A classification answer using your ontology options (single- or multi-select).
- Configuration:
| Parameter | Description | Example |
|---|---|---|
| Model | Choose the OpenAI model used to make classification decisions. Smaller models are cheaper and faster; larger models handle more nuanced categories. | Use gpt-4o-mini for straightforward labels like indoor/outdoor, or gpt-4o for fine-grained distinctions like similar product subtypes. |
| Detail | Controls how much visual information the model receives. Higher detail helps with fine-grained distinctions. | Use low for broad scene classification; use high when distinguishing visually similar product types or detecting small objects. |
| Custom Prompt | Optional guidance that explains how the model should interpret your ontology, such as definitions of borderline classes. | "If both a person and a vehicle are present, select both labels" or "Classify as 'damaged' only if defects are clearly visible". |
| Temperature | Controls how confidently the model sticks to the most likely class vs. exploring alternatives. Lower values are recommended for production classification. | Use 0.0–0.2 for consistent, production-grade labeling, or 0.6 when stress-testing edge cases during evaluation. |
- Credentials: OpenAI API key.
- Ontology requirements: A classification feature with options, using radio (single-select) or checklist (multi-select).
Ask a Question About an Image
- What it does: Answers a natural-language question about an image (visual question answering).
- Best for: Targeted questions such as “What brands are visible?” or “What are the people doing?”
- Input data: Image data units.
- Output: A single text classification containing the answer.
- Configuration:
| Parameter | Description | Example |
|---|---|---|
| Model | Choose which OpenAI model should answer the question. Use larger models for more complex or nuanced questions. | Use gpt-4o-mini for simple factual questions like "Is the light on or off?", or gpt-4o for questions requiring reasoning like "Is the safety equipment being used correctly?". |
| Question | The natural-language question you want to ask about each image. | "How many people are in this image?", "What brand logos are visible?", or "Is the safety helmet worn correctly?". |
| Detail | Controls how much visual detail the model considers when answering. Higher detail helps with dense or complex scenes. | Use low for simple presence/absence questions; use high for questions about small text, crowded scenes, or fine-grained object attributes. |
| Temperature | Controls how deterministic the answers are. Lower values reduce variability between similar images. | Use 0.0 for consistent yes/no or count-based answers, or 0.5 for more descriptive open-ended responses. |
- Credentials: OpenAI API key.
- Ontology requirements: A classification with a TEXT attribute where the answer is stored.
Transcribe Audio
- What it does: Transcribes entire audio files using OpenAI Whisper into a single transcript.
- Best for: Full-file audio transcription (calls, interviews, long recordings) where you want one combined transcript.
- Input data: Audio data units.
- Output: A single text classification containing the full transcript.
- Configuration:
| Parameter | Description | Example |
|---|---|---|
| Model | Select which Whisper model version to use. Larger models may improve accuracy on challenging audio. | Use whisper-1 for clean, studio-quality recordings, or a larger variant for heavily accented speech or noisy environments. |
| Language | Optionally specify the spoken language to improve accuracy; leave blank to let Whisper auto-detect. | Set en for English-only call center recordings, or leave blank for multilingual interview datasets. |
| Prompt | Optional context to bias the transcript toward domain-specific terms or acronyms. | "This is a medical consultation. Terms include ECG, systolic, and triage." or "Speaker discusses cloud services: AWS, Kubernetes, CI/CD." |
- Credentials: OpenAI API key.
- Ontology requirements: A classification with a TEXT attribute where the transcript is stored.
Diarize and Transcribe Audio
- What it does: Transcribes audio and performs two-speaker diarization, writing separate transcripts per speaker.
- Best for: Two-speaker conversations (for example, agent/customer) where you want per-speaker transcripts.
- Input data: Audio data units.
- Output:
- Time-based AUDIO objects for Speaker 1 and Speaker 2 segments.
- TEXT attributes on those objects with each speaker’s transcript.
- Configuration:
- Language: Optionally specify the language spoken in the audio. This can improve diarization and transcription quality; leave blank to auto-detect.
- Credentials: OpenAI API key.
- Ontology requirements:
- An AUDIO object for Speaker 1 segments, plus a TEXT attribute on that object.
- An AUDIO object for Speaker 2 segments, plus a TEXT attribute on that object.
Recognize and Extract Text
- What it does: Detects and extracts text from images using Google Document AI, returning polygon regions and associated text.
- Best for: OCR on documents, receipts, forms, and other scanned images where you need both text and its region.
- Input data: Image data units.
- Output:
- POLYGON objects for each detected text region.
- A TEXT attribute on each polygon containing the extracted text.
- Configuration:
| Parameter | Description | Example |
|---|---|---|
| Processor ID | Identifies which Document AI processor to use, determining the underlying OCR model and configuration in your GCP project. | abc123def456 — the ID found on your processor’s details page in the Google Cloud Console. |
| Processor location | Region where the processor is hosted. Should match the region where you created the processor. | us for processors created in the United States, or eu for processors created in Europe. |
| Language hints | Optional list of likely languages in your documents to improve OCR accuracy. | ["en"] for English-only invoices, or ["en", "fr"] for bilingual Canadian documents. |
- Credentials: Google Document AI service account.
- Ontology requirements:
- A POLYGON object type for OCR regions.
- A TEXT attribute attached to that polygon object for the extracted text.

