Quick Guides
Multimodal agents — vision, image, and audio in one pipeline
One agent that takes images and audio in, and emits images, audio, or video out. Same SDK, same call surface.
A single agent that reads images, generates new images, and produces speech — all through the same SDK. Use one model for vision input and a different one for image generation, picked per-call.
from agentfield import Agent, AIConfig
from agentfield.multimodal import image_from_url, image_from_file, audio_from_file, text
app = Agent(
node_id="creative-agent",
ai_config=AIConfig(model="openai/gpt-4o"),
)
@app.reasoner()
async def storyboard_from_brief(brief: str, reference_url: str) -> dict:
# Vision input — auto-detects the image URL
interpretation = await app.ai(
text("Describe the visual style of this reference:"),
image_from_url(reference_url),
)
# Generate three images matching the brief
images = await app.ai_generate_image(
f"{brief}. Style: {interpretation}",
model="dall-e-3",
size="1792x1024",
num_images=3,
)
saved = images.save_all("./output", prefix="frame")
# Voice-over from the brief
audio = await app.ai_generate_audio(
f"Storyboard concept: {brief}",
model="tts-1-hd",
voice="alloy",
)
audio.audio.save("./output/voiceover.wav")
return {
"interpretation": str(interpretation),
"frames": list(saved.values()),
"voiceover": "./output/voiceover.wav",
}
# Audio input — transcribe and act on a recording
@app.reasoner()
async def summarize_meeting(recording_path: str) -> dict:
return await app.ai(
text("Transcribe this meeting and pull out the action items:"),
audio_from_file(recording_path),
)
app.run()What this gives you
- Vision and audio inputs are auto-detected from positional args — no manual base64 dance.
- Image, audio, and video generation share one response shape with
.save()helpers. - Pluggable provider system means DALL-E, Flux, fal.ai, and OpenRouter all work through the same call.
Next
Signed audit chain for any workflow
Every step of a multi-agent workflow becomes a signed Verifiable Credential. Pull the whole chain in topological order with one request.
Realtime voice sessions
A live voice agent in one decorator — control plane owns the provider, every turn lands in the workflow DAG.