Multimodal agents — vision, image, and audio in one pipeline
One agent that takes images and audio in, and emits images, audio, or video out. Same SDK, same call surface.
A single agent that reads images, generates new images, and produces speech — all through the same SDK. Use one model for vision input and a different one for image generation, picked per-call.
from agentfield import Agent, AIConfig
from agentfield.multimodal import image_from_url, image_from_file, audio_from_file, text
app = Agent(
node_id="creative-agent",
ai_config=AIConfig(model="openai/gpt-4o"),
)
@app.reasoner()
async def storyboard_from_brief(brief: str, reference_url: str) -> dict:
# Vision input — auto-detects the image URL
interpretation = await app.ai(
text("Describe the visual style of this reference:"),
image_from_url(reference_url),
)
# Generate three images matching the brief
images = await app.ai_generate_image(
f"{brief}. Style: {interpretation}",
model="dall-e-3",
size="1792x1024",
num_images=3,
)
saved = images.save_all("./output", prefix="frame")
# Voice-over from the brief
audio = await app.ai_generate_audio(
f"Storyboard concept: {brief}",
model="tts-1-hd",
voice="alloy",
)
audio.audio.save("./output/voiceover.wav")
return {
"interpretation": str(interpretation),
"frames": list(saved.values()),
"voiceover": "./output/voiceover.wav",
}
# Audio input — transcribe and act on a recording
@app.reasoner()
async def summarize_meeting(recording_path: str) -> dict:
return await app.ai(
text("Transcribe this meeting and pull out the action items:"),
audio_from_file(recording_path),
)
app.run()What this gives you
- Vision and audio inputs are auto-detected from positional args — no manual base64 dance.
- Image, audio, and video generation share one response shape with
.save()helpers. - Pluggable provider system means DALL-E, Flux, fal.ai, and OpenRouter all work through the same call.
Next
Signed audit chain for any workflow
Every step of a multi-agent workflow becomes a signed Verifiable Credential. Pull the whole chain in topological order with one request.
Replace API keys with agent identity
Sign outbound HTTP requests with the agent's DID. Partners verify the signature against a public key — no shared secrets, no key rotation.