Multimodal agents — vision, image, and audio in one pipeline

One agent that takes images and audio in, and emits images, audio, or video out. Same SDK, same call surface.

A single agent that reads images, generates new images, and produces speech — all through the same SDK. Use one model for vision input and a different one for image generation, picked per-call.

from agentfield import Agent, AIConfig
from agentfield.multimodal import image_from_url, image_from_file, audio_from_file, text

app = Agent(
    node_id="creative-agent",
    ai_config=AIConfig(model="openai/gpt-4o"),
)

@app.reasoner()
async def storyboard_from_brief(brief: str, reference_url: str) -> dict:
    # Vision input — auto-detects the image URL
    interpretation = await app.ai(
        text("Describe the visual style of this reference:"),
        image_from_url(reference_url),
    )

    # Generate three images matching the brief
    images = await app.ai_generate_image(
        f"{brief}. Style: {interpretation}",
        model="dall-e-3",
        size="1792x1024",
        num_images=3,
    )
    saved = images.save_all("./output", prefix="frame")

    # Voice-over from the brief
    audio = await app.ai_generate_audio(
        f"Storyboard concept: {brief}",
        model="tts-1-hd",
        voice="alloy",
    )
    audio.audio.save("./output/voiceover.wav")

    return {
        "interpretation": str(interpretation),
        "frames": list(saved.values()),
        "voiceover": "./output/voiceover.wav",
    }

# Audio input — transcribe and act on a recording
@app.reasoner()
async def summarize_meeting(recording_path: str) -> dict:
    return await app.ai(
        text("Transcribe this meeting and pull out the action items:"),
        audio_from_file(recording_path),
    )

app.run()

What this gives you

Vision and audio inputs are auto-detected from positional args — no manual base64 dance.
Image, audio, and video generation share one response shape with .save() helpers.
Pluggable provider system means DALL-E, Flux, fal.ai, and OpenRouter all work through the same call.

Next