Skip to content
AgentField
Intelligence

Media Generation

Generate images, video, audio, and transcriptions with a unified API backed by pluggable media providers.

Media generation via app.ai()

Generate images, audio, video, and transcriptions from your agents -- same method pattern as app.ai().

Agents that generate reports, product listings, marketing content, or customer-facing assets often need more than text. AgentField's media generation API lets you create images, narrate text, produce video, and transcribe audio through a unified interface backed by pluggable providers like fal.ai, DALL-E, and ElevenLabs.

from agentfield import Agent, AIConfig

app = Agent(
    node_id="product-listing-generator",
    ai_config=AIConfig(fal_api_key="your-fal-key"),  # or set FAL_KEY env var
)

@app.reasoner()
async def generate_product_listing(product: dict) -> dict:
    # Generate a product image from description
    image = await app.ai_generate_image(
        prompt=f"Professional product photo: {product['name']}, {product['description']}",
        model="fal-ai/flux/schnell",
        size="square_hd",
    )
    image.images[0].save(f"/output/{product['id']}_hero.png")

    # Generate audio narration for the listing
    audio = await app.ai_generate_audio(
        text=f"Describe this product in an engaging way: {product['name']}",
        model="fal-ai/f5-tts",
        voice="alloy",
    )
    if audio.audio:
        audio.audio.save(f"/output/{product['id']}_narration.mp3")

    # Generate a short product demo video
    video = await app.ai_generate_video(
        prompt=f"Product demonstration video: {product['name']} in use",
        model="fal-ai/minimax-video/image-to-video",
        duration=10,
    )
    if video.files:
        video.files[0].save(f"/output/{product['id']}_demo.mp4")

    # Transcribe customer review audio
    transcript = await app.ai_transcribe_audio(
        audio_url=product["review_audio_url"],
        model="fal-ai/whisper",
    )

    return {
        "image_url": image.images[0].url if image.images else None,
        "audio_url": audio.audio.url if audio.audio else None,
        "video_url": video.files[0].url if video.files else None,
        "transcript": transcript.text,
    }
// Media generation is Python-only.
// TypeScript agents can call a Python agent's media generation
// capabilities via agent-to-agent calls:
const result = await agent.call('media-agent.generateProductImage', {
  prompt: `Professional product photo: ${product.name}`,
  model: 'fal-ai/flux/schnell',
});
console.log(result.imageUrl);
// Media generation is Python-only.
// Go agents can call a Python agent's media generation
// capabilities via agent-to-agent calls:
result, _ := app.Call(ctx, "media-agent.generateProductImage", map[string]any{
    "prompt": fmt.Sprintf("Professional product photo: %s", product.Name),
    "model":  "fal-ai/flux/schnell",
})
fmt.Println(result["imageUrl"])

What you get back

The example did four output-producing operations in one workflow: image generation, audio generation, video generation, and transcription. The point to emphasize above the fold is that those outputs come back as typed objects with saveable files and URLs, not as ad hoc provider responses you have to normalize yourself.

{
  "image_url": "https://cdn.example.com/product_hero.webp",
  "audio_url": "https://cdn.example.com/product_narration.mp3",
  "video_url": "https://cdn.example.com/product_demo.mp4",
  "transcript": "Customer review text..."
}