VO to Images
VO to Images (Voiceover to Images) is a multi-step AI pipeline that transforms voiceover recordings into matching scene images. It takes an audio narration, transcribes it, identifies scenes and characters, generates tailored image prompts for each scene, and then bulk-generates all the images — automating what would otherwise be hours of manual work.
This tool is ideal for narrated content like explainer videos, documentary-style videos, storytelling content, educational walkthroughs, and any format where visuals need to match spoken narration.
Pipeline Overview
The VO to Images pipeline guides you through 8 sequential steps: Audio → Characters → Models → Segments → Storyboard → Images → Videos → Export. Each step must be completed (or at least reviewed) before moving to the next. You can go back to any previous step to make changes.
Step 1: Audio
The first step is getting your voiceover audio into the system and converting it to text.
Audio Upload: Upload your voiceover audio file in any common format (MP3, WAV, M4A, etc.). Flokan uses AI-powered transcription to automatically convert your audio to text with high accuracy.
SRT File Support: If you already have subtitles or a transcript, you can import an SRT file instead of (or in addition to) the audio. SRT files with word-level timing provide the most precise scene segmentation later in the pipeline.
Review & Edit: After transcription, review the text for accuracy. You can edit any part of the transcript — fix names, technical terms, or any words the transcription got wrong. Getting the transcript right here ensures better results in all subsequent steps.
Model Selection: Choose which AI image model to use for generation throughout the project:
- Pro — Higher quality for production use (Gemini)
- Creative — More artistic and varied results (Ideogram)
Step 2: Characters
The AI analyzes your transcript and identifies characters, subjects, or recurring entities mentioned in the narration.
Automatic Detection: The system scans the transcript for character names, descriptions, and references. It presents you with a list of detected characters for review.
Step 3: Models
Create visual references for each detected character (or any character the AI missed):
- Provide a detailed visual description (appearance, clothing, age, etc.)
- Upload reference images so the AI knows what they should look like
- Set a character name for consistent reference throughout the pipeline
Character Library: Characters you create are saved and can be reused across multiple projects. If you're creating a series with recurring characters, you only need to define them once.
Step 4: Segments
The transcript is broken into logical scenes — each segment representing a distinct visual moment.
Scene Segmentation: The AI determines scene boundaries based on topic changes, pauses, and narrative structure. You can review and adjust the segment boundaries manually.
Step 5: Storyboard
This is where the AI creates specific image generation prompts for each scene in your voiceover.
AI-Powered Prompts: For each scene, the AI generates a detailed image prompt that:
- Describes the visual composition appropriate for that part of the narration
- Incorporates character descriptions from Steps 2–3
- Maintains visual continuity between scenes
- Includes style, lighting, and mood appropriate to the content
Manual Editing: Review and edit every generated prompt before moving to image generation. You have full control — rewrite prompts entirely, adjust details, add specific visual elements, or remove things you don't want.
Step 6: Images
With prompts finalized, the pipeline generates images for all scenes in bulk.
Bulk Generation: All scene images are generated in a single batch operation. Depending on the number of scenes, this can generate dozens of images automatically.
Real-Time Progress: A progress tracker shows you the status of each scene's generation in real-time. You can see which scenes are complete, which are in progress, and which are queued.
Review & Regenerate: Examine each generated image alongside its scene's narration text. For any images that don't match, modify the prompt and regenerate just that scene — you don't need to redo the entire batch.
Step 7: Videos
Generate short video clips from your scene images using AI video models.
Motion Prompts: Each scene can have a motion prompt that describes how the image should animate — camera movements, subject motion, and visual effects.
Video Generation: Videos are generated in bulk with real-time progress tracking, similar to image generation. You can regenerate individual scene videos as needed.
Step 8: Export
Finalize and export your project assets:
Export Options: Download your generated images, videos, and project data for use in your video editing software.
Project Management
VO to Images supports multiple concurrent projects:
- Project Dashboard — See all your projects with their current step and progress percentage
- Auto-Save — Progress is saved automatically at every step. Close the browser and come back later without losing anything.
- Resume Anytime — Pick up any project exactly where you left off
- Project Settings — Configure per-project settings like default model tier and generation preferences
Credits
VO to Images can consume a significant number of credits, especially with many scenes and higher model tiers. A 10-minute voiceover might have 15–30 scenes, each requiring one or more image generations. Monitor your credit balance in Settings → Personal Billing.
Permissions
This feature requires the appropriate role permissions and must be enabled for your workspace. Contact your workspace admin if you can't access it.
REST API
VO to Images is also available via the Flokan public API. Because the pipeline is multi-step and long-running, the API is split into a small set of endpoints you call in sequence — there is no single "do it all" call that returns finished images. See the VO to Images API reference for endpoints, authentication, and a worked example.