VO to Images
VO to Images (Voiceover to Images) is a multi-step AI pipeline that transforms voiceover recordings into matching scene images. It takes an audio narration, transcribes it, identifies scenes and characters, generates tailored image prompts for each scene, and then bulk-generates all the images — automating what would otherwise be hours of manual work.
This tool is ideal for narrated content like explainer videos, documentary-style videos, storytelling content, educational walkthroughs, and any format where visuals need to match spoken narration.
Pipeline Overview
The VO to Images pipeline guides you through 8 sequential steps. Each step must be completed (or at least reviewed) before moving to the next. You can go back to any previous step to make changes.
Step 1: Upload & Transcribe
The first step is getting your voiceover audio into the system and converting it to text.
Audio Upload: Upload your voiceover audio file in any common format (MP3, WAV, M4A, etc.). Flokan uses AI-powered transcription to automatically convert your audio to text with high accuracy.
SRT File Support: If you already have subtitles or a transcript, you can import an SRT file instead of (or in addition to) the audio. SRT files with word-level timing provide the most precise scene segmentation later in the pipeline.
Review & Edit: After transcription, review the text for accuracy. You can edit any part of the transcript — fix names, technical terms, or any words the transcription got wrong. Getting the transcript right here ensures better results in all subsequent steps.
Step 2: Character Selection
The AI analyzes your transcript and identifies characters, subjects, or recurring entities mentioned in the narration.
Automatic Detection: The system scans the transcript for character names, descriptions, and references. It presents you with a list of detected characters for review.
Character Creation: For each detected character (or for any character the AI missed), you can:
- Provide a detailed visual description (appearance, clothing, age, etc.)
- Upload reference images so the AI knows what they should look like
- Set a character name for consistent reference throughout the pipeline
Character Library: Characters you create are saved and can be reused across multiple projects. If you're creating a series with recurring characters, you only need to define them once.
Step 3: Prompt Generation
This is where the AI creates specific image generation prompts for each scene in your voiceover.
Scene Segmentation: The transcript is broken into logical scenes — each segment representing a distinct visual moment. The AI determines scene boundaries based on topic changes, pauses, and narrative structure.
AI-Powered Prompts: For each scene, the AI generates a detailed image prompt that:
- Describes the visual composition appropriate for that part of the narration
- Incorporates character descriptions from Step 2
- Maintains visual continuity between scenes
- Includes style, lighting, and mood appropriate to the content
Manual Editing: Review and edit every generated prompt before moving to image generation. You have full control — rewrite prompts entirely, adjust details, add specific visual elements, or remove things you don't want.
Step 4: Image Generation
With prompts finalized, the pipeline generates images for all scenes in bulk.
Bulk Generation: All scene images are generated in a single batch operation. Depending on the number of scenes, this can generate dozens of images automatically.
Model Selection: Choose which AI model tier to use for generation:
- Standard — Fast and economical for draft versions
- Pro — Higher quality for production use
- Creative — More artistic and varied results
- Premium-Pro — Highest quality with premium Airforce models
Real-Time Progress: A progress tracker shows you the status of each scene's generation in real-time. You can see which scenes are complete, which are in progress, and which are queued.
Steps 5–8: Review & Refinement
The remaining steps focus on reviewing, refining, and finalizing your generated images:
Review: Examine each generated image alongside its scene's narration text. Flag any images that don't match the intended scene or need improvement.
Regenerate: For flagged images, modify the prompt and regenerate just that scene. You don't need to redo the entire batch — regenerate individual scenes as needed.
Refine: Make final adjustments to prompts and regenerate any remaining scenes that need work.
Approve: Mark all images as approved. Approved images are ready to use in your video production.
Project Management
VO to Images supports multiple concurrent projects:
- Project Dashboard — See all your projects with their current step and progress percentage
- Auto-Save — Progress is saved automatically at every step. Close the browser and come back later without losing anything.
- Resume Anytime — Pick up any project exactly where you left off
- Project Settings — Configure per-project settings like default model tier and generation preferences
Credits
VO to Images can consume a significant number of credits, especially with many scenes and higher model tiers. A 10-minute voiceover might have 15–30 scenes, each requiring one or more image generations. Monitor your credit balance in Settings → Personal Billing.
Permissions
This feature requires the appropriate role permissions and must be enabled for your workspace. Contact your workspace admin if you can't access it.