In development

A document renders a video.

Explainer video is slow because every frame is hand-made. Video Factory treats it as a render target: author the story as beats in a JSON document, and the engine draws the diagrams, types the code, tracks the camera, speaks the narration, and burns the captions, deterministically, on your brand.

Get early access JSON in, narrated MP4 out
explainer.jsonrendering
0:00
Real output

Explainers it rendered.

Full videos the engine produced from a JSON document, with narration, captions, motion and sound. Published on the Fetching Westie channel.

  • InfrastructureCloudFlare Deep Dive
  • EconomicsThe AI Job Paradox
  • AIOpen vs Closed AI Models
The model

The beat is the unit.

You do not lay out frames. You write a beat: the line it says, the scene it shows, and the moments motion should fire. The engine turns spoken words into a timeline and moves the picture to match.

  • Narration is the clock

    Each beat carries the line it narrates. Motion events anchor to the words, so picture and voice stay locked, frame for frame.

  • One canvas, a touring camera

    Not a deck of slides. Elements persist across beats and the camera travels, pushes in, and pulls back through one shared world.

  • Composition, not templates

    Beats are trees of primitives styled by brand tokens. The engine lays them out, so no two scenes are a filled-in stencil.

The vocabulary

One grammar. Every kind of scene.

Diagrams, sequences, code, terminals, charts, gauges and more, drawn from the same set of primitives and animated by the same motion verbs, all anchored to the narration.

  • DiagramNodes and edges, auto-laid.
  • SequenceLifelines and messages.
  • CodeSyntax-lit, types in.
  • TerminalPrompts and output.
  • ChartLine and area, scaled.
  • BarsValues that fill.
  • GaugeDials that sweep.
  • MeshGenerative node graph.
  • QuotableFull-frame punchline.
Primitives
  • frame
  • text
  • shape
  • path
  • icon
  • mesh
  • arrow
  • gauge
  • chart
  • bartrack
  • image
  • cylinder
  • hexagon
  • stadium
  • donut
  • arc
Motion verbs
  • draw
  • packet
  • type
  • count
  • morph
  • fill
  • travel
  • camera
  • dissolve
  • reject
  • circle
  • underline
  • colorflip
  • rotate
  • stamp
  • highlight
Voice & captions

The narration runs the show.

Text-to-speech produces the voiceover and word-level timings. Captions burn in as word-by-word karaoke, and sound design is welded to the motion: a draw pops, a line of code clacks, a counter ticks. Mixed to broadcast loudness.

  • Kokoro TTSdeterministic voice + word timings
  • Karaoke captionssynced to the spoken word
  • Welded SFXdraw→pop, type→clack, count→tick
  • −16 LUFS mixmusic bed ducks under voice
Render

Deterministic, brand-tokened, any aspect.

The same draw code runs on GPU and CPU. Colours, type, spacing and motion all come from brand tokens, so a rebrand is a token swap, and the same document renders the same video, every time.

  • 3840×21604K, 16:9Long-form for YouTube. HEVC 10-bit, 30fps.
  • 1080×19209:16 shortsReels, Shorts, TikTok, composed portrait.
  • same in → same outDeterministicRender twice, byte-identical. Seeded, cacheable.
Authoring

Built by an agent, in a loop.

The engine exposes its whole surface as tools. An agent writes a beat, renders a frame, looks at it, and fixes it, the same write, render, look, fix loop a motion designer runs, with quality gates that refuse flat motion and frozen cameras.

  • get_schema()
  • render_frame()
  • synthesize()
  • export()
  • lint()
  • motion_review()
  • camera_review()

Want to shape it?

Video Factory is in active development. If explainer video is a real cost for you, tell us the kind of thing you need to explain, and we will bring you into early access.