Building a vignette skill with Veo and personal notes

Draft Preview

This post is a draft. Enter the password to view.

Draft Preview

I wanted to make 30-second video pieces from photos. Phone photo + theme → short ambient video with text overlays and a music bed. Something that felt personal, not stock footage with captions.

The piece that finally worked came from being stuck. A snowstorm had buried the roads to Vermont—I was supposed to be skiing, but instead I was in my New York apartment watching a plow crawl down the street. Same snow that should have meant powder days, just in the wrong place.

I took a photo from my window. That became the input.

The first attempts were flat. I’d describe a theme like “presence vs performance” and get back generic copy: “what kind of presence requires no proof?” It read like a meditation app.

Theme can’t be the subject

The initial drafts were abstract. Snow falling, cinnamon rolls cooling, “calculating the shot.” The feedback I gave myself: Those are my two literal and definitive interpretations. I think you should incorporate the arc and emotional presence of a story into something that is more specific and easy to experience.

The theme can’t be the subject. You have to put the theme inside a specific experience.

Second attempt was better—a family story about a cousin visiting from Houston, the afternoon that exists nowhere except in memory. But it raised a question: do you actually want faces of people you know in something like this?

B-roll, not portraits

The videos work better as B-roll. No faces—just objects and spaces:

Coffee cup, steam, morning light
Kitchen table after a meal
Window with condensation
Empty chair

Text captions go on top. The sequence of B-roll tells the story; the copy does the emotional work. You’re observing a scene, not inhabiting a character. The images provide grounding, the words provide meaning.

Three beats in 30 seconds

For something this short, you need exactly three emotional beats:

Recognition (0:00-0:12): Familiar scene, viewer settles
The Turn (0:12-0:25): Reframe—something noticed, something missing
The Landing (0:25-0:35): Meaning arrives, weight

This is story structure compressed to minimum. Each beat gets one image, one movement direction, one text overlay.

Found footage without saying it

The aesthetic should embody the theme. Found footage = presence without performance. The imperfection is the message—this wasn’t made for anyone to see.

Elements embedded in every video prompt:

High ISO grain (digital sensor noise + fine film grain)
Soft focus at edges
Slight handheld drift (never smooth gimbal)
Colors muted, lifted blacks
One moving element per shot (steam, snow, light shift)

Never say “found footage” in prompts. The qualities are there; the label isn’t.

The video prompt formula

For Veo with image input:

[Scene from image]. [Camera behavior: minimal drift / handheld wobble /
slow push]. [Lighting]. [One element that moves]. High ISO grain,
soft focus edges, muted colors, lifted blacks. [Duration].

Example from a snowy street photo:

Snow-covered New York street seen from above through window. Plow truck pushing through heavy snow, buried cars on both sides. Gray winter light, overcast. Slow drift downward through the frame. Snow continues falling. Plow inches forward, exhaust visible in cold air. High ISO grain, soft focus at edges, muted colors, lifted blacks. Slight handheld drift. 9:16 vertical. 12 seconds.

Music as single prompt

Initial approach was section-by-section music cues. Didn’t work with AI music tools—they need one comprehensive prompt:

18-second ambient piece. 72 BPM. D minor.

Opens immediately with low drone and muffled city-through-glass texture.
A single piano note, sustained. No build—just presence. Hold for 3 seconds.

At 0:03, the drone lifts. A pad enters with air in it—something like hope,
but uncertain. Three ascending notes that don't resolve. Brief momentum.
5 seconds only.

At 0:08, everything pulls back sharply. The ascending fragment inverts,
descends. Drone thins to almost nothing. Just the piano note again, lower,
and room tone. Sits in stillness for 10 seconds. Ends on held breath.

Feel: The quick cut from effort to escape to diagnosis.

Reference: Nils Frahm "Says" intro, Grouper "Holding".

Duration, BPM, key. Opening texture. Change points with timestamps. Emotional qualities. A concrete “feel” reference. 1-2 tracks for vibe.

Vault exploration changed the copy

The generic variants were hollow. “Same snow” / “wrong mountain” / “still here”—clever but weightless.

I pushed to explore personal notes via Enzyme. The search pulled up writing about “chasing adventure first, hospitality later” and “starting from scratch every time I move to a new city.” Notes about ethnic identity and travel, about proving yourself in new places.

The connection clicked: being sad about missing a ski trip wasn’t just about missing powder. It was about the same restlessness I’d been writing about—escape as another form of ambition, the mountain as proving ground dressed up as freedom.

When external quotes work

This particular piece landed on Eugene Peterson quotes from “A Long Obedience in the Same Direction”:

“We take the energies that make for aspiration”
“and remove God from the picture, replacing him with our own crudely sketched self-portrait”
“Ambition is aspiration gone crazy”

His language was too good to paraphrase. The three quotes built an arc: energy → substitution → diagnosis. Each beat got a longer phrase instead of the punchy 3-5 word overlays I’d planned.

This isn’t a general pattern—usually short copy works better. But when external language captures exactly what the piece is about, use it. The quotes came from Readwise highlights, searched via Enzyme. Same semantic search that found personal notes also found the reading that shaped my thinking.

The stack

Video: Veo 3.1 via Python SDK, image input per beat
Music: External generation (Suno/Udio) from comprehensive prompt
Compositing: Manual assembly with text overlays
Orchestration: Claude Code skill for interactive workflow
Source material: Enzyme for vault + Readwise search

The generate_videos.py script runs all beats in parallel using ThreadPoolExecutor. Three videos, three workers, ~5 minutes total instead of 15 sequential.

The workflow now

Photo(s) + theme → generate 2-3 beat structure variants
User picks/modifies → lock in structure
Generate music prompt
Generate videos via Veo (parallel)
Output: videos + music prompt + copy timing doc

The vault integration is optional but changes the quality. When the copy comes from actual notes—things I’ve written or highlighted—it carries weight that invented phrases don’t.

Rough edges

The timing revision was manual. I asked for 3s/5s/10s beats instead of 12s/13s/10s. The videos were already generated at full length; I just updated the music prompt and copy timing docs. Actual trimming happens in edit.

Veo sometimes generates movement that doesn’t match the prompt. The “slow drift downward” might become a zoom. I accept 80% match and move on—regenerating costs time and the imperfection often works.

No automated compositing yet. Videos + music + text overlays assembled manually. Could script this, but the manual step is where I catch problems.

The piece

“Aspiration Gone Crazy”—18 seconds, three beats:

Beat	Duration	Copy
1	3s	”We take the energies that make for aspiration”
2	5s	”and remove God from the picture, replacing him with our own crudely sketched self-portrait”
3	10s	”Ambition is aspiration gone crazy”

The final beat gets 10 seconds. The diagnosis needs space to land.

---
If you enjoyed this post, you can subscribe here to get new posts and a weekly marginalia digest. No spam, unsubscribe any time.