Building a vignette skill with Veo and personal notes
30-second video pieces from photos. The copy started hollow until I pulled quotes from my vault and Readwise—external language that carried weight.
Draft Preview
This post is a draft. Enter the password to view.
I wanted to make 30-second video pieces from photos. Phone photo + theme → short ambient video with text overlays and a music bed. Something that felt personal, not stock footage with captions.
The piece that finally worked came from being stuck. A snowstorm had buried the roads to Vermont—I was supposed to be skiing, but instead I was in my New York apartment watching a plow crawl down the street. Same snow that should have meant powder days, just in the wrong place.
I took a photo from my window. That became the input.
The first attempts were flat. I’d describe a theme like “presence vs performance” and get back generic copy: “what kind of presence requires no proof?” It read like a meditation app.
Theme can’t be the subject
The initial drafts were abstract. Snow falling, cinnamon rolls cooling, “calculating the shot.” The feedback I gave myself: Those are my two literal and definitive interpretations. I think you should incorporate the arc and emotional presence of a story into something that is more specific and easy to experience.
The theme can’t be the subject. You have to put the theme inside a specific experience.
Second attempt was better—a family story about a cousin visiting from Houston, the afternoon that exists nowhere except in memory. But it raised a question: do you actually want faces of people you know in something like this?
B-roll, not portraits
The videos work better as B-roll. No faces—just objects and spaces:
- Coffee cup, steam, morning light
- Kitchen table after a meal
- Window with condensation
- Empty chair
Text captions go on top. The sequence of B-roll tells the story; the copy does the emotional work. You’re observing a scene, not inhabiting a character. The images provide grounding, the words provide meaning.
Three beats in 30 seconds
For something this short, you need exactly three emotional beats:
- Recognition (0:00-0:12): Familiar scene, viewer settles
- The Turn (0:12-0:25): Reframe—something noticed, something missing
- The Landing (0:25-0:35): Meaning arrives, weight
This is story structure compressed to minimum. Each beat gets one image, one movement direction, one text overlay.
Found footage without saying it
The aesthetic should embody the theme. Found footage = presence without performance. The imperfection is the message—this wasn’t made for anyone to see.
Elements embedded in every video prompt:
- High ISO grain (digital sensor noise + fine film grain)
- Soft focus at edges
- Slight handheld drift (never smooth gimbal)
- Colors muted, lifted blacks
- One moving element per shot (steam, snow, light shift)
Never say “found footage” in prompts. The qualities are there; the label isn’t.
The video prompt formula
For Veo with image input:
[Scene from image]. [Camera behavior: minimal drift / handheld wobble /
slow push]. [Lighting]. [One element that moves]. High ISO grain,
soft focus edges, muted colors, lifted blacks. [Duration].
Example from a snowy street photo:
Snow-covered New York street seen from above through window. Plow truck pushing through heavy snow, buried cars on both sides. Gray winter light, overcast. Slow drift downward through the frame. Snow continues falling. Plow inches forward, exhaust visible in cold air. High ISO grain, soft focus at edges, muted colors, lifted blacks. Slight handheld drift. 9:16 vertical. 12 seconds.
Music as single prompt
Initial approach was section-by-section music cues. Didn’t work with AI music tools—they need one comprehensive prompt:
18-second ambient piece. 72 BPM. D minor.
Opens immediately with low drone and muffled city-through-glass texture.
A single piano note, sustained. No build—just presence. Hold for 3 seconds.
At 0:03, the drone lifts. A pad enters with air in it—something like hope,
but uncertain. Three ascending notes that don't resolve. Brief momentum.
5 seconds only.
At 0:08, everything pulls back sharply. The ascending fragment inverts,
descends. Drone thins to almost nothing. Just the piano note again, lower,
and room tone. Sits in stillness for 10 seconds. Ends on held breath.
Feel: The quick cut from effort to escape to diagnosis.
Reference: Nils Frahm "Says" intro, Grouper "Holding".
Duration, BPM, key. Opening texture. Change points with timestamps. Emotional qualities. A concrete “feel” reference. 1-2 tracks for vibe.
Vault exploration changed the copy
The generic variants were hollow. “Same snow” / “wrong mountain” / “still here”—clever but weightless.
I pushed to explore personal notes via Enzyme. The search pulled up writing about “chasing adventure first, hospitality later” and “starting from scratch every time I move to a new city.” Notes about ethnic identity and travel, about proving yourself in new places.
The connection clicked: being sad about missing a ski trip wasn’t just about missing powder. It was about the same restlessness I’d been writing about—escape as another form of ambition, the mountain as proving ground dressed up as freedom.
When external quotes work
This particular piece landed on Eugene Peterson quotes from “A Long Obedience in the Same Direction”:
- “We take the energies that make for aspiration”
- “and remove God from the picture, replacing him with our own crudely sketched self-portrait”
- “Ambition is aspiration gone crazy”
His language was too good to paraphrase. The three quotes built an arc: energy → substitution → diagnosis. Each beat got a longer phrase instead of the punchy 3-5 word overlays I’d planned.
This isn’t a general pattern—usually short copy works better. But when external language captures exactly what the piece is about, use it. The quotes came from Readwise highlights, searched via Enzyme. Same semantic search that found personal notes also found the reading that shaped my thinking.
The stack
- Video: Veo 3.1 via Python SDK, image input per beat
- Music: External generation (Suno/Udio) from comprehensive prompt
- Compositing: Manual assembly with text overlays
- Orchestration: Claude Code skill for interactive workflow
- Source material: Enzyme for vault + Readwise search
The generate_videos.py script runs all beats in parallel using ThreadPoolExecutor. Three videos, three workers, ~5 minutes total instead of 15 sequential.
The workflow now
- Photo(s) + theme → generate 2-3 beat structure variants
- User picks/modifies → lock in structure
- Generate music prompt
- Generate videos via Veo (parallel)
- Output: videos + music prompt + copy timing doc
The vault integration is optional but changes the quality. When the copy comes from actual notes—things I’ve written or highlighted—it carries weight that invented phrases don’t.
Rough edges
The timing revision was manual. I asked for 3s/5s/10s beats instead of 12s/13s/10s. The videos were already generated at full length; I just updated the music prompt and copy timing docs. Actual trimming happens in edit.
Veo sometimes generates movement that doesn’t match the prompt. The “slow drift downward” might become a zoom. I accept 80% match and move on—regenerating costs time and the imperfection often works.
No automated compositing yet. Videos + music + text overlays assembled manually. Could script this, but the manual step is where I catch problems.
The piece
“Aspiration Gone Crazy”—18 seconds, three beats:
| Beat | Duration | Copy |
|---|---|---|
| 1 | 3s | ”We take the energies that make for aspiration” |
| 2 | 5s | ”and remove God from the picture, replacing him with our own crudely sketched self-portrait” |
| 3 | 10s | ”Ambition is aspiration gone crazy” |
The final beat gets 10 seconds. The diagnosis needs space to land.
---
If you enjoyed this post, you can subscribe here to get new posts and a weekly marginalia digest. No spam, unsubscribe any time.