Soft Serve vs Smartphones: How Google's Veo 3 AI Sandwich Built a 10.5M-View Pixel 10 Spot
Google's Vanilla spot swaps phones for cones to pitch Pixel 10, hitting 10.5M views since Sept 23. Veo 3 guides pre-vis; directors deliver the final cut-their 'AI sandwich'.

Vanilla, Veo 3 and volleyball back-and-forth: Inside Google's AI ad integration strategy
Google is trading "vanilla" tools for Veo 3 - and asking consumers to do the same. To push the AI-powered Pixel 10, the brand launched Vanilla, a spot where standard smartphones are swapped with vanilla soft serve cones. Cones on selfie sticks, cones on dashboards, cones at the Louvre - a simple visual joke with a point. Since its September 23 premiere, the ad has pulled 10.5 million views across TikTok, YouTube and Instagram.
The intent is clear: wake people up to the sameness of phones and make them curious about what's next. "We like to create ads that have insight but feel playful at the same time," said Jesse Juriga, senior director and ECD at Google Creative Lab.
The "AI sandwich": AI for pre-vis, humans for story
Google Creative Lab built Vanilla with Veo - using the tool for nine months across video and social projects. For Vanilla and Just Ask Google, the team used Veo 3 to produce rich storyboards: sets, blocking, mood, and even how actors interact with product. Then human directors Matias & Mathias shot the final spot. That's the AI sandwich: AI to explore and clarify, humans to craft and surprise.
The goal wasn't volume; it was fidelity. "You had creatives spitting out hundreds of ideas⦠but without Veo, there was less fidelity," Juriga said. The edit team then brought rhythm and humanity back into the cut - avoiding the trap where AI washes out connection, as seen in other brand misfires.
Veo 3's creative edge: clarity early, fewer surprises late
The team started with Veo 2 and finished on Veo 3 as the model evolved mid-production. The key advantage wasn't speed alone - it was precision. Veo forces decisions early: time of day, wardrobe, framing, tone. That makes pre-production sharper and on-set debates shorter.
Juriga is blunt: "I don't think we would have made this ad two years ago without Veo." It shortens the gap between a weird idea in your head and something the room can see.
Where AI stumbles (and how they fixed it)
Veo initially "misunderstood" the prompt - showing actors holding both cones and phones instead of replacing phones. The team solved it by generating training images with Imagen that showed people holding cones like phones, then nudged Veo with those references. Volleyball back-and-forth: prompt, review, correct, repeat.
This is the real workflow shift: build a tiny visual dataset to steer the model, rather than rewriting prompts forever. AI is the assistant, not the finisher.
Team structure: same headcount, more output
No staff cuts. "It's the same amount of people, but they're being more prolific," Juriga said. Fewer than five creatives locked the concept. Production still scaled up with directors, crew and post - the craft layer stayed intact.
What this means for creatives
- Use AI for pre-visualization, not the final film. Treat Veo as your storyboard artist with a camera.
- Decide early. Lock time of day, wardrobe, lenses, and tone inside your AI boards to prevent on-set drift.
- Build micro-reference sets. If the model misreads the idea, create 10-20 reference frames (Imagen or similar) and steer it back.
- Keep the human spike. Hire directors to add texture, pacing and surprise. That's the difference between scroll-by and shareable.
- Measure ideas by clarity, not novelty. If the AI board makes everyone "get it," you're ready for production.
A practical prompt framework that works
- Scene intent: what the viewer should feel or realize in 5-7 words.
- Framing and movement: lens, angle, camera move, duration.
- Subject and action: who, doing what, with what prop.
- Environment specifics: time, lighting, weather, crowd density.
- Product behavior: how the product is used or misused (e.g., "cone replaces phone").
- Reference frames: 3-5 stills per beat to reduce ambiguity.
Why this campaign resonates
- One visual joke, many use-cases. The cone-as-phone gag scales across selfies, maps, museums.
- Clear stake in the ground. Anti-sameness positioning is easy to understand and hard to copy without looking late.
- Social-native thinking. The spot reads without sound and lands in three seconds.
Tools referenced
Build your own AI sandwich
- Pre-vis in Veo (or similar) until your team can "see" the idea.
- Solve misreads with a small reference set made in Imagen or your image tool of choice.
- Hand off to a director who can push story and performance.
- Let editors restore timing, humanity and comedic beats.
As Juriga put it, Veo has been "jet fuel for the creative process." The people who win won't be the ones replacing crews - they'll be the ones shipping clearer ideas, faster, without losing the weirdness that makes work spread.