Digital twin consumers: a faster, cheaper way to test products before launch
A new method for market research lets large language models simulate consumer feedback with accuracy close to real panels. It's called semantic similarity rating (SSR), and it converts open-ended opinions into Likert scores that mirror human distributions without asking models to pick numbers directly.
Tested on 57 product surveys and 9,300 human responses from a major personal care brand, SSR hit roughly 90% of human test-retest reliability. Even more compelling: the score distributions were statistically almost indistinguishable from the human panel. The authors put it plainly-this approach keeps the familiar metrics and keeps the qualitative "why" intact.
Why this matters for product teams
- Shorten the loop from concept to shelf by running "digital focus groups" in hours, not weeks.
- Get both a score and the reasoning behind it for copy, concepts, features, and packaging variants.
- Iterate immediately: tweak claims, adjust price anchors, or reposition benefits and rerun the panel the same day.
- Cut costs for early-stage testing so you can explore more options before committing to expensive fieldwork.
The problem it fixes
Ask an LLM for a 1-5 rating and you'll get clumpy, unrealistic distributions. Ask for text first, then map that text to a score using embeddings and reference statements, and the distribution starts to look human. That's the leap.
This is timely. A 2024 analysis from Stanford GSB flagged that many online survey responses are now generated by chatbots, producing "nice," homogenized feedback that washes out signal. Instead of filtering contaminated panels, SSR lets you generate synthetic panels under control and with auditability.
How SSR works (plain English)
- Prompt the model for a short, candid opinion as if it were your target customer.
- Turn that text into an embedding (a numeric vector).
- Compare it to embeddings of five reference statements that each represent a Likert point (1-5).
- Assign the closest score. Keep both: the score and the verbatim rationale.
Because the model isn't guessing a number, it avoids the unnatural rating behavior that breaks classic LLM panels. The method rests on good embeddings and a clean reference set.
Where it applies today (and where it doesn't)
- Validated domain: personal care products and consumer-style purchase intent.
- Likely near-term fit: FMCG, DTC, app feature prioritization, ad and packaging claims, early concept reads.
- Open questions: complex B2B decisions, luxury categories, or culturally specific products.
- Scope: works at the population level, not as a predictor of individual behavior or 1:1 personalization.
Your 90-day implementation plan
Phase 1: Prove it on a small slice (Weeks 1-3)
- Select 2-3 recent concept tests with available human results as ground truth.
- Define clear segments (e.g., age range, category usage, price sensitivity). Keep prompts short and specific.
- Draft five reference statements for each Likert point that match your brand voice and category.
- Run SSR to produce synthetic ratings plus verbatims for each concept.
- Measure: distribution match (skew/kurtosis), rank-order agreement, correlation with human means, and variance.
Phase 2: Expand scope (Weeks 4-7)
- Add packaging, claim lines, and price points. Test sensitivity by changing one variable at a time.
- Introduce geographic or demographic segments and check whether deltas reflect past panel trends.
- Put guardrails on prompts to avoid bland "nice" language; enforce character limits and tone control.
- Run a small human holdout (n=50-100) to confirm the pattern holds.
Phase 3: Operationalize (Weeks 8-12)
- Template your prompts, reference statements, and evaluation metrics so anyone on the team can run a study.
- Stand up a simple dashboard: score distribution, top themes, segment deltas, and verbatim browser.
- Define decision thresholds (e.g., move forward if SSR rank matches human rank within top-2 for two cycles).
- Document limits: population-level insights, not 1:1 targeting; maintain a periodic human benchmark.
Data and model quality checks
- Construct validity: confirm that embedding distances track changes in purchase intent, not style or verbosity.
- Reference set hygiene: write clear, unambiguous statements for 1-5; avoid polarizing brand cues.
- Model choice: test at least two embedding models; pick the one with better agreement on holdout sets.
- Prompt control: enforce segment descriptors, usage context, and constraints on length and tone.
- Calibration: periodically align scores to fresh human panels to catch drift.
Team and stack
- Product: owns concepts, claims, and decisions from the readout.
- Insights: curates reference statements, sets evaluation criteria.
- Data/ML: manages embeddings, similarity scoring, and dashboards.
- Legal/Compliance: documents synthetic nature of panels and usage constraints.
Procurement checklist for vendors
- Which embedding models are supported, and how are they validated for purchase intent?
- How do you prevent mode collapse (samey responses) across runs?
- Can we supply our own reference statements? How is performance affected?
- What metrics do you report by default (distribution fit, rank agreement, test-retest)?
- How are segments enforced in prompts and outputs?
- How frequently do you recalibrate against human data?
- What controls exist for sensitive categories and bias?
- What are the unit costs per 1,000 synthetic responses and expected turnaround time?
Cost and time outlook
- Traditional national concept test: weeks and tens of thousands. Limited iteration once in field.
- SSR simulation: hours to days and a small fraction of that cost. Iterate as needed before committing to fieldwork.
Make it usable for day-to-day product work
- Use SSR as a pre-screen: pressure-test 20 ideas, send the top 3 to human panels.
- Run "what if" sprints: change a claim, tweak a price, swap packaging color-measure shifts by segment.
- Feed qualitative rationales to your copy and design teams for tighter iteration.
Know the limits
- Works best on consumer-style purchase intent with clear benefits and tradeoffs.
- Treat results as directional unless you've validated against fresh human data in your category.
- Do not infer individual behavior or target individuals based on synthetic panels.
Further reading
Build the skill inside your team
If you want hands-on training to set up AI-driven testing workflows for product work, explore role-based programs here:
The takeaway: human-only focus groups aren't going away, but you can offload early-stage screening to synthetic panels that keep both the numbers and the "why." Move first, validate often, and let your team test more ideas than your competitors can afford.
Your membership also unlocks: