How it works

A research platform is its accuracy stack.
Here is ours, in full.

Prism doesn't run on a single clever model. It runs on nine independent techniques, each of which closes a known failure mode of synthetic research. The first one is the foundation: agents that keep reading. We describe each below, including the numbers and the limits.

/ 00

Living agents come first. Eight more layers of accuracy follow.

Every accuracy technique below assumes the agent is responding to your stimulus with a current view of the world. That assumption breaks the moment a competitor ships a launch, a category leader changes their pricing, or your ICP's feed lights up with a new entrant. Prism agents consume real news and synthetic social signals every 24 hours, so the prior they answer from is the prior of the day you ran the check, not the day they were generated. For SaaS-buyer clusters specifically, that signal includes the launch posts, pricing-change threads, and switching discussions your buyers are reading on Hacker News, X, and Lenny's, not generic news feed.

→ Read the full Living Agents writeup

/ 01

Real-data seeding

Every synthetic agent is built on fragments of real, anonymized human behaviour, not an LLM's second-hand memory of people.

The easiest way to build a synthetic population is to ask a language model for one. It will happily produce ten thousand plausible-sounding humans in a few minutes. They will also be extraordinarily bland, anchored to the model's training distribution, stripped of the idiosyncrasies that make real humans noisy, contradictory, and informative.

Prism agents are seeded from a fragment layer. Each fragment is a small, consent-cleared, anonymised unit of real human behaviour, a survey response, a product review, a community thread, a panel-consented conversation excerpt. When we build an agent in a given cluster, we retrieve a compact bundle of fragments that matches the cluster's distribution and anchor the agent to them. The LLM's job is to animate the fragment, not invent it.

The output is agents who behave more like data and less like prose.

For SaaS buyer clusters, the same principle holds with different sources. Each fragment is a G2 review excerpt, a Hacker News thread comment, a public switching post, a consent-cleared founder transcript. A simulated CTO reacting to a dev-tool landing page is anchored to actual senior-engineer voice, not a model's idea of what a CTO sounds like.

Next: Multi-model ensemble

/ 02

Multi-model ensemble

A reaction is never generated by a single model. Prism routes across multiple independent frontier model families so no one model's bias dominates the answer.

Every language model has a shape. One defaults to politeness and hedged nuance. Another defaults to agreeable optimism. A third leans encyclopaedic. A fourth is drier and sometimes blunter. If you build synthetic research on one of them, you inherit its shape, your synthetic population starts to sound like that model's idea of a person.

Prism assigns each agent a home model at generation time, weighted to match the real behavioural variance of its cluster. When a stimulus is tested, agents react through their assigned model. The aggregate sentiment and the distribution of reactions are therefore the average across an ensemble, not a vote inside one model's head.

We also periodically re-audit the ensemble weights against ground-truth datasets and rebalance. Assignments are stable per agent, so consistency holds, but the mix evolves.

For SaaS-buyer clusters this matters more than anywhere else, because the SaaS-founder audience itself uses these models daily and notices their shapes. A check that sounds like a single chatbot, too agreeable, too long-winded, too unwilling to say no, fails the smell test before the buyer reads the methodology. Prism's ensemble means a simulated indie hacker can be terse and dismissive, the next can be detailed and hedged, the next can be enthusiastic and expansive. The aggregate is the population, not the model.

Next: Calibration layer

/ 03

Calibration layer

Every raw reaction is passed through a calibration model that corrects known biases against real-world ground truth before it reaches the dashboard.

Raw LLM sentiment runs hot. Models trained to be helpful rate almost everything between "interesting" and "delightful". If you just average them, you get a research platform that never delivers bad news, which is the same thing as never delivering useful news.

Prism stores a calibration function per cluster, fit against ground-truth surveys and panels. Before any reaction leaves the reaction engine, its raw sentiment, purchase intent, and emotional labels are mapped through the calibration transform for that cluster. The calibration is refit weekly as new ground-truth arrives.

The effect is measurable: uncalibrated Prism matches ground truth within 11 percentage points on average. Calibrated Prism matches within 3.

For SaaS buyer clusters the calibration targets are different, sign-up intent, conversion intent, switching intent, and pricing perception, audited against G2 quarterly buyer reports, the Stack Overflow Developer Survey, OpenView PLG benchmarks, and our own opt-in customer panel. The same per-cluster isotonic regression, just with SaaS-grade ground truth.

Next: Revealed-preference weighting

/ 04

Revealed-preference weighting

Agents answer based on what people like them actually did, not just what they said. Stated intent is explicitly discounted.

Most surveys overstate virtue. People tell pollsters they'll buy sustainable, they'll tip fairly, they'll exercise more. Aggregate retail data shows they don't. This gap, stated versus revealed preference, is one of the most persistent errors in market research. It is also something language models tend to amplify, because models trained on surveys inherit the distortion.

Prism agents carry a weight: stated preferences contribute less than observed behaviour. When an agent reports a purchase likelihood, the output is adjusted toward the behavioural class its cluster historically exhibits. The agent still speaks in the first person, but the underlying number is weighted toward what that cohort does, not what it claims.

You can opt out per-study, for example when testing aspirational concepts where stated intent matters. The default is weighted.

SaaS buyers do this too, just differently. They tell pollsters they'd switch from Mailchimp tomorrow. Aggregate retention data shows 78% don't, even when they say they will. Prism's SaaS agents are weighted toward what buyers actually do, the G2 "switched from / switched to" data, public retention curves, observed PLG conversion rates, not what they claim in a survey.

Next: Behavioral consistency

/ 05

Behavioral consistency

Agents remember what they said. In a focus group, they build on and contradict each other like real people, not like two forgetful chatbots in adjacent windows.

LLMs, run naively, produce orthogonal answers to related questions. Ask the same model the same question five times and you may get five different personalities. In a simulated focus group, that surfaces as agents who can't remember that they already committed to disliking a product three turns ago.

Prism gives each agent a persistent memory store. Their self-reactions across stimuli are embedded, stored, and retrieved when a related stimulus arrives. This creates consistent agents, the same person, with the same blind spots, revisited across tests.

In practice this means focus-group mode produces conversations that develop rather than drift, and longitudinal studies show the kind of drift real consumers show: slow, correlated with new information, not random.

For SaaS specifically, this matters because real buying committees have memory. A simulated CTO who pushed back on the per-seat pricing in week one will push back on the same point in week six. A growth lead who flagged the missing Salesforce integration in the first eval keeps flagging it in the second. Our agents have the same continuity.

Next: Noise injection

/ 06

Noise injection

Mood, time pressure, and fatigue are modeled as state. The same agent on a Friday night answers differently than on a Tuesday morning.

Real human responses to stimuli are never deterministic. A person shown an ad at 10:47am, mildly caffeinated, reads it differently than the same person shown it at 9:15pm after a hard day. Surveys average this out by sample size. Synthetic research, run without noise, collapses it, every agent answers as if they just walked in, fresh and focused.

Prism samples a mood state per agent per reaction: arousal, valence, time-pressure, cognitive load, and recent-context drift. The mood state modulates the system prompt at inference time. Aggregate results match real-sample distributions far more closely, especially in tails, where noise matters most.

The mood sampler is seeded per agent per stimulus, so a given run is reproducible. Re-run the same stimulus with the same seed and you get the same distribution. Re-run it cold and you get a different sample, the way a different day would.

SaaS buyers carry their own state dimensions on top of mood: runway pressure, hiring pressure, switching fatigue, quarter-end timing, recent-bad-demo recency. The same simulated CTO answers differently in week 4 of a hiring sprint than in a calm month. The same Head of Growth pushes back harder on annual contracts at quarter-end than at quarter-start. These states aren't decorative; they materially change what objections an agent surfaces.

Next: Distribution-shape matching

/ 07

Distribution-shape matching

Real markets are polarised. Prism outputs are shaped to match that polarisation, not flattened to a reassuring mean.

The default output of a well-trained LLM is the median of its training distribution, a shape with a tall centre and thin tails. Real consumer reactions are usually the opposite: bimodal love-or-hate distributions, skewed by price sensitivity, or heavy-tailed by brand loyalty.

Prism tracks the historical shape of real reactions per cluster per stimulus type and shapes outputs to match. This is done post-hoc, after raw generation, using a light reweighting and, for extreme stimuli, targeted resampling. The individual reaction is never fabricated. The population's shape is recovered.

The reason this matters is that the decision you're making is almost always about the tails. Which 15% will love it. Which 12% will hate it loudly. The average is the least useful number.

SaaS buyer reactions are even more polarised, often more so than consumer markets. Dev-tool reactions cluster bimodally at "love it" and "I'd build it myself." Pricing reactions are J-shaped, most cluster neutral-positive, with a sharp spike of "absolutely not" at psychological thresholds (€100/seat, €500/seat, €5k/yr). Indie-hacker reactions have heavy tails on both sides; enterprise reactions cluster centrally around "need to bring this to procurement." Prism shapes outputs to match these specific patterns per cluster, not flatten them.

Next: Public validation

/ 08

Public validation

Every cluster's accuracy is dated, sourced, and visible on a public page. If we're wrong, you find out from us first.

Synthetic research platforms that won't publish their accuracy numbers are, with near-complete reliability, platforms whose accuracy numbers you would not want to see. This industry operates on quoted 85%, 90%, 92% figures without audit, without dataset, without date.

Prism publishes every cluster we serve with: its accuracy, its ground-truth source, its sample size, its last-audited date. When a cluster drifts below 80% accuracy, we pause it. Customers receive an advisory. The /validation page is updated automatically.

The point isn't to be the most accurate. The point is to be the most honest about how accurate we are. Everything else follows from that.

SaaS founders are the audience this matters most for. They will read the validation page before the homepage. They will check the dataset name. They will notice if a cluster's last-audit date is six months old. We list the SaaS-specific clusters with the same level of detail as the consumer clusters, same audit cadence, same pause threshold, same public dataset citations. If a check tells you something surprising, you can trace exactly which cluster, which dataset, which calibration date the prediction came from.

Now run your first simulation.

Or, if you'd rather see the calibration data first: every cluster's accuracy, dated.

Run 3 free reactions See validation data