Synthetic Data vs. Real Respondents: When to Use Which (and When to Use Both)

Or why the binary debate misses the point entirely

The market research community has been having the wrong argument for the last two years. The argument goes like this: synthetic data or real respondents? Which is better? Should we move to synthetic? Can synthetic replace human insight?

The argument is wrong because it's binary. And the problem is not binary.

Synthetic data can do some things incredibly well. Real respondents can do other things that synthetic cannot do at any scale. The question is not which one is better. The question is which one is right for this problem. And the answer is almost always: both.

But first, let's be clear about what synthetic data cannot do. This matters because there's a marketing narrative in the industry that suggests it can do everything if you just get the technology right. It cannot.

Synthetic data cannot estimate incidence. If you ask "how many people in the UK experience chronic pain?" synthetic data cannot answer that question accurately because the answer depends on the distribution of a real, observable phenomenon. The synthetic respondents are shaped by whatever they were trained on. They don't know the true incidence rate. They're going to estimate whatever the training data suggested. And the training data is always biased.

Real respondents can answer incidence questions because they live in the real world. They experience things. They know whether they're in the population or not. Synthetic respondents can only approximate what they think the population might report.

Synthetic data cannot model rare behaviors. If you're researching people who spend £1000+ per month on luxury skincare, synthetic data will struggle because the training corpus won't have enough examples of that behavior to learn from it properly. You need real respondents who actually do this. You need to talk to the actual audience.

Synthetic data cannot understand emotional nuance the way a human can. A person can read an open-end response about why someone loves a brand and understand the emotional truth underneath the words. Synthetic data can mimic language patterns. It can generate plausible-sounding emotional responses. But it's generating them from learned correlations, not from actual emotion. When you're trying to understand what makes someone feel something, that distinction matters.

Synthetic data cannot capture lived experience. If you're researching the experience of being a parent, or having a chronic illness, or navigating a disability, synthetic data can approximate it. But it's always an approximation. It's trained on descriptions of lived experience, not experience itself. The texture of the thing is missing. The small details that make it real are missing.

Synthetic data cannot mirror cultural context the way someone from that culture can. If you're researching cultural attitudes, values, or decision-making frameworks, you need people who live in that culture. Synthetic data trained on English-language text will not understand the subtleties of how people from different cultures make decisions or what matters to them.

Synthetic data cannot replace diversity in the field. If you need to research a diverse population, you need real diversity. Real people from different backgrounds, with different experiences, different values. Synthetic data will give you the appearance of diversity. The actual diversity of thought and experience is missing.

Synthetic data cannot handle surprise. One of the most valuable things about real respondent research is when someone tells you something that contradicts your assumptions. When a respondent gives an answer you didn't expect. That surprise is where new insights live. Synthetic data will never surprise you because it's constrained by what it was trained on. It cannot go beyond the patterns in the training data.

Synthetic data cannot be used alone for high-stakes decisions. If the research is going to inform a major business decision—a product pivot, a brand repositioning, a significant investment—you need the confidence that comes from real respondents. You need to know that the data comes from actual people making actual decisions. Synthetic data is too uncertain for that level of stakes.

Here's what synthetic data can do. It's incredibly useful for these specific applications.

Synthetic data can triage early-stage concepts. You have ten concept variations. You want to narrow it down to three before running expensive real respondent research. Synthetic data can help you eliminate the ones that are obviously not working. It's fast and cheap. You're using it as a filter, not as proof.

Synthetic data can explore complex survey design. If you want to run a 150-attribute survey testing many combinations and interactions, running that with real respondents is expensive and slow. You can use synthetic data to understand how respondents navigate complex choice tasks, where there are trade-offs, what the decision logic looks like. Then you validate the insights that matter with real respondents.

Synthetic data can repair tracking gaps. You've run a brand tracker for years with a certain set of questions. Then you want to add new attributes. But you have no historical data for those attributes because they weren't asked before. You can use synthetic data trained on your respondent population to estimate what those attributes would have been in historical years. It's not a perfect historical reconstruction, but it's better than no baseline.

Synthetic data can map research data to activation environments. You've learned what messaging resonates with your audience. Now you want to activate that learning at scale. Synthetic data trained on your insights can help you generate variations of messaging that maintain the core insight while adapting to different contexts. It's acceleration of execution, not generation of new insight.

These are the places where synthetic shines: as a supplement, not a replacement. As acceleration, not as proof. As a tool for exploration, not for validation.

The question "synthetic or real respondents?" has been the wrong question all along. The right question is: which one earns its place in this specific research architecture?

Earlier this year we wrote about the end of the sample size debate. The same thinking applies here. The future isn't synthetic or real. It's neither and both. It's knowing which one gets you closer to the answer you need, and when to use each one. It's designing research architecture instead of choosing between binary options.

A well-designed research program uses synthetic data where it's genuinely useful and real respondents where they're required. It uses the speed and cost of synthetic to accelerate exploration. And it uses the depth and truth of real respondents to validate what matters. It's not about choosing a side. It's about choosing the smartest tool for each problem.

The companies that are winning at this aren't asking which data is better. They're asking: what does this decision require? And then they're assembling whatever combination of real and synthetic gets them there with the right confidence level.

The future of research isn't synthetic or human. It's knowing when each one earns its place. And building a research stack smart enough to use both.