Current Events
July 16, 2025

What is a digital twin anyway???

Written by our Founder + CEO, Neil Dixit:

At Panoplai, we’ve spent years using digital twins and synthetic data to help global brands brands like Hubspot, Alphabet, Kraft Heinz, and many more.

We’ve learned a lot—about what works, what doesn’t, and why.

This post kicks off a new series where we’ll be sharing our real-world insights, scoring methodology, and breakthroughs. We’ll talk both about accuracy and reliability and about value. And we’ll share concrete lessons about hybrid approaches to research that combine cutting-edge and more traditional techniques.

Let's start with digital twins.

The concept of a digital twin has been around for decades. At its core, it is a digital representation of something that exists in the real world. Think of General Electric building digital twins of jet engines to simulate conditions and prevent failures before they happen. More recently, Patrick Collison, CEO of Stripe, spoke about an initiative that his Arc Institute is working on, aiming to create a “virtual cell” by using AI to simulate biological processes in a simulated environment before physical experimentation.

Over time, the concept expanded into advertising — and now increasingly into product innovation and market discovery.  

Earlier this year, we wrote about some of the applications of digital twins in Quirks and in Barron’s, and last year in AdAge.  

So, what is a digital twin, anyway?

In our world of audience discovery and market research, let’s start with what it’s not. A digital twin isn’t a chatbot. It’s also not a simple construction based on entering three or four demographic or personality characteristics into an LLM and then letting it go wild. There are lots of experiments demonstrating that this approach isn’t solid enough to help make important business decisions.   

Let’s say you prompt the general purpose LLM to simulate a 35-year-old mom from California, who shops at Walmart, and who thinks convenience is important, and then ask her to evaluate a new product idea. How will she respond? Based on underlying training data from places like Instagram and Reddit, responses will likely be generic, predictable, and demonstrate little nuance or emotional resonance. They will lean heavily on behavioral stereotypes and will lack textured and differentiated opinions and attitudes. Compare the outputs to those of real human beings—the difference will be night and day. 

Now imagine even tougher audiences–like C-Suite investment bankers or medical specialists–who rarely share professional opinions on social media. The outputs will be even worse.

At Panoplai, by contrast, a digital twin is a dynamic, AI-powered representation of a real audience segment—built on heaps of actual first-party data, and designed to let you ask both deterministic (what do we already know?) and inferential (what’s a reliable educated guess based on what we already know?) questions.

Our digital twins are based on vetted, curated data, pulling from millions of data points, with access to real language, real emotion and sentiment, and resulting in accurate quantitative outputs and qualitative conversations grounded in what real people have actually said and done. That means you can use them to explore how someone feels, and not just what they do (or what an LLM thinks they do).

Now, you might be asking yourself, “how is a digital twin different than synthetic data?” The truth is, over the past four months, I’ve seen countless definitions, often using the two word interchangeably. It can be very confusing, especially since digital twins sometimes draw on synthetic data to simulate audiences. 

So let’s explore the category of synthetic data next.

What is synthetic data and why would anyone want to use it?

Synthetic data is just artificially-generated information at scale. Good synthetic data matches–as closely as possible–the statistical properties of real data. It fills in the gaps and it’s useful when you need to understand a scenario you didn’t capture in your original dataset—like testing a new product idea, a shifting cultural trend, or a message no one has seen yet. It’s also great for “virtual recontact”—going back to virtual extensions of the same audience and asking follow-up questions instantly, without refielding an entire study. It can therefore be used to augment lots of diverse datasets. And in certain scenarios (if you have the right combination of underlying data to inform your model) it can be used to create new datasets.

Also important to note—synthetic data should never be a replacement for 1st party data collection. There will always be a need to hear directly from people. Opinions change, and emotions are complex–they evolve. No synthetic dataset–even the most complexly constructed ones–could ever predict the ever-shifting political state of the US (or the world), or how people feel about it in real-time. If someone tells you that synthetic data is going to completely replace data collection, my advice is: “run!” 

We believe in using synthetic data and digital twins where they make sense, and in the context of the right business, marketing, and audience challenges. The benefits, in terms of increased speed, reduced costs, and enhanced audience understanding, can be huge. But in the end, people’s opinions do change. And it’s rarely the case that LLM training data or commercially available third party datasets include all of the audience data that matters most to your organization. Synthetic data helps you move faster and smarter, but it should work with real data, not against it.

At Panoplai, all of our synthetic data is generated based on patterns found in real, high-quality source data.

What is the Panoplai approach?

Our foundation is usually 1st party audience opinion data. Why? Because it can be scaled and representative enough to create AI segments and personas anchored to the real world. We then work with our clients to strengthen this the foundation with other forms of data: behavioral, social, transactional, purchase, financial, qualitative, and more. We also round it out with different sources including reports, competitive analysis, or other contextual sources of information. 

We look not just at categorical data, but also at unstructured responses, examples of their language, and then it all gets fed into our models, resulting in more accurate outputs. This allows our models to come up with more nuanced and accurate answers that are grounded in truth. 

We’ve spent years building with synthetic data and digital twins. From HubSpot to Diageo, McCann to Alphabet, and working with major brands across CPG, appliances, consulting, and more, we’ve tested these tools (and are always testing more) in the real world. In addition to striving for accuracy, we’ve also learned how to deliver value. By accelerating innovation, shaping marketing strategy, testing concepts, or enhancing customer understanding. 

Here’s how we do it:

- Our approach is transparent rather than black-boxed. Our clients know how the system works and what's powering their outputs.

- We deliver rigorous traditional research as well–usually in the form of surveys with an extreme focus on understanding people through language, sentiment, and emotion. 

- We’re built on a secure foundation of 1st-party data (often in combination with other forms of data as well). 

- We help clients create their own private data repositories—including not just survey data, but also social, transactional, sales, NPS, and more, in the form of PDFs, competitive decks, qual transcripts, summaries—anything and everything. We bring to life their knowledge and ways of understanding customers.

- We build digital twins and synthetic datasets. Digital twins let you have conversations with accurate simulations of real groups of people and gain instant feedback. Synthetic archetypes help you generate and then understand data at scale.

- We’re agnostic at the core. Our models are LLM- and sample-agnostic, choosing—sometimes in real-time–the best tool for the business challenge at hand. We work with a multitude of LLM models and sources of respondents. This approach allows us to generate the most relevant, contextual and nuanced outputs, while removing whole categories of risk.

How can you be sure it’s useful?

When it comes to synthetic data and digital twins, of course, accuracy is non-negotiable. But, arguably just as  important are big questions like, “Is this approach useful? Will it deliver real value to clients?” In the end, helping clients make better, more data-driven decisions is what matters most. 

We evaluate usefulness and value through a blend of methods, developed over many client engagements. We look at both quantitative and qualitative accuracy, relevance, insightfulness, clarity, generalizeability, ease-of-use, quality of ideas/concepts, and time to value. We have a complete synthetic data and digital twin scoring method—and it works. It’s become our internal report card and the way we hold ourselves accountable

We’ll be sharing more of these evaluations publicly, including experiments that didn’t pass the test, because we believe the only way this space matures is through transparency.

What’s next?

We’re kicking off this series to share what we’ve learned—the good, the bad, what has worked, what didn’t work. Leveraging AI to simulate real people is real. It is powerful, and it does return value. But the risks are large and it’s easy to make mistakes. 

Expect real case studies, a white paper, and new blog posts. If you're building with or curious about synthetic data, digital twins, or AI-powered insights—join the conversation!

latest news

Related Post