At a moment when our industry is scandalized by a panel provider having been indicted for selling fraudulent survey respondents, in comes another provider that is doing nearly the same thing with absolutely full disclosure. The catch? They are using AI to do it, and dressing it up with phrases like “contextually relevant” and “statistically valid” and “machine learning.”
Here are excerpts from their pitch:
Let’s face it – surveying hard-to-reach consumer segments within tight project timelines can drive up costs and push deadlines, creating challenges that even well-designed studies struggle to overcome. . . . Synthetic responses offer a promising solution. . . . Synthetic responses augment your collected dataset to strategically fill critical gaps, particularly when reaching hard-to-reach audiences like C-suite executives, high-net-worth individuals, and other notoriously difficult-to-access demographics that often leave quotas incomplete. For example, if your study requires responses from 30 high-net-worth respondents over 40 years old in a particular geographic region, but you’ve only secured 20, [our offering] provides the option to generate the remaining 10 responses based on behavioral patterns of similar participants.
What’s the difference between synthetic responses and standard weighting techniques? While weighting simply adjusts the results of an existing study to make them more representative for analysis, synthetic responses create new, artificial responses that mimic real respondent data, using machine learning models. The engine takes all survey responses and detailed profile information into account, employing advanced imputation techniques, to create responses that fill quota gaps where real responses are lacking. This approach effectively addresses research challenges like low response rates or insufficient sample sizes for hard-to-reach segments.
To be fair, the pitch has the kernel of something interesting: Maybe we can use AI to create synthetic survey respondents that closely approximate how real respondents would answer. (So far, rigorous studies suggest that we cannot.) Maybe we can bring together AI models and current statistical techniques for imputing missing data in order to augment the partial data we have in hand. (As far as I know, there is no credible evidence that we can.)
But yikes. They are “solving” the problem of small sample sizes by simply inflating datasets with synthetic respondents. They act as if “statistical validity” involves nothing more than getting larger samples that mimic smaller samples.
Of course there are other ways you could inflate your sample sizes and fill quotas. You could do it by copying rows of existing data, counting each of your respondents more than once. That would be fraud. You could hire people to pretend they are respondents and fill out your survey, like the panel provider now under indictment. That, too, would be fraud.
Now, instead, you can use AI to create additional rows of data, jumbling them up (instead of copying them) with algorithms derived from the other rows of data. Voilà. Now you have a big sample size, all your quotas are full, and your tests of “statistical significance” will pass muster.
Hmm. I’m not going to call this fraud, because the provider of this tool is totally upfront about what they are doing. But until there is proof that synthetic data works and in what contexts it might work, Versta Research recommends staying far away from this kind of “research” for business decisions.