Spotting an AI Cheater in Research: Investigating the Limits of Intuition in Remote Interviews
by Emma Spero
Subscribe to get sharp thinking all about ResearchOps delivered straight to your email inbox. It’s free!

The ResearchOps Review is supported by User Interviews, now part of UserTesting. User Interviews makes it fast, easy, and affordable to recruit participants so you can scale research without sacrificing quality.
Have you ever seen a screener response that looked flawlessly written, one in which every answer had perfect grammar and structure? Or perhaps you’ve been in a research interview and wondered whether the participant was simply an unusually focused person or an ardent fan of the product.
About two years into my role as a ResearchOps administrator at the software company, Red Hat, I noticed that skepticism of research participants among researchers had gradually become the default during recruitment and virtual interviews. Researchers started flagging up to 20 percent of participants in their studies as “fraudulent;” suspecting that they were using AI tools to answer questions. Encountering occasional fraudulent participants wasn’t new; we offered monetary incentives and had optimized our recruitment screeners to filter out both bots—built and managed by professional fraudsters—and less sophisticated individuals who feigned their expertise in order to qualify. Instead of being skeptical of incomplete, slow, or unpolished answers as they were previously, researchers were now also skeptical of people who seemed too polished. Too well informed.
In a User Interviews blog post titled “The Next Decade of Fraud in User Research: A Guide to Staying Ahead,”1 Michele Ronsen, founder of the user research consultancy, Curiosity Tank, framed this new wave of fraud as “faster, smarter, and oftentimes indistinguishable from legitimate participation.” Also from User Interviews, the “State of Research Operations 2025”2 report flagged junk data from AI bots and “participants using AI to circumvent open response screeners and even using AI during live sessions” as top concerns for ResearchOps teams heading into 2025. Because an increasing number of participants were being disqualified, I wondered how accurately research and ResearchOps colleagues could discern AI from humans. If we incorrectly flag a person as a “cheater,” we could damage relationships with customers and company partners who genuinely want to help improve the product. If we miss too many cheaters, we risk polluting our data with synthetic responses that we believed came from humans.
To better understand the challenge, I read up on what talent acquisition leaders were doing to combat cheating in job interviews—a similar context to research interviews. What detection tools, cheating “tells,” and methods did they use to screen out or scare away cheaters? While we can learn a lot from their frameworks, I wanted to investigate this threat specifically within a user research context. I designed and executed a pilot study in which both participants and researchers were anonymized to test our own team’s defenses against real-time AI tools in interviews. The results uncovered flaws in our own detection instincts and identified concrete operational shifts we could make to protect our data.
The Cheating Tools
In addition to the most popular AI tools, like ChatGPT and Claude, which can generate on-the-spot answers, a growing, well-funded industry of products has emerged in the past two years to help people answer questions during video calls. These tools listen to both the interviewer and interviewee, watch the screen, and generate context-specific answers within seconds of a question being asked. I didn’t find any cheating tools specifically designed with research participants in mind, but it’s clear that these meeting tools would work for remote research sessions, too.
Cluely is openly marketed as a tool to “Cheat on everything.” They claim a 95 percent transcription accuracy, a 300 millisecond response time (as long as it takes you to blink) and an ability to follow the eyes of the user to be undetectable. Immediately after discovering this tool, I asked my manager to do a quick test run with me. She downloaded the tool and quickly answered my questions about rock climbing (a topic she knew nothing about) using Cluely-generated responses. After witnessing the natural responses and real-time speed of the tool, the next step was to explore detectability. Researchers were confident that they could spot cheaters, but how accurate were they? If participants were able to fake answers so easily, cheaply, and covertly, even about topics they knew nothing about, could we continue to rely on intuition to disqualify participants? To find out, I ran a small pilot study to see if we could spot the cheaters and use it as a jumping off point for future investigations and to build solutions into our operations.
The Pilot Study
The study design involved four members of my team acting as interview participants. Each participant completed two short interviews on fun, non-work-related topics that they were either very familiar or completely unfamiliar with. Topics spanned TV shows like Friends, movies like Pirates of the Caribbean, hobbies like Brazilian jiu-jitsu, and interests like Anime. In the “AI-assisted” condition, participants were allowed to use Cluely. In the “Human” condition, they answered from genuine knowledge. A single moderator conducted all the interviews without knowing which condition any participant was in, in many ways mirroring the party game “Mafia,” in which unwitting players try to sniff out the “killers” based on body language. Then eleven observers from across the user experience design team, including researchers, managers, and people who do research (PWDRs), analyzed the recordings, identified which condition they thought each interview belonged to, and rated their confidence on a five-point scale. I also asked them to note why they made their judgments.
The results at a glance were as follows:
Overall accuracy. Observers correctly classified responses as accurate in sixty-four out of eighty-two judgments, indicating that researchers were right more than they were wrong
Genuine human responses. When evaluating authentic human interactions (research in which no AI assistance was utilized), observers successfully verified authenticity in four out of every five assessments. Again, that’s a good result.
AI-assisted responses. When evaluating sessions in which participants used real-time covert AI software, detection accuracy dropped to three out of four assessments.
The misidentification risk (false positives). The biggest data risk emerged when observers incorrectly flagged an authentic human participant as an AI user. Despite being entirely wrong, observers reported a confidence metric of 4.5 out of 5—they felt “very confident” to “extremely confident” in their incorrect judgment.
It was encouraging to see that researchers, overall, could identify when AI was being used in these interviews. However, the findings revealed a weakness when it came to using intuition: When observers were wrong about AI usage, they weren’t tentative; they were most confident. The moments when observers were sure they had caught a cheater were, in this study, the moments they were most likely to be wrong.
The Confidence Paradox
Researchers are pattern-matchers, and consistent interviewing experience can help them develop a sense or intuition of what authentic responses do—and don’t—look and sound like. But if their intuition hasn’t kept pace with the capabilities of the latest technology, and if more participants are using AI to cheat and researchers are on high alert, that same intuition can be misleading.
The business risk isn’t only false negatives, though that’s a risk worth addressing. A researcher operating on gut feel might disqualify a genuine participant, losing the data the team paid for, and damaging the participant’s relationship with the panel in the process. And because research operations rarely include a way to confirm the authenticity—or falsity—of the participant and the insights they shared afterwards, the researcher may walk away believing they made the right judgment and apply the same flawed pattern to additional sessions.
Luckily, intuition isn’t our only tool. If we can identify the true tells of AI assistance and build that know-how into how research operates, we can confidently flag the cheaters and protect our data. So how do you tell if someone is using AI to cheat?
The Tells that Told On Cheaters
The qualitative results of the study were especially interesting because they surfaced which cues the observers were using and which were predictive in remote interview contexts. The observers who took part in the pilot study said things like, “He repeated the questions aloud a lot, which felt like stalling for time,” and “she seemed to get flustered when the software was lagging.” And they were right. These markers showed up consistently in correct AI identifications:
A fixed off-screen gaze. Especially one that returned to the same spot for every answer. Unlike casual upward glances while someone thinks, these were precise, repeated glances to the same location that almost always indicated the participant was reading a generated response.
Filler language used as a stalling pattern. Participants inserted phrases like “I guess” or “that’s a great question” at the start of every answer, with a consistent pause length afterwards. This is often correlated with someone buying processing time to review the AI-generated answers. As another stalling tactic, participants using AI also often echo questions back to an interviewer.
Hyper-specificity. AI-generated responses tended to be encyclopedic. They used complete sentences, contained unusual detail, and lacked the natural mess of how people actually talk about their jobs. Real experts ramble, contradict themselves, and sometimes even forget the name of the system they use every day, while synthetic experts do not.
Typing after each question. Many models have done away with the need for participants to type. However, if participants use older models or their software doesn’t work as planned (a few glitches did occur when we were using Cluely), then clear typing sounds, especially during the silence after a question, are a strong indicator.
On the other hand, the markers that correlated most reliably with authentic responses were:
Personal analysis. Participants frequently anchored their answers in subjective framing with phrases like “I feel like…” or “in my experience…,” and included mentions of messy realities or opinions they don’t believe are held by the majority. For example, a participant may admit that while the industry standard is to use a specific automated tool, they find it cumbersome and rely on simple spreadsheets instead.
Natural forgetting or slipups. In authentic responses, it was common to hear mid-sentence contradictions, rambling, or participants admitting to forgetting a fact.
Eye movement that drifts. When people think, their eyes often wander. An authentic participant’s gaze will tend to drift to retrieve a memory or formulate a complex thought. Crucially, these movements never lock into a rigid, repetitive loop. Unlike the fixed movements of someone scanning an AI-generated paragraph, authentic eye movement is less structured.
Hand and body gestures that synchronize naturally with speech. Authentic participants used their hands, leaned into the camera, or nodded to emphasize a point at the exact time the corresponding word left their mouths. Because they weren’t dividing their cognitive energy between listening, reading a hidden screen, and speaking, their body language flowed with their voice.
The Cues That Fooled Us
It could be argued that pilot participants were aware that they were looking for a cheater and were, therefore, especially alert to suspicious behaviors. It’s true. But, as mentioned earlier, the prevalence of AI cheating means that researchers are now often in that state of vigilance or skepticism by default. Comments like “he responded too quickly to be using AI in real time,” or “she seemed to lean in to check her screen for responses” are tells that observers thought were dead giveaways, but they turned out to be unhelpful—or worse, misleading.
The two most prominent mistaken tells involved the speed of responses:
Long pauses fooled almost everyone. People take a long time to think, especially when asked about something they haven’t articulated before. A real expert who pauses to find the right word is indistinguishable, on the surface, from a fraudulent participant waiting for a language model. Pauses aren’t an accurate indication of cheating.
Quick, confident responses aren’t evidence of authenticity. AI tools can feed answers to participants in close to real time—again, in the blink of an eye—and a coached participant can read them fluently. With these tools advancing so quickly, you can no longer rely on quick delivery as a sign of authenticity.
Ultimately, these false indicators prove that the technology has evolved to the point that we must now re-evaluate our detection methods.
What This Means for ResearchOps
Most ResearchOps teams I know have built participant governance around a familiar set of risks, such as fraudulent identities, repeat participants, response bias, and the gaming of screeners to ensure acceptance into a study. AI-assisted participation overlaps with all of those, but it’s not a perfect fit for existing controls. Besides, the pace of AI tool improvement means that ResearchOps professionals, or anyone recruiting participants or interviewing users, is constantly playing catch-up.
To safeguard against the current generation of AI tools, here are four shifts to build into your operations:
Replace gut-feel disqualifications with multi-cue thresholds. The confidence paradox means that an observer, however experienced or confident, should not disqualify a participant based on intuition alone. To support my team, I put together a moderator authenticity checklist of AI-cheating indicators (the fixed gaze, audible typing, and encyclopedic answers), authenticity markers (the personal analysis and natural forgetting), and an explicit reminder that long pauses and fast responses are not reliable tells.
Encourage storytelling. Prompt researchers to include questions in their discussion guides that encourage participants to share personal stories, feelings, and opinions rather than just facts. Questions about failure, frustration, and abandonment can be especially telling. Real experts likely have a long list of things that didn’t work, workarounds they’re embarrassed about, and features they’ve given up on. AI tends to generate diplomatic, balanced responses that praise concepts and identify areas for improvement.
Include AI provisions in your participant agreement. State plainly in the participant agreement that the use of AI assistance disqualifies the response and forfeits the incentive. Encourage participants to use their own voice and emphasize that perfection isn’t required. While this won’t stop the most determined cheaters, it signals to participants that you’re on alert for this type of deception.
Consider deploying specialized interview fraud detection software. Some options include platforms like Sherlock AI and Polygraf AI that use multimodal machine learning and real-time speech transcription to flag unnatural response latency, mechanical eye-movement patterns, or perfect, AI-scripted linguistic structures. For qualitative AI-moderated interviews, Conveo or Strella add an extra layer of defense. These tools use adaptive AI moderators to push back on vague or diplomatic answers with follow-up questions intended to throw off participants using real-time AI cheating assistants.
Future Studies Are Essential
While this pilot study relied on a localized pool of eleven observers from the same team and focused on the tech industry, I see an opportunity for future research in this area—research that could help everyone better manage AI-enabled participant fraud. Expanding research to include a larger, more diverse pool of observers across the broader research community would reveal how detection accuracy varies across professional backgrounds and industries, and scaling the research to a larger dataset would provide the statistical significance needed to validate these qualitative patterns.
Additionally, shifting the research design from lighthearted, non-work topics to highly technical subjects would help determine if AI detection rates change when the cognitive load and complexity of the discussion increase. Investigating a broader variety of covert assistants beyond Cluely would also allow researchers to distinguish universal AI “tells” from product-specific cues.
Finally, introducing screenshare tasks, even with tools designed to remain undetectable, would add an extra layer of operational complexity that might discourage participants from attempting fraud. And if you do implement fraud-detection software to prevent this kind of cheating, test those tools to ensure automated systems don’t inherit the same biases found in human observers.
AI Cheating Versus Intentional Synthetic Data
When I first set out to explore AI cheating in user research, I found that conversations tended to cover two related problems and that people had strong opinions on how AI can and should be incorporated in user research. The first problem was participant AI fraud, the second was synthetic users: AI-generated profiles deliberately used in place of real participants.
In their article, “Synthetic Users: If, When, and How to Use AI-Generated “Research,”3 Kate Moran and Maria Rosala of the Nielsen Norman Group concluded that synthetic participants “cannot replace the depth and empathy gained from studying and speaking with real people. They often provide shallow or overly favorable feedback.” It’s worth noting that their article was published in 2024. Still, many researchers would come to that conclusion. However, just as general-use AI tools have greatly improved, so has the quality of synthetic data. Now, synthetic participants can be a quick, cost effective way to simulate a wide range of demographic profiles that are traditionally harder to reach.
In my opinion, synthetic data will be increasingly used as an intentional tool in user research, particularly for rapid ideation and hypothesis generation. But if we use it to supplement data from human participants, it’s even more critical that we can trust the humans we recruit to provide genuine opinions—long pauses, rambling, and musings included—rather than the generic AI responses that tools like Cluely generate.
The polished, perfectly phrased responses that once made researchers marvel at finding the “ideal” participant have become the very baseline of our suspicion. This constantly shifting landscape forces us to re-evaluate what authentic human participation actually looks like in a remote research environment. As ResearchOps professionals, our challenge is to stay ahead of the tools designed to deceive us, be aware of our occasionally flawed intuitions, and continue to seek out genuine responses. Ultimately, the very technology creating this trust crisis will likely be the tool we use to solve it.
The ResearchOps Review is made possible by User Interviews, now part of UserTesting. With a vast participant network, precise matching, and fraud prevention, User Interviews can reliably fill any research study. Source, screen, track and pay participants, then move seamlessly from data collection to deep analysis, all in one place. → Learn more about User Interviews for ResearchOps.
Ronsen, Michele. “The Next Decade of Fraud in User Research: A Guide to Staying Ahead.” User Interviews Blog, July 8, 2025. https://www.userinterviews.com/blog/user-research-fraud-prevention.
"The State of Research Operations 2025." User Interviews. December 3, 2025. https://www.userinterviews.com/state-of-research-operations.
Moran, Kate, and Maria Rosala. "Synthetic Users: If, When, and How to Use AI-Generated ‘Research’." Nielsen Norman Group. June 21, 2024. https://www.nngroup.com/articles/synthetic-users/.




