Using Machine Learning to Identify and Remove Scammers in Online Studies

BlogHealthJul 1 2021

Machine learning algorithms can alleviate some of the unique challenges of recruiting participants online by helping evaluators better identify participants who do not meet study criteria. Program evaluators who recruit participants online and do not interact with them face-to-face may find it difficult to verify that participants meet enrollment criteria (such as age, gender, and race/ethnicity) because of the anonymity of online interactions. Despite this difficulty, it is critical that evaluation researchers correctly identify individuals in a study sample who pass the study screener but do not qualify to be in the study (or have enrolled more than once) because their inclusion jeopardizes the accuracy of evaluation results and wastes both staff time and the financial resources spent on incentives.

Researchers can turn to machine learning algorithms—which can identify invalid participants in real time—to solve this problem. For a Child Trends evaluation of an app-based intervention, we explored the role of machine learning to more efficiently identify invalid participants in our sample.

Machine learning algorithms can identify potential scammers and duplicates in close to real time by automatically identifying patterns in survey responses that are likely to indicate a scammer or duplicate entry. In our situation, a supervised machine learning algorithm learned from records that were already identified as scammers or duplicates through manual detection in order to automatically identify future potential duplicates.

There are two primary advantages of using automated detection versus traditional manual detection: efficiency and generalizability.

  • Efficiency. One major advantage of algorithmic duplicate/scammer detection is that it can happen in seconds without much manual labor. Rather than waiting for a manual review to compare each enrollee with all past data, an algorithmic approach can be applied instantaneously to block scammers as they fill out the survey. Additionally, once trained and optimized, a machine learning algorithm can be easily applied to each new survey participant at no additional cost. This scalability and capacity for real-time detection are unique advantages of machine learning algorithmic detection.
  • Generalizability. Another advantage of a machine learning-based approach is that the code can be reused with minor adaptations to adjust to specific research contexts and can even be reused in different projects. For example, in developing a scammer/duplicate detection algorithm for a recent survey on teen pregnancy prevention, we knew from manual inspection that similarity in IP addresses was particularly relevant for detecting duplicates, as was similarity in participant ages. We easily tuned the algorithm to use this knowledge to improve its performance. Researchers could make similar tweaks in other research contexts, enabling them to reuse code across projects while considering the unique conditions of each project.

There are some key limitations to using machine learning to detect scammers and duplicates. For supervised learning approaches, like the one described above, a data set is needed on which to train the original algorithm. In other words, some initial manual work is still required to create this initial data set (although it will be substantially less work than a scenario in which machine learning is not used). Further, it can be complicated to evaluate the accuracy of automated detection because the manual coding, from which the machine learning algorithm learns, is likely not perfect and has not captured all scammers and duplicates.

Despite these limitations, however, the efficiency and generalizability of machine learning algorithms often make them worth the initial investment of time and resources. Researchers can avoid months, and potentially years, of painstaking manual work that might otherwise be needed to ensure a valid, accurate sample.

The authors would like to thank Jen Manlove of Child Trends, and Mila Garrido of Healthy Teen Network, for their thoughtful and substantive contributions to this blog.