Using Machine Learning to Identify and Remove Scammers in Online Studies
Machine learning algorithms can alleviate some of the unique challenges of recruiting participants online by helping evaluators better identify participants who do not meet study criteria. Program evaluators who recruit participants online and do not interact with them face-to-face may find it difficult to verify that participants meet enrollment criteria (such as age, gender, and race/ethnicity) because of the anonymity of online interactions. Despite this difficulty, it is critical that evaluation researchers correctly identify individuals in a study sample who pass the study screener but do not qualify to be in the study (or have enrolled more than once) because their inclusion jeopardizes the accuracy of evaluation results and wastes both staff time and the financial resources spent on incentives.
Researchers can turn to machine learning algorithms—which can identify invalid participants in real time—to solve this problem. For a Child Trends evaluation of an app-based intervention, we explored the role of machine learning to more efficiently identify invalid participants in our sample.
Machine learning algorithms can identify potential scammers and duplicates in close to real time by automatically identifying patterns in survey responses that are likely to indicate a scammer or duplicate entry. In our situation, a supervised machine learning algorithm learned from records that were already identified as scammers or duplicates through manual detection in order to automatically identify future potential duplicates.
There are two primary advantages of using automated detection versus traditional manual detection: efficiency and generalizability.
- Efficiency. One major advantage of algorithmic duplicate/scammer detection is that it can happen in seconds without much manual labor. Rather than waiting for a manual review to compare each enrollee with all past data, an algorithmic approach can be applied instantaneously to block scammers as they fill out the survey. Additionally, once trained and optimized, a machine learning algorithm can be easily applied to each new survey participant at no additional cost. This scalability and capacity for real-time detection are unique advantages of machine learning algorithmic detection.
- Generalizability. Another advantage of a machine learning-based approach is that the code can be reused with minor adaptations to adjust to specific research contexts and can even be reused in different projects. For example, in developing a scammer/duplicate detection algorithm for a recent survey on teen pregnancy prevention, we knew from manual inspection that similarity in IP addresses was particularly relevant for detecting duplicates, as was similarity in participant ages. We easily tuned the algorithm to use this knowledge to improve its performance. Researchers could make similar tweaks in other research contexts, enabling them to reuse code across projects while considering the unique conditions of each project.