Education Researchers Should Supplement Statistical Techniques with Data Science to Better Serve All Students

Publication Date:

January 26, 2021



Educational researchers often use data science techniques to predict educational outcomes, but many are just starting to use these techniques to evaluate the efficacy of interventions and expand knowledge about which interventions work, and for whom. Below, we use a case study to illustrate how data science techniques can help researchers accurately predict student outcomes associated with interventions and identify complex patterns in their data. Specifically, we compare how well a traditional statistical method—structural equation modeling—and a data science technique—Random Forest models—perform in the context of a school climate intervention evaluation. We found that the Random Forest model performed as well as the traditional statistical method at predicting outcomes, and that Random Forest was able to independently identify complex relationships and interactions in the data.

Study Design and Methods Definitions

Safe school certification (SSC) is a three-year technical assistance program aimed at improving schools’ capacity to improve school climate. Twenty-six public and charter schools from the Washington, DC area that serve 7th- through 10th-grade students were randomized into two treatment conditions: a treatment group that received technical assistance to support SSC and a control group of schools that did not receive technical assistance but remained free to implement the SSC framework on their own. Schools were then followed through the three years of the intervention, and through an additional post-intervention year. Student perceptions of school climate were measured in each of the four years of the study using the U.S. Department of Education School Climate Surveys (EDSCLS). EDSCLS measures school climate across three domains of school climate: engagement, safety, and environment.

Random Forest is a machine learning method popular in data science research that uses random subsets of data and variables to develop many decision trees, which are then combined to make predictions.

Structural equation modeling is a popular analytic technique in the social sciences used to address measurement error and examine complex relationships between constructs.

Data science techniques often outperform traditional statistical methods in generating accurate predictions on the kinds of large and complex data sets now available to education researchers. These techniques have been used in important applied applications such as early warning systems and in predicting school dropout.

To test whether Random Forest predicts students’ perceptions of school climate as accurately as a traditional structural equation model, we first generated the models using part of the data set and then compared the mean squared errors of the two models’ predictions on the rest of the data.[1] This provides a metric of how well the model describes the patterns in data that were not used in developing the model. The Random Forest and structural equation models performed equally well in predicting all three domains of school climate (engagement, safety, and environment). In this case, the Random Forest model did not improve predictive accuracy when compared with the structural equation model, possibly in part because the overall effect sizes for all measured variables were very small.