Education Researchers Should Supplement Statistical Techniques with Data Science to Better Serve All Students

BlogHealthy SchoolsJan 26 2021

Educational researchers often use data science techniques to predict educational outcomes, but many are just starting to use these techniques to evaluate the efficacy of interventions and expand knowledge about which interventions work, and for whom. Below, we use a case study to illustrate how data science techniques can help researchers accurately predict student outcomes associated with interventions and identify complex patterns in their data. Specifically, we compare how well a traditional statistical method—structural equation modeling—and a data science technique—Random Forest models—perform in the context of a school climate intervention evaluation. We found that the Random Forest model performed as well as the traditional statistical method at predicting outcomes, and that Random Forest was able to independently identify complex relationships and interactions in the data.

Study Design and Methods Definitions

Safe school certification (SSC) is a three-year technical assistance program aimed at improving schools’ capacity to improve school climate. Twenty-six public and charter schools from the Washington, DC area that serve 7th- through 10th-grade students were randomized into two treatment conditions: a treatment group that received technical assistance to support SSC and a control group of schools that did not receive technical assistance but remained free to implement the SSC framework on their own. Schools were then followed through the three years of the intervention, and through an additional post-intervention year. Student perceptions of school climate were measured in each of the four years of the study using the U.S. Department of Education School Climate Surveys (EDSCLS). EDSCLS measures school climate across three domains of school climate: engagement, safety, and environment.

Random Forest is a machine learning method popular in data science research that uses random subsets of data and variables to develop many decision trees, which are then combined to make predictions.

Structural equation modeling is a popular analytic technique in the social sciences used to address measurement error and examine complex relationships between constructs.

Data science techniques often outperform traditional statistical methods in generating accurate predictions on the kinds of large and complex data sets now available to education researchers. These techniques have been used in important applied applications such as early warning systems and in predicting school dropout.

To test whether Random Forest predicts students’ perceptions of school climate as accurately as a traditional structural equation model, we first generated the models using part of the data set and then compared the mean squared errors of the two models’ predictions on the rest of the data.[1] This provides a metric of how well the model describes the patterns in data that were not used in developing the model. The Random Forest and structural equation models performed equally well in predicting all three domains of school climate (engagement, safety, and environment). In this case, the Random Forest model did not improve predictive accuracy when compared with the structural equation model, possibly in part because the overall effect sizes for all measured variables were very small.

In addition to generating accurate predictions, data science methods can identify complex relationships without requiring a researcher to specify them a priori. Whereas standard structural equation models require assumptions about what factors contribute and assume linear associations (unless specified), the real world is much messier. Educational outcomes, such as school climate or test scores, do not always follow the simplified patterns that researchers must sometimes impose on their data. Data science techniques allow the data to speak by allowing complex relationships to come to light, such as interactions and nonlinear relationships between factors predicting outcomes. Because of this and other strengths, researchers are increasingly using Random Forest and related models.

In our case study, the Random Forest model identified interactions between treatment group and charter school membership, finding that students in charter schools benefited less from being in the treatment group than students in non-charter schools. Additionally, Random Forest and related methods do not assume a linear relationship between predictors and outcomes, but find the relationships that best fit the patterns in the data. In our study, the Random Forest model identified nonlinear relationships between time and student perceptions of school climate. In the control group, student perceptions of school climate increased in the first three years of the study and then decreased. We would not have known about this interaction or nonlinear relationship had we relied solely on traditional statistical approaches. By using the Random Forest model, we were able to offer a more nuanced interpretation of the intervention’s efficacy.

Given the success of data science techniques in education research, education researchers should expand their tool kit to include data science methods such as Random Forest. Researchers should add data science techniques to traditional statistical methods to improve accuracy in their predictions and uncover unexpected insights. Data science techniques hold particular promise in the context of educational equity: Because data science techniques are adept at uncovering complex relationships, they are especially adept at identifying how interventions work differently for different demographic groups. Researchers can apply these techniques to answer more nuanced research questions around the efficacy of interventions intended to serve the needs of Black, Hispanic, and American Indian/Alaska Native students; lesbian, gay, bisexual, transgender, and queer/questioning (LGBTQ+) students; and students with disabilities.

[1] As is commonly done in data science, both models were fit on a training subset (70% of the total data), after which the results were tested on the held-out test set (30%) to more accurately assess how well the models would handle out-of-sample data.