Data cleaning: How a knack for solving logic puzzles can improve children’s lives
One of the things that makes Child Trends a great place to work is that sometimes I get to “clean” data—a process that involves identifying and resolving inconsistencies and inaccuracies within and across records of data, prior to analyzing them. I bet a few of you are thinking, “That is so cool!” (Please send us your application now.) Others may be wondering how that effort relates to Child Trends’ mission of improving the lives and prospects of youth and families. I would argue that it’s crucial.
The data that may need cleaning could be child records coming from state departments of education, health, or child welfare. Other times, we have obtained data through primary data collection, such as surveys we conduct as part of a study. Problems frequently arise through human errors such as typos, leaving important fields blank, double-entering a piece of information, or entering multiple pieces of information that are logically inconsistent (such as dates that occur in impossible order). Graduate students typically receive little training in how to clean data. Rather, they often use cleaned, canned data for their problem sets. Yet the process is critical to ensuring that analytical results will be as accurate and meaningful as possible.
Data cleaning is not a manual process of eyeballing the data, row by row, ticking off numbers that look incorrect. Not only would such a process be boring, it would be inefficient. Instead, analysts review any existing documentation along with their tabulations of data to identify outliers, impossible values, and patterns of missing data. They think creatively about analyses that could identify inconsistencies and the best way to resolve those inconsistencies—think about logic puzzles you may have seen in grade school, or Sudoku puzzles.
For example, I may compare multiple data elements—perhaps using data from different agencies—to figure out whether a child with a missing foster care placement record was actually discharged from foster care or not. Analysts review each other’s work, learning ways to increase efficiency and improve their coding skills along the way. We document any decisions about data corrections. Perhaps most critically, we collaborate with data providers for a valuable outside perspective. They understand the policies and practices in their jurisdictions, including variations across jurisdictions, and they often know which database fields practitioners use regularly and which they ignore, as well as how to interpret the data fields correctly. Insights from data providers aid us in the data cleaning process, and facilitate our correct interpretation of our final analyses.
Not only are internal reviews important, but so too is opening our work to the scrutiny of outside researchers. In January 2016, The International Committee of Medical Journal Editors issued an editorial stating that “[s]haring data will increase confidence and trust in the conclusions drawn from clinical trials. It will enable the independent confirmation of results, an essential tenet of the scientific process.” Sharing data publicly may also help address the growing skepticism about the validity of social science research claims.
So how does data cleaning make Child Trends a great place to work? Data cleaning might seem like a small technical exercise, but our attention to detail is mission-driven: the more accurate our research, the more effectively it can guide policymakers, practitioners, and other stakeholders in their efforts to promote child and family well-being—that’s what I care about. When children’s lives are at stake, there are no small details.