Big Data and Social Science Research

Weilin Li

Big Data and Social Science Research

BlogMar 12, 2014

facebook

X

Bluesky

Author

Weilin Li

In March 2012, the Obama Administration announced the "Big Data Research and Development Initiative" to improve our ability to extract knowledge and insights from large and complex collections of digital data. At its launch, the Big Data Initiative featured more than $200 million in new commitments from six federal departments and agencies aiming to make the most of big data and the tools needed to analyze it.

Two months after that, the United Nations published the White Paper "Big Data for Development: Opportunities and Challenges," which highlighted the opportunities and challenges of using big data in the field of international development. It says that "it is important to recognize that Big Data and real-time analytics are no modern panacea for age-old development challenges. That said, the diffusion of data science to the realm of international development nevertheless constitutes a genuine opportunity to bring powerful new tools to the fight against poverty, hunger, and disease." Both of these announcements indicate the potential influence of big data techniques on social science research.

Traditional statistical analyses were based on assumptions that (1) all data are collected for centralized data processing, usually over one or a few computers, (2) data collection and analyses will take a considerable amount of time, usually measured in weeks to years, before results are ready to inform policies, and (3) datasets are well organized with formatted rows and columns. However, nowadays these assumptions no longer hold.

Researchers and public health officials recognize that big data techniques can offer a more efficient analytical process with relatively low cost. First, the volume of data to be analyzed is exploding far beyond gigabytes or even terabytes—the capacity of traditional data storage and management systems. Decoding the human genome involves analyses of 3 billion base pairs, which took 10 years the first time it was done in 2003, but now can be achieved in one week. Second, many policies and practices require a guaranteed response within strict time constraints. When H1N1, originally known as the Swine flu, made headlines in 2009, the Centers for Disease Control and Prevention relied on traditional reporting and analytic procedures to identify infected areas, which required two weeks to be completed. But controlling such a rapidly spreading disease required a much quicker response. At the same time, Google was receiving more than 3 billion search queries, such as "medicine for cough and fever," every day. Utilizing these search queries, Google was able to identify geographic areas infected by the flu virus based on what people searched for on the Internet. Third, various formats of data are often difficult to link and/or integrate for analysis and decision making. For example, on average more than 500 million tweets are posted on Twitter every day. It would be enormously challenging to organize and manage that many tweets in well-formatted tables. But when processed using big data techniques, these tweets can be used by public health experts, marketers, and even financial analysts to spot trends. Research has indicated that daily variations in public mood states in tweets are statistically significantly correlated with daily changes in Dow Jones Industrial Average closing values.

Many leading tech companies have used big data for their commercial purposes and have pioneered advances in the field. Google created the seminal framework, MapReduce, to manage huge web data for its own web searching and ranking services using commodity servers. Then Yahoo! initiated the development of Hadoop, an open-source implementation of MapReduce that has been widely used worldwide for processing various large-scale data sets. Amazon uses big data techniques to recommend the ideal book to its users while LinkedIn uses these techniques to suggest other people to whom we should connect based on data in our profiles and our existing connections.

While those of us in social science research can learn much from the tech companies using big data, we also need to recognize the differences. In the social science sphere, we care about the generalizability of our analyses and need to keep in mind the underlying population. For example, research has shown that American adults under 30 who live in urban or suburban areas are more likely to use Twitter, which should make us cautious when drawing inferences from Twitter data regarding the overall population.

We also have to be cautious about measurement errors. For example, a social network study that looks at a person’s list of Facebook friends might not reflect the person's actual social network in reality. Finally, big data techniques are often used in combination with data mining approaches when certain clusters or associations are identified. However, social science research also targets the ’causal’ relationship to understand whether certain interventions or policies could impact people’s lives.

Big data is a new and exciting frontier that offers great promise to the work we do in the child and youth field. Let us know how you are using big data!