The term “big data”  brings up the specter of a new positivism, as another one in the series of many ideological tropes that have sought to supplant the qualitative and descriptive sciences with numbers and statistics.
But what do scientists think of big data? Last year, in a widely circulated blog post titled “The Big Data Brain Drain: Why Science is in Trouble,” physicist Jake VanderPlas made the argument that the real reason big data is dangerous is because it moves scientists from the academy to corporations.
But where scientific research is concerned, this recently accelerated shift to data-centric science has a dark side, which boils down to this: the skills required to be a successful scientific researcher are increasingly indistinguishable from the skills required to be successful in industry. While academia, with typical inertia, gradually shifts to accommodate this, the rest of the world has already begun to embrace and reward these skills to a much greater degree. The unfortunate result is that some of the most promising upcoming researchers are finding no place for themselves in the academic community, while the for-profit world of industry stands by with deep pockets and open arms. [all emphasis in the original]
His argument proceeds in four steps: first, he argues that yes, new data is indeed being produced, and in stupendously large quantities. Second, processing this data (whether it’s biology or physics) requires a certain kind of scientist who is skilled at both statistics and software-building. Third, that because of this shift, “scientific software” to clean, process, and visualize data has become a key part of the research process. And finally, because this scientific software needs to be built and maintained, and because the academy evaluates its scientists not for the software they build but for the papers they publish, all of these talented scientists who would have spent a lot of their time building software are now moving to corporate research jobs, where this work is better rewarded and appreciated. All of this, he argues, does not bode well for science.
Clearly, to those familiar with the history of 20th century science, this argument has the ring of deja vu. In The Scientific Life, for example, Steven Shapin argued that the fear that corporate research labs would cause a tear in the prevailing (Mertonian) norms of science, by attracting the best scientists away from the academy, loomed large over the scientific (and social scientific) landscape of the middle of the 20th century. And these fears were mostly unfounded–partly, because they were based on a picture of science that never existed, and partly because, as Shapin finds, ideas about scientific virtue remained nearly intact in its move from the academy to the corporate research lab. Would things be any different today?  One suspects not.
But I would argue though that we shouldn’t let this historicizing get in the way of appreciating VanderPlas’ post. Because, as I see it, it contains a striking number of observations about the nature of scientific work in a world of computing and data. I will speculate in this post that the anxiety around big data is less about a crisis in epistemology (the role of theory, ideas about statistical significance) and more about a crisis in professional identities of practitioners who work with data.
First, consider this description of the new scientist in the world of big data:
In short, the new breed of scientist must be a broadly-trained expert in statistics, in computing, in algorithm-building, in software design, and (perhaps as an afterthought) in domain knowledge as well. [emphasis in the original].
The reason I find this description interesting is that it fits exactly with the description of what a computer scientist does. The computer scientists I observe, who design software to augment online learners, work exactly in this way: they need some domain knowledge, but mostly they need the ability to code, and they need to know statistics in order to create machine learning algorithms, as well as to validate their argument to other practitioners. What VanderPlas is saying, then, is that practitioners of the sciences are starting to look more and more like computer scientists.
The point here is not so much to argue that he is right or wrong, but that his blog post is an indication of changes that are afoot. His post, it seems to me (and with the caveat that this is a data point of 1), can be read as an indicator of a struggle that seems to be occurring within physics (or perhaps astronomy). There are physicists like VanderPlas (whose CV, strikingly, lists his interests as “astronomy” and then “machine learning”) who do a great deal of work building software, but who want these activities to be recognized as a legitimate part of physics. In effect, he is telling his colleagues: if you don’t grant me equal status as a physicist, I will take my skills elsewhere, where they will be more amply rewarded. The threat he uses is the one that seems to resonate most with scientists: the corporate research lab and how it will dilute the scientific ethos.
If some physicists are doing work that looks like computer science and are seeking recognition for it as physics, then a parallel phenomenon is happening on the other side. In the last few years, computer scientists have increasingly turned their attention to a variety of domains: for example, biology, romance, learning. And in each of these cases, their work looks very similar to the work that VanderPlas’ “new breed of scientist” does.
As Andrew Abbott reminds us, when a new community of experts takes on a certain task or a set of clients, it inevitably has to negotiate with an existing community of experts who serve that domain. Let me take the domain of “learning.” Over the past year, I’ve been studying the evolving eco-system around MOOCs, and the interactions between software engineers, educators and education researchers who have organized around these new online learning platforms. As the “platform” becomes more prominent in the delivery of higher education, more and more computer scientists are taking on “learning” as a topic of research. These computer scientists come mainly from the sub-fields of Human-Computer Interaction (or HCI) and Machine Learning. Both these communities have been very influential in inventing the techniques deployed to great effect on the World Wide Web: data mining, A/B testing, crowd-sourcing and business analytics. Now, they want to use these techniques to help online learners.
All papers must tackle topics “at scale.” For example, a paper that would not qualify for Learning at Scale would be one about a system that behaves no differently with one student than with thousands, or which does not improve after being exposed to data from previous use by many students. [emphasis in the original]
“Scale” here is being used to distinguish the new learning researchers from existing researchers who study learning computationally. I asked one of these existing learning scientists what he thought of the papers published at the recently concluded LAS 2014. He said he liked the papers but he worried that they were all about supporting learning, or supporting the scaling of learning, but not really about learning itself. Learning itself, he seemed to be implying, was about assessing students’ knowledge of certain topics and not about building new tools for them.
This is the kind of debate about what VanderPlas, in his post, calls “domain knowledge” (or an example of of what Andrew Abbot calls “jurisdiction contest” and what Thomas Gieryn calls “boundary work“). In VanderPlas’ case, it revolves around his insistence that what he, and others like him, do is still physics even if it involves a lot of software-building, and should be recognized and rewarded as such. In the case of the new learning researchers, it is a debate about whether building software for helping learners constitutes a study of “learning.” How these debates play out–in the media, in scientific journals and conferences, but also in workplaces and in front of funding agencies–is something that we in STS should observe keenly. It’s going to be a bumpy ride ahead.
 Tom Boellstorff has convinced me to not capitalize the term
 Stuart Geiger‘s provocative 4s 2013 talk, however, suggests that big data practitioners–the so-called “data scientists”–are much more like us ethnographers than we like to think. His abstract for his talk “Hadoop as Grounded Theory: Is an STS Approach to Big Data Possible?” is well worth citing in full:
In this paper, I challenge the monolithic critical narratives which have emerged in response to “big data,” particularly from STS scholars. I argue that in critiquing “big data” as if it was a stable entity capable of being discussed in the abstract, we are at risk of reifying the very phenomenon we seek to interrogate. There are instead many approaches to the study of large data sets, some quite deserving of critique, but others which deserve a different response from STS. Based on participant-observation with one data science team and case studies of other data science projects, I relate the many ways in which data science is practiced on the ground. There are a diverse array of approaches to the study of large data sets, some of which are implicitly based on the same kinds of iterative, inductive, non-positivist, relational, and theory building (versus theory testing) principles that guide ethnography, grounded theory, and other methodologies used in STS. Furthermore, I argue that many of the software packages most closely associated with the big data movement, like Hadoop, are built in a way that affords many “qualitative” ontological practices. These emergent practices in the fields around data science lead us towards a much different vision of “big data” than what has been imagined by proponents and critics alike. I conclude by introducing an STS manifesto to the study of large data sets, based on cases of successful collaborations between groups who are often improperly referred to as quantitative and qualitative researchers.
 Lee Vinsel makes this point in his comment on a Scientific American blog post that links to VanderPlas’ post. One should note that there are also dissonant voices: e.g. Philip Mirowski takes a very different approach to the commercialization of science in Science-Mart.