Although anthropologists have been working with large-scale data sets for quite some time, the term “big data” is currently being used to refer to large, complex sets of data combined from different sources and media that are difficult to wrangle using standard coding schemes or desktop database software. Last year saw a rise in STS approaches that try to grapple with questions of scale in research, and the trend toward data accumulation seems to be continuing unabated. According to IBM, we generate 2.5 quintillion bytes of data each day. This means that 90% of the data in the world was created during the last 2 years.
Big data are often drawn and aggregated from a very large variety of sources, both personal and public, and include everything from social media participation to surveillance footage to consumer buying patterns. Big data sets exhibit complex relationships and yield information to entities who may mine highly personal information in a variety of unpredictable and even potentially violative ways.
The rise of such data sets yields many questions for anthropologists and other researchers interested both in using such data and investigating the techno-cultural implications and ethics of how such data is collected, disseminated, and used by unknown others for public and private purposes. Researchers have called this phenomenon the “politics of the algorithm,” and have called for ways to collect and share big data sets as well as to discuss the implications of their existence.
I asked David Hakken to respond to this issue by answering questions about the direction that big data and associated research frameworks are headed. David is currently directing a Social Informatics (SI) Group in the School of Informatics and Computing (SoIC) at Indiana University Bloomington. Explicitly oriented to the field of Science, Technology, and Society studies, David and his group are developing a notion of social robustness, which calls for developers and designers to take responsibility for the creation and implications of techno-cultural objects, devices, software, and systems. The CASTAC Blog is interested in providing a forum to exchange ideas on the subject of Big Data, in an era in which it seems impossible to return to data innocence.
Patricia: How do you define “big data”?
David: I would add three, essentially epistemological, points to your discussion above. The first is to make explicit how “Big Data” are intimately associated with computing; indeed, the notion that they are a separate species of data is connected to the idea that they are generated more or less “automatically,” as traces normally a part of mediation by computing. Such data are “big” in the sense that they are generated at a much higher rate than are those large-scale, purpose-collected sets that you refer to initially.
The second point is the existence of a parallel phenomenon, “Data Science,” which is a term used in computing circles to refer to a preferred response to “Big Data.” Just as we have had large data sets before Big Data, so we have had formal procedures for dealing with any data. The new claim is that Big Data has such unique properties that it demands its own new Data Science. Also part of the claim is that new procedures, interestingly often referred to as “data mining.” will be the ones characteristic of Data Science. (What are interesting to me are the rank empiricist implications of “data mining.”) Every computing school of which I know is in the process of figuring out how to deal with/“capitalize” on the Data Science opportunity.
The third point is the frequently-made claim that the two together, Big Data and Data Science, provide unique opportunities to study human behavior. Such claims become more than annoying for me when it is asserted that the Big Data/Data Sciences uniquenesses are such that those pursuing them need not pay any attention to any previous attempt to understand human behavior, that only they and they alone are capable of placing the study of human behavior on truly “scientific” footing, again because of their unique scale.
Patricia: Do you think that anthropologists and other researchers should use big data, for instance, using large-scale, global information mined from Twitter or Facebook? Do you view this as “covert research”?
David: We should have the same basic concern about these as we would any other sources of data: Were they gathered with the informed consent of those whose activities created the traces in the first place? Many of the social media sites, game hosts, etc., include permission to gather data as one of their terms of service, to which users agree when they access the site. This situation makes it hard to argue that collection of such data are “covert.” Of course, when such agreement has not been given, any gathered data in my view should not be used.
In the experience of my colleagues, the research problem is not so much the ethical one to which you refer so much as its opposite—that the commercial holders of the Big Data will not allow independent researchers access to it. This situation has led some colleagues to “creative” approaches to gathering big data that have caused some serious problems for my University’s Institutional Review Board.
In sum, I would say that there are ethical issues here that I don’t feel I understand well enough to take a firm position. I would in any particular case begin with whether it makes any sense to use these data to answer the research questions being asked.
Patricia: Who “owns” big data, and how can its owners be held accountable for its integrity and ethical use?
David: I would say that the working assumption of the researchers with whom I am familiar is either the business whose software gathers the traces or the researcher who is able to get users to use their data gathering tool, rather than the users themselves. I take it as a fair point that such data are different from, say, the personal demographic or credit card data that are arguably owned by the individual with whom they are associated. The dangers of selling or similar commercial use of these latter data are legion and clear; of the former, less clear to me, mostly because I don’t know enough about them.
Patricia: What new insights are yielded by the ability to collect and manipulate multi-terrabyte data sets?
David: This is where I am most skeptical. I can see how data on the moves typically made by players in a massive, multiplayer, online game (MMOG) like World of Warcraft ™ would be of interest to an organization that wants to make money building games, and I can see how an argument could be made that analysis of such data could lead to better games and thus be arguably in the interest of the gamers. When it comes to broader implications, say about typical human behavior in general, however, what can be inferred is much more difficult to say. There remain serious sampling issues however big the data set, since the behaviors whose traces are gathered are in no sense that I can see likely to be randomly representative of the population at large. Equally important is a point made repeatedly by my colleague John Paolillo, that the traces gathered are very difficult to use directly in any meaningful sense; that they have to be substantially “cleaned,” and that the principles of such cleaning are difficult to articulate. Paolillo works on Open Source games, where issues of ownership are less salient that they would be in the proprietary games and other software of more general interest.
Equally important: These behavioral traces are generated by activities executed in response to particular stimulations designed into the software. Such stimuli are most likely not typical of those to which humans respond; this is the essence of a technology. How they can be used to make inferences about human behavior in general is beyond my ken.
Let me illustrate in terms of some of my current research on MMOGs. Via game play ethnography, my co-authors (Shad Gross, Nic True) and I arrived at a tripartite basic typology of game moves: those essentially compelled by the physics engine which rendered the game space/time, those responsive to the specific features designed into the game by its developers, and those likely to be based on some analogy with “real life” imported by the player into the game. As the first two are clearly not “normal,” while the third is, we argue that games could be ranked in terms of the ratio between the third and the first two, such ratio constituting an initial indicator of the extent of familiarity with “real life” that could conceivably be inferred from game behavior. Perhaps more important, the kinds of traces to be gathered from play could be changed to help make measures like this easier to develop.
Patricia: What are the epistemological ramifications of big data? Does its existence change what we mean by “knowledge” about behavior and experience in the social sciences?
David: I have already had a stab at the first question. To be explicit about the second: I don’t think so. There are no fundamental knowledge alterations regarding those computer mediations of common human activity, and we don’t know what kind of knowledge is contained in manipulations of data traces generated in response to abnormal, technology-mediated stimuli.
Patricia: boyd and Crawford (2011) argue that asymmetrical access to data creates a new digital divide. What happens when researchers employed for Facebook or Google obtain access to data that is not available to researchers worldwide?
David: I find their argument technically correct, but, as above, I’m not sure how important its implications are. I am reminded of a to-remain-unnamed NSF program officer who once pointed out to a panel on which I served that NSF was unlikely to be asked to fund the really cutting edge research, as this was likely to be done as a closely guarded, corporate secrete.
Patricia: What new skills will researchers need to collect, parse, and analyze big data?
David: This is interesting. When TAing the PhD data analysis course way back in the 1970s, I argued that to take random strolls through data sets in hopes of stumbling on a statistically significant correlation was bad practice, yet this is in my understanding, the approach in “data mining.” We argue in our game research that ethnography can be used to identify the kinds of questions worth asking and thus give a focus, even foster hypothesis testing, as an alternative to such rampant empiricism. Only when such questions are taken seriously will it be possible to articulate what new skills of data analysis are likely to be needed.
Patricia: How can researchers insure data integrity across such mind-boggling large and diverse sets of information?
David: Difficult question if dealing with proprietary software; as with election software, “trust me” is not enough. This is why I have where possible encouraged study of Open Source Projects, like that of Giacomo Poderi in Trento, Italy. Here, at least, the goals of designers and researchers should be aligned.
Patricia: To some extent, anthropologists and other qualitative researchers have always struggled to have their findings respected among colleagues who work with quantitative samples of large-scale data sets. Qualitative approaches seem especially under fire in an era of Big Data. As we move forward, what is/will be the role and importance of qualitative studies in these areas?
David: As I suggested above, in my experience, much of the Data Science research is epistemologically blind. Ethnography can be used to give it some sight. By and large, however, my Data Science colleagues have not found it necessary to respond positively to my offers of collaboration, nor do I think it likely that either their research communities of funders like the NSF, a big pusher for Data Science, will push them toward collaboration with us any time soon.
Patricia: What does the future hold for dealing with “big data,” and where do we go from here?
David: I think we keep asking our questions and turn to Big Data when we can find reason to think that they can help us answer them. I see no reason to jump on the BD/DS bandwagon any time soon.
On behalf of The CASTAC Blog, please join me in thanking David Hakken for contributing his insights into a challenging new area of social science research!
Patricia G. Lange
The CASTAC Blog
The CASTAC community joined together in 2012 to launch this blog and begin dialogue on contemporary issues and research approaches. Even though the blog is just getting off the ground, certain powerful themes are already emerging across different projects and areas of study. Key themes for the coming year include dealing with large data sets, connecting individual choices to larger economic forces, and translating the meaning of actions from different realms of experience.
Perhaps the most visible trend on our minds right now involves dealing with scale. How can anthropologists, ethnographers, and other STS scholars address large data sets and approaches in research and pedagogy, while also retaining an appropriate relationship to the theories and methods that have made our disciplines strong? As we look ahead to 2013, it would seem that a big question for the CASTAC community involves finding creative and ethical ways to deal with phenomena that range from the overwhelmingly large to the microscopic, in order to provide insight and serve our constituents in research and teaching.
Discussing large-scale forays into education and research
In the past two weeks in her posts on MOOCs in the Machine, Jordan Kraemer, our dedicated Web Producer, has been reflecting on how higher education is grappling with MOOCs, or “massive open online classes,” which open up opportunities to those who have been shut out of traditional elite institutions. At the same time, serious questions emerged about the ramifications of trade-offs between saving money and providing high-quality education. Kraemer points out that much of the debate ties into larger arguments about why it is that people have been shut out of education and how concentration of wealth and the neoliberalization of the university are challenging the old equation of supporting open-ended research that ultimately strengthens and supports teaching. She proposes new forms of graduate education in which recent graduates are supported by their universities with teaching jobs, to complete teaching experience, transfer teaching loads from full-time faculty, and support graduate students as they transition into full-time positions.
Part of the issue with MOOCs has to do with questions of scale, and how or whether individual lectures and course preparation can be generalized to large-scale audiences in ways that provide solid instruction without compromising quality. Higher-education depends upon staying current with research, and so far, we do not have enough evidence to support the idea that MOOCs will work or will address all of the concerns emerging from the neoliberalization of the academy. Those of us interested in online interaction and pedagogy will be watching this space closely in the coming year.
Questions of scale also came into play with Daniel Miller’s discussion of doing Eight Comparative Ethnographies. Miller argues that doing several ethnographies at the same time will enable comparative questions that are not possible when investigating one site alone. He provides an example from social network sites. He asks, to what extent are particular behaviors the product of a type of site, a single site, or the intersection of cultures in which a site is embedded? Is the behavior so because it is happening on Facebook or because the participants are Brazilian? A comparative study enables a level of analysis that is more inclusive than that derived from a single study. Expanding scale without compromising the traditions and benefits of ethnographic work remains a challenge for these and other large-scale projects in the future, which have the potential to provide crucial insights.
Making small-scale choices visible
As one set of researchers bring up issues with regard to enormously large-scale education and research, other STS participants on The CASTAC Blog are dealing with the opposite issue, which involves grappling with how the dynamics of extremely personal and individualistic acts—such as the donation of sex cells—interact with large-scale economic and cultural forces. In her post on The Medical Market for Eggs and Sperm, Rene Almeling, the winner of the 2012 Forsythe Prize, provides an inside look into how human beings’ donations of sex cells are connected to much larger economic forces that play out differently for women and men. Women are urged to regard egg donation as a feminine act of a gift; men are encouraged to see donation as a job. Almeling ties our understanding of what might be an individual act into economic forces, as well as gendered, cultural expectations about families and reproduction. Gendered framings of donation not only impact the individuals who provide genetic material, but also strongly influence the structure of the market for sex cells.
Another key issue on our minds has to do with dealing with personal responsibility and showing how individual choices impact much larger social and economic forces in finance, computing, and going green.
In his post, On Building Social Robustness, David Hakken raises the question of how individuals contributed to large-scale economic and social crises, such as the recent disasters in the world of finance. His project is informed by work that is trying to deal with the first “5,000 years” in the history of debt. He proposes developing a notion of social robustness, parallel to the idea of the technical notion of robustness in computer science.
His work provides an intriguing use of ideas from people whom we study, and applying them as an inspiration for making social change. When Hakken asks about the extent to which computing professionals are ethically responsible for the financial crisis, he is proposing a way of asking how a large-scale disaster can be traced to more individual, micro-units of action. By investigating these connections, his project informs a conversation that is increasingly picking up steam in the area of the anthropology of value.
Hakken’s reflections are especially haunting as he warns of the difficulties of building a career in anthropology and STS. As he is moving towards retirement, his perspective is especially valued in our community. As an antidote to more provincial institutional perspectives, he urges a more consolidated and community approach that involves supporting each other in doing the important work that the CASTAC community has the potential to achieve.
Questions of scale and responsibility are once again intertwined in David J. Hess’s post on Opening Political Opportunities for a Green Transition. Hess points out that a non-partisan political issue has become partisan despite the fact that the planet has now surpassed a carbon dioxide level that it has not had for at least 800,000 years! But because change is imperceptibly slow to the human eye, politics is allowed to complicate change. Hess has worked to investigate what he calls the “problem behind the problem,” which involves the lack of political will to address environmental sustainability and social fairness, which considerably worsens the environmental problem itself. He provides real solutions through an ambitious three-part series of books that propose “alternative pathways” or social movements centered on reform in part through the efforts of the private sector.
Notably, personal experiences in anthropology inform Hess’s work. Although he is in a sociology department and in an energy and environment institute, he points out that an anthropological sensibility continues to inform his thinking. While the discourse on these issues has traditionally revolved around a two party system, Hess’s more anthropological approach makes visible other ideologies such as localism and developmentalism that may pave a more direct path to “good green jobs” and a more sensitive and responsible green policy. Again interacting with questions of scale, Hess’s notions of responsibility are grounded in understanding the “broad contours” of the “tectonic shifts” of ideology and policy that are underway in working toward a green transition in the United States and around the world. Without real action, however, his prognoses remains pessimistic.
Translating phenomena across different realms of experience
A theme that also emerged from our nascent blog’s initial posts had to do with understanding the ramifications of processing one realm of experiencing by using metaphors and concepts from another. In her post on the Anthropological Investigations of MIME-NET, Lucy Suchman explores the darker side of entertainment and its relationship to military applications. She investigates how information and communication technologies have “intensified rather than dissipated” what theorists have described as the “fog of war.”
The problem is partly one of translation. How is it possible to maintain what military strategists call “situational awareness,” which has to do with maintaining a constant and accurate mental image of relevant tactical information. Suchman is studying activities such as The Flatworld Project, which bring together practitioners from the Hollywood film industry, gaming, and other models of immersive computing to understand these dynamics. Such a project also involves analyzing how such approaches “extend human capacities for action at a distance,” and present ethical challenges to researchers as they grapple with military realms and connecting seemingly disparate but interrelated areas such as war and healthcare.
Lisa Messeri’s post, Anthropology and Outer Space, offers an absolutely fascinating look into human conceptualization of place. She asks, why should earthlings be concerned about what is happening on Mars? Her work focuses on how “scientists transform planets from objects into places.” Significant milestones in space exploration such as the passing of Venus between the Earth and the Sun (not scheduled to do so again until 2117) and the landing of the Mars rover, Curiosity, provide rich areas to mine for understanding cultural notions of place and human exploration. Curiosity has its own Twitter account (!) and tweets freely about its experience of “springtime” in its southern hemisphere. Messeri argues that this kind of language “bridges” our worlds in that Curiosity somehow seems to experience something that is familiar to humans—springtime. Scientists are now studying things that are so far away that telescopes cannot take an image of them. Somehow, these “invisible” objects become familiar and complex. Planets begin to seem like places because of the way in which language “makes the strange familiar,” and bridges the experience between events on an exoplanet and life on Earth.
Astronomers become place makers, and observing these processes shows how spaces become “social” even as Messeri argues, “humans will never visit such planetary places.” Messeri shows how such conceptualizations can lead to the spread of erroneous scientific rumors that get reported on national news organizations. Her work shows not only how knowledge production is compromised by the use of such metaphors but also provides an intriguing look at how humans process invisible objects through the cultural production of imagined place.
Tune in next week!
Given that questions of scale were on our minds in 2012, it is especially fitting that we launch 2013 with a discussion about Big Data, and the challenges and opportunities that emerge when entities collect and combine huge data sets that are far too large to handle through ordinary coding schemes or desktop databases. Social scientists, technologists, and other researchers must grapple with numerous issues including legibility, data integrity, ethics, and usability. I am particularly pleased that David Hakken agreed to be interviewed by The CASTAC Blog to discuss his views. Next week, he provides fascinating insights into what the future holds for dealing with Big Data!
Before signing off, I would like to thank everyone for their participation in The CASTAC Blog, especially those who wrote posts, left comments, read articles, and tweeted our posts to the world. I very much appreciated everyone’s participation. The richness of the posts makes it too difficult to adequately cover all the content of the past year in one commentary, but rest assured that everyone’s post is contributing to the conversation and is valued by the CASTAC community.
In an effort to include more voices and keep a continuing flow of content, The CASTAC Blog is now seeking a core group of “frequent” contributors to keep pace with new developments in this space in 2013. Notice that I use the term “frequent” sparingly—even a few posts throughout the year makes you a frequent contributor. Please consider sharing your thoughts and views with the CASTAC community. If you would like to join in, please email me at: firstname.lastname@example.org.
I look forward to an interesting and productive year ahead!
Patricia G. Lange
The CASTAC Blog