NB: This blogpost was originally published on the New Political Communication Unit blog.
Are academic researchers being left behind by their commercial counterparts? Is the rigour and transparency of social science being sacrificed in the name of revolution? ‘The Big Data Revolution’ workshop hosted by Innocentive offered an excellent chance to weigh up these questions. The workshop brought together a mix of commercial, academic and government researchers including representatives from Amazon, City University, Royal Bank of Scotland, and the US Department of Homeland Security.
Defining Big Data and Outlining its Problems
Defining big data is no easy feat. The term is often employed to signify low density, high volume information. Popular discourse commonly refers to large collections of social data. However the parameters of ‘big’ are inherently subjective and the content type and richness of data vary immensely. Whilst graduate researchers may grapple with a gigabyte of 140-character tweets, global corporations may amass petabytes of potentially rich data through store cards, web transactions, and other forms of personal information collection. It is imperative to try and encapsulate the disparity in the use of big data within a standardised definition. Boyd and Crawford (2012: 663) draw on two separate strands: (i) that the data collection process itself is formed by maximising computation power to gather large datasets, and (ii) these datasets are analysed in order to make an assortment of claims.
The opportunities offered by big data are intuitively quite simple. Datasets collected from social platforms or as a result of organic human behaviour offer potentially large amounts of increasingly rich data. Data is not prone to issues of researcher bias like survey based studies and the content collected may even rival some qualitative research for depth. As such, an idealistic vision has emerged overstating the benefits of big data, especially within marketing circles. As a result the size of one’s dataset is often viewed as more important than the methodological rigour of accumulation and analysis. Boyd and Crawford claim this is cultivating a harmful and pervasive myth that large collections of data offer access to a higher form of knowledge, surpassing the insights that were previously possible through other forms of data collection (2012: 663).
Big data projects using social media data encounter a number of dangers. Whilst data voluntarily produced outside of researcher bias may offer a more accurate and authentic representation of the individual, the motivation or meaning behind the content of the messages are almost impossible to gauge. In order to understand the behavioural context of an individual’s communication then qualitative techniques, such as in-depth interviews, must be employed. Research should embrace and integrate these techniques and not cast them aside in favour of the allure of large datasets (Anstead and O’Loughlin 2012). Furthermore, the majority of tools that collect social data do so without users’ prior consent of their content being used for research purposes. With the ongoing debates surrounding privacy online, ethics must begin to taken more seriously.
Given the numbers involved with big data projects, researchers tend to rely on one of two approaches, either flawed computational analysis of large datasets, or event-specific analysis. Automated analysis is especially problematic. How do these tools accumulate data? Who regulates these systems and their methods (boyd & Crawford 2012)? All too often companies rely on problematic technologies that may offer flawed geo-location tagging or sentiment analysis with very poor inbuilt text analytics. The alternate approach, projects based on events, compromise the measurement validity of the research by specifying smaller samples. In both instances the influx of information runs the risk of breeding apophenia, patterns where no significant relationship exists, such as the correlation between the winning team of the FA Cup and political party elected in Britain
. How we ensure that the analysis of big data is both valid and reliable is a pressing concern. Innocentive do offer a unique alternative. By combining big data with crowdsourcing, research can avoid both computational flaws and ensure a large pool of willing individuals to structure and analyse the data as to avoid the need to limit analysis to specific cases.
The divide between corporate and academia
The current ecosystem around bid data creates a new kind of digital divide: the Big Data rich and the Big Data poor (boyd & Crawford, 2012: 674)
I am not rejecting the value of big data as a resource for research. My point is that methodological rigor must always take precedence over the attraction of data size and stunning visualizations. Ensuring valid and reliable methodological practice is paramount within academic research. As the pressure for the political use of big data is constantly growing, it is imperative that academic research continues to engage and offer improvements to commercial technologies, especially given the pervasive ‘black box’ nature of a number of services. However, this is not feasible at present given the fundamental problem of access to these technologies for academic research. Research can only be as good as the data on which it is based. The difficulties of access to data, especially data collected from social media platforms, has created a new digital divide: those with money, be they in the commercial sector or a wealthy academic institution, have better access to data; those without money are left with poor data which compromises methodological validity.
There are some promising developments in bridging the gap between the demand for large datasets with methodological diligence. Demos have recently announced the establishment of The Centre for the Analysis of Social Media
. Lead by Jamie Bartlett
and in collaboration with the Text Analytics Group at the University of Sussex, the unit aims to draw methodologically valid and reliable inferences from social data for policy and social research.
Thank you to Innocentive for the opportunity to take part in the workshop.
Anstead, N., & O’Loughlin, B. (2012). Semantic polling: the ethics of online public opinion. Media policy brief, 5. The London School of Economics and Political Science, London, UK. Retrieved 05/11/2012, from http://eprints.lse.ac.uk/46944/
boyd, d. m., & Crawford, K. (2012). Critical Questions for Big Data: Provocations for a Cultural, Technological, and Scholarly Platform. Information, Communication and Society, 15(5), 662-579.