Ethics in Big Data and Computational Social Science Research

by Janet Salmons, Research Community Manager for SAGE Methodspace
Dr. Salmons is the author of Doing Qualitative Research Online, which focuses on ethical research and writing, and What Kind of Researcher Are You? which focuses on researcher integrity. With the code MSPACEQ323 you receive a 20% discount when you order the books from SAGE. Valid through September 30.


Clearly, there is no one-size-fits-all set of guidelines or practices given the complexities of technology-infused scholarly endeavors.

What was readily accepted as a standard approach can no longer be applied without consideration of emerging factors. Nowhere is that point more obvious than in any kind of research that involves technology.

Critical data studies is in its infancy, but it faces a substantial challenge: as the practice of data science surges ahead, we lack a strong and rigorous sense of ethical parameters for scientific research….There are growing discontinuities between the research practices of data science and established tools of research ethics regulation. … The ethical frameworks for Big Data research are highly contested and in flux, and the potential harms of data science research are unpredictable. (Metcalf & Crawford, 2016)

The “discontinuities” have grown even more complex since 2016! While typically Big Data and computational social science research do not involve engagements with human subjects, as Metcalf and Crawford observe, they have their own ethical dilemmas. This collection of articles offers multiple perspectives on the use of Big Data and ethical protocols for computational research methods.


Bak, M. A. (2022). Computing fairness: ethics of modeling and simulation in public health. SIMULATION, 98(2), 103–111. https://doi.org/10.1177/0037549720932656

Abstract. The field of public health increasingly utilizes computational models. In this context, computer scientists are confronted with moral dilemmas like those around modeling the distribution of scarce resources. However, there is a lack of research on the ethical implications of computer modeling and simulation (M&S). In this paper I aim to show that taking a transdisciplinary ethical perspective is useful when analyzing these implications. The practice of modeling geospatial distribution of automated external defibrillators for sudden cardiac arrest treatment is used as a case study. It is shown that there exists no consensus on what theory of justice should underlie choices in computer M&S of public health resources, and that professionals struggle with building equity considerations into their models. The example highlights new ethical consequences arising at the nexus of public health and M&S. Computer models and simulations are not morally neutral, but have the effect of making those involved in their creation more responsible for making just choices. For some moral dilemmas, such as those related to distributive justice, there may be no correct solution that can be readily modeled. Promoting professional responsibility through a code of ethics will not help prescribe a right course of action in these situations. I suggest therefore that procedural justice and deliberation with a range of stakeholders is needed to take ethical considerations into account “by design” when developing computer models and simulations for public health policy. Future research should reflect on the content and practical procedures for these deliberations.

Bradfield, O. M. (2022). Waving away waivers: an obligation to contribute to ‘herd knowledge’ for data linkage research? Research Ethics, 18(2), 151–162. https://doi.org/10.1177/17470161211058311

Abstract. In today’s online data-driven world, people constantly shed data and deposit digital footprints. When individuals access health services, governments and health providers collect and store large volumes of health information about people that can later be retrieved, linked and analysed for research purposes. This can lead to new discoveries in medicine and healthcare. In addition, when securely stored and de-identified, the privacy risks are minimal and manageable. In many jurisdictions, ethics committees routinely waive the requirement for researchers to obtain consent from data subjects before using and linking these datasets in an effort to balance respect for individuals with research efficiency. In this paper, I explore the ethical justification for using routinely collected health data for research without consent. I conclude that, not only is this morally justified but also that data subjects have a moral obligation to contribute their data to such research, which would obviate the need for ethics committees to consider consent waivers. In justifying this argument, I look to the duty of easy rescue, distributive justice and draw analogies with vaccination ethics.

Custers, B. (2016). Click here to consent forever: Expiry dates for informed consent. Big Data & Society. https://doi.org/10.1177/2053951715624935

Abstract. The legal basis for processing personal data and some other types of Big Data is often the informed consent of the data subject involved. Many data controllers, such as social network sites, offer terms and conditions, privacy policies or similar documents to which a user can consent when registering as a user. There are many issues with such informed consent: people get too many consent requests to read everything, policy documents are often very long and difficult to understand and users feel they do not have a real choice anyway. Furthermore, in the context of Big Data refusing consent may not prevent predicting missing data. Finally, consent is usually asked for when registering, but rarely is consent renewed. As a result, consenting once often implies consent forever. At the same time, given the rapid changes in Big Data and data analysis, consent may easily get outdated (when earlier consent no longer reflects a user’s preferences). This paper suggests expiry dates for consent, not to settle questions, but to put them on the table as a start for further discussion on this topic. Although such expiry dates may not solve all the issues of informed consent, they may be a useful tool in some situations.

Ferretti, A., Ienca, M., Velarde, M. R., Hurst, S., & Vayena, E. (2022). The Challenges of Big Data for Research Ethics Committees: A Qualitative Swiss Study. Journal of Empirical Research on Human Research Ethics, 17(1–2), 129–143. https://doi.org/10.1177/15562646211053538

Abstract. Big data trends in health research challenge the oversight mechanism of the Research Ethics Committees (RECs). The traditional standards of research quality and the mandate of RECs illuminate deficits in facing the computational complexity, methodological novelty, and limited auditability of these approaches. To better understand the challenges facing RECs, we explored the perspectives and attitudes of the members of the seven Swiss Cantonal RECs via semi-structured qualitative interviews. Our interviews reveal limited experience among REC members with the review of big data research, insufficient expertise in data science, and uncertainty about how to mitigate big data research risks. Nonetheless, RECs could strengthen their oversight by training in data science and big data ethics, complementing their role with external experts and ad hoc boards, and introducing precise shared practices.

Giglietto, F., & Rossi, L. (2012). Ethics and Interdisciplinarity in Computational Social Science. Methodological Innovations Online, 7(1), 25–36. https://doi.org/10.4256/mio.2012.003

Abstract. During the last few years a growing amount of content produced by Internet users has become publicly available online. These data come from a variety of places, including popular social web services like Facebook and Twitter, consumer services like Amazon or weblogs. The research opportunities opened up by this socio-technological innovation are, as shown by the growing literature on the topic, huge. At the same time new challenges for social scientists arise. In this paper we will focus on two of the main challenges posed to the growth of the so-called computational social science: interdisciplinarity and ethics. While the searchability and persistence of this information make it ideal for sociological research, a quantitative approach is still challenging because of the size and complexity of the data. Collecting, storing and analyzing these data often require technical skills beyond the traditional curricula of social scientists. These projects require, in fact, collaboration with computer scientists. Nevertheless developing a common interdisciplinary project is often challenging because of the different backgrounds of the researchers. At the same time the availability of this content poses a challenge concerning privacy and research ethics. Due to the amount of data and the fact that the real identity of the author is often hidden behind a nickname, it is often impossible to ask the subject involved to consent to the use of their data. On the other hand, especially in the first wave of web 2.0, this information has been - intentionally or not - publicly shared by the users. While a technique of dis-embedding the identity of the user from the content analyzed is often the solution used to bypass this issue, an even more important privacy-related challenge for computational social science is emerging. Due to the wide adoption of social network sites such Facebook or Google+, where a user may decide to share his content with his/her group of friends only, the amount of public data will change and decrease in the future. We will discuss this issue by enumerating a number of possible future scenarios.

Obar, J. A. (2020). Sunlight alone is not a disinfectant: Consent and the futility of opening Big Data black boxes (without assistance). Big Data & Society. https://doi.org/10.1177/2053951720935615

Abstract. In our attempts to achieve privacy and reputation deliverables, advocating for service providers and other data managers to open Big Data black boxes and be more transparent about consent processes, algorithmic details, and data practice is easy. Moving from this call to meaningful forms of transparency, where the Big Data details are available, useful, and manageable is more difficult. Most challenging is moving from that difficult task of meaningful transparency to the seemingly impossible scenario of achieving, consistently and ubiquitously, meaningful forms of consent, where individuals are aware of data practices and implications, understand these realities, and agree to them as well. This commentary unpacks these concerns in the online consent context. It emphasizes that self-governance fallacy pervades current approaches to achieving digital forms of privacy, exemplified by the assertion that transparency and information access alone are enough to help individuals achieve privacy and reputation protections.

Rafiq, F., Awan, M. J., Yasin, A., Nobanee, H., Zain, A. M., & Bahaj, S. A. (2022). Privacy Prevention of Big Data Applications: A Systematic Literature Review. SAGE Open, 12(2). https://doi.org/10.1177/21582440221096445

Abstract. This paper focuses on privacy and security concerns in Big Data. This paper also covers the encryption techniques by taking existing methods such as differential privacy, k-anonymity, T-closeness, and L-diversity. Several privacy-preserving techniques have been created to safeguard privacy at various phases of a large data life cycle. The purpose of this work is to offer a comprehensive analysis of the privacy preservation techniques in Big Data, as well as to explain the problems for existing systems. The advanced repository search option was utilized for the search of the following keywords in the search: “Cyber security” OR “Cybercrime”) AND ((“privacy prevention”) OR (“Big Data applications”)). During Internet research, many search engines and digital libraries were utilized to obtain information. The obtained findings were carefully gathered out of which 103 papers from 2,099 were found to gain the best information sources to address the provided study subjects. Hence a systemic review of 32 papers from 103 found in major databases (IEEExplore, SAGE, Science Direct, Springer, and MDPIs) were carried out, showing that the majority of them focus on the privacy prediction of Big Data applications with a contents-based approach and the hybrid, which address the major security challenge and violation of Big Data. We end with a few recommendations for improving the efficiency of Big Data projects and provide secure possible techniques and proposed solutions and model that minimizes privacy violations, showing four different types of data protection violations and the involvement of different entities in reducing their impacts.

Shilton, K., Moss, E., Gilbert, S. A., Bietz, M. J., Fiesler, C., Metcalf, J., Vitak, J., & Zimmer, M. (2021). Excavating awareness and power in data science: A manifesto for trustworthy pervasive data research. Big Data & Society, 8(2). https://doi.org/10.1177/20539517211040759

Abstract. Frequent public uproar over forms of data science that rely on information about people demonstrates the challenges of defining and demonstrating trustworthy digital data research practices. This paper reviews problems of trustworthiness in what we term pervasive data research: scholarship that relies on the rich information generated about people through digital interaction. We highlight the entwined problems of participant unawareness of such research and the relationship of pervasive data research to corporate datafication and surveillance. We suggest a way forward by drawing from the history of a different methodological approach in which researchers have struggled with trustworthy practice: ethnography. To grapple with the colonial legacy of their methods, ethnographers have developed analytic lenses and researcher practices that foreground relations of awareness and power. These lenses are inspiring but also challenging for pervasive data research, given the flattening of contexts inherent in digital data collection. We propose ways that pervasive data researchers can incorporate reflection on awareness and power within their research to support the development of trustworthy data science.

Stewart, R. (2021). Big data and Belmont: On the ethics and research implications of consumer-based datasets. Big Data & Society. https://doi.org/10.1177/20539517211048183

Abstract. Consumer-based datasets are the products of data brokerage firms that agglomerate millions of personal records on the adult US population. This big data commodity is purchased by both companies and individual clients for purposes such as marketing, risk prevention, and identity searches. The sheer magnitude and population coverage of available consumer-based datasets and the opacity of the business practices that create these datasets pose emergent ethical challenges within the computational social sciences that have begun to incorporate consumer-based datasets into empirical research. To directly engage with the core ethical debates around the use of consumer-based datasets within social science research, I first consider two case study applications of consumer-based dataset-based scholarship. I then focus on three primary ethical dilemmas within consumer-based datasets regarding human subject research, participant privacy, and informed consent in conversation with the principles of the seminal Belmont Report.

Williams, M. L., Burnap, P., & Sloan, L. (2017). Towards an Ethical Framework for Publishing Twitter Data in Social Research: Taking into Account Users’ Views, Online Context and Algorithmic Estimation. Sociology, 51(6), 1149–1168. https://doi.org/10.1177/0038038517708140

Abstract. New and emerging forms of data, including posts harvested from social media sites such as Twitter, have become part of the sociologist’s data diet. In particular, some researchers see an advantage in the perceived ‘public’ nature of Twitter posts, representing them in publications without seeking informed consent. While such practice may not be at odds with Twitter’s terms of service, we argue there is a need to interpret these through the lens of social science research methods that imply a more reflexive ethical approach than provided in ‘legal’ accounts of the permissible use of these data in research publications. To challenge some existing practice in Twitter-based research, this article brings to the fore: (1) views of Twitter users through analysis of online survey data; (2) the effect of context collapse and online disinhibition on the behaviours of users; and (3) the publication of identifiable sensitive classifications derived from algorithms.


Did you know these Sage Journals are open access, including the archives?


More Methodspace posts about Research Ethics

Previous
Previous

Addressing Scientific Misconduct

Next
Next

Design Tool: The SRM Research Project Planner