Reclaiming the Data Commons: A Public Data Trust for Training Data
Democratization of AI means not only that people can freely use AI, but also that people can collectively decide how AI is to be used. In particular, collective decision-making power is required to redress the negative externalities from the development of increasingly advanced AI systems, including degradation of the digital commons and unemployment from automation. The rapid pace of AI development and deployment currently leaves little room for this power. Monopolized in the hands of private corporations, the development of the most capable foundation models has proceeded largely without public input. There is currently no implemented mechanism for ensuring that the economic value generated by such models is redistributed to account for their negative externalities. The citizens that have generated the data necessary to train models do not have input on how their data are to be used. In this work, we propose that a public data trust assert control over training data for foundation models. In particular, this trust should scrape the internet as a digital commons, to license to commercial model developers for a percentage cut of revenues from deployment. First, we argue in detail for the existence of such a trust. We also discuss feasibility and potential risks. Second, we detail a number of ways for a data trust to incentivize model developers to use training data only from the trust. We propose a mix of verification mechanisms, potential regulatory action, and positive incentives. We conclude by highlighting other potential benefits of our proposed data trust and connecting our work to ongoing efforts in data and compute governance.
EleutherAI: Going Beyond "Open Science” to “Science in the Open”
Jason Phang, Herbie Bradley, Leo Gao, Louis Castricato, and Stella Biderman. “EleutherAI: Going Beyond “Open Science” to “Science in the Open.” Broadening Research Collaborations Workshop in ML @ NeurIPS, 2022. Oral Presentation
Over the past two years, EleutherAI has established itself as a radically novel initiative aimed at both promoting open-source research and conducting research in a transparent, openly accessible and collaborative manner. EleutherAI's approach to research goes beyond transparency: by doing research entirely in public, anyone in the world can observe and contribute at every stage. Our work has been received positively and has resulted in several high-impact projects in Natural Language Processing and other fields. In this paper, we describe our experience doing public-facing machine learning research, the benefits we believe this approach brings, and the pitfalls we have encountered.
You reap what you sow: On the Challenges of Bias Evaluation Under Multilingual Settings
Zeerak Talat, Aurélie Névéol, et al. (incl. Stella Biderman). "You reap what you sow: On the Challenges of Bias Evaluation Under Multilingual Settings." In Proceedings of BigScience Episode #5 -- Workshop on Challenges & Perspectives in Creating Large Language Models, 2022.
Evaluating bias, fairness, and social impact in monolingual language models is a difficult task. This challenge is further compounded when language modeling occurs in a multilingual context. Considering the implication of evaluation biases for large multilingual language models, we situate the discussion of bias evaluation within a wider context of social scientific research with computational work.We highlight three dimensions of developing multilingual bias evaluation frameworks: (1) increasing transparency through documentation, (2) expanding targets of bias beyond gender, and (3) addressing cultural differences that exist between languages.We further discuss the power dynamics and consequences of training large language models and recommend that researchers remain cognizant of the ramifications of developing such technologies.
Data Governance in the Age of Large-Scale Data-Driven Language Technology
Jernite, Nguyen, et al. (incl. Stella Biderman). "Data Governance in the Age of Large-Scale Data-Driven Language Technology." In the Proceedings of ACM Conference on Fairness, Accountability, and Transparency. 2022
The recent emergence and adoption of Machine Learning technology, and specifically of Large Language Models, has drawn attention to the need for systematic and transparent management of language data. This work proposes an approach to global language data governance that attempts to organize data management amongst stakeholders, values, and rights. Our proposal is informed by prior work on distributed governance that accounts for human values and grounded by an international research collaboration that brings together researchers and practitioners from 60 countries. The framework we present is a multi-party international governance structure focused on language data, and incorporating technical and organizational tools needed to support its work.
The Hard Problem of Aligning AI to Human Values
Connor Leahy and Stella Biderman. "The Hard Problem of Aligning AI to Human Values." The State of AI Ethics Report 4, p. 180-183. 2021.
We discuss how common framings of AI ethics conversations underestimate the difficulty of the task at hand: if a model becomes dangerous by the mere exposure to unethical content, it is unacceptably dangerous and broken at its core. While gating such models (as OpenAI does with GPT3) behind an API with rudimentary automatic filters plus less rudimentary human moderation is a useful temporary patch, it does not address the underlying problem. These models are fundamentally not doing what we as humans want them to do, which is to act in useful, aligned ways, not just regurgitate an accurate distribution of the text they have been trained on. We need AI that is, like humans, capable of reading all kinds of content, understanding it, and then deciding to act in an ethical manner. Indeed, learning more about unethical ideologies should enhance one's ability to act ethically and fight such toxic beliefs.