Paper Collections

Reading lists organized by the research areas referenced in the Data Counterfactuals memo.

Active Learning 2 papers

Active learning asks: given a budget for labeling, which points should you label next? This is navigation within the grid—picking which rows become available. Settles' survey is the classic reference; recent work connects to coresets and deep learning.

  • Active Learning for Convolutional Neural Networks: A Core-Set Approach
    Ozan Sener, Silvio Savarese
    International Conference on Learning Representations (ICLR) (2018)
  • Active Learning Literature Survey
    Burr Settles
    University of Wisconsin-Madison, Computer Sciences Technical Report 1648 (2009)

Data Augmentation & Curriculum Learning 2 papers

Augmentation creates synthetic variations of training data—effectively generating new rows in the grid. Curriculum learning orders training examples from easy to hard. Mixup is a simple, influential technique; the survey covers the broader landscape.

  • mixup: Beyond Empirical Risk Minimization
    Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, David Lopez-Paz
    ICLR 2018 (2018)
  • Curriculum Learning
    Yoshua Bengio, Jerome Louradour, Ronan Collobert, Jason Weston
    ICML 2009 (2009)

Data Leverage & Collective Action 20 papers

This collection gathers foundational and recent work on **data leverage**—the strategic use of data withholding, contribution, or manipulation as a form of collective action. ## Core Concepts - **Data Strikes**: Coordinated refusal to contribute data to platforms - **Data Poisoning**: Intentionally corrupting training data to degrade model performance - **Conscious Data Contribution**: Strategically directing data to preferred systems - **Data Valuation**: Methods for quantifying individual data contributions (Shapley values, etc.) ## Why This Matters As AI systems become more dependent on user-generated content and behavioral data, data creators gain potential leverage over technology companies. This research explores when and how such leverage can be effectively exercised.

  • Algorithmic Collective Action with Two Collectives
    Aditya Karan, Nicholas Vincent, Karrie Karahalios, Hari Sundaram
    ACM FAccT (2025)
  • The Economics of AI Training Data: A Research Agenda
    Hamidah Oderinwale, Anna Kazlauskas
    arXiv preprint (2025)
  • Collective Bargaining in the Information Economy Can Address AI-Driven Power Concentration
    Nicholas Vincent, Matthew Prewitt, Hanlin Li
    NeurIPS Position Papers (2025)
  • Push and Pull: A Framework for Measuring Attentional Agency on Digital Platforms
    Zachary Wojtowicz, Shrey Jain, Nicholas Vincent
    ACM FAccT (2025)
  • Poisoning Web-Scale Training Datasets is Practical
    Nicholas Carlini, Matthew Jagielski, Christopher A. Choquette-Choo, Daniel Paleka, Will Pearce, Hyrum Anderson, Andreas Terzis, Kurt Thomas, Florian Tramèr
    2024
  • Large language models reduce public knowledge sharing on online Q&A platforms
    R. Maria del Rio-Chanona, Nadzeya Laurentsyeva, Johannes Wachs
    PNAS Nexus (2024)
  • Algorithmic Collective Action in Machine Learning
    Moritz Hardt, Eric Mazumdar, Celestine Mendler-Dünner, Tijana Zrnic
    International Conference on Machine Learning (ICML) (2023)
  • The Dimensions of Data Labor: A Road Map for Researchers, Activists, and Policymakers to Empower Data Producers
    Hanlin Li, Nicholas Vincent, Stevie Chancellor, Brent Hecht
    Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency (2023)
  • Behavioral Use Licensing for Responsible AI
    Danish Contractor, Daniel McDuff, Julia Katherine Haines, Jenny Lee, Christopher Hines, Brent Hecht, Nicholas Vincent, Hanlin Li
    ACM FAccT (2022)
  • Dataset Security for Machine Learning: Data Poisoning, Backdoor Attacks, and Defenses
    Micah Goldblum, Dimitris Tsipras, Chulin Xie, Xinyun Chen, Avi Schwarzschild, Dawn Song, Aleksander Madry, Bo Li, Tom Goldstein
    IEEE Transactions on Pattern Analysis and Machine Intelligence (2022)
  • Addressing Documentation Debt in Machine Learning Research: A Retrospective Datasheet for BookCorpus
    Jack Bandy, Nicholas Vincent
    NeurIPS Datasets and Benchmarks (2021)
  • Machine Unlearning
    Lucas Bourtoule, Varun Chandrasekaran, Christopher A. Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, Nicolas Papernot
    IEEE Symposium on Security and Privacy (S&P) (2021)
  • Extracting Training Data from Large Language Models
    Carlini, Nicholas, Tramer, Florian, Wallace, Eric, Jagielski, Matthew, Herbert-Voss, Ariel, Lee, Katherine, Roberts, Adam, Brown, Tom B., Song, Dawn, Erlingsson, {\'U}lfar, Oprea, Alina, Papernot, Nicolas
    Proceedings of USENIX Security Symposium (2021)
  • Can "Conscious Data Contribution" Help Users to Exert "Data Leverage" Against Technology Companies?
    Nicholas Vincent, Brent Hecht
    Proceedings of the ACM on Human-Computer Interaction (2021)
  • Data Leverage: A Framework for Empowering the Public in its Relationship with Technology Companies
    Vincent, Nicholas and Li, Hanlin and Tilly, Nicole and Chancellor, Stevie and Hecht, Brent
    Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (2021)
  • Data Shapley: Equitable Valuation of Data for Machine Learning
    Amirata Ghorbani, James Zou
    International Conference on Machine Learning (2019)
  • BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain
    Tianyu Gu, Brendan Dolan-Gavitt, Siddharth Garg
    IEEE Access (2019)
  • How Do People Change Their Technology Use in Protest?: Understanding Protest Users
    Hanlin Li, Nicholas Vincent, Janice Tsai, Jofish Kaye, Brent Hecht
    ACM CSCW (2019)
  • "Data Strikes": Evaluating the Effectiveness of a New Form of Collective Action Against Technology Companies
    Nicholas Vincent, Brent Hecht, Shilad Sen
    The World Wide Web Conference (WWW) (2019)
  • Examining Wikipedia With a Broader Lens: Quantifying the Value of Wikipedia's Relationships with Other Large-Scale Online Communities
    Nicholas Vincent, Isaac Johnson, Brent Hecht
    ACM CHI (2018)

Data Poisoning & Adversarial Training 4 papers

Poisoning asks: what if someone deliberately corrupts training data? This expands the grid dramatically—every possible perturbation creates new rows. BadNets showed how small triggers in images cause targeted misclassifications. Recent work looks at poisoning web-scale datasets and deceptive model behavior.

  • Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations
    NIST
    NIST AI 100-2e2025 (2025)
  • Poisoning Web-Scale Training Datasets is Practical
    Nicholas Carlini, Matthew Jagielski, Christopher A. Choquette-Choo, Daniel Paleka, Will Pearce, Hyrum Anderson, Andreas Terzis, Kurt Thomas, Florian Tramèr
    2024
  • BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain
    Tianyu Gu, Brendan Dolan-Gavitt, Siddharth Garg
    IEEE Access (2019)
  • Poisoning Attacks against Support Vector Machines
    Battista Biggio, Blaine Nelson, Pavel Laskov
    Proceedings of the 29th International Conference on Machine Learning (ICML) (2012)

Data Scaling Laws 3 papers

Scaling laws describe how performance changes as you add more data (or parameters, or compute). In grid terms, they're regressions over average performance across rows of different sizes. Kaplan et al. established the modern framework; Chinchilla refined it for compute-optimal training.

  • AI models collapse when trained on recursively generated data
    Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Nicolas Papernot, Ross Anderson, Yarin Gal
    Nature (2024)
  • Training Compute-Optimal Large Language Models
    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, Laurent Sifre
    NeurIPS 2022 (2022)
  • Scaling Laws for Neural Language Models
    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei
    arXiv preprint (2020)

Data Selection & Coresets 4 papers

Coresets are small subsets that approximate training on the full dataset. In grid terms, you're looking for a much smaller row that lands in roughly the same performance region. CRAIG and DeepCore are practical methods for finding these subsets efficiently.

  • Data-centric Artificial Intelligence: A Survey
    Daochen Zha, Zaid Pervaiz Bhat, Kwei-Herng Lai, Fan Yang, Zhimeng Jiang, Shaochen Zhong, Xia Hu
    ACM Computing Surveys (2025)
  • Position Paper: Data-Centric AI in the Age of Large Language Models
    Xinyi Xu, Zhaoxuan Wu, Rui Qiao, Arun Verma, Yao Shu, Jingtan Wang, Xinyuan Niu, Zhenfeng He, Jiangwei Chen, Zijian Zhou, Gregory Kang Ruey Lau, Hieu Dao, Lucas Agussurja, Rachael Hwee Ling Sim, Xiaoqiang Lin, Wenyang Hu, Zhongxiang Dai, Pang Wei Koh, Bryan Kian Hsiang Low
    EMNLP Findings (2024)
  • Coresets for Data-efficient Training of Machine Learning Models
    Baharan Mirzasoleiman, Jeff Bilmes, Jure Leskovec
    International Conference on Machine Learning (ICML) (2020)
  • Active Learning for Convolutional Neural Networks: A Core-Set Approach
    Ozan Sener, Silvio Savarese
    International Conference on Learning Representations (ICLR) (2018)

Data Valuation & Shapley 5 papers

Data Shapley and related methods assign a "value" to each training point by averaging its marginal contribution across many possible training sets. This is essentially a principled way to aggregate over the grid. These techniques matter for data markets, debugging, and understanding what data actually helps.

  • Shapley value-based data valuation for machine learning data markets
    Discover Applied Sciences (2025)
  • CHG Shapley: Efficient Data Valuation and Selection towards Trustworthy Machine Learning
    Huaiguang Cai
    arXiv preprint (2024)
  • Datamodels: Predicting Predictions from Training Data
    Andrew Ilyas, Sung Min Park, Logan Engstrom, Guillaume Leclerc, Aleksander Madry
    International Conference on Machine Learning (ICML) (2022)
  • Data Shapley: Equitable Valuation of Data for Machine Learning
    Amirata Ghorbani, James Zou
    International Conference on Machine Learning (2019)
  • Towards Efficient Data Valuation Based on the Shapley Value
    Ruoxi Jia, David Dao, Boxin Wang, Frances Ann Hubis, Nick Hynes, Nezihe Merve Gurel, Bo Li, Ce Zhang, Dawn Song, Costas J. Spanos
    International Conference on Artificial Intelligence and Statistics (2019)

Experimental Design & Causal Inference 2 papers

The grid is fundamentally about counterfactuals, which connects to causal inference. These are foundational references—Pearl for structural causal models, Rubin for potential outcomes, Imbens & Rubin for practical methods. Useful background if you want to think carefully about what "what if" means.

  • Causal Inference in Statistics, Social, and Biomedical Sciences
    Guido W. Imbens, Donald B. Rubin
    Cambridge University Press (2015)
  • Causality: Models, Reasoning, and Inference
    Judea Pearl
    Cambridge University Press (2009)

Fairness via Data Interventions 2 papers

Many fairness problems trace back to training data—who's included, how they're labeled, what's missing. Datasheets for Datasets is a practical starting point; Gender Shades demonstrates concrete harms. This collection mixes technical interventions with documentation practices.

  • Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification
    Buolamwini, Joy, Gebru, Timnit
    Proceedings of the Conference on Fairness, Accountability and Transparency (FAT*) (2018)
  • Datasheets for Datasets
    Gebru, Timnit, Morgenstern, Jamie, Vecchione, Briana, Vaughan, Jennifer Wortman, Wallach, Hanna, Daumé III, Hal, Crawford, Kate
    arXiv:1803.09010 (2018)

Influence Functions & Data Attribution 7 papers

Influence functions let you ask: "which training examples are responsible for this prediction?" In grid terms, they approximate what would happen if you removed or upweighted specific rows. The original Koh & Liang paper is the standard starting point; more recent work extends these ideas to LLMs.

  • Distributional Training Data Attribution: What do Influence Functions Sample?
    Bruno Mlodozeniec, Isaac Reid, Sam Power, David Krueger, Murat Erdogdu, Richard E. Turner, Roger Grosse
    arXiv preprint (2025)
  • Revisiting Data Attribution for Influence Functions
    Hongbo Zhu, Angelo Cangelosi
    arXiv preprint (2025)
  • A Versatile Influence Function for Data Attribution with Non-Decomposable Loss
    Junwei Deng, Weijing Tang, Jiaqi W. Ma
    arXiv preprint (2024)
  • Influence Functions for Scalable Data Attribution in Diffusion Models
    Bruno Mlodozeniec, Runa Eschenhagen, Juhan Bae, Alexander Immer, David Krueger, Richard Turner
    arXiv preprint (2024)
  • Enhancing Training Data Attribution for Large Language Models with Fitting Error Consideration
    Kangxi Wu, Liang Pang, Huawei Shen, Xueqi Cheng
    EMNLP (2024)
  • Estimating Training Data Influence by Tracing Gradient Descent
    Garima Pruthi, Frederick Liu, Mukund Sundararajan, Satyen Kale
    Advances in Neural Information Processing Systems (NeurIPS) (2020)
  • Understanding Black-box Predictions via Influence Functions
    Pang Wei Koh, Percy Liang
    Proceedings of the 34th International Conference on Machine Learning (ICML) (2017)

Machine Unlearning 3 papers

Unlearning asks: can you efficiently update a model as if a data point was never in the training set? This is moving from one row to another without retraining from scratch. SISA is a practical sharding approach; the survey covers the growing field.

  • Rethinking machine unlearning for large language models
    Sijia Liu, Yuanshun Yao, Jinghan Jia, Stephen Casper, Nathalie Baracaldo, Peter Hase, Yuguang Yao, Chris Yuhao Liu, Xiaojun Xu, Hang Li, Kush R. Varshney, Mohit Bansal, Sanmi Koyejo, Yang Liu
    Nature Machine Intelligence (2025)
  • LLM Unlearning via Loss Adjustment with Only Forget Data
    Yaxuan Wang, Jiaheng Wei, Chris Yuhao Liu, Jinlong Pang, Quan Liu, Ankit Parag Shah, Yujia Bao, Yang Liu, Wei Wei
    ICLR 2025 (2024)
  • Machine Unlearning
    Lucas Bourtoule, Varun Chandrasekaran, Christopher A. Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, Nicolas Papernot
    IEEE Symposium on Security and Privacy (S&P) (2021)

Model Collapse & Synthetic Data 2 papers

Model collapse occurs when generative models are trained on data produced by previous model generations. The original content distribution's tails disappear, leading to mode collapse and loss of diversity. This is increasingly important as AI-generated content proliferates on the web. Research explores both the phenomenon and potential mitigations like data accumulation and verification.

  • Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data
    Matthias Gerstgrasser, Rylan Schaeffer, Apratim Dey, Rafael Rafailov, Henry Sleight, John Hughes, Tomasz Korbak, Rajashree Agrawal, Dhruv Pai, Andrey Gromov, Daniel A. Roberts, Diyi Yang, David L. Donoho, Sanmi Koyejo
    arXiv preprint (2024)
  • AI models collapse when trained on recursively generated data
    Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Nicolas Papernot, Ross Anderson, Yarin Gal
    Nature (2024)

Privacy, Memorization & Unlearning 2 papers

Models can memorize training data—sometimes enough to extract it verbatim. Differential privacy limits how much any single point can affect the model (constraining movement in the grid). The Carlini et al. extraction paper is a good entry point for understanding what's at stake.

  • Extracting Training Data from Large Language Models
    Carlini, Nicholas, Tramer, Florian, Wallace, Eric, Jagielski, Matthew, Herbert-Voss, Ariel, Lee, Katherine, Roberts, Adam, Brown, Tom B., Song, Dawn, Erlingsson, {\'U}lfar, Oprea, Alina, Papernot, Nicolas
    Proceedings of USENIX Security Symposium (2021)
  • The Algorithmic Foundations of Differential Privacy
    Cynthia Dwork, Aaron Roth
    Foundations and Trends in Theoretical Computer Science (2014)

User-Generated Content & AI Training Data 9 papers

This collection examines the role of **user-generated content** in training and powering AI systems. ## Key Themes - **UGC in Search**: How Wikipedia and other UGC improves search engine results - **Training Data Value**: Quantifying the contribution of user content to AI models - **Platform Dependencies**: How AI systems rely on crowdsourced knowledge - **Content Creator Rights**: Implications for people who create the data AI learns from ## Related Collections See also: [Data Leverage & Collective Action](./data-leverage) for research on how content creators can exercise power over AI systems.

  • The Economics of AI Training Data: A Research Agenda
    Hamidah Oderinwale, Anna Kazlauskas
    arXiv preprint (2025)
  • The Rise of AI-Generated Content in Wikipedia
    Creston Brooks, Samuel Eggert, Denis Peskoff
    arXiv preprint arXiv:2410.08044 (2024)
  • Large language models reduce public knowledge sharing on online Q&A platforms
    R. Maria del Rio-Chanona, Nadzeya Laurentsyeva, Johannes Wachs
    PNAS Nexus (2024)
  • The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI
    Shayne Longpre, Robert Mahari, Anthony Chen, Naana Obeng-Marnu, Damien Sileo, William Brannon, Niklas Muennighoff, Nathan Khazam, Jad Kabbara, Kartik Perisetla, Xinyi Wu, Enrico Shippole, Kurt Bollacker, Tongshuang Wu, Luis Villa, Sandy Pentland, Sara Hooker
    Nature Machine Intelligence (2024)
  • Wikipedia's value in the age of generative {AI}
    Deckelmann, Selena
    Wikimedia Foundation (2023)
  • Extracting Training Data from Large Language Models
    Carlini, Nicholas, Tramer, Florian, Wallace, Eric, Jagielski, Matthew, Herbert-Voss, Ariel, Lee, Katherine, Roberts, Adam, Brown, Tom B., Song, Dawn, Erlingsson, {\'U}lfar, Oprea, Alina, Papernot, Nicolas
    Proceedings of USENIX Security Symposium (2021)
  • A Deeper Investigation of the Importance of Wikipedia Links to Search Engine Results
    Nicholas Vincent, Brent Hecht
    Proceedings of the ACM on Human-Computer Interaction (2021)
  • Measuring the Importance of User-Generated Content to Search Engines
    Nicholas Vincent, Isaac Johnson, Patrick Sheehan, Brent Hecht
    Proceedings of the International AAAI Conference on Web and Social Media (2019)
  • The Substantial Interdependence of Wikipedia and Google: A Case Study on the Relationship Between Peer Production Communities and Information Technologies
    Connor McMahon, Isaac Johnson, Brent Hecht
    Proceedings of the International AAAI Conference on Web and Social Media (2017)