This is intentionally loose. The goal is to make the site usable as open course material without turning it into a full course platform. Each week has two readings and one simple site activity: read the relevant memo, use the grid or graph for a few minutes, and write down one counterfactual question that the formalism makes easier to ask.
Week 1. Local data counterfactuals first
- Koh and Liang (2017), Understanding Black-box Predictions via Influence Functions
- Ilyas et al. (2022), Datamodels: Understanding Predictions with Data and Data with Predictions
- Site activity: open the grid in Explore and LOO modes; compare one selected cell to its nearest row neighbor.
Week 2. From influence to contribution accounting
- Jia et al. (2019), Efficient Task-Specific Data Valuation for Nearest Neighbor Algorithms
- Ghorbani and Zou (2019), Data Shapley: Equitable Valuation of Data for Machine Learning
- Site activity: compare Shapley, Banzhaf, and Beta Shapley on the same evaluation column.
Week 3. More data, less data, and the shape of scaling
- Kaplan et al. (2020), Scaling Laws for Neural Language Models
- Paul, Ganguli, and Dziugaite (2021), Deep Learning on a Data Diet: Finding Important Examples Early in Training
- Site activity: use Scaling mode and describe what is lost when many rows collapse into one curve.
Week 4. Diagnosing datasets and synthesizing stand-ins
- Swayamdipta et al. (2020), Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics
- Wang et al. (2018), Dataset Distillation
- Site activity: write one row move that removes data and one row move that replaces data.
Week 5. Privacy as a neighboring-world guarantee
- Dwork (2006), Differential Privacy
- Shokri et al. (2017), Membership Inference Attacks Against Machine Learning Models
- Site activity: use DP mode and inspect how one adjacent-row gap becomes a toy noise scale.
Week 6. Deletion, forgetting, and unlearning
- Ginart et al. (2019), Making AI Forget You: Data Deletion in Machine Learning
- Bourtoule et al. (2021), Machine Unlearning
- Site activity: use Unlearning mode and explain why the retrain reference matters.
Week 7. Adversarial data interventions
- Biggio, Nelson, and Laskov (2012), Poisoning Attacks against Support Vector Machines
- Gu, Dolan-Gavitt, and Garg (2019), BadNets: Evaluating Backdooring Attacks on Deep Neural Networks
- Site activity: use Poison mode and compare the reference grid to the operator view.
Week 8. Provenance, corpus construction, and dependency
- Vincent et al. (2019), Measuring the Importance of User-Generated Content to Search Engines
- Dodge et al. (2021), Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus
- Site activity: write one example where provenance changes which row is legally available.
Week 9. Data leverage and collective action
- Vincent et al. (2021), Data Leverage: A Framework for Empowering the Public in its Relationship with Technology Companies
- Hardt et al. (2023), Algorithmic Collective Action in Machine Learning
- Site activity: use Group LOO mode and write down the coalition move in grid notation.
Week 10. Why any of this might matter economically
- Arrieta-Ibarra et al. (2018), Should We Treat Data as Labor? Moving Beyond “Free”
- Jones and Tonetti (2020), Nonrivalry and the Economics of Data
- Site activity: split one object’s value into training-side, evaluation-side, and trust-side value.
Week 11. Evaluation data and secure holdouts
- Data Counterfactuals memo, When the Column Changes
- Srivastava et al. (2023), Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
- Site activity: use Eval value mode and compare a column move to a row move on the same selected cell.
Week 12. Trust institutions and data rights
- Mitchell et al. (2019), Model Cards for Model Reporting
- Gebru et al. (2021), Datasheets for Datasets
- Site activity: define a rights state for A, B, C, and D: trainable, evaluable, both, reserved, or unavailable.