Loose Syllabus | Data Counterfactuals

This is intentionally loose. The goal is to make the site usable as open course material without turning it into a full course platform. Each week has two readings and one simple site activity: read the relevant memo, use the grid or graph for a few minutes, and write down one counterfactual question that the formalism makes easier to ask.

Week 1. Local data counterfactuals first

Koh and Liang (2017), Understanding Black-box Predictions via Influence Functions
Ilyas et al. (2022), Datamodels: Understanding Predictions with Data and Data with Predictions
Site activity: open the grid in Explore and LOO modes; compare one selected cell to its nearest row neighbor.

Week 2. From influence to contribution accounting

Jia et al. (2019), Efficient Task-Specific Data Valuation for Nearest Neighbor Algorithms
Ghorbani and Zou (2019), Data Shapley: Equitable Valuation of Data for Machine Learning
Site activity: compare Shapley, Banzhaf, and Beta Shapley on the same evaluation column.

Week 3. More data, less data, and the shape of scaling

Kaplan et al. (2020), Scaling Laws for Neural Language Models
Paul, Ganguli, and Dziugaite (2021), Deep Learning on a Data Diet: Finding Important Examples Early in Training
Site activity: use Scaling mode and describe what is lost when many rows collapse into one curve.

Week 4. Diagnosing datasets and synthesizing stand-ins

Swayamdipta et al. (2020), Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics
Wang et al. (2018), Dataset Distillation
Site activity: write one row move that removes data and one row move that replaces data.

Week 5. Privacy as a neighboring-world guarantee

Dwork (2006), Differential Privacy
Shokri et al. (2017), Membership Inference Attacks Against Machine Learning Models
Site activity: use DP mode and inspect how one adjacent-row gap becomes a toy noise scale.

Week 6. Deletion, forgetting, and unlearning

Ginart et al. (2019), Making AI Forget You: Data Deletion in Machine Learning
Bourtoule et al. (2021), Machine Unlearning
Site activity: use Unlearning mode and explain why the retrain reference matters.

Week 7. Adversarial data interventions

Biggio, Nelson, and Laskov (2012), Poisoning Attacks against Support Vector Machines
Gu, Dolan-Gavitt, and Garg (2019), BadNets: Evaluating Backdooring Attacks on Deep Neural Networks
Site activity: use Poison mode and compare the reference grid to the operator view.

Week 8. Provenance, corpus construction, and dependency

Vincent et al. (2019), Measuring the Importance of User-Generated Content to Search Engines
Dodge et al. (2021), Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus
Site activity: write one example where provenance changes which row is legally available.

Week 9. Data leverage and collective action

Vincent et al. (2021), Data Leverage: A Framework for Empowering the Public in its Relationship with Technology Companies
Hardt et al. (2023), Algorithmic Collective Action in Machine Learning
Site activity: use Group LOO mode and write down the coalition move in grid notation.

Week 10. Why any of this might matter economically

Arrieta-Ibarra et al. (2018), Should We Treat Data as Labor? Moving Beyond “Free”
Jones and Tonetti (2020), Nonrivalry and the Economics of Data
Site activity: split one object’s value into training-side, evaluation-side, and trust-side value.

Week 11. Evaluation data and secure holdouts

Data Counterfactuals memo, When the Column Changes
Srivastava et al. (2023), Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
Site activity: use Eval value mode and compare a column move to a row move on the same selected cell.

Week 12. Trust institutions and data rights

Mitchell et al. (2019), Model Cards for Model Reporting
Gebru et al. (2021), Datasheets for Datasets
Site activity: define a rights state for A, B, C, and D: trainable, evaluable, both, reserved, or unavailable.