I believe this project is so promising that I applied to SPAR to volunteer to help directly.
Four months salary to write a paper on how incentives for performative prediction can be eliminated through the joint evaluation of multiple predictors.
Performative prediction refers to predictions where the act of making them affects their own outcome, such as through actions taken by others in response. This introduces an incentive for a predictor to use their prediction to influence the world towards more predictable outcomes. If powerful AI systems develop the goal of maximizing predictive accuracy, either incidentally or by design, then this incentive for manipulation could prove catastrophic.
Initial results indicate that a system of two or more predictors can be jointly evaluated in a way that removes the incentive to manipulate the outcome, in contrast with previous impossibility results for the case of a single predictor. This project aims to extend these initial results, ensure their robustness, and produce empirical evidence that models can be trained to act in the desired way across a wide range of scenarios.
Concretely, the goal of this project is to produce a paper containing theoretical and empirical results demonstrating that the incentive for performative prediction can be eliminated by jointly evaluating multiple predictors. Ideally, such a paper could get published at a top ML conference.
More broadly, the goal of this project is to create a technique that improves the safety of powerful predictive models, and disseminate information about it to the research community. By doing so this project would reduce the risk from predictive models and increase the chance that leading AI companies focus on predictive models over more dangerous general systems.
Specific components of this paper include:
Comparing the behavior of a single model to the behavior of a jointly evaluated pair of models in an environment where performative prediction is possible
Building a theoretical model proving which conditions are necessary to avoid performative prediction when predictors have access to different information
Running experiments to test predictor behavior under incomplete information, including both the above theoretical model and setups without a closed-form solution
Extending the results to prediction and decision markets, including training models to be market makers
Exploring other possibilities opened up by jointly evaluating predictors, such as eliciting honest reports on predictor uncertainty
The funding will pay for four months of my salary, from which I will pay the costs of running experiments and office space.
I would be the only researcher receiving funding from this project. However, I may collaborate with Johannes Treutlein, a PhD student at UC Berkeley. We have previously worked together on two related papers, Condition Predictive Models: Risks and Strategies, as well as Incentivizing honest performative predictions with proper scoring rules. We have also written well received Alignment Forum posts on the Underspecification of Oracle AI, and the initial results for this project.
It is possible that I will be mentoring undergraduate or junior AI safety researchers while working on this project, in which case I could involve them in running experiments.
The best failure mode would be conclusive negative results, in which case I could publicize them and share the lessons learned from the process. A more likely failure scenario is inconclusive results, where the system cannot be shown to work, but the possibility remains open that it could under a different setup. These failure modes could result from the theory being mathematically intractable, experimental results contradicting the theory, or from me as a researcher missing possible solutions to problems that arise.
I currently have an application under evaluation at the Long-Term Future Fund (LTFF) to fund this project for three months. Between Manifund and the LTFF, I would not take more than four months of funding, as I believe that should be sufficient to finish the project.
Sheikh Abdur Raheem Ali
3 months ago
I believe this project is so promising that I applied to SPAR to volunteer to help directly.
Evan Hubinger
3 months ago
I am excited about more work along the lines of the existing "Incentivizing honest performative predictions with proper scoring rules" paper. I think that there are serious safety problems surrounding predictors that select their predictions to influence the world in such a way as to make those predictions true ("self-fulfilling prophecies") and I am excited about this work as a way to discover mechanisms for dealing with those sorts of problems. "Conditioning Predictive Models" discusses these sorts of issues in more detail. Rubi is a great person to work on this as he was an author on both of those papers.
I think my main reservations here are just around Rubi's opportunity costs, though I think this is reasonably exciting work and I trust Rubi to make a good judgement about what he should be spending his time working on. The most likely failure mode here would probably be that the additional work here doesn't turn up anything else new or interesting that wasn't already surfaced in the "Incentivizing honest performative predictions with proper scoring rules" paper.
I think that $33k is a reasonable amount given the timeframe and work.
Rubi was a previous mentee of mine in SERI MATS and a coauthor of mine on "Conditioning Predictive Models."
Johannes Treutlein
3 months ago
I have worked with Rubi on performative prediction in the past and I think he would be great at this! I think testing zero-sum training empirically would be a good next step. Rubi has some ideas for experiments that I find interesting and that I'd be happy to collaborate on.