Superforecaster predictions of long-term impacts

David Rhys Bernard

$2,999raised

$2,000valuation

Project description

I will pay 30 forecasters with excellent track records on Good Judgement Open, Metaculus and Manifold (superforecasters) to make forecasts of long-run treatment effects from randomised controlled trials. This will allow me to provide new evidence on two critical questions in the forecasting space (1) how does forecast accuracy decay with time horizon?, and (2) when are (super)forecasters better than domain experts?

Why forecasts of long-term treatment effects from randomised controlled trials (RCTs)? Firstly, most research on forecasting is about ‘state’ forecasts, what the world will look like in the future. More relevant for those seeking to improve the world are ‘impact’ (or causal) forecasts, the difference between what would happen if we take action X and what would happen if we did not take action X. The treatment effects of RCTs are causal impacts and by collecting forecasts of them I contribute to this understudied area of forecasting.

Secondly, using RCTs allows us to resolve long-run forecasts more quickly. I will collect forecasts for the 5-10 year results from 7 different RCTs. These RCTs are already underway and the long-run results will be available to me in spring 2023 so I will be able to resolve the long-run forecasts soon. However, the only information that is available about the RCTs is a short-run set of results, typically observed 2 years after each RCT started. As such, if the long-run results are from year 10, the long-run forecast of these results approximates an 8-year forecast but resolves much more quickly. It is not possible for the forecasters to know anything about what happened in each RCT between years 2 and 10, so the forecast is a real long-run forecast.

Why care about question (1) how does forecast accuracy decay with time horizon? Firstly, it’s important to know how much we can trust long-range forecasts in a variety of domains when we’re making policies and decisions with long-run impacts. Secondly, a common objection to longtermism is that the effect of our actions on the long-term future are essentially impossible to predict. Thus, despite the huge potential value in the future, extreme uncertainty around long-term impacts means that the expected value of our options is mostly determined by their short-run impacts. However, there is limited empirical evidence on this question and my study will generate relevant and important information for this crucial consideration.

Why care about question (2) when are (super)forecasters better than domain experts? Tetlock’s research shows that in geopolitics, teams of superforecasters are better than other prediction mechanisms, such as domain experts. However, there are very few studies explicitly comparing experts to forecasters without domain expertise, across any domain. In other important areas, such as economics, we might expect the greater domain knowledge of experts to better compensate for their lack of experience in forecasting. In general, we need more research in a variety of domains to understand how much we should trust domain experts versus forecasters. I already have many forecasts from academic economists and development practitioners with domain expertise, so I just need forecasts from superforecasters to be able to make this comparison.

For existing research on question (1) see: Charles Dillon, Data on forecasting accuracy across different time horizons and levels of forecaster experience; Niplav, Range and Forecasting Accuracy; Javier Prieto, How accurate are Open Phil’s predictions?; Luke Muelhauser, How Feasible Is Long-range Forecasting? For existing research on question (2) see: Gavin Leech & Misha Yagudin, Comparing top forecasters and domain experts.

What is your track record on similar projects?

I am already running surveys on the Social Science Prediction Platform (https://socialscienceprediction.org/predict/) and receiving forecasts from academics, practitioners and laypeople (the latter recruited via Prolific). The surveys have been well received with one respondent, a professor of economics at Stanford saying it was “cool research” and a “really interesting idea”. The superforecasters will be able to take these same surveys so no additional work will be required to design and create new surveys.

This project is part of my PhD in which I have a similar research project on how to use supervised machine learning to estimate long-term treatment effects in cases where we don’t have data on long-term outcomes. For this other project, I won the best paper prize at the Global Priorities Institute’s Early Career Conference Program and presented at EA Global 2019. This demonstrates my ability to make useful empirical and methodological contributions to forecasting and global priorities research.

Paper - https://globalprioritiesinstitute.org/david-rhys-bernard-estimating-long-term-treatment-effects-without-long-term-outcome-data/
EAG presentation - https://www.youtube.com/watch?v=mOufR9vFO_U
Presentation transcript - https://www.effectivealtruism.org/articles/david-rhys-bernard-estimating-long-term-effects-without-long-term-data/

How will you spend your funding?

I will pay 30 superforecasters $50 per hour to make forecasts on this project. I expect completing 7 surveys to take around 2 hours, so the total cost will be $3,000 (30 forecasters, for 2 hours at $50 per hour). I am not asking for any money for support of living costs or personal expenses as part of this funding. I will recruit forecasters by reaching out to Good Judgement, Metaculus, the Forecasting Research Institute, and personal connections.

If you have a strong track record on Good Judgement Open, Metaculus or Manifold and are interested in making forecasts for this project, please get in touch: david.rhys.bernard@gmail.com

holds 0.0333%

David Rhys Bernard

almost 2 years ago

1. How much money have you spent so far? Have you gotten more funding from other sources? Do you need more funding?

I spent $1100 out of the $3000 I received on paying experienced forecasters. I haven't received any additional funding since I received the money from Manifund (although I already had funding from the EA Long-Term Future Fund for the other parts of the research project.) I do not need more funding.

2. How is the project going? (a few paragraphs)

I completed the research project and wrote up my results here (https://davidrhysbernard.files.wordpress.com/2023/08/forecasting_drb_230825.pdf). This is included as a chapter of my PhD dissertation which I will defend in two weeks!

As I was already running the study with academics and lay-people, the parts of the paper that should be counted for credit in the impact evaluation are the ones which involve analysis of expert forecasters. In particular these are; section 4.1.1 and 4.1.2, including Table 1 Panel B, Table 2, Figure 3 and Figure 4. You may have to read the intro or sections 2.2 and 3 to get sufficient context on the study to understand the results.

In section 4.1.1 (Forecaster type) I show that my sample of forecasters perform better at forecasting short and long-run treatment effects than academics (recruited from the Social Science Prediction Platform) and lay-people (recruited from Prolific). In particular, I show that regardless of the accuracy metric used, both academics and forecasters are statistically significantly better than laypeople. If I use my preferred log score accuracy metric (which relies on the full distribution given), forecasters are better than academics. However, if I use a negative absolute error metric (which only relies on the central point of the distribution), there is no significant difference between forecasters and academics. This suggests that the forecasters are better at forecasting a range of likely treatment effects, but no better at specifying the most likely effect within that range.

In section 4.1.2 (Calibration) I show that although all groups are poorly calibrated and overconfident, the forecaster group are more well calibrated than the other two. I show this with a calibration curve in figure 3 and a comparison of log scores and stated confidence levels in figure 4. Better calibration of forecasters seems to be a key part of forecasters having higher accuracy in this context.

3. How well has your project gone compared to where you expected it to be at this point? (Score from 1-10, 10 = Better than expected)

I'd give the project a 6/10. I failed at reaching my target of 30 forecasters and was significantly overconfident in how likely I would be to reach that number. I underestimated how busy and in demand most forecasters are and how interested they would be in my project. I heard that $50 per hour was within the range of expected compensation for at least one forecaster, but it turned out this was not sufficient for many others. In the future, I'd plan to pay superforecasters at least $100 per hour of their time and give them a longer period over which they can make forecasts.

Despite this limited sample size, I still ended up being well-powered enough to find meaningful differences between academics and forecasters. Collecting forecasts from experienced forecasters of impact forecasts of causal treatment effects from randomised controlled trials is already a novel contribution, since almost all previous forecasting research has been on state forecasts. Being able to show that forecasters outperform academics in this new context and that this outperformance depends on the accuracy metric used are also both useful contributions.

4. Are there any remaining ways you need help, besides more funding?

I've completed the project so I do not need any more help immediately. I'm presenting the results at a Forecasting in the Social Sciences Workshop at UC Berkeley in October. Depending on the feedback I get there, I will decide whether or not to proceed with publication. As I have now left academia (and started at Rethink Priorities), publishing this is not a top priority for me, but if someone was interested in further improving the data-analysis, writing, and submitting the paper as co-authors, I'd definitely be open to the possibility and keen to chat.

5. Any other thoughts or feedback?

The Manifund process was very smooth and easy. I want to express my gratitude to all the people who bought shares in this project, and the Manifund and ACX team for setting this up.

Brian T. Edwards

over 2 years ago

Brian T. Edwards

over 2 years ago

David, this is a great project! Glad to see it is getting lots of bids. Assuming my project gets funded too I would love to interview you. https://manifund.org/projects/crystal-ballin-podcast

holds 0.0333%

David Rhys Bernard

over 2 years ago

Hi Domenic. If I recall correctly, one of them said that amount was the lower bound of what they'd expect, but I didn't systematically ask the people I spoke to.

holds 0%

Domenic Denicola

over 2 years ago

I've spoken to a few superforecasters already and they said they would be happy to participate if I could compensate them appropriately

Did they specifically confirm that $50/hour is appropriate compensation?

holds 0.0333%

David Rhys Bernard

over 2 years ago

Hi Austin, thanks for the questions!

Yep, I am running the experiment with academics and domain experts at the moment. I started with them for two reasons. Firstly, academics are currently seen as the experts on these sorts of topics at the moment and they are the ones who provide policy advice, so their priors are the ones that matter more in an action-relevant sense. Of course whether this should be the case is up for debate and I hope to provide some evidence here. Secondly and more practically, academic economists tend to care more about what other academic economists think rather than uncredentialed superforecasters, so to improve the paper's chances in economic journals, I made academics my initial focus.

I've spoken to a few superforecasters already and they said they would be happy to participate if I could compensate them appropriately, so if I use them + their network, I'm 75% sure I'd be able to get 30 superforecasters conditional on receiving funding, 10% if not. From chatting to the folks at the Social Science Prediction Platform, they think 15-20 domain expert forecasters tends to be sufficient for getting a reasonable forecast of an average treatment effect. To do the additional analysis of comparing different types of forecasters requires more sample size so I would worry about being underpowered if I had much less than 30.

Austin Chen

over 2 years ago

Hey David, thanks for this proposal -- I loved the in-depth explainer, and the fact experiment setup allows us to learn about the results of long-term predictions but on a very short timeframe.

Some questions:

Am I correct in understanding that you're already running this exact experiment, just with non-superforecasters instead of superforecasters? If so, what was the reasoning for starting with them over superforecasters in the first place?
How easily do you expect to be able to recruit 30 superforecasters to participate? If you end up running this experiment with less (either due to funding or recruiting constraints), how valid would the results be?