I will pay 30 forecasters with excellent track records on Good Judgement Open, Metaculus and Manifold (superforecasters) to make forecasts of long-run treatment effects from randomised controlled trials. This will allow me to provide new evidence on two critical questions in the forecasting space (1) how does forecast accuracy decay with time horizon?, and (2) when are (super)forecasters better than domain experts?
Why forecasts of long-term treatment effects from randomised controlled trials (RCTs)? Firstly, most research on forecasting is about ‘state’ forecasts, what the world will look like in the future. More relevant for those seeking to improve the world are ‘impact’ (or causal) forecasts, the difference between what would happen if we take action X and what would happen if we did not take action X. The treatment effects of RCTs are causal impacts and by collecting forecasts of them I contribute to this understudied area of forecasting.
Secondly, using RCTs allows us to resolve long-run forecasts more quickly. I will collect forecasts for the 5-10 year results from 7 different RCTs. These RCTs are already underway and the long-run results will be available to me in spring 2023 so I will be able to resolve the long-run forecasts soon. However, the only information that is available about the RCTs is a short-run set of results, typically observed 2 years after each RCT started. As such, if the long-run results are from year 10, the long-run forecast of these results approximates an 8-year forecast but resolves much more quickly. It is not possible for the forecasters to know anything about what happened in each RCT between years 2 and 10, so the forecast is a real long-run forecast.
Why care about question (1) how does forecast accuracy decay with time horizon? Firstly, it’s important to know how much we can trust long-range forecasts in a variety of domains when we’re making policies and decisions with long-run impacts. Secondly, a common objection to longtermism is that the effect of our actions on the long-term future are essentially impossible to predict. Thus, despite the huge potential value in the future, extreme uncertainty around long-term impacts means that the expected value of our options is mostly determined by their short-run impacts. However, there is limited empirical evidence on this question and my study will generate relevant and important information for this crucial consideration.
Why care about question (2) when are (super)forecasters better than domain experts? Tetlock’s research shows that in geopolitics, teams of superforecasters are better than other prediction mechanisms, such as domain experts. However, there are very few studies explicitly comparing experts to forecasters without domain expertise, across any domain. In other important areas, such as economics, we might expect the greater domain knowledge of experts to better compensate for their lack of experience in forecasting. In general, we need more research in a variety of domains to understand how much we should trust domain experts versus forecasters. I already have many forecasts from academic economists and development practitioners with domain expertise, so I just need forecasts from superforecasters to be able to make this comparison.
For existing research on question (1) see: Charles Dillon, Data on forecasting accuracy across different time horizons and levels of forecaster experience; Niplav, Range and Forecasting Accuracy; Javier Prieto, How accurate are Open Phil’s predictions?; Luke Muelhauser, How Feasible Is Long-range Forecasting? For existing research on question (2) see: Gavin Leech & Misha Yagudin, Comparing top forecasters and domain experts.
I am already running surveys on the Social Science Prediction Platform (https://socialscienceprediction.org/predict/) and receiving forecasts from academics, practitioners and laypeople (the latter recruited via Prolific). The surveys have been well received with one respondent, a professor of economics at Stanford saying it was “cool research” and a “really interesting idea”. The superforecasters will be able to take these same surveys so no additional work will be required to design and create new surveys.
This project is part of my PhD in which I have a similar research project on how to use supervised machine learning to estimate long-term treatment effects in cases where we don’t have data on long-term outcomes. For this other project, I won the best paper prize at the Global Priorities Institute’s Early Career Conference Program and presented at EA Global 2019. This demonstrates my ability to make useful empirical and methodological contributions to forecasting and global priorities research.
Paper - https://globalprioritiesinstitute.org/david-rhys-bernard-estimating-long-term-treatment-effects-without-long-term-outcome-data/
EAG presentation - https://www.youtube.com/watch?v=mOufR9vFO_U
Presentation transcript - https://www.effectivealtruism.org/articles/david-rhys-bernard-estimating-long-term-effects-without-long-term-data/
I will pay 30 superforecasters $50 per hour to make forecasts on this project. I expect completing 7 surveys to take around 2 hours, so the total cost will be $3,000 (30 forecasters, for 2 hours at $50 per hour). I am not asking for any money for support of living costs or personal expenses as part of this funding. I will recruit forecasters by reaching out to Good Judgement, Metaculus, the Forecasting Research Institute, and personal connections.
If you have a strong track record on Good Judgement Open, Metaculus or Manifold and are interested in making forecasts for this project, please get in touch: david.rhys.bernard@gmail.com
David, this is a great project! Glad to see it is getting lots of bids. Assuming my project gets funded too I would love to interview you. https://manifund.org/projects/crystal-ballin-podcast
Hi Domenic. If I recall correctly, one of them said that amount was the lower bound of what they'd expect, but I didn't systematically ask the people I spoke to.
I've spoken to a few superforecasters already and they said they would be happy to participate if I could compensate them appropriately
Did they specifically confirm that $50/hour is appropriate compensation?
Hi Austin, thanks for the questions!
Yep, I am running the experiment with academics and domain experts at the moment. I started with them for two reasons. Firstly, academics are currently seen as the experts on these sorts of topics at the moment and they are the ones who provide policy advice, so their priors are the ones that matter more in an action-relevant sense. Of course whether this should be the case is up for debate and I hope to provide some evidence here. Secondly and more practically, academic economists tend to care more about what other academic economists think rather than uncredentialed superforecasters, so to improve the paper's chances in economic journals, I made academics my initial focus.
I've spoken to a few superforecasters already and they said they would be happy to participate if I could compensate them appropriately, so if I use them + their network, I'm 75% sure I'd be able to get 30 superforecasters conditional on receiving funding, 10% if not. From chatting to the folks at the Social Science Prediction Platform, they think 15-20 domain expert forecasters tends to be sufficient for getting a reasonable forecast of an average treatment effect. To do the additional analysis of comparing different types of forecasters requires more sample size so I would worry about being underpowered if I had much less than 30.
Hey David, thanks for this proposal -- I loved the in-depth explainer, and the fact experiment setup allows us to learn about the results of long-term predictions but on a very short timeframe.
Some questions:
Am I correct in understanding that you're already running this exact experiment, just with non-superforecasters instead of superforecasters? If so, what was the reasoning for starting with them over superforecasters in the first place?
How easily do you expect to be able to recruit 30 superforecasters to participate? If you end up running this experiment with less (either due to funding or recruiting constraints), how valid would the results be?