The underlying statistical theory:
I want to test whether communication such as deliberation (e.g. via the comment section) can improve aggregate forecasting accuracy.
Current evidence shows that communication can reliably improve *individual* accuracy but suggests that the impact of communication between contributors is unreliable with regard to the aggregate "crowd" accuracy e.g. for platforms such as Manifold Markets or Metaculus.
However, I believe the methods in this prior research generated social dynamics that were primarily driven by numeric anchoring effects, not information exchange.
I will obtain RCT evidence on the effects of deliberation using a platform and sample population that should generate "real" deliberation i.e. dynamics driven by information exchange and not just numeric anchoring.
To develop this research I will seek input from existing platforms to identify the practical questions most relevant to forecast aggregators.
I have successfully collected estimation and forecasting data using my own custom-built platforms by recruiting participants from web sources such as Amazon Mechanical Turk. I have been conducting this work since 2014. I have been building web projects since about 1998.
My research has advanced basic theory about the effect of communication on the 'wisdom of the crowd.' The practical relevance of my prior work is demonstrated by inclusion in popular press such as the Harvard Business Review "Guide to Critical Thinking."
Not every experiment I tried has been successful: showing that communication is unreliable has proven straightforward, but showing how it can be reliably beneficial has proven difficult.
I am optimistic for this project because it takes advantage of recent theoretical advances to test for the presence of meaningful deliberation, looking for effects that should appear only when social influence is driven primarily by numerical anchoring effects. The analysis is designed such that even 'null' results will be informative.
If funded, I will seek guidance on design and recruitment.
I will use the funding to provide prize money for a forecasting competition using a platform that enables randomized controlled experiments. This amount is consistent with the recommendation of an experienced practitioner.
Your support will allow me to supplement my ongoing laboratory research with "real world" data that will make the results more directly applicable to practice.
I currently have institutional funding to pursue this project with participants from Amazon Mechanical Turk (MTurk). This funding will allow me to develop the platform and collect data sufficient for demonstrating basic theoretical principles of group behavior.
However, my current plan is limited to a few basic scenarios and will not test the types of features likely to appear in a web platform (e.g. comments section) which can introduce large variations in dynamics depending on the design. Moreover, if the MTurk experiment fails, we won't know if it's because deliberation is inherently unreliable or because crowd workers just aren't engaged in the task.
Your support will allow me to (1) examine questions specifically relevant to online crowdsourced forecasting with (2) a population representing users of online forecasting platforms.
Great comments and questions by Austin! I'll take them in turn.
On the obviousness of comments being helpful: I agree, it's so intuitively compelling that communication between forecasters should improve accuracy!
However, previous controlled experiments using chatrooms have shown that communication can reliably improve individual accuracy, but that it cannot reliably improve aggregate accuracy!
It turns out (according to current evidence) that the effect of communication is driven by statistical effects that emerge from the initial estimate distribution. In other words, it's hit-or-miss whether it actually helps. In a related example (not comments, but communication generally) the platform Estimize.com stopped allowing people to see the community estimates before providing their own independent estimate, after finding it reduced accuracy.
However, as noted, I still believe comments can be helpful, which is why I want to prove it. My explanation for previous research is that these studies failed to capture 'real' deliberation.
What specific forecasting platform would you use? Ideally, I would like to work with an existing platform and I'm currently in discussions with one possible partner to see if that could work for these purposes. The risk of using an existing platform is that I can't quite get the experimental control needed, or that they decide that running experiments is not suited to their mission. As an alternative, I would use a custom-built platform that I am currently in the final stages of developing for use in a laboratory context (i.e. with participants recruited from a platform like Amazon Mechanical Turk).
How many participants do I expect to attract? This is honestly a bit difficult to predict. Based on my previous work in this area, I would aim to collect approximately 4,000 estimates. Depending on the design, which will be developed collaboratively with my partners, this could be either 4,000 people answering one question (unlikely) or 100 people answering 40 questions (much more likely).
These numbers seem feasible: a $10k prize pool on Metaculus for forecasting the Ukraine conflict 500-3k estimates per question for 95 questions. This topic is unusually popular, however: a $20k prize pool on Metaculus for forecasting our world in data has attracted much smaller numbers of people, on the order of 30-50 for 30 questions.
The lesson here is that topic matters a lot. One advantage of my project is that we don't care what questions we forecast, so we are free to identify questions and topics that are likely to attract contributors.
How would I recruit these participants? Ideally, even if I don't work with an existing platform to run the forecasts, I still am very optimistic that I can work with an existing community to recruit participants.
In the unlikely event I am not able to find any community partner to help with recruiting, I can use more general methods that I have used in the past, which involve actively promoting/advertising the opportunity in online fora. For example, my dissertation involved a financial forecasting study (unpublished) that attracted approximately 1,000 participants by sharing the opportunity in online discussion groups.
What practical recommendations will emerge? Well, that depends on the results. If we find that even these communities are driven by statistical effects rather than information sharing, we might want to recommend removing comments sections. However, comments are potentially about more than just improvement, since they also have the ability to drive participation by creating a more engaging community. Therefore, effect sizes will be very important here, which is another important reason to study this in an ecologically valid context. That is: even if the laboratory results hold, their risk may not warrant a change in practice, if they are very small.
9 months ago
Hey Joshua! I've always believed that the comments on Manifold were super helpful in helping forecasters improve their accuracy -- it seemed so obvious so as to not even need testing in an RCT, haha. It's cool to see the amount of rigor you're committing to this idea, though!
Some questions for you:
Based on the different possible outcomes of your experiment, what different recommendations would your project generate for prediction platforms? Eg if you find that comments actually reduced forecasting accuracy somehow, would the conclusion be that Manifold should turn off comments?
What specific forecasting platform would you use (is it one that you'd build/have already built?)
How many participants do you expect to attract with the $10k prize pool? How would you recruit these participants?