I recently ran a Turing Test with GPT-4 here (turingtest.live). We got around 6000 games from ~2000 ppts. There's a pre-print of results from the first 2000 games here (https://arxiv.org/abs/2310.20216). The full pop of data is under review and one prompt gets 49.7% after 855 games).
While the TT has important drawbacks as a test of intelligence, I think it's important as a test of deception per se. Can alert and adversarial users detect an LLM vs a human in a 5 minute text-only conversation? Which prompts and models work best? Which interrogation strategies work best? I think these are important and interesting questions to answer from a safety and sociological perspective. Plus lots of people reported finding the game very fun and interesting to play!
Games cost around $0.3 to run w/ GPT-4. We don't have specific funding for the project and we've been using a limited general experiment funding pot. The site gained popularity and we went through $500 in December so we decided to shut it down temporarily. Ideally, I'd like to revive it in 2024 but would need some dedicated funding to do this. If you'd like to test out the interface, you can do it here: turingtest.live/ai_game (please don't share this link widely though!)
As well as getting a better estimate on the success of existing models and allowing more people to play the game, there are a variety of additional questions we'd like to ask.
1. Prompts: We've tried around 60 prompts and there's a lot of variance. I'd be keen to generate more and see how well these do. A priori it seems very likely there are better prompts than the ones we've tried
2. Temperature. We've varied temperature a bit, but not very systematically. It would be useful to try the same prompt at a variety of temperatures.
3. Auxiliary infrastructure. Models often fail due to lack of real-time info. We could address this through browsing/tool-use. They also often make silly errors which we might be able to address through double-checking, and/or CoT scratchpads.
4. User-generated prompts. It would be lovely to let users generate and test their own prompts. But you probably need at least 30-50 games to reliably test a prompt. We would need a good ratio of games played:prompts created, a decent userbase, and some funding to do this well
5. Other models. I'm planning to include another couple of API model endpoints (e.g. Claude), which should be relatively easy to do. Lots of the feedback on Twitter was from e/acc folks who want to see OS/non-RLHF models tested and that seems right to me too. We could probably run some 7B models for < $2/hr and bigger ones for something like $5-10/hr (though I haven't tested this). Some fiddling with the infrastructure would be needed for this. We also might experiment with only running the game for 1-2hrs/day, to minimise server uptime & maximise concurrent human users.
Essentially, my goal would be to make some of these improvements, run several thousand more games, and publish the results.
I am a PhD student in cognitive science at UCSD. I've implemented the first version of this site and written a paper on the results. I'm pretty familiar with the literature on the Turing Test and I've implemented a range of similar experiments over the last 4 years of my PhD.
I'll also be working with my advisor, Ben Bergen, a professor in the department who has a proven track-record of successful cognitive science research across his career (https://pages.ucsd.edu/~bkbergen/).
Website: https://camrobjones.com
Twitter: @camrobjones
Github: camrobjones
Linkedin: https://linkedin.com/in/camrobjones
~$5000. at $0.3/game this would buy us ~16000 games. Some additions like browsing and double-checking might increase game cost. Most likely we would use a decent part of this to run servers for OS models (e.g. $5 * 2hr/day * 7 days * 8 weeks = $560).
Site: turingtest.live
Demo: turingtest.live/ai_game (please don't share widely).
preprint: https://arxiv.org/abs/2310.20216
Running ~5000 games in < 3 months: 95%
Building out auxiliary infrastructure: 90%
Building out OS model infrastructure: 85%
Running ~10000 games in < 3 months: 80%
Finding a prompt/setup that reliably "passes" (I don't know if this is 'success' but an interesting outcome. By "passes" I mean significantly > 50% success*): 40%.
* We discuss this a lot more in the preprint. This seems like the least-worst benchmark to me.