Hide table of contents
This is a linkpost for https://arxiv.org/abs/2503.23674

We evaluated 4 systems (ELIZA, GPT-4o, LLaMa-3.1-405B, and GPT-4.5) in two randomised, controlled, and pre-registered Turing tests on independent populations. Participants had 5 minute conversations simultaneously with another human participant and one of these systems before judging which conversational partner they thought was human. When prompted to adopt a humanlike persona, GPT-4.5 was judged to be the human 73% of the time: significantly more often than interrogators selected the real human participant. LLaMa-3.1, with the same prompt, was judged to be the human 56% of the time -- not significantly more or less often than the humans they were being compared to -- while baseline models (ELIZA and GPT-4o) achieved win rates significantly below chance (23% and 21% respectively). The results constitute the first empirical evidence that any artificial system passes a standard three-party Turing test. The results have implications for debates about what kind of intelligence is exhibited by Large Language Models (LLMs), and the social and economic impacts these systems are likely to have.

11

1
1

Reactions

1
1
Comments6


Sorted by Click to highlight new comments since:

I don't think we should say AI has passed the Turing test until it has passed the test under conditions similar to this: 

But I do really like that these researchers have put the test online for people to try!

https://turingtest.live/

I've had one conversation as the interrogator, and I was able to easily pick out the human in 2 questions. My opener was:

"Hi, how many words are there in this sentence?"

The AI said '8', I said 'are you sure?', and it re-iterated its incorrect answer after claiming to have recounted.

The human said '9', I said 'are you sure?', and they said 'yes?'.. indicating confusion and annoyance for being challenged on such an obvious question.

Maybe I was paired with one of the worse LLMs... but unless it's using hidden chain of thought under the hood (which it doesn't sound like it is) then I don't think even GPT 4.5 can accurately perform counting tasks without writing out its full working.

My current job involves trying to get LLMs to automate business tasks, and my impression is that current state of the art models are still a fair way from something which is truly indistinguishable from an average human, even when confronted with relatively simple questions! (Not saying they won't quickly close the gap though, maybe they will!)

I'd be worried about getting sucked into semantics here. I think it's reasonable to say that it passes the original turing test, described by Turing in 1950:

I believe that in about fifty years’ time it will be possible to programme computers, with a storage capacity of about 109, to make them play the imitation game so well that an average interrogator will not have more than 70 percent chance of making the right identification after five minutes of questioning. … I believe that at the end of the century the use of words and general educated opinion will have altered so much that one will be able to speak of machines thinking without expecting to be contradicted.

I think given the restrictions of an "average interrogator" and "five minutes of questioning", this prediction has been achieved, albeit a quarter of a century later than he predicted.  This obviously doesn't prove that the AI can think or substitute for complex business tasks (it can't), but it does have implications for things like AI-spambots.  

Thanks for sharing the original definition! I didn't realise Turing had defined the parameters so precisely, and that they weren't actually that strict! I

I probably need to stop saying that AI hasn't passed the Turing test yet then. I guess it has! You're right that this ends up being an argument over semantics, but seems fair to let Alan Turing define what the term 'Turing Test' should mean.

But I do think that the stricter form of the Turing test defined in that metaculus forecast is still a really useful metric for deciding when AGI has been achieved, whereas this much weaker Turing test probably isn't.

(Also, for what it's worth, the business tasks I have in mind here aren't really 'complex', they are the kind of tasks that an average human could quite easily do well on within a 5-minute window, possibly as part of a Turing-test style setup, but LLMs struggle with)

I probably need to stop saying that AI hasn't passed the Turing test yet then. I guess it has!


By that definition, ELIZA would have passed the Turing test in 1966

Show me a 1966 study showing 70% of a representative sample of the general population mistake ELIZA for an human after 5 minutes of conversation.

But I do really like that these researchers have put the test online for people to try!

https://turingtest.live/

 

Thanks for sharing, it's an interesting experience.

As you mention for now it's really easy to tell humans and AIs apart, but I found it surprisingly hard to convince people I was human.

Curated and popular this week
 ·  · 5m read
 · 
[Cross-posted from my Substack here] If you spend time with people trying to change the world, you’ll come to an interesting conundrum: Various advocacy groups reference previous successful social movements as to why their chosen strategy is the most important one. Yet, these groups often follow wildly different strategies from each other to achieve social change. So, which one of them is right? The answer is all of them and none of them. This is because many people use research and historical movements to justify their pre-existing beliefs about how social change happens. Simply, you can find a case study to fit most plausible theories of how social change happens. For example, the groups might say: * Repeated nonviolent disruption is the key to social change, citing the Freedom Riders from the civil rights Movement or Act Up! from the gay rights movement. * Technological progress is what drives improvements in the human condition if you consider the development of the contraceptive pill funded by Katharine McCormick. * Organising and base-building is how change happens, as inspired by Ella Baker, the NAACP or Cesar Chavez from the United Workers Movement. * Insider advocacy is the real secret of social movements – look no further than how influential the Leadership Conference on Civil Rights was in passing the Civil Rights Acts of 1960 & 1964. * Democratic participation is the backbone of social change – just look at how Ireland lifted a ban on abortion via a Citizen’s Assembly. * And so on… To paint this picture, we can see this in action below: Source: Just Stop Oil which focuses on…civil resistance and disruption Source: The Civic Power Fund which focuses on… local organising What do we take away from all this? In my mind, a few key things: 1. Many different approaches have worked in changing the world so we should be humble and not assume we are doing The Most Important Thing 2. The case studies we focus on are likely confirmation bias, where
 ·  · 2m read
 · 
I speak to many entrepreneurial people trying to do a large amount of good by starting a nonprofit organisation. I think this is often an error for four main reasons. 1. Scalability 2. Capital counterfactuals 3. Standards 4. Learning potential 5. Earning to give potential These arguments are most applicable to starting high-growth organisations, such as startups.[1] Scalability There is a lot of capital available for startups, and established mechanisms exist to continue raising funds if the ROI appears high. It seems extremely difficult to operate a nonprofit with a budget of more than $30M per year (e.g., with approximately 150 people), but this is not particularly unusual for for-profit organisations. Capital Counterfactuals I generally believe that value-aligned funders are spending their money reasonably well, while for-profit investors are spending theirs extremely poorly (on altruistic grounds). If you can redirect that funding towards high-altruism value work, you could potentially create a much larger delta between your use of funding and the counterfactual of someone else receiving those funds. You also won’t be reliant on constantly convincing donors to give you money, once you’re generating revenue. Standards Nonprofits have significantly weaker feedback mechanisms compared to for-profits. They are often difficult to evaluate and lack a natural kill function. Few people are going to complain that you provided bad service when it didn’t cost them anything. Most nonprofits are not very ambitious, despite having large moral ambitions. It’s challenging to find talented people willing to accept a substantial pay cut to work with you. For-profits are considerably more likely to create something that people actually want. Learning Potential Most people should be trying to put themselves in a better position to do useful work later on. People often report learning a great deal from working at high-growth companies, building interesting connection
 ·  · 1m read
 · 
Need help planning your career? Probably Good’s 1-1 advising service is back! After refining our approach and expanding our capacity, we’re excited to once again offer personal advising sessions to help people figure out how to build careers that are good for them and for the world. Our advising is open to people at all career stages who want to have a positive impact across a range of cause areas—whether you're early in your career, looking to make a transition, or facing uncertainty about your next steps. Some applicants come in with specific plans they want feedback on, while others are just beginning to explore what impactful careers could look like for them. Either way, we aim to provide useful guidance tailored to your situation. Learn more about our advising program and apply here. Also, if you know someone who might benefit from an advising call, we’d really appreciate you passing this along. Looking forward to hearing from those interested. Feel free to get in touch if you have any questions. Finally, we wanted to say a big thank you to 80,000 Hours for their help! The input that they gave us, both now and earlier in the process, was instrumental in shaping what our advising program will look like, and we really appreciate their support.