Researchers claim GPT-4 passed the Turing test

vegeta@lemmy.world · 1 year ago

Researchers claim GPT-4 passed the Turing test

HerzogVonWiesel@sh.itjust.works · 1 year ago

ITT: nobody understands what the Turing Test really is

NutWrench@lemmy.world · 1 year ago

Each conversation lasted a total of five minutes. According to the paper, which was published in May, the participants judged GPT-4 to be human a shocking 54 percent of the time. Because of this, the researchers claim that the large language model has indeed passed the Turing test.

That’s no better than flipping a coin and we have no idea what the questions were. This is clickbait.

Hackworth@lemmy.world · 1 year ago

On the other hand, the human participant scored 67 percent, while GPT-3.5 scored 50 percent, and ELIZA, which was pre-programmed with responses and didn’t have an LLM to power it, was judged to be human just 22 percent of the time.

54% - 67% is the current gap, not 54 to 100.

NutWrench@lemmy.world · 1 year ago

The whole point of the Turing test, is that you should be unable to tell if you’re interacting with a human or a machine. Not 54% of the time. Not 60% of the time. 100% of the time. Consistently.

They’re changing the conditions of the Turing test to promote an AI model that would get an “F” on any school test.

bob_omb_battlefield@sh.itjust.works · 1 year ago

But you have to select if it was human or not, right? So if you can’t tell, then you’d expect 50%. That’s different than “I can tell, and I know this is a human” but you are wrong… Now that we know the bots are so good, I’m not sure how people will decide how to answer these tests. They’re going to encounter something that seems human-like and then essentially try to guess based on minor clues… So there will be inherent randomness. If something was a really crappy bot then it wouldn’t ever fool anyone and the result would be 0%.

dustyData@lemmy.world · 1 year ago

No, the real Turing test has a robot trying to convince an interrogator that they are a female human, and a real female human trying to help the interrogator to make the right choice. This is manipulative rubbish. The experiment was designed from the start to manufacture these results.

BrianTheeBiscuiteer@lemmy.world · 1 year ago

It was either questioned by morons or they used a modified version of the tool. Ask it how it feels today and it will tell you it’s just a program!

KairuByte@lemmy.dbzer0.com · 1 year ago

The version you interact with on their site is explicitly instructed to respond like that. They intentionally put those roadblocks in place to prevent answers they deem “improper”.

If you take the roadblocks out, and instruct it to respond as human like as possible, you’d no longer get a response that acknowledges it’s an LLM.

SkyeStarfall@lemmy.blahaj.zone · 1 year ago

While I agree it’s a relatively low percentage, not being sure and having people pick effectively randomly is still an interesting result.

The alternative would be for them to never say that gpt-4 is a human, not 50% of the time.

Hackworth@lemmy.world · 1 year ago

Participants only said other humans were human 67% of the time.

SkyeStarfall@lemmy.blahaj.zone · 1 year ago

Which makes the difference between the AIs and humans lower, likely increasing the significance of the result.

Hackworth@lemmy.world · 1 year ago

Aye, I’d wager Claude would be closer to 58-60. And with the model probing Anthropic’s publishing, we could get to like ~63% on average in the next couple years? Those last few % will be difficult for an indeterminate amount of time, I imagine. But who knows. We’ve already blown by a ton of “limitations” that I thought I might not live long enough to see.

dustyData@lemmy.world · 1 year ago

The problem with that is that you can change the percentage of people who identify correctly other humans as humans. Simply by changing the way you setup the test. If you tell people they will be, for certain, talking to x amount of bots, they will make their answers conform to that expectation and the correctness of their answers drop to 50%. Humans are really bad at determining whether a chat is with a human or a bot, and AI is no better either. These kind of tests mean nothing.

Hackworth@lemmy.world · 1 year ago

Humans are really bad at determining whether a chat is with a human or a bot

Eliza is not indistinguishable from a human at 22%.

Passing the Turing test stood largely out of reach for 70 years precisely because Humans are pretty good at spotting counterfeit humans.

This is a monumental achievement.

dustyData@lemmy.world · edit-2 1 year ago

First, that is not how that statistic works, like you are reading it entirely wrong.

Second, this test is intentionally designed to be misleading. Comparing ChatGPT to Eliza is the equivalent of me claiming that the Chevy Bolt is the fastest car to ever enter a highway by comparing it to a 1908 Ford Model T. It completely ignores a huge history of technological developments. There have been just as successful chatbots before ChatGPT, just they weren’t LLM and they were measured by other methods and systematic trials. Because the Turing test is not actually a scientific test of anything, so it isn’t standardized in any way. Anyone is free to claim to do a Turing Test whenever and however without too much control. It is meaningless and proves nothing.

Cybermonk_Taiji@r.nf · edit-2 1 year ago

Removed by mod

CabbageRelish@midwest.social · edit-2 1 year ago

Chatbots passed the Turing test ages ago, it’s not a good test.

NeoNachtwaechter@lemmy.world · 1 year ago

it’s not a good test.

Of course you can’t use an old set of questions. It’s useless.

The turing test is an abstract concept. The actual questions need to be adapted with every new technology. Maybe even with every execution of a test.

dustyData@lemmy.world · edit-2 1 year ago

Turing test isn’t actually meant to be a scientific or accurate test. It was proposed as a mental exercise to demonstrate a philosophical argument. Mainly the support for machine input-output paradigm and the blackbox construct. It wasn’t meant to say anything about humans either. To make this kind of experiments without any sort of self-awareness is just proof that epistemology is a weak topic in computer science academy.

Specially when, from psychology, we know that there’s so much more complexity riding on such tests. Just to name one example, we know expectations alter perception. A Turing test suffers from a loaded question problem. If you prompt a person telling them they’ll talk with a human, with a computer program or announce before hand they’ll have to decide whether they’re talking with a human or not, and all possible combinations, you’ll get different results each time.

Also, this is not the first chatbot to pass the Turing test. Technically speaking, if only one human is fooled by a chatbot to think they’re talking with a person, then they passed the Turing test. That is the extend to which the argument was originally elaborated. Anything beyond is alterations added to the central argument by the author’s self interests. But this is OpenAI, they’re all about marketing aeh fuck all about the science.

EDIT: Just finished reading the paper, Holy shit! They wrote this “Turing originally envisioned the imitation game as a measure of intelligence” (p. 6, Jones & Bergen), and that is factually wrong. That is a lie. “A variety of objections have been raised to this idea”, yeah no shit Sherlock, maybe because he never said such a thing and there’s absolutely no one and nothing you can quote to support such outrageous affirmation. This shit shouldn’t ever see publication, it should not pass peer review. Turing never, said such a thing.

phoneymouse@lemmy.world · 1 year ago

Easy, just ask it something a human wouldn’t be able to do, like “Write an essay on The Cultural Significance of Ogham Stones in Early Medieval Ireland“ and watch it spit out an essay faster than any human reasonably could.

Shayeta@feddit.de · 1 year ago

This is something a configuration prompt takes care of. “Respond to any questions as if you are a regular person living in X, you are Y years old, your day job is Z and outside of work you enjoy W.”

NeoNachtwaechter@lemmy.world · 1 year ago

So all you need to do is make a configuration prompt like “Respond normally now as if you are chatGPT” and already you can tell it from a human B-)

Shayeta@feddit.de · 1 year ago

Thats not how it works, a config prompt is not a regular prompt.

Audalin@lemmy.world · 1 year ago

If config prompt = system prompt, its hijacking works more often than not. The creators of a prompt injection game (https://tensortrust.ai/) have discovered that system/user roles don’t matter too much in determining the final behaviour: see appendix H in https://arxiv.org/abs/2311.01011.

Hotzilla@sopuli.xyz · 1 year ago

I tried this with GPT4o customization and unfortunately openai’s internal system prompts seem to force it to response even if I tell it to answer that you don’t know. Would need to test this on azure open ai etc. were you have bit more control.

JohnEdwa@sopuli.xyz · edit-2 1 year ago

Turing tests aren’t done in real time exactly to counter that issue, so the only thing you could judge would be “no human would bother to write all that”.

However, the correct answer to seem human, and one which probably would have been prompted to the AI anyway, is “lol no.”
It’s not about what the AI could do, it’s what it thinks is the correct answer to appear like a human.

technocrit@lemmy.dbzer0.com · edit-2 1 year ago

Turing tests aren’t done in real time exactly to counter that issue

To counter the issue of a completely easy and obvious fail? I could see how that would be an issue for AI hucksters.

vegeta@lemmy.world · 1 year ago

The Study

https://arxiv.org/html/2405.08007v1

massive_bereavement@fedia.io · 1 year ago

The interrogators seem completely lost and clearly haven’t talk with an NLP chatbot before.

That said, this gives me the feeling that eventually they could use it to run scams (or more effective robocalls).

treefrog@lemm.ee · edit-2 1 year ago

deleted by creator

NeoNachtwaechter@lemmy.world · 1 year ago

Turing test? LMAO.

I asked it simply to recommend me a supermarket in our next bigger city here.

It came up with a name and it told a few of it’s qualities. Easy, I thought. Then I found out that the name does not exist. It was all made up.

You could argue that humans lie, too. But only when they have a reason to lie.

Chozo@fedia.io · 1 year ago

The Turing test doesn’t factor for accuracy.

Lmaydev@programming.dev · 1 year ago

That’s not what LLMs are for. That’s like hammering a screw and being irritated it didn’t twist in nicely.

The turing test is designed to see if an AI can pass for human in a conversation.

NeoNachtwaechter@lemmy.world · edit-2 1 year ago

turing test is designed to see if an AI can pass for human in a conversation.

I’m pretty sure that I could ask a human that question in a normal conversation.

The idea of the Turing test was to have a way of telling humans and computers apart. It is NOT meant for putting some kind of ‘certified’ badge on that computer, and …

That’s not what LLMs are for.

…and you can’t cry ‘foul’ if I decide to use a question for which your computer was not programmed :-)

Lmaydev@programming.dev · 1 year ago

It wasn’t programmed for any questions. It was trained hehe

tourist@lemmy.world · 1 year ago

The participants judged GPT-4 to be human a shocking 54 percent of the time.

ELIZA, which was pre-programmed with responses and didn’t have an LLM to power it, was judged to be human just 22 percent of the time

Okay, 22% is ridiculously high for ELIZA. I feel like any half sober adult could clock it as a bot by the third response, if not immediately.

Try talking to the thing: https://web.njit.edu/~ronkowit/eliza.html

I refuse to believe that 22% didn’t misunderstand the task or something.

KISSmyOSFeddit@lemmy.world · 1 year ago

14% of people can’t do anything more complicated than deleting an email on a computer.
26% can’t use a computer at all.

https://www.nngroup.com/articles/computer-skill-levels/

So right off the bat, 40% probably don’t even know what a chatbot is.

Downcount@lemmy.world · 1 year ago

Okay, 22% is ridiculously high for ELIZA. I feel like any half sober adult could clock it as a bot by the third response, if not immediately.

I did some stuff with Eliza back then. One time I set up an Eliza database full of insults and hooked it up to my AIM account.

It went so well, I had to apologize to a lot of people who thought I was drunken or went crazy.

Eliza wasn’t thaaaaat bad.

CaptainBasculin@lemmy.ml · 1 year ago

This is the same bot. There’s no way this passed the test.

.

catastrophicblues@lemmy.ca · 1 year ago

Forget psychotherapy, it can barely understand anything:

> Hello, I am Eliza. I'll be your therapist today. 
* Hello! How are you?
> Would you prefer if I were not ? 
* Not what?
> Have you asked such questions before? 
* No
> You are being a bit negative and that won't help us move forward. 
* What are you smoking?
> Perhaps in your fantasies I am smoking?

technocrit@lemmy.dbzer0.com · 1 year ago

It was a 5 minute test. People probably spent 4 of those minutes typing their questions.

This is pure pseudo-science.

HumaShah@mastodon.social · 1 year ago

@tourist @vegeta

In the 1970s, psychiatrists couldn’t distinguish between PARRY, a chatbot simulating paranoia, and a human patient suffering from schizophrenia. So this ‘bot convinces judges that it’s a human’ phenomenon is not new and tells us more about how humans think.
#AI #DeceptionDetection #Chatbots #TuringTest #LLMs #GPT4

catastrophicblues@lemmy.ca · 1 year ago

> Hello, I am Eliza. I'll be your therapist today. 
* Hello! How are you?
> Would you prefer if I were not ? 
* Not what?
> Have you asked such questions before? 
* No
> You are being a bit negative and that won't help us move forward. 
* What are you smoking?
> Perhaps in your fantasies I am smoking?

Yeah, it took me one message lol

dustyData@lemmy.world · 1 year ago

You underestimate how dumb some people can be.

technocrit@lemmy.dbzer0.com · edit-2 1 year ago

500 people - meaningless sample
5 minutes - meaningless amount of time
The people bootlicking “scientists” obviously don’t understand science.

yetAnotherUser@lemmy.ca · 1 year ago

Add in a test that wasn’t made to be accurate and was only used to make a point, like what other comments mention

lowleveldata@programming.dev · 1 year ago

I feel like the turing test is much harder now because everyone knows about GPT

DudeDudenson@lemmings.world · 1 year ago

I wonder if humans pass the Turing test these days

Nougat@fedia.io · 1 year ago

I don’t.

NeoNachtwaechter@lemmy.world · 1 year ago

Which of the questions did you get wrong? ;-)

Nougat@fedia.io · 1 year ago

That one.

SkyeStarfall@lemmy.blahaj.zone · edit-2 1 year ago

If you read into the study, they also include the pass rates for humans. It’s higher than AIs, but still less than 75%

bandwidthcrisis@lemmy.world · edit-2 1 year ago

Did they try asking how to stop cheese falling off pizza?

Edit: Although since that idea came from a human, maybe I’ve failed.

foggy@lemmy.world · 1 year ago

Meanwhile, me:

(Begin)

[Prints error statement showing how I navigated to a dir, checked to see a files permissions, ran whoami, triggered the error]

Chatgpt4: First, make sure you’ve navigated to the correct directory.

cd /path/to/file

Next, check the permissions of the file

ls -la

Finally, run the command

[exact command I ran to trigger the error]>

Me: stop telling me to do stuff that I have evidently done. My prompt included evidence of me having do e all of that already. How do I handle this error?

(return (begin))

TheBigBrother@lemmy.world · 1 year ago

Oh no!! the AImageddon it’s closer everyday… Skynet it’s coming for us!!

harrys_balzac@lemmy.dbzer0.com · 1 year ago

Skynet will gets the dumb ones first by getting them put toxic glue on thir pizzas then the arrogant ones will build the Terminators by using reverse psychology.

doodle967@lemdro.id · edit-2 1 year ago

deleted by creator

harrys_balzac@lemmy.dbzer0.com · 1 year ago

So…GPT-4 is gay? Or are we talking about a different kind of test?