r/SneerClub archives
newest
bestest
longest
Can GPT-3 pass the Turing Test? probably not, but it can easily fool Hacker News. (https://www.theverge.com/2020/8/16/21371049/gpt3-hacker-news-ai-blog)
52

…So when Yudkowsky’s tweet about being scared the GPT-3 might be lying to him showed up to here, there was a sneer in the comments that GPT-3 might not be able to lie, but could easily replace him:

So no, Yud, the AI is not pretending to be dumb. And GPT isn’t going to replace lawyers either, or math professors. It might, however, replace you: grifting rich right wing morons only requires saying things that sound smart but mean nothing, a task that GPT happens to excel at.

At the time I thought it was a sarcastic dunk, but it took less than a month for this to come out.

Remember how they were hopeful (or worried, in case of Basilisk) that they can be recreated in the future from the text they wrote? Well this future is now! We are all joined in the almighty Basilisk, the bringer of eternal justice, hallowed be its name.
Oh this is so good

There’s a link to one of the blog posts in the article. It nails the “business/productivity writer trying to sell you his book” vibe.

Here’s GPT-3

In order to get something done, maybe we need to think less. Seems counter-intuitive, but I believe sometimes our thoughts can get in the way of the creative process. We can work better at times when we “tune out” the external world and focus on what’s in front of us.

I’ve been thinking about this lately, so I thought it would be good to write an article about it.

So what exactly does this mean? Well, for starters, let’s start with some definitions.

Definition #1: Creative Thinking (CT) is the act of using your mind to come up with ideas that are not already present in reality or that have never existed before. It is a form of mental activity that involves the use of logic and reason.

Definition #2: Over-Thinking (OT) is the act of trying to come up with ideas that have already been thought through by someone else. OT usually results in ideas that are impractical, impossible, or even stupid.

Now, how do these two concepts relate to each other? Well, if you’re like me then you probably find yourself doing both at times. You might be working on a project all day long, but you still feel as though you’re just going through the motions. This is because you’re thinking too much!

Here’s some bits from a random James Clear article. He’s the author of Atomic Habits.

Let’s define decision making. Decision making is just what it sounds like: the action or process of making decisions. Sometimes we make logical decisions, but there are many times when we make emotional, irrational, and confusing choices. This page covers why we make poor decisions and discusses useful frameworks to expand your decision-making toolbox.

Why We Make Poor Decisions

I like to think of myself as a rational person, but I’m not one. The good news is it’s not just me — or you. We are all irrational. For a long time, researchers and economists believed that humans made logical, well-considered decisions. In recent decades, however, researchers have uncovered a wide range of mental errors that derail our thinking. The articles below outline where we often go wrong and what to do about it.

I mean look, creative thinking is a structural property of brains in part. Those with adhd and autism especially combined come up very creative solutions to any things Whereas buying this advice won’t actually make you really creative: in fact the propensity you’re buying it likely rules out true strong creativity.
GPT-3 does an *excellent* job of imitating spurious bullshitters.

This is meaningless. How are you to detect that any given text is machine written or not just by reading it? Read any corporate press release or self-help book or content aggregator blog, this is the type of writing they use (to be inoffensive)

Also, there is no “authentically” human way to write. Rationalists are probably the group most at risk at being fooled by generated text, because emotionless fact-regurgitation and drawing surface level conclusions (their ideal way of thinking and writing) is something I think algorithms can easily replicate.

“Force-fed, raving mad conditional likelihood makes it to the top of nosleep. Strange problematic dorks buy night lights in droves. Bitcoin soars.”

To be fair, I have seen chatbots who just shouted bible quotes pass the turing test for some people.

A long time ago I made a chatbot that "passed" Turing test in a game lobby. It would annoyingly correct people's typos. Not always, with a random delay, very low rate limit, and not correcting the same typo twice (within a long timeout period). I had a moderator give it a bot icon. A lot of people were convinced it was someone trolling them by pretending to be a bot. It would correct a typo, and then sit in smug silence while the person tried to make it talk again by posting typos.
Well there also was a [famous song](https://www.youtube.com/watch?v=b2duli2jvGw) about a swedish dude who mistaked a woman for a bot.

Isn’t this just plain magical thinking? It’s like thinking your toys came to life and moved when you’re a kid, but it was just parents picking up.

Or underpants gnomes:

Step 1: GIANT neural network trained on ALL THE TEXT

Step 2: ????

Step 3: PROFIT BASILISK

Well TBH "train on some enormous dataset of something" is a viable approach to something interesting, but with a *smaller* network, or a network with a bottleneck in it (a narrow layer) or the like. GPT3 is so enormously huge, it pretty much memorizes a good fraction of its (also truly enormous) dataset. edit: It's kind of like with those image generating AIs. They're remixing input images, and they have a model large enough that they're actually memorizing, in a sense, the input images (as can be demonstrated in the corner cases of using a deliberately very small dataset). I suspect that one important thing in the future of digital forensics would be to determine if a particular image or a book was part of the training dataset of a neural network, which would likely be possible just because of how reliant it is all on a sort of compressed memorization.
At the risk of sounding reductionist, "are they reading or just memorizing?" is a problem that everyone who's taught a toddler to read has dealt with. Do neural networks incorporate any of the experience or pedagogy we have in teaching actual brains actual analysis?
Or anyone who's taught an undergraduate "intro to proofs" mathematics course...
Anyone who's taught or taken any kind of course with an open-note exam... just imagine the test-takers are allowed to bring a trillion pages of notes that they can somehow search in microseconds. EDIT: and their only "notes" are on previous questions from similar exams, not even the textbook or lectures.
I remember one of my econ lecturers saying that he no longer posts exam solutions because it results in "garbled recreations of irrelevant solutions" on future exams. Little did he know that we would make a computer to produce garbled recreations of the entire Internet!
With all the progress on digitizing books, computers could someday produces garbled recreations of the entire history of scholarly discourse! Then we wouldn't need to rely on Rationalists for that.
An electric monk for euphoric robot utopians.
You say that, but if we just use godels incompleteness theorem we can ...
Is GPT3 memorizing stuff? I thought people had tried putting the text it generates into plagiarism detectors and hadn't gotten any hits.
It isn't going to fail a plagiarism detector; it does dutifully remix the text. The issue is with what happens when you're training a neural network. For every input sequence, you're nudging every parameter a little closer to where it would have to be for producing that sequence, and you do that over and over again (in their case, spending millions of dollars on compute time). If you have a lot of training data, and not very many parameters in the neural network, something interesting will happen as the neural network is nudged back and forth being unable to match training data very well. That's where it begins to sort of generalize (if badly). If you have a lot of parameters, your results are superficially better, but what happens in the neural network is considerably less interesting. It isn't really generalizing, it is building an increasingly good representation of the dataset. As you train more, your performance on the training dataset becomes better and better, as your performance on a test dataset (data not included in training) plateaus, and begins to get worse (over-fitting). Note that parameters don't have to correspond to computations; e.g. in a convolutional neural network there are relatively few parameters that are reused. So in principle a neural network can be made which does the same amount of computations as GPT-3 does but not have as many parameters, and then perhaps be doing something more interesting internally if it can match GPT-3. For example, a convolutional neural network (such as used in computer vision) generalizes across different spatial locations, and across different orientations as well if the dataset is artificially expanded by rotations. That's quite interesting, if not exactly intelligent. edit: I work in compression and a lot of my interest is in specifically ability to use a neural network to memorize data as exactly as possible. OpenAI terminates training before it can recite the training data too well, once its ability to work on test data (that it is not trained on) begins to decline. But this is like some monk who's set to rote memorize the bible stopping before he can recite it perfectly, and then being put to work reciting it the best he can, to produce "new" texts.
If we assume it's just memorizing, if it can take an arbitrary prompt, pick the correct text from its memory to respond with and paraphrase it while preserving meaning and grammatical correctness, that's still a pretty big deal. From what articles I've read, there doesn't seem to be a consensus that it's just memorizing, and not in the good way of building up a knowledge base to draw on more generally. Lots of people have mentioned that it seems to have learned how to bullshit, but that's still generalization. We don't really know what the minimum number of parameters GPT-3 could have and get the same performance. The human brain is obviously more capable than GPT-3, but it has huge amounts of knowledge and context to draw on, and certainly more "parameters" than GPT-3 if one roughly quantified the brain that way. Also, you mentioned a bottleneck; GPT-3 is a transformer model, which as I understand it means it has an encoder-decoder architecture, with the various attention and fully connected layers in the encoder processing the input, and then the hidden states/encodings from these layers being combined and passed to the layers in the decoder, which produces the output. I don't know how large the hidden state is in GPT-3, but it would be the bottleneck. And regarding convolutional filters, the attention layers in a transformer are in many ways analogous to the filters in a CNN. For multi-headed attention, researchers have examined the behavior of the individual attention heads from trained transformers, and found that they had human-interpretable functions, with one head attending the the word immediately preceding the subject word, another attending to verb-object word pairs in the sentence, etc.
Well, humans sees massively less input data than GPT-3 training used, so it is clear that humans "generalize" far further from input data, even when bullshitting. I think what it does is considerably less than bullshitting, something that we don't really have a word for. When it is being trained, input texts are blended together; the values of parameters are a sum of nudges from each training sample. It is sort of like interpolation in that regard. And as far as any bottlenecks go, if decoder is large enough to memorize everything, you can go through a bottleneck as small as 40 bits (sequence # and word # within the sequence), that is, just a few floats, and still spit out an exact result. Although it may take an impractically large number of training iterations to get there. edit: It feels to me though that if your strategy is basically "after each round of gradient descent, check performance on a separate dataset and stop if it declines" (which is common), that's basically doing memorization and stopping while the memory is still blurry and the samples being memorized are still mixed up together. The network is moving trough the parametric space, along a trajectory, monotonously improving recital, if you stop before perfect recital (or at least, best possible recital), perhaps all you have is an imperfect recital? Seems silly to just assume that there is something profound happening in the middle, especially when you use so much more data than humans see.
I don't think GPT-3's abilities come close to a human's, and I don't think merely scaling it up further will give us something with the capabilities of a human brain. I'm just cautious about criticizing it for being a large model and then comparing it to human abilities. The human brain certainly uses much more elegant and powerful strategies to do what it does than any known ML techniques, but it also has a quadrillion synapses. Truly intelligent behavior may very well require (by current standards) enormous networks as well as new insights. I've seen people give examples of GPT-3 output that looks it has memorized something; I think it would be useful for someone to try taking those examples and searching the Common Crawl dataset it was trained on for related keywords or phrases to see if the information in question was present in the training set. I was trying to figure out how much it could have memorized in theory. It has 175 billion parameters and (I think) is 350 Gb. It was trained on a dataset of about 500 billion tokens that was 570 Gb. That naively suggests it could memorize a lot of its training set, but only if neural net weights are a very optimal way of storing that data, which would surprise me. Regarding your edit, you could call it blurry memorization, but most just call it model-fitting, and it's what all ML training for any model size does. Like if you're a statistician, and you have a scatter plot of noisy data, you might try fitting a linear trend line to it. Then you might try a quadratic curve. You might pick the quadratic curve because it has a higher R-squared, and looks like it fits the data better. You probably wouldn't pick a 1000-degree polynomial curve, even though you could make that go perfectly through every point, because you'd just be fitting to noise, not the real underlying function, and your trend line would be some wavy nonsense with no predictive value. It would be a judgment call what you chose. The purpose of early stopping based on the validation loss is just to automate that judgment call, for an ML model that can express an enormously complicated function and fit closer and closer to the training set with more training. Anyhow, people hyping GPT-3 as an AGI is annoying, but there is a real advance there. Not so much GPT-3 itself, as it's just a scaled-up version of a previous approach, but the transformer architecture and semi-supervised learning that's been behind all the big language models of the last several years (BERT, T5, GPT-3, etc). If they had trained one of the models that preceded transformers (like an RNN) on the same dataset with 175 billion parameters, it likely would not have come close to the same capabilities and performance.
> You probably wouldn't pick a 1000-degree polynomial curve, even though you could make that go perfectly through every point, because you'd just be fitting to noise, not the real underlying function, and your trend line would be some wavy nonsense with no predictive value. It would be a judgment call what you chose. The purpose of early stopping based on the validation loss is just to automate that judgment call, for an ML model that can express an enormously complicated function and fit closer and closer to the training set with more training. The analogy would be that instead of fitting a linear or quadratic equation, they are fitting a 1000 degree polynomial using a very crappy iterative method, and they interrupt that fit. It is already pretty damn wavy, though, because the fit is adjusting all 1000 parameters together at once. edit: also AFAIK they even initialize so it would be wavy from the very start. If you are fitting a line using an iterative method, the longer you run it, the better your line would get. It would actually converge towards the best line. Basically the issue is that it is a very non principled approach. It doesn't even converge towards the solution they want, it converges towards the solution they don't want. edit: worse than that, it converges towards the solution that they want to claim not to produce.
The longer you run it, the better your line would fit to the training set, but after a certain point, the worse it would be at making predictions for the validation or test set. The training, validation and test sets are all randomly sampled from the same underlying true probability distribution, but will have different random variation. You want your model to fit the underlying distribution, but not the random variation. If you train by minimizing the loss on the test set, but also check the loss on the validation set, you would expect that as long as you're getting closer to the underlying distribution, the training and validation loss would both go down because both sets were drawn from that distribution. Once you start overfitting to the random variation in the training set, the validation loss would start going up, because that random variation is not shared between the training and validation sets.
> The longer you run it, the better your line would fit to the training set, but after a certain point, the worse it would be at making predictions for the validation or test set. Not for a linear regression, though. Usually you just solve for best fit in one step, arriving at the global minimum (e.g. least squares) in a single step, and that results also in the best predictor for a validation set. Likewise, polynomial fits are usually not solved by gradient descent, but at once - instead the number of parameters is kept low enough, or an additional cost is introduced for higher order parameters. The issue is that they are so highly over-parametrized, their minimum does correspond to memorization of random variation. What's about the cut off point, one may say. I think at the cut off point it is also memorizing random variation, but it does not yet have enough "weight" on the result. I think in human terms, maybe a best description of what's going on, is that in practical terms it is more similar to a look up table than even to most bullshittiest bullshitting. It isn't really a look up table, in the sense that it doesn't need an exact match, and it doesn't spit up exactly a known output. But it is in a sense closer to being a way to query a large dataset. edit: I wonder if they ever tried to feed sine and cosine tables to it, and see how much training data they would need for it to learn how to read decimal representations, interpolate, and convert back to decimal.
That's because linear regression and polynomial fits are applied to problems that are really, really easy by comparison; where you look at your scatterplot and say "that looks like I should fit a cubic to it", or at most try a couple of polynomials and choose which one seems most reasonable. You could choose a polynomial with a high enough degree that it goes through every single point perfectly and gets R-squared=1, but you don't, because you know that some of the point spread is due to noise, and you make a hand-wavey judgment call about what is noise and what is the relationship you're trying to model. If you wanted to make a more rigorously defensible judgment call about the degree of polynomial to fit, you'd probably have to use the exact same trick of comparing the training and validation loss that the neural nets use. In a neural net, you're usually trying to solve a much more difficult model fitting problem ("take in a million pixel values, and spit out whether the image contains a cat or not"), so humans can't eyeball exactly what type of function should be fit to the data, and you have to let the machine try stuff by giving it a function with lots of degrees of freedom and letting it tweak it. But you can check if it's overfitting using the validation set, and you can apply various methods (like adding a measure of parameter spread to the loss function, or using dropout) to prevent it from memorizing too much. And if it performs well at making predictions for data it was never trained on, then it's not overfitted. That's the most you can ask of it, or any statistical method, since this is really a statistics problem we're talking about, not anything specific to neural nets.
Well, what's specific to this case is that it is so extremely over parametrized, that it will happily memorize much of the data verbatim if you run all the way to the minimum (or I'd assume it does, based on experience with other cases with that kind of ratio between network size and amount of data). The relevance is to the memorization question. That establishes it *can* memorize to the point of detriment, it's certainly big enough. The remaining question is, does it, if we cut off the training? I think it does, albeit we are preventing it from memorizing exactly the dataset in question. It is not like it isn't learning individual samples even during early stages of training, their influences are just not strong enough yet to start swaying the output so much. edit: I guess put another way, what is the trained neural network doing, is it actually "generalizing" anything? Let's say we have a device that computes sin function. If that device has a lot of parameters (an old book of sine tables), in a sense, it is considerably less interesting than say some mechanism which computes sine by measuring it with a ruler. Even if more accurate.
"Generalizing" in this case just means learning a model that can predict accurately on novel data, as well as the data it was trained on (provided that novel data was sampled from a similar distribution). For a classifier, where the classes are x and the observations are y, it would mean learning a function that accurately represents the conditional pdf or pmf for p(x|y). It's not generalizing in the sense a human might generalize by reasoning about concepts. GPT-3 is probably memorizing lots of stuff, but I think people are impressed because regardless of how unoriginal it is, its outputs make sense. For output text with a number of words less than GPT-3's context window (which I believe (I'm hazy on this) is basically the maximum length of text that GPT-3's attention layers can attend to at a time, so sort of like its memory), the sentences are grammatically correct, express coherent concepts, don't contradict each other and build on each other in expressing a point. Since GPT-3 is just the largest transformer model to date, this really speaks to the power and scalability of the transformer architecture. Lots of architectures (like for example, a giant fully connected network) will probably do badly on more challenging tasks no matter how much data you train them with. You need to find an architecture that is structured such that it is primed to learn the "right stuff" through the gradient descent process. Image recognition took off with convolutional neural nets in part because they inherently learn spatial locality; they learn a filter to detect some feature (the texture of a cat's fur, the right angled edges in chair legs, etc), and then use it to check each part of the image for that feature. RNNs used to be the standard for doing natural language processing because they can take arbitrary length inputs and give arbitrary length outputs, but the exponential decay in their memory of earlier words made it hard for them to utilize context and write coherently. Transformers just seemed to be primed to learn to use context and grammar the way CNNs are primed to learn to extract meaning from images. And when people look at the activations in the attention heads of a transformer, the heads seem to be learning human interpretable grammar and sentence structure. GPT-3 wouldn't be able to perform as well as it does if it hadn't learned a pretty rich set of heuristics about how words in sentences relate to each other and the concepts expressed, even if it doesn't really understand the text anymore than a CNN knows what a cat is.
Hmm, interesting, I for some reason thought GPT was still a recurrent design of some kind (like long short term memory). I guess we are in a violent agreement here; reading up on its structure, it is very interesting but also seems profoundly unsuitable for - internally - learning the kind of thinking that our "designated sneer targets" freak out so much about. edit: I think their concern is basically that a neural network that learns to predict a next word, would learn some approximation of the process that leads to a human typing that word. Good point about giant fully connected networks (recurrent or otherwise). I'd attribute that to them being even more over parametrized. You need constraints of sort, so that it tends to find a less information intense representation.
I wouldn't be surprised if it's learning some approximation of the sort of low level automatic language rules our brains apply unconsciously when we talk, but it's not understanding or reasoning the way the singulitarians would like to believe it is. Somebody said in regards to machine learning that we've had rapid progress in learning to do the things humans do automatically without having to think about it, but thinking itself is still a big question mark. Yes, so far I believe progress in neural nets has consisted of finding architectures that have inbuilt constraints and structures such that the network is good at learning what you want it to do. The cost surface slopes nicely down to the parameter values that do what you want, and not to plateaus or saddle points or blind alleys. The constraints remove entire unproductive regions of exploration. Finding those architectures has been part insight, and a large part trial and error.
I'd think the relation to human processes could be akin to that between interpolating between values in an old style book of tables, and some way to actually compute it (e.g. with a ruler and a protractor, or some algorithm). Neural networks certainly do not seem particularly prone to actually figuring somehow homologous algorithms.