I'll confess up front that I have no idea if this is legit:
Just plotted the new @OpenAI model on my AI IQ tracking page.
Note that this test is an offline-only IQ quiz that a Mensa member created for my testing, which is *not in any AI training data* (so scores are lower than for public IQ tests.)
OpenAI's new model does very well pic.twitter.com/D3MDZOxzhK
— Maxim Lott (@maximlott) September 13, 2024
I will say, however, that I'm wide open to believing it. Based on previous versions of ChatGPT plus what I've heard about o1, an IQ of 97 sounds very plausible. As with IQ scores on all standardized tests, this is an average of various subtests. On some of them it scored higher than 97 and on some it scored lower.
Speaking of, I watched Oprah's AI special last night and mostly laughed at it. For some reason it reminded me of how strongly we resist the notion that modern LLM models say more about us than about AI.
Sam Altman said on the show that ChatGPT "just" examines a long string of words and then predicts the next one. "Like word suggestions on your phone," Oprah chirped. Yes, Altman replied, except more sophisticated.
This makes it sound like LLMs are just a parlor trick. And I suppose they are. But they also sound remarkably human. That's because taking in context and mechanically creating a response is also how human brains work. There's very little real thinking or understanding going on most of the time. We just aren't nearly as sophisticated or creative as we like to believe.
And that's why AI is certain to get really good really fast—compared to humans anyway. Not because it's all that smart, but because we aren't.
Color me skeptical. The current technology called AI is sort of like curve fitting. The fitted curve can't be any better than the data. Also, allowing more degrees of freedom often produces a worse result.
The algorithms don't have any common sense. This is why they sometimes come up with wildly wrong answers. Benedict Evans had a great example in his yearly presentation. When asked to find the best example of something it came up with a totally wrong answer but each part of the answer was correct.
This is sort of like when asked to make a picture the result is totally wrong but each part of the image could be very nicely created.
I guess that in more narrow fields where the training sets are well curated the results could be much better.
I've read that Apple is doing this in order to offer an AI service for developers writing software. They are carefully checking what software is used for training for this purpose.
Ugh. I've tried a couple of those services (or more accurately, had them forced upon me). I don't believe I'm in any way unique, so I'm quite sure there'll be almost universal agreement with me when I say that at best they sometimes generate useful code snippets. More often than not though, what they do is, they generate annoying code snippets.
Maybe this is good enough for front end UI, I don't know. But I remain highly skeptical of the notion that it's good enough for most of us.
So Grok is dumb as a bag of hammers. No surprises there.
I think Kevin's point is important. Not to defend AI, but to have us question "human exceptionalism." How much time do you spend each day with people with IQs of 99 and below?
Half of everyone is below average.
No. Half of everyone is below or above the median. Your misunderstanding suggests your position.
A number of people in my industry are excited about the possibility of AI being an SaaS killer. This week I've used it to read a briefing I wrote and suggest a 10 slide presentation for an upcoming conference. Then do another 10 slide presentation for another topic, for a different presentation. I still have to build out the slides and polish, but I don't have to sit and wrangle out what content to include to make sure I'm covering the bases.
I continue to think that AI makes smart people more productive and better at what they do and makes losers lose much bigger because they didn't do their homework and check the data.
I think both things can be true. We can refrain from calling “AI” intelligent, and we can get useful work out of it, with care.
At this point, the only reason I don't scroll right past Google's AI results is if I want a couple of laughs. The only times it is correct is when it has raided Wikipedia for a definition of something and can spit that back at you. Anything requiring critical weighing of sources by recency of publication or by conditions on the topic set forth a couple of paragraphs before, is totally unreliable.
Perhaps this is all a roundabout way of saying that people with IQs of 100 aren't particularly impressive.
the real time with bill maher piece.
...what?
LLMs are just a (sophisticated) parlor trick. Why are you pretending differently, but also (previously in your post) not pretending?
Why are you pretending that LLMs have an IQ? They don't. They don't reason. They regurgitate.
This is getting tiresome.
Drum's clear point is that humans do not reason either anywhere much as humans think they do, that in fact often we are meat robots where we fool and flatter ourselves otherwise.
Nevertheless, the application of IQ to LLM seems something like, I dont know, measuring the bloodpressure of a hydrolic system - that is applying a measure that maybe you can do but just simply does not make sense. (Although one may also reasonably stop to have a thought on the IQ concept itself as archaicly flawed...)
LLM AI is more than a parlour trick, less than what the AI companies are selling it as.
While I disagree with your first and third sentences, the middle paragraph is the heart of the issue.
Very tiresome.
not sure what you mean by parlor trick, but current LLM AI systems are demonstrably useful for a wide variety of tasks.
This is evidence of my greatest fear about AI--that it will enforce a dominance of consensus, of the median, the average, an algorithm optimized for mediocrity. It will accelerate intellectual entropy by giving equal weight to insight and banality.
In short, that it will be the both-siderism of media overflowing its banks and flooding every aspect of cognitive thought.
Over the last two months, I’ve used ChatGPT to accomplish the following tasks, which are just a few of many more projects I’ve completed with its assistance. Without a doubt, it has allowed me to cut my time by a factor of four.
1. Building a SQL database for a marine business that I acquired. ChatGPT helped me build and install a SQL database on the cloud using Google CloudSQL. While I can code, I had never worked with SQL before.
2. Creating a customer- facing front- end: ChatGPT guided me in writing a JavaScript front-end application, hosted on Heroku, to access the database. I had no prior experience with JavaScript or cloud-hosted apps. This app was over 1,000 lines of code.
3. Developing an Excel VBA front-end: I used ChatGPT to code an Excel VBA front-end for staff to manage the cloud-based SQL database.
4. Populating the SQL database with marine data: ChatGPT helped me build up the database with marine water pumps, impellers, and covers, cross-referencing them with diesel engine models, production years, and variants. It also helped me compile a database of marine chat board discussions about these models.
5. Researching workstation specs for a 3D scanner: ChatGPT assisted me in researching workstation laptops and comparing them to the requirements necessary to run a 3D blue laser scanner. I had no knowledge of current-generation processors, cores, threads, GPUs, ECC RAM, clock rates, etc., but GPT made sense of it all, allowing me to make a purchase decision in just 90 minutes.
For point #4, I could have done the research myself using Google, but it would have taken me far longer. ChatGPT also seemed to find better data than I was able to locate on my own.
For point #5, I could have managed without ChatGPT’s help, but it still cut my time in half.
Some people may view ChatGPT as a simple parlor trick, but it is far from that. It has proven to be a highly valuable tool for me.
My own teams in investment research reflect similar - this is not mere parlour trick territory while at the same time remaining dangerously 'dumb' if not used with a certain expertise or awareness.
Sounds kinda like how ordinary people can't understand the augury of looking at bird entrails, it requires skilled experts, "priests" you might call them.
The new ChatGPT model is the first, from any company, that I’ve found, to solve this riddle. It’s from Martin Gardener’s book Aha.
Mary and Joe walk past a record store. Joe asks Mary if she still has her record collection. She says not really, she given most of them away.
She tells him she gave half her records and half a record more to Sue. She then gave half the remaining records and half a record more to Sam.
Mary tells Joe he can have the rest of her collection if he can tell her the smallest number of records she had in her collection to start.
Is the answer 3, and Joe gets nothing?
(3 divided by 2 = 1.5) + 0.5 = 2 to Sue
(1 divided by 2 = 0.5) + 0.5 = 1 to Sam
total 3.
If we take Mary at her word that "most" does not mean "all" and she still has some quantity left over, then the answer is 4.
(4 divided by 2 = 2) + 0.5 = 2.5 to Sue
(1.5 divided by 2 = 0.75) + 0.5 = 1.25 to Sam
0.25 left over for Joe, gee, thanks a lot.
If we assume that Mary isn't destroying records by breaking them into halves or quarters, even though the text does not explicitly state this, the answer is 7.
(7 divided by 2 = 3.5) + 0.5 = 4 to Sue
(3 divided by 2 = 1.5) + 0.5 = 2 to Sam
1 left over for Joe.
Still no AI here. But many people are making a lot of money tricking people into thinking so.
Kevin shows a realization that its a con, but then jumps back in.
It's not a con but it is a massive over-sell. Not the same thing, AI tools my teams use now are impressive - and 'mechanically smart' in a way 10 yrs ago was just not imaginable. But
This to Drum's note:
There's very little real thinking or understanding going on most of the time. We just aren't nearly as sophisticated or creative as we like to believe.
This is right and wrong at the same time.
Most of the time we are as humans not at all what we flatter ourselves we are, however on non-standard or novel applications, we are in fact. So probably 75% of the time we are despite our self-perception, meat robots. It is the 25% (as just a plausible illustrative number not a serious suggestion of a number as 'God alone knows' at this time)
His observation that we are not as clever as we as humans think is a proper one, but he's tending too much to the contrarian
Thank you for stating this clearly.
Absolutely. The difference in a 'con' and an 'over-sell' is mostly semantic jiu-jitsu.
Assigning magical properties to a very useful bean is certainly a con....this just doesnt seem much different.
This point from Jarod Lanier (from his 2010 book You Are Not a Gadget) is good to keep in mind.
You want to say that LLMs are powerful tools and the potential for AI is pretty incredible, go right ahead. But all this crap about humans not being "all that smart" or just "meat robots" is just crap. It's a pernicious idea, and one that the technofascists want us to believe, reducing all of human life to some set of functions that a machine can do faster. The idea is not just wrong, it is dangerously wrong. It is deeply ignorant of our own history, and wherever the idea takes hold, it will be the end of human freedom.
Aside from that, I find it amusing that smart people can be so dumb.
This, oh yeah, and it's blindingly obvious. This was covered decades ago: lookup tables, however large they may be, are allowed under the imitation game. Does anyone think that referencing a lookup table counts as an interactive agent that is as likely as not modelling some rough representation of human cognition? Don't make me point and holler.
"people are much smarter than you technofascists think. i can't believe people are so dumb" seems a bit of a muddled argument. the current LLM models can do lots of useful knowledge based tasks that previously only humans could do in a way that is very surprising. i call that smart in a way that is effectively equivalent to human intelligence insofar as what is can accomplish. i'm not fully convinced that human intelligence is quite so simplistic and easily replicated in full as kevin seems to think, but i'm not getting closer.
+1
I can't imagine why I would buy an encyclopedia where half the articles were bullshit.
Surpassing your average undecided voter--AI Suffrage Now!
When I, and I assume other humans, attempts to express a complex thought, I certainly don’t use anything like an LLM’s guess-the-next-word strategy. Rather, I nearly instantly construct a framework of sentences or sentence fragments containing key words and thoughts, choose a logical order to place them in, and only then start crafting them at the word level, with the emphasis now on connecting words and adjectives. If writing, I will circle back and revise and reorganize as appropriate if I detect flaws in my original plan.
Furthermore, LLM’s have no memory, theory of mind, or understanding of the physical world, the lack of which leads them off into the weeds constantly.
The only uses I’ve found for them other than entertainment are:
1: Answering very simple questions. Ask AI “What time is the Superbowl?” and you get a simple, direct answer. Ask the same to Google and you get an infinite supply of clickbait articles.
2: Personalized Wikipedia pages for obscure questions, though you have to confirm the answers
3: Light text editing. I’ll ask it to revise long passages for clarity and manually keep a few of its suggestions.
"
Sam Altman said on the show that ChatGPT "just" examines a long string of words and then predicts the next one. "Like word suggestions on your phone," Oprah chirped. Yes, Altman replied, except more sophisticated.
"
Except that Sam Altman is very wrong on this. I didn't watch the show, so I have no idea of the context or when he was interviewed. But the specific important point about Strawberry is that, much more so than earlier LLMs, it is now more than "just an LLM", it's no longer just "predicting successor words", it's now also calling different types of algorithms (eg search or "reasoning") when appropriate. That's why it's getting so much better numbers for eg math and physics, because it is "actually" "reasoning" rather than just engaging in linguistic pattern matching.
This is important because when ChatGPT first hit the big time I pointed out in various places, including Jabberwocking, that LLMs as currently implemented are like System 1 thinking, a single pass through multiple layers of a neural net, giving an ultimate result in a predictable amount of time. This works for many tasks (including handling the issues, like vision or language) that seemed to be so problematic for earlier AI,
BUT it does not work for "reasoning", which, based on System 2, seems to require multiple "loops" through the brain, multiple cycles of "thinking and rethinking" (whatever those boil down to physiologically...). So I also pointed out that for ChatGPT to move beyond language manipulation (a neat trick, frequently useful, and easily able to fool journalists, but not actual scientists, as to "intelligence") items would need to be bolted on which looked more like traditional AI, and which performed the role of reasoning.
THAT is the significance of what OpenAI has done here; it's something *very different* from just growing the LLM as has been the pattern for the past three years or so. This reasoning ability, along with RAG, is what will make these machines really "seem" intelligent. (At which point you lot can argue about the difference between "seeming" to be vs "being" "intelligent"...)
In addition to a pat on my back about this (which I clearly got right against critics who said what I was writing, talking about System 1 vs System 2 and LLMs was just word salad) I'd like to add a successor point which will, likely, also be dismissed as word salad.
Which is that once you engage in loops (and it doesn't matter if you are Turing, Godel, or Chaitin, it's all the same) you hit THE philosophical discovery of the 20th century, namely that things aren't just true or false, there's always the third option that they are incomputable. There's a good reason, in other words, that System 1 gives you a straight pass through the nets, no looping -- you may occasionally hallucinate, but you will get a result in a predictable amount if time, and that's generally more useful.
As soon as you allow backward loops you can never tell if the fact that your thinking hasn't yet resulted in something useful means there is no path to what you want - or if another 10% more thinking will get you there...
In a few years (where few may be as early as one year) THE frontier of our new wonder machines will be how do we deal with this? Just like in humans... Has your PhD student produced nothing because they're not smart enough? Or because the problem is intractable? Or because they just need another year? Same for OpenAI 5.0. Expect lots of discussion about optimal heuristics for when to cut your losses and end a "reasoning" search vs when to just keep going.
Reading the comments after writing my note above, I see that (no surprise for the Jabberwocking crowd) almost no-one here actually has a clue just how different Strawberry is. We see the same old tired complaints about LLMs, complaints relevant to the state of the art in 2023 - but it's not 2023 anymore.
For the few (very few...) here who actually want to educate themselves before posting more nonsense, here's a quick summary of why Strawberry is different:
https://www.oneusefulthing.org/p/something-new-on-openais-strawberry
Bottom line is Strawberry gets 83% on competition math. ChatGPT4o (the one that's better than the 4 than some of of you have tried and much better than the ChatGPT3 that many of you tried last year) gets 13%
Strawberry gets 89% on olympiad level code problems, GPT4o gets 11%.
Strawberry gets 78% on PhD level science questions; human experts get 70%.
There are multiple technical qualifications to these results. Some are best-of-N results (ie the AI is given multiple chances to solve the problem with some sort of external validation of which of the N solutions is correct). Some require translation of the problem from natural language into something more computerese.
The point is not that Strawberry is now a PhD in a box; it's that Strawberry makes the vision of PhD in a box *plausible* in a way that was not the case for ChatGP4o; it provides a scaffold on which various other elements can be erected to get something that's both more reliable and easier to use, in the same way that the LLM of ChatGPT4o provided a scaffold on which to erect the new reasoning abilities.
So a fake thing (so-called AI, aka autocomplete on steroids) does well at an unofficial version of a test (this offline-only MENSA test) made by a fool (by definition anyone belonging to MENSA is a preening idiot in constant need of an ego boost) measuring something (IQ) that people increasingly understand doesn't really mean anything and is loaded with biases.
Mr. Drum had to do a fair amount of avoidance of any of these factors to make this fit his narrative but the good news is that it also required very little work.
It increasingly feels like Mr. Drum is a nihilist who hates human beings, for example: "taking in context and mechanically creating a response is also how human brains work. There's very little real thinking or understanding going on most of the time. We just aren't nearly as sophisticated or creative as we like to believe."
I'm pretty sure a psychologist or a cognitive neuroscientist could come in and tell Mr. Drum, "that's true on a very superficial level, but actually..."
Ah, yes, example #278231 of "We took a test designed for a person, and gave it to something that wasn't a person, now everybody pretend the results meant the same thing they would for a person."
An IQ test doesn't measure the full scope of your intellectual capabilities. It measures a small subset of them, ones which can easily be assessed in a standardized way in an isolated setting in a short timespan. For human beings, this is a fair proxy for actual intelligence: If your brain does well on these tasks, it probably does well elsewhere too. (Though not always.)
But stop for a minute and think about how AIs are trained. They are trained by feeding them, essentially, millions and millions of tests -- tests whose results can easily be assessed in a standardized way in an isolated setting in a short timespan. They are designed to excel at precisely the things we can easily test for... but on things we *can't* easily test for, they tend to crash and burn.
These models are extraordinary achievements and can do some amazing things. But that doesn't mean they are anywhere near AGI, and giving them IQ tests is nothing but a meaningless publicity stunt.
For anyone caught up in the hype around LLMs and their relatives, I invite you to contemplate the driverless cars we've been promised were around the corner for a solid decade now. As of September 2024, the supposedly "autonomous" robotaxis (limited to a handful of cities) still require human guidance to get them out of jams both figurative and literal: https://www.nytimes.com/interactive/2024/09/03/technology/zoox-self-driving-cars-remote-control.html
We're probably not as special as we like to think we are. AI is different than us, so applying an IQ test is not measuring what we think it is measuring.
We're basically at peak LLM since they already are being trained on everything. The big breakthrough was when Google engineers set up AI to be able to efficiently look over longer string of words. The next LLM will grow that, which is supposed to help it deal with more complex things--but I'd guess we're maxing out there too.
Other people are trying to run multiple LLM's to get a consensus answer, and in the game playing world, AI's with humans do better than either alone.
Now this model is trying to incorporate reasoning (and logic?) with the LLM. I guessing when it sees a math problem, it doesn't try to guess the numbers but now sends it to a calculator. That's different and can make it more useful.
I say part of the Oprah show too...and there was a lot of anthropomorphism, so to speak, going on. I'm not sure if the concept of "understanding" applies to AI's, at least not yet.