The good and the ugly of AI performance

Kevin DrumDecember 28, 2024 – 9:37 pm48 Comments

A reader brought to my attention the 2024 Stanford AI index Report, and it's basically catnip for someone like me. I shall regale you with various charts from the report over the next few days, but let's start with the good and the ugly (we'll get to the bad later). Here's how basic AI performance has progressed over the past decade:

Impressive. Practically everything is now performing at human level except for high-level math, and it's been on a rocket recently, going from nothing to 90% in only two years. But here's the ugly:

Even the good AI models make up shit 20% of the time, and the bad ones might as well be Donald Trump. This is by far the biggest short-term Achilles heel of AI. Nobody can rely on it for anything serious until this gets cleaned up.

48 thoughts on “The good and the ugly of AI performance”

different_name December 28, 2024 – 9:59 pm at

I suspect there isn't a great way to get much better on "the hallucination problem" other than RLHF, or laborious generation of supplemental data.

What's missing is effectively the context of being human. We aren't "trained" solely on written outputs, and anyway are "agentic" with a self-modifying "reward function".

Humans don't generally write that context down (bedridden Frenchmen being the first exceptions that come to mind), because that would be as tedious to do as it would be useless to other humans.
1. ScentOfViolets December 29, 2024 – 10:00 am at
  
  Ah, the one(s) I read merely to take someone down a peg who was putting on airs. Presumably. I don't have near the stamina to read something like that these days, and wouldn't if you paid me handsomely to do so.
painedumonde December 28, 2024 – 10:38 pm at

I wonder if AI could somehow be shamed, it might start learning when its embarrassed.
1. bbleh December 28, 2024 – 10:43 pm at
  
  Even sociopaths can fake shame, so I'll bet it's possible.
  
  What I continue to doubt is that AI can ever develop a sense of humor.
2. thersites December 29, 2024 – 5:05 pm at
  
  Anecdata:
  I use ChatGPT at work. Ask something like "Write me a Python program to load file X. find all the instances of Y and Z and write out the data between Y and Z into a separate file" and it does a great job.
  Last night I asked it "in what episodes of Show X did Y appear as a guest star?" and got a very wrong answer. I told it it was wrong. It apologized and then gave me some more wrong answers. I doubt if it really felt shame.
  1. painedumonde December 29, 2024 – 8:46 pm at
    
    That's probably it – it doesn't value anything. It just performs its instructions. It's doesn't care if it's wrong or right, it just does backflips. That's fine for a tool. But a tool that constantly makes errors...
bbleh December 28, 2024 – 10:41 pm at

I spent a number of years working in "big data" -- remember that? the hype wasn't as big as AI, but it arguably was kind of a prelude -- and one of the things we routinely tried to remind people about -- at least until it became clear it was hopeless and we gave up -- is that the large majority of the time and energy was spent in curation. That is, we couldn't do Amazing Things with our Big Data until we cleaned it up first, and that took a LOT of time AND domain expertise. In short, GIGO.

I don't offhand see any automate-able way around this. So count me skeptical that this is a "short-term" problem for AI.
1. ScentOfViolets December 29, 2024 – 10:02 am at
  
  Not too mention that at root, all this stuff is just applied linear algebra and probability/statistics.z
  1. Crissa December 29, 2024 – 2:03 pm at
    
    It's not linear, but graph mathematics...
    
    But computers are great at brute forcing complex things.
    1. ScentOfViolets December 29, 2024 – 2:37 pm at
      
      Sigh. Tell me you don't know about back-propagation without telling me you don't know about back-propagation.
      1. Crissa December 30, 2024 – 10:18 am at
        
        Tell me you don't know about graph mathematics or beute force calculations by calling them 'linear algebra'.
        
        MF December 30, 2024 – 3:00 pm at
        
        It isn't graph mathematics.
        
        At the simplest, it is taking a partial derivative of the weight of each link with respect to the scoring function and then adjusting each weight proportionately to increase the predicted value of the scoring function.
2. SwamiRedux December 29, 2024 – 11:00 am at
  
  Exactly right. I've been in the field now and have come to the conclusion that Big Data works better with machine-generated data (e.g. system logs, web transactions, automated process data etc.) than it does with human-entered data (e.g. sales notes, customer status etc.). For the latter you need a lot of curation, and that just doesn't scale. Data puddles provide more reliable insight than data lakes.
3. JohnH December 29, 2024 – 1:55 pm at
  
  I'm impressed by the many comments, like this one, insisting not so much on human intelligence as superior and ineffable, but as essential to AI. Prophets of AI like Kevin, repeatedly, tend to think of it as a contest, with the solitary genius pitted against the rising star. But AI is not acting on its own.
  
  Someone has to program it, with an eye not necessarily to general intelligence so much as specific tasks, such as working with images, math, or text to a specific purpose, from therapy to information summaries. Someone has to return to tweak and correct those programs repeatedly to get better results or to adapt to other tasks. Someone has to select what source and training data to feed it, to obtain those data (legally or, from several suits over proprietary products, not), and to format them in a way the computer can use. That alone can be an endless task.
  
  Most of all, someone has not necessarily to come up with better answers, but rather to direct the question. History suggests that is precisely what goes into the solutions of the past we take as genius. Otherwise, AI really will just be serving up what we already know or outright errors.
  
  Still more, though, my repetition of "someone" simply reinforces the myth of AI or us as the solitary genius. In reality that someone is a team or an entire corporation of engineers and consultants.
  
  In the end, we're getting a combination of better Web search engines with (and here's last year's breakthrough) enough natural language skills to serve up the results n a way we can call more or less AI.
  
  Now, you can say that none of this matters. A team can still design a machine that "decides" to take over from its creator. But the point is not whether we will recreate 2001. It's whether we put everyone out of work, and that's very different. I still see no reason we're not talking terrible or (we can always hope) wonderful machines for carrying out tasks that benefit from remembering more than we can. That's happened before in the industrial revolution or indeed in everyday use of browsers and no reason it can't happen again. People may be employed more than ever, just not on the same tasks.
  
  And I can add that, even if fewer of us had jobs, it's a political and not a technological decision whether that leads, as in Keynes's dream, to easier lives or, in Kevni's, to starvation. It's a matter of where the money goes: who gains, not which "species" gains.
  
  For now, we're left with human-like voices delivering useful results but also way, way too many mistakes and platitudes. And doing so on jobs with far fewer components, in tasks, media, and information types, than the ones most of us would recognize as ours.
  1. dfhoughton December 29, 2024 – 4:14 pm at
    
    > And I can add that, even if fewer of us had jobs, it's a political and not a technological decision whether that leads, as in Keynes's dream, to easier lives or, in Kevni's, to starvation.
    
    I find little comfort in this. Without the AI, starvation is not one of the choices. With AI, it's up to politics (manipulated by AI). The last election shows how much faith one can put in the electorate when they are exposed to even the current state of the art in lying.
golack December 28, 2024 – 11:09 pm at

Not sure about this particular analysis, but (some of) the articles in The Atlantic pointed out problems and limitations with AI. One of the things pointed out was when doing "tests", the AI generated a lot of answers (hundreds?) to get one to be as good/better than humans. I'm pretty sure this is different than it refining it's answer over and over again. The other thing was that there's no effective way to filter the results. It 's not part of the AI, just code tacked on later.

The latest attempts are to generate reasoning/logic type AI's. That is, use the LLM to take requests and break them down into smaller sections/questions. If you were to ask it a math problem, it would translate it as a math problems and feed it into a calculator to get results. 'The competition level math is still results from LLM's. One of The Atlantic (or Wired?) articles pointed out that they have been getting,say, the first digit to the answer right for a while, but it's been taking longer to get the second, then third, then fourth, etc. If the results are graded on if the full answer is correct, it looks like the results are just starting to rocket up. If you look at answers being partially correct, then it looks like slow steady progress as more digits come into place. Basically, the "competition level math" results are a bit of a mirage.
1. Crissa December 29, 2024 – 2:06 pm at
  
  That's... someone miscommunicating how the large language models work.
  
  They have hundreds of possible, but they only select the one to give to you.
  
  Just like a person could think up lots of things to say (especially if they count rephrasings) but only get to provide one to the test.
Jimm December 28, 2024 – 11:34 pm at

In fairness, people hallucinate, are often confused, and believe things that aren't true, at least sometimes. AI still has a very long way to go.
QuakerInBasement December 28, 2024 – 11:38 pm at

'Nobody can rely on it for anything serious until this gets cleaned up."

As you say, the worst ones might as well be Trump...and we put him in charge of the country. AI, with is faults, should be sufficient to replace me.
shaldengeki December 29, 2024 – 12:56 am at

The report you're relying on is bad, and I would highly recommend that you actually dig into their methodology here. What they're doing is using a specific benchmark, with publicly-available questions and solutions, for each task. For example, the MATH benchmark, which is the "mathematical reasoning" line in the chart, is 12500 questions and solutions, which you can find here:
https://github.com/hendrycks/math/?tab=readme-ov-file

For this and all the other benchmarks, _obviously_ OpenAI is training GPT on these datasets. That means that these benchmarks are worse than useless at telling us whether GPT is getting better at the abstract task. From what I understand, they're still pretty bad - tweaking constants or phrasing of a question is frequently enough to make the model no longer reproduce the answer it was trained on.

(So-called private benchmarks aren't much better; assessing GPT without leaking the underlying questions fundamentally requires an on-premise private deployment of GPT, which OpenAI will never consent to.)
1. Crissa December 29, 2024 – 2:07 pm at
  
  Weirdly, your challenges work on humans, to get them to screw up, too.
MF December 29, 2024 – 1:06 am at

I use AI like I would use a fast inexperienced intern.

It can go through lots of documents and highlight issues to consider on a deal. But you still need to read the docs and see if it missed something.

It can go through 1,000 emails and find the right one based on a vague description without good keywords to narrow down the search. But you still need to read the email and make sure it reported the right information.

It is great at finding obscure information on the Internet but you still need to check the sources it provides.

It cannot replace human analysis.
Steve_OH December 29, 2024 – 2:58 am at

I think the biggest source of hallucinations is that no AI really "understands" the data that they are ingesting, so while they can do a rudimentary semantic analysis of text, any unusual phrasing, etc., can throw them completely off track.

Case in point: I'm going to Ghana and need to take anti-malaria medication. I had some leftover pills from previous trips, of two different kinds, and I wanted to verify that what I thought was atovaquone-proguanil was indeed that. So I did a search on the pill shapes and markings, and confirmed what I thought: One pill was generic atovaquone-proguanil, and the other was brand-name Malarone, which should be the same thing. Except that Google's AI told me that the Malarone pills were 62.5mg/25mg, which I didn't think could be right. I did some more searching and determined that the pills were actually 250mg/100mg, as I had expected. The AI couldn't parse its source of information well enough to know that the 62.5mg/25mg pills were a different pediatric-dose pill.
1. geordie December 29, 2024 – 8:39 am at
  
  That actually gives me hope.That is the same type of mistake a harried human looking something up might make. Unfortunately hallucination has taken over the zeitgeist and we call all mistakes made by the LLMs that. By blurring over the distinctions we lose sight of the true state of AI and what remedies are available to make it more accurate.
2. Crissa December 29, 2024 – 2:09 pm at
  
  Some of the AI can understand things, in that they're given parameters and can hold those parameters in their predictions.
  
  But LLMs, specifically, that part can not understand anything, that is true.
  
  Once we marry these two things together, so the AI can use the former to check the latter...
3. SwamiRedux December 29, 2024 – 5:00 pm at
  
  Have fun in Ghana! If you haven't been there before, make sure you have some Jollof rice. And if you want to get under the skin of your hosts, tell them that Nigerian Jollof rice is far superior.
  
  It's a thing in West Africa.
jdubs December 29, 2024 – 4:42 am at

Strong overlap between the people who are really excited about the state of AI and the people who are really excited about standardized test results. Teaching to the test is not a revolution in education or technology and the results, while easy to chart and measure, are not that valuable in the long run.
Salamander December 29, 2024 – 5:14 am at

Hallucinations are bad indeed, but the excessive power usage to produce this stuff is also a growing concern. Is LLM (aka "AI") actually worth it? Does it look as if its problems can be dealt with? What kind of actual value is it adding to human life?
1. Justin December 29, 2024 – 7:55 am at
  
  AI will entertain you if you can afford it. Then it will steal from you. It will Hack into systems we rely on today at the behest of organized crime and corrupt governments. It will provide tools to wage war and surveillance on unnecessary people. All in all, a bad idea. I don’t know why so many are enthusiastic about it.
  
  Too far off in the future?
  1. CAbornandbred December 29, 2024 – 10:10 am at
    
    👍👍👍👍👍
  2. Crissa December 29, 2024 – 2:10 pm at
    
    This is nonsense. It can not do that at all.
jte21 December 29, 2024 – 6:51 am at

I busted a student recently for turning in an AI generated paper. 1. The writing style and organization were obviously canned and not written by a college freshman and 2. several citations in the bibliography were hallucinated. It took the name of a well-known scholar and just made up the title of a book for them. I think going forward that's going to be in my "why you don't want to cheat with AI" talk. It makes shit up. You won't know where the shit is, but I sure will.
1. cephalopod December 29, 2024 – 2:04 pm at
  
  My high-schooler reported that several kids used AI to write papers about short stories from a recent anthology they were assigned. The AI hallucinated everything in the papers, because there was no information about the stories in the training data. Sure, AI can write a paper about Moby Dick, but it can't write an essay about a short story from last month's New Yorker.
  1. Crissa December 29, 2024 – 2:12 pm at
    
    You can, however, feed it the short story and it can do it without hallucinating.
    
    If you keep the scope of tasks within the data you've provided.
Justin December 29, 2024 – 8:05 am at

It’s ugly already. Or is this just hysteria?

https://www.harvardmagazine.com/2023/02/right-now-ai-hacking

From a human’s perspective, the AI-only hacking competition didn’t look like much. A half-dozen brightly colored server racks running sophisticated AI systems were arranged in a semi-circle on a stage in one of the hotel’s ballrooms; flickering LED lights on each machine were the only indicators that an all-out robot war was raging on DARPA’s network. But for Bruce Schneier, a computer-security expert and adjunct lecturer in public policy at the Harvard Kennedy School, what transpired that day was a sobering glimpse of a not-too-distant future when AIs can find and exploit vulnerabilities with superhuman speed, scope, scale, and sophistication. These future AI hackers won’t be limited to computers. They will hack financial, political, and social systems in unimaginable ways—and people might not even notice until it’s too late.
1. lawnorder December 29, 2024 – 4:20 pm at
  
  The good news is that while future AIs will be able to "find and exploit vulnerabilities with superhuman speed, scope, scale, and sophistication", other AIs will be able to "find and eliminate vulnerabilities with superhuman speed, scope, scale, and sophistication". Hackerbots vs. securitybots.
Jimm December 29, 2024 – 8:54 am at

I just hope Wikipedia and the like have an effective policy here, would hate to see AI injecting hallucinations into history (as obscure references initially).
1. Crissa December 29, 2024 – 2:12 pm at
  
  As long as there are humans checking these.
NotCynicalEnough December 29, 2024 – 9:50 am at

There is a pretty good chance that if AI researchers really star eliminating the BS results, the GOP congress will pass legislation severely restricting AI use. Where will they be if AI started producing 100% correct answers *and* people actually trusted them? The entire GOP program would go down in flames. Much better to have AI that gives wrong answers.
1. jte21 December 29, 2024 – 10:41 am at
  
  Yeah, here's Chatgpt's conclusion about the link between vaccines and autism:
  
  "The overwhelming body of scientific evidence supports the safety of vaccines and refutes the claim that they cause autism. Vaccines protect against serious diseases and save lives, making them one of the most significant public health achievements in history."
  
  AI is clearly a nefarious liberal plot to brainwash patriotic Americans.
akapneogy December 29, 2024 – 10:42 am at

"Practically everything is now performing at human level ...."

Is that surprising given the vast database of things to choose from and optimize at mind boggling speeds?
rick_jones December 29, 2024 – 10:44 am at

From the list of takeaways in the summary to which Kevin links:

3. Frontier models get way more expensive.

According to AI Index estimates, the training costs of state-of-the-art AI models have reached unprecedented levels. For example, OpenAI’s GPT-4 used an estimated $78 million worth of compute to train, while Google’s Gemini Ultra cost $191 million for compute.

In addition to the power consumption implications in those figures, I think someone was listening too closely to Dr. Meredith: https://www.youtube.com/watch?v=_ecNAhlWq64

The full report is at: https://aiindex.stanford.edu/wp-content/uploads/2024/05/HAI_AI-Index-Report-2024.pdf
ADM December 29, 2024 – 1:13 pm at

Coming in a bit late here, but it is interesting that the earlier AI trajectories seem to have an asymptote very near (and clearly not far above) the human baseline. Originally, I would have expected AI to continue past the human level without much slowdown. My question is, why does this asymptote even exist? Is it somehow embedded in the training sets?

The two most recent forms of AI seem to be approaching human performance very quickly, with less sign of a slowdown, at least so far. We should soon know whether they suddenly slow down, or rather, continue well past human potential. I believe this is very important to understanding the future of AI.
1. Crissa December 29, 2024 – 2:14 pm at
  
  It becomes difficult to train beyond human levels, and for many literary tasks, there is no 'beyond' currently defined.
  
  Hard to go past zero.
cephalopod December 29, 2024 – 2:19 pm at

I've been watching AI from the view of academic libraries, and here are the things I have found:

People love to use the AI tools that hallucinate citations, and frequently use those hallucinated citations when they absolutely should not.

The AI tools that are designed for scholarly article searches are full of predatory journal content. Predatory journals are also now publishing more articles written by AI. You've basically got trash research generation and trash research elevation going gangbusters thanks to AI. I'm curious to see if retracted articles are being filtered out - haven't seen research on that yet.

The integrated AI in library databases isn't very useful. The toughest thing for students to do is to take their topic and turn it into all of the relevant keywords authors use, but the AI isn't generating alternative keywords. Instead it is mostly offering options to narrow a search (a much easier task) and summaries that aren't much of an improvement over the already-existing abstracts.

It would be amazing if databases could take a description of a topic, understand all of the context, and then deliver lists of articles that deal with all the relevant aspects of the topic, and also provide information about the quality of the publishing journal. That seems a long way off, though. If it does get developed, I expect it to be so expensive that few schools can afford it.
Vog46 December 29, 2024 – 2:28 pm at

******BREAKING NEWS and VERY OT*********
\
Jimmy Carter has died
I didn't like him as president but in retrospect he was handed a very bad economy with high inflation. He brokered the Camp David accords between Menachem Begin and Anwar Sadat. He was a populist.

But he was a man of contradictions. He gave off the persona of a poor, uneducated peanut farmer yet he studied nuclear physics and became a millionaire. I loved his philanthropic nature AFTER he left the oval office.

I will say this about Carter that I cannot say about most politicians. He was true to his word
Love him or hate him he did very good deeds in his life.
1. ScentOfViolets December 29, 2024 – 2:45 pm at
  
  How much flack does Nixon get these days for taking us off the gold standard?
2. thersites December 29, 2024 – 5:21 pm at
  
  You don't have to agree with all that he did to acknowledge that he was one of the most fundamentally decent men to serve as President in my lifetime, maybe in all of US history.
  
  And he put a solar panel on the roof of the White House, a small but significant gesture toward taking action on climate change. When I think how Reagan undid that it get angry.

Comments are closed.