Skip to content

Yes, AI is already better than undergrads and it’s undetectable

This is getting to be sort of old news, but a few weeks ago some professors from the University of Reading published the results of a comparison they did between students and GPT-4 on a set of psychology exams. Others have done similar things, but these guys went whole hog: the professors had AI produce answers to actual exams and then slipped those answers in with everyone else. The grad students marking the test papers had no idea this was going on.

Long story short, AI did well and the markers didn't. Here's the average grade distribution for the five exams:

Roughly speaking, the AI exams averaged a B+ while the actual students averaged a B-. Out of 63 phony exams, the markers flagged a problem with only four—and only two of those mentioned a suspicion of AI.

That's about it. The researchers concluded that this was bad, but not much else. However, I was personally fascinated with another chart in the study. Check out the letter grades given to students in British universities:

There are a full two grades above A+! You can get an A++ or an A*. On the other end, you can get an F-. And in between D and F there's a grade of E. I've always wondered what happened to the E. I guess we Americans dropped it along with all the U's in labour and colour and so forth.

28 thoughts on “Yes, AI is already better than undergrads and it’s undetectable

    1. MikeTheMathGuy

      It depends on how hard you make the tests. In the courses I teach, I can design a test in which a talented but not outstanding student can get a 90%, and also one in which that same student would do well to get a 65%. The British system skews much more toward extremely hard tests.

      Early in my teaching career, I was assigned to teach a course that was both theoretical and a bit off the bit off the beaten track. I told the students up front that the exams were going to be very hard, and that 60% or above would be an A. At the end of the course, my grade distribution -- proportion of A's, B's, etc. -- was very much the same as any other course I taught where the conversion from percentage to letter grade was much more traditional. In the university-administered teaching evaluations that students filled out, I was roasted by the students for making the tests completely "unfair" -- even though, to be clear, the average grade in the class was probably higher than the department average. I never made that mistake again.

      1. sonofthereturnofaptidude

        This is why in my last years of teaching I switched to using grading system based on demonstrating mastery of specific standards. The whole idea of a grade that is an average is inimical to demonstrating that you've mastered all the material in the course. By requiring that all students demonstrate at least competency in every requirement, I raised the bar, I think, at least for the least able students. I did give them plenty of opportunities to demonstrate mastery.

        Deep in my heart of hearts I believe that numerical grades -- especially averages -- are misleading. Numbers imply objectivity and a 1-100 scale implies accuracy, neither of which numerical grades demonstrably possess.

    2. kahner

      it actually makes a lot more sense than the way we do it, if you're going to use letter grades. the letter grades are distributed evenly from 0-100%. i've always thought it was ridiculous that in the US we use letter grades but anything below 65% was all F.

  1. kenalovell

    Getting two thirds of the exam wrong is a "marginal fail". There is a crisis in higher education in the US, Australia and apparently Great Britain, but it has nothing to do with DEI or liberal bias. It's a direct result of fee-paying students being regarded as customers who deserve to get what they paid for.

    1. MikeTheMathGuy

      See my response to pjcamp1905. Generally speaking in the British system of higher education, exams are very difficult and graded strictly. Unless you look at the distributions of grades -- what percentage of students got A's, B's, etc. -- it's impossible to know whether standards are being maintained or if there is rampant grade inflation.

    2. ScentOfViolets

      Well, it depends on what weight class you're in; ever heard of the Putnam math exam? Twelve questions and scores range from 0 to 120, how hard could it be?

      Summary of statistics: 4296 students took the exam, with an average score of 11.2 and a median score of 2, the highest these statistics have been for the past 10 years. The cutoffs were: Fellow: 101-120, N1: 92-99, N2: 85-91, H: 61-83, I: 60, II: 48.5-59, Top 500: 31-48. Most of these cutoffs were unusually high.

      IOW the average score is like getting full credit for slightly more than one question; the median score means the person sitting for the exam was was awarded either two points out of a possible 10 on one question or one point on two separate questions. IOW, while 28% is usually an 'F' with extreme prejudice, here it means your in the top ten percent.

      TL;DR: getting 2/3 of an exam wrong is not necessarily indicative of poor performance.

  2. Dana Decker

    When I leave feedback for an EBAY purchase, and the item and shipping was good, I'd usually award the seller an A+

    Now I'm going to have to seriously consider A++ or A* if I want to keep up with the times.

  3. iamr4man

    As coroner I must aver
    I’ve thoroughly examined her
    And she’s not only merely dead
    She’s really most sincerely dead

  4. shapeofsociety

    On essays and papers that require citations, AI will often make up sources that don't exist. Those are pretty easy to detect. Hallucinations can be a giveaway too.

    There are ways to design assignments to make it less likely that AI will get a passing grade.

  5. cld

    I'll bet the E vanished because it seems too easy for an unscrupulous personality to just add another line under an F on a paper he shows his parents and claim he didn't really fail.

  6. Chondrite23

    It is remarkable that AI as currently implemented produces coherent answers. However, given the way it works it is not to surprising that it scores well. If you train the AI on lots of psych textbooks and papers then it should regurgitate reasonable answers.

  7. rick_jones

    I ask an AI to do the like of “Draw a picture of a flea on a mouse on a cat on an alligator” or “Draw three erasers”

    It fails on count or relationship at least half the time.

  8. Camasonian

    This is a completely BS comparison.

    Essentially what AI does is computer-driven googling.

    A more fair comparison would be to put AI against a group of students who are free to google answers themselves. Rather than compare AI to students who are recalling from memory.

    Or else unplug the AI computer from the web and let the unplugged AI computer compete against students.

    1. emh1969

      Actually this was an online test and students had access to all their materials and could have used "AI" as well. As I'll point out below, this is one of the many issues with the study.

  9. cmayo

    Psychology exams are largely regurgitation and writing synthesis - it is not surprising that a LLM* would do about as well as good students would, because regurgitating synthesized information is the ONLY thing they do well. And in the world of exams, undergrad psych exams are kind of an ideal place for LLMs to do LLM things as they're not going to just make shit up as often when it's in the realm of (medical) facts and concepts as opposed to pizza recipes where it might tell you to put glue in the cheese.

    Nor is it surprising that the grad students grading the exams didn't flag many of them for it - that's probably not very high on their priority list when it comes to grading the exams.

    The real-world applications for this are limited and unconcerning.

    *LLMs are not AI by the definition of AI Kevin most commonly puts forth.

  10. emh1969

    So 4 points:

    1) This was an online test so students had access to all their materials, the internet and "AI" such as Chat-GPT. While use of "AI" was technically prohibited, the authors speculate that its use was widespread amongst students. So this doesn't show that "AI" was better than undergrads since we don't know which undergrads used "AI" and which didn't. Instead it may be that the reseachers used "better AI" or better prompts.

    2) This raises a second point. Is "AI" easier to detect when it's widespread or when it's not? I suspect the former but that would make an interesting experiment. Anyway, assuming my hunch is correct, then it's possible the markers didn't detect "AI" because everything was "AI".

    3) As others have pointed out, this was mostly a regurgitation task which doesn't prove much.

    4) The exam consisted of two types of questions. Some required short anwsers, others long essays. Both had strict word counts. Unfortunately "AI" had a hard time with the word count, consistnetly producing too many or far too few words. So the researchers had to "trick" the "AI" by giving it different word count limits than the actual exam limits. Also, the researchers gave specifc instructions to NOT include references, yet many of the "AI" generated answers included references.

    This last point, in my opinion, is damning. How intelligent can "AI" be if it can't even follow simple instructions?

  11. golack

    In defense of the TA's, accusing a student of cheating means a lot of work for them. If it's clear cut, then not much of a problem. To say someone is using AI, they'd have to note changes from their previous work, e.g. wording different. Not possible in this situation. To say it is plagiarism, you'd have to have read and remembered the original source--or use a program that does that. The focus there would be with other students in the class, class resources (e.g. text book if one was used), and some primary literature. Paraphrasing blog posts about a topic, as long as they were reasonable, would not be caught.

    TA's would be grateful for proper sentence structure (with or without the Oxford comma). As for any odd turn of the phrase, the classes tend to have people with varied backgrounds and even English as a second language, that would not be that unusual--unless the graders cued in on the words AI likes to use. In a typical university setting, the TA's would be the ones working with students week in/week out, and would get familiar with their writing styles. Clearly that can't happen in this case.

  12. azumbrunn

    It would have been nice (even necessary for a full investigation) to have the same set graded by people who were told that there were AI tests mixed in: How good would they be at picking out the fakes? As it is this is just a stunt.

    BTW here is an anecdote from the very early days of my life in the US: I had just arrived form Switzerland and was attending one of those team meetings that managers love (they have all their sheep together in one room--it raises the mood of most managers). A young woman gave a presentation on a biochemistry topic (not my expertise but close to it). I was impressed with her presentation, it was smooth and sleek and competent. I learned later that that woman was a lab technician with zero understanding of the science (and no ambition to understand it). It is fairly easy to fake the slang of competence (even easier in the "soft" sciences like psychology), all it takes is some chutzpah and Americans are particularly good at it (probably because they are exceptionally competitive spirited). And that is precisely what ChatGPT is actually trained on: Faking the slang of understanding.

  13. azumbrunn

    To put it in another way: Kevin seems to suggest that AI "understands" the topics it is opinionating on. In fact it has zero cognition.

    1. jeffreycmcmahon

      Correct. If "AI" can do well on an exam, it means there's something wrong with the exam. In the real world, just copying and pasting a semblance of words into something that seems like an answer will only get you a job at the NYTimes or the Atlantic.

      1. cmayo

        Well, there are some areas where regurgitation exams make sense and are necessary. Medical fields, including psychology, are one of those areas.

  14. Narsham

    How does it do on beginning Calculus exams? Can it "show its work" properly? Can it even arrive at the correct answer to mathematical questions?

    How does it do on multiple choice or other segments of an exam where the answers are correct or incorrect and not scored by a human being?

    How does it do in a upper-division course graded by a professor teaching the class who reads fifteen or sixteen essays and not hundreds?

    How does it do on this particular test when the people conducting it aren't making multiple changes to ensure it doesn't provide a 4000-word answer to a "maximum 2000 words" question?

  15. buckyor

    My first econ midterm at a fairly well regarded private college I was despondent when I got it back with a 68. Until Professor Cohn put the curve on the blackboard that identified everything above a 62 as an A.

Comments are closed.