The New York Times has sued OpenAI for copyright infringement. They should lose.

Kevin DrumDecember 27, 2023 – 2:04 pmDecember 29, 2023 – 1:05 am25 Comments

The New York Times is suing Microsoft and OpenAI for training their AI software on Times stories:

In a complaint filed Wednesday, the Times said the technology companies exploited its content without permission to create their AI products, including OpenAI’s humanlike chatbot ChatGPT and Microsoft’s Copilot. The tools were trained on millions of pieces of Times content, the suit said, and draw on that material to serve up answers to users’ prompts.

....Tech companies building generative-AI tools have generally argued that content available on the open internet can be used to train their technologies under a legal provision called “fair use,” which allows for copyrighted material to be used without permission in certain circumstances. In its suit, the Times said the fair use argument shouldn’t apply because the AI tools can serve up, almost verbatim, large chunks of text from Times news articles.

I've expressed my opinion before that mere training is indeed fair use. If things were otherwise, everybody would be liable who had ever learned something from a Times article and then used it to publicly form an opinion or analysis.

But routinely serving up large verbatim chunks of stories is a different thing. This got me curious, so I asked GPT 4 about the Times' coverage of the James Comey letter regarding Hillary Clinton at the tail end of the 2016 presidential campaign:

In October 2016, the New York Times and other sources reported on a letter from then-FBI Director James Comey regarding the reopening of the investigation into Hillary Clinton's use of a private email server.

....The letter by Comey was criticized for its vagueness, as it acknowledged that the FBI did not know the contents of the emails in question or whether they were relevant to the investigation.

Etc.

That seemed fine, so I asked what the Times had actually said:

I'm unable to access the full content of the New York Times article from October 2016 regarding James Comey's letter about the Hillary Clinton email investigation. For detailed information and to read the article, I recommend visiting the New York Times website or accessing their archives directly.

This is just one query and might not represent typical behavior. But if it does, it appears that GPT 4 reads stuff from the Times and then summarizes it briefly the same way any human would do. Unless there's a lot more to it, including the ability to essentially act as a substitute for reading the Times, I still don't see anything wrong here. Treating this as copyright infringement would set an enormously dangerous precedent.

POSTSCRIPT: The lawsuit is here. It includes some examples of large-scale copying, mostly from Bing and mostly produced by laboriously asking for single sentences or paragraphs at a time. I'm skeptical that this is truly serious infringement since it's so artificial, but I can at least imagine a judge enjoining Microsoft from reproducing so much content. Overall, though, I remain unconvinced that training, summarization, and brief excerpts should be barred.

25 thoughts on “The New York Times has sued OpenAI for copyright infringement. They should lose.”

cld December 27, 2023 – 2:25 pm at

It seems that it would be only about forming an opinion if the AI were conscious, and it could only be about referencing within the context of free speech if the AI were conscious, and it's more like you're using my stuff to build your product, as if you took my bricks to make your own building, I object and you respond but it's a different building.
1. MarissaTipton December 27, 2023 – 4:14 pm at
  
  Best among the data entry jobs we have. No internet connection is required to do the job. Just, once download the files on your desktop or any device. Present the work bx12 to us in fifteen to thirty days duration of time. The size of the page is A4 size. The money for the page is offered from $ 10USD to $20USD according to the plan. Just present eighty-five percent vx02 perfection, Simple way to earn money.
  
  Here.......................... https://paymoney33.blogspot.com/
2. lawnorder December 27, 2023 – 4:15 pm at
  
  Trying to analogize physical property and intellectual property can get slippery. In the intellectual property cases, someone may copy your "bricks" but they don't actually deprive you of them. In any case, linguistic "bricks" are called words, and you can't copyright a word unless you invented it.
  1. cld December 27, 2023 – 4:27 pm at
    
    But in this analogy the material isn't as simple a single word. I'd get sued if I used someone else's article as a chapter in a book without crediting it or paying for it. By putting it out there like that I'm denying the author the fair use of his own invention.
    1. J. Frank Parnell December 27, 2023 – 4:53 pm at
      
      Copyrights only protect the exact wording. You cannot use someone else’s work word for word as a chapter in your book without permission, but you are free to express all the thoughts and ideas of that person as long as you use your own words.
      1. Jasper_in_Boston December 28, 2023 – 6:34 am at
        
        Exactly. The "use" CLD is describing is called "copying."
Doctor Jay December 27, 2023 – 2:29 pm at

Well, if someone is aver going to sue ChatGPT etc for this kind of thing, now is the time to do so. Let's find out how this works out.

I mean, I have an opinion, but yeah, let's find out what the courts say, and let's find out as soon as we can, and adapt to whatever the new situation turns out to be.

So I don't mind that they sue, even though I doubt they have much chance of success.

So, is the NYT threatened by the existence of ChatGPT? I'm not sure it is, since ChatGPT depends on its corpus - the body of written text it trains on - heavily. It could omit a few sources and do fine, but it cannot survive being excluded from all sources - such as all news agencies.

So it's kind of a weird parasite of the corpus. But it might be a symbiont.

News organizations kind of thought of (think of?) search engines as parasites, but they are in fact symbionts. They change the ecosystem, for sure. But maybe it has its up side.
Justin December 27, 2023 – 2:52 pm at

This is what the AI master would say.
Larry Jones December 27, 2023 – 2:53 pm at

Since we know (from reading this blog) that pretty soon artificially intelligent computers and machines will put everybody out of work and out of business, and since we can be pretty certain that The Times is in it to earn money, how is this lawsuit anything other than a perfectly logical response upon learning that an early iteration of AI was trained on millions of pieces of Times content? AI development is notoriously unregulated, so why should the Times delay until the facts on the ground make it unlikely or impossible for them ever to get a handle on what is, for them, an existential crisis?
shaldengeki December 27, 2023 – 3:00 pm at

Uh, I read the lawsuit as well, and while it's true that they _asked_ for single paragraphs at a time, in the exhibits you clearly see LLMs output multiple paragraphs in response. It doesn't seem laborious to me at all - in fact, it seems much easier than what you usually have to do to get sensible output out of an LLM.

Seems weird to leave this in a postscript, and even then misleadingly-phrased.
Pingback: Training an AI is much harder than you think – Kevin Drum
D_Ohrk_E1 December 27, 2023 – 3:41 pm at

Intellectual property protection is quite a racket in America.
1. kkseattle December 28, 2023 – 11:49 am at
  
  A fraction of the resources involved in protecting physical property.
jte21 December 27, 2023 – 4:05 pm at

If things were otherwise, everybody would be liable who had ever learned something from a Times article and then used it to publicly form an opinion or analysis.

Sure, but if I'm a pundit or analyst (at least an honest one) and I use a NYT article in supporting or informing my opinion in a public forum, I should be citing/linking them and the reporter involved. Plus, I would also be paying the Times via a subscription to read that article. AI bots are perhaps paying for access to the Times database, but then using it to develop a profit-making product w/o attribution or permission.
1. lawnorder December 27, 2023 – 4:17 pm at
  
  If a NYT story reports on political events in Washington, DC, you can comment on those events without giving the Times credit, especially if the Times' report is not exclusive.
2. Jasper_in_Boston December 28, 2023 – 3:30 am at
  
  AI bots are perhaps paying for access to the Times database, but then using it to develop a profit-making product w/o attribution or permission
  
  Why is this a problem? If I subscribe to the NY Times to access their recipes, and use said recipes at my highly profitable restaurant, I've wronged the Times?
pjcamp1905 December 27, 2023 – 6:37 pm at

Well, first of all, fair use does not cover indiscriminate use of entire works. It also requires the derivative work to be transformative but GPTx is just a big mixmaster. And finally, it requires the derivative work to not draw on an entire work but to use limited quotations. So I see an argument that this sort of scraping is not fair use at all. Just because something is on the web does not mean that nobody owns it anymore.

Second, nothing that GPT4 does can even charitably be described as summarizing. That requires some thought, and all GPT4 does is stochastic pattern matching. It throws words and phrases together that have been in close proximity in the past.
DarkBrandon December 27, 2023 – 7:06 pm at

So OpenAI's products will think that Saddam had large stocks of WMDs and that keeping a private email server for government business is the most flagitious of all possible crimes.

The Times was effectively a trade journal of email server provisioning and security protocols for 5 years. Does anyone else see a problem with sourcing its archives, after they treated every rumor about the Clintons as truth for three decades, and believed every word from Ahmed Chalabi's mouth?
Kit December 28, 2023 – 1:03 am at

I’m guessing that these sundry AIs will be able to show they used the underlying data correctly but that terms and conditions will be updated to prevent further such use. The written word is the new oil. The fight will be to determine who profits at whose expense. As the little guy, I’ll certainly be worse off, getting a more expensive and less capable product.

I do think the Times has been hostile towards ChatGPT from the beginning, plowing endless effort into getting the AI to spit out something with a whiff of scandal rather than helping its readers understand what appears to be a dawning revolution.

Also, there’s far too much self-assured speculation on how this technology actually works, all too often nearly aligning with a commenter’s priors. I wish I had confidence in the ability of the world’s courts to reach a surer understanding, but it seems obvious that we’ll quickly find ourselves with two dozen major rulings, each conflicting with the others.
Jasper_in_Boston December 28, 2023 – 3:26 am at

I can understand the NYTimes trying to get a piece of the action. Why not try? But it doesn't seem like OpenAI is remotely a competitor of (or a substitute for) any newspaper. In other words OpenAI's success or lack therefore should remotely harm the NYTimes or any other publisher. How would it?

I think the case is conceptually very shaky. Open AI's use of newspaper articles to enrich functionality is no different from, say, a consulting firm that uses newspapers (among other sources of information) to bone up on a topic they need to be familiar with in order to serve clients. The NY Times is in the business of selling information. How can they object to people using the information they freely sell?
realrobmac December 28, 2023 – 8:02 am at

"If things were otherwise, everybody would be liable who had ever learned something from a Times article and then used it to publicly form an opinion or analysis."

That's an awfully bold statement. So chat bots should treated by the courts as if they were human beings and not profit making property completely owned by giant corporations?
ScentOfViolets December 28, 2023 – 8:05 am at

Turn it around and ask instead how much it would cost to generate totally synthetic data to train these models on. What, you say, that would cost trillions of dollars. I think you have your answer in re copyright infringement right there.
1. lawnorder December 28, 2023 – 3:52 pm at
  
  What would it cost to recreate the whole of human knowledge from the techniques of generating and controlling fire onward? We all, including the "AIs", rely on the enormous body of knowledge generated by others over the millennia. That does not make us all copyright infringers.
Chondrite23 December 28, 2023 – 8:07 am at

I’m not sure about this either way. ChatGPT is not a sentient being that is participating in our society like the rest of us. It is a machine that is taking content from the NYT and other sources and producing a profitable product.

I don’t know about the legal issues, at a gut level it feels like the journalists who collect the information should receive just compensation.

On a parallel track, note that Apple is in negotiations with publishers to pay them for access to their works in order to train their AI.
1. realrobmac December 28, 2023 – 2:12 pm at
  
  I know that Microsoft has done this as well with at least one large company that owns a lot of media content.

Comments are closed.