Skip to content

Do AIs use too much of your personal data?

Large language AIs like ChatGPT train themselves on vast amounts of information hoovered up from every corner of the internet:

"In the absence of meaningful privacy regulations, that means that people can scrape really widely all over the internet, take anything that is ‘publicly available’ — that top layer of the internet for lack of a better term — and just use it in their product," said Ben Winters, who leads the Electronic Privacy Information Center’s AI and Human Rights Project and co-authored its report on generative AI harms.

....For creators — writers, musicians, and actors, for instance — copyrights and image rights are a major issue, and it’s pretty obvious why. Generative AI models have both been trained on their work and could put them out of work in the future.

That’s why comedian Sarah Silverman is suing OpenAI and Meta as part of a class action lawsuit. She alleges that the two companies trained off of her written work by using datasets that contained text from her book, The Bedwetter. There are also lawsuits over image rights and the use of open source computer code.

I dunno. Google scrapes every bit as much data to "train" its search engine, which is a source of enormous profit. The Wayback Machine is nonprofit, but it literally duplicates huge amounts of copyrighted material and makes it publicly available. Wikipedia relies on an army of volunteers, not automated scraping, but the result is similar: an enormous website that's the product of gathering information from all over the internet. Google Books explicitly scans copyrighted works and the Supreme Court ruled that it was "transformative" and therefore A-OK.

Maybe all this stuff should be illegal. But that would sure make the internet a lot less useful for everyone. And training an AI is at least as transformative as creating an index of books. It's hard to see these lawsuits going anywhere.

33 thoughts on “Do AIs use too much of your personal data?

  1. cedichou

    Google may use the same data to turn a profit, but it provides a service: it gives a pointer to the data. It profits from the data by showing you where the data is. it steers traffic to your website.
    Would Kevin be able to run jabberwocking without Google returning its domain name for a query on Kevin Drum?

    ChatGPT doesn't return anything useful to me if it scrapes my data. it uses my work to generate a new text/answer without attribution, pointer, acknowledgement. That's quite different, isn't it?

    1. Bobber

      "Would Kevin be able to run jabberwocking without Google returning its domain name for a query on Kevin Drum?"

      Most of his readers likely just followed him when he left Mother Jones. I know I never googled him to find jabberwocking. But Google is useful for getting him new followers.

      Bing Chat(GPT) does provide references. It's important to look at those references because it often "misinterprets" them.

      1. Batchman

        FWIW, I started following Kevin because Mark Evanier frequently posted links to Kevin's articles on his newsfromme.com blog.

  2. Doctor Jay

    Google Search does not reproduce the work that has been processed. It might quote a bit, but that's attributed and fair use.

    The Wayback machine is attributed, and not for profit. I'm pretty sure that you can ask to be excluded, just as you can have your website be excluded by Google search.

    I presume there's a way to ask Google Books to remove your work in particular, but again, it's attributed.

    The way ChatGPT works, nothing is attributed. It can be presented as something that it just came up with. You can't exclude your work from its processing either. It does seem worse. I can't say whether it's legal or not. It is entirely possible to be able to exclude specific works from the corpus it trains on, though, and I think this is a very defensible position for a copyright holder to take.

  3. Five Parrots in a Shoe

    The purpose of the internet, going all the way back to the DARPANET days, is to share information. Trying to keep data secret and secure, and even just trying to enforce copyright, is super difficult simply because it goes against what the internet is really for.

    If ChatGPT scrapes my data it may or may not return anything useful to me, but it may well be useful for someone else out there. Which is OK with me. It's not like I ever intended to profit from my comments on this blog.

  4. ScentOfViolets

    The intertubes would be a much nicer place if everyone just paid up, i.e., was based on a subscription-based rather than ad-based model.

  5. D_Ohrk_E1

    I'd like to know if GPT and other LLMs examine Creative Commons licenses while scraping and skip all but CC0 -- “No Rights Reserved” / public domain.

    I'd also like to know if these LLMs (especially Google's) abide by the (Google HTML) robots / x-robots-tag rules.

  6. cooner

    Yeah, what folks said above. Google Search points you to the original source (or at least, it should/used to). Wayback Machine and Wikipedia give attribution or cite sources. It's arguable whether Google Books is fair from a copyright standpoint, but it's still presenting the original material as published with creator attribution.

    Not only is AI hoovering up text and images to mash together with absolutely no credit or compensation to the people who created the elements it's using*, but it's being created with the intent of replacing and taking away future work from the very writers, artists, and creators it's scraped all that information from.

    *Someone else can have the philosophical argument about whether all art is just copying and remixing prior art the creator has seen or experienced. I feel like that's a bad-faith argument when the entire commercial goal of generative AI seems to be replacing creative workers who've spent years honing their craft and style with faster, cheaper, less creative shadows.

    1. jlredford

      Not for the first time, I wish this comment system had a means of upvoting good comments like this, and bad takes like Kevin's.

  7. painedumonde

    If it's hoovering without paying or cracking into private areas what's the big deal? You put your information, content, and self out there - it's like walking down the street reciting your social security number. But the kerfuffle does point out the shockingly apparent normality of people and providers just leaving tabs open for everybody.

    1. Narsham

      Just because information is publicly available doesn't mean it should be legal to do anything you want with that information.

      Someone trying to empty your bank account by using information you made publicly available is still committing a crime.

      And at least part of current privacy laws are based upon human limitations on such things as observing people and compiling data, limitations which no longer operate thanks to computers. That doesn't even factor in arguments like the one saying that if a computer program collects private information on you and stores it in a database that no human being ever looks at, your privacy rights do not enter into the equation and will not until a human looks at that information, regardless of what other programs access it.

      1. painedumonde

        You've understood my point. It is a crime, no doubt. But I would propose the information exchanged between to entities (most of the time an individual and some corporation) would be expected to be protected and at the very least remain private, especially from the layperson's viewpoint. And since it is apparently not protected to a great degree, I could be mistaken here but I assume a copywrite or trademark designation or even personal identification markings are not enough, there is likely some negligence here. From both sides of the exchange. But all that aside, how many stories and judgments about data theft, spyware, malware, and ransom must we read before agencies, private, governmental, individual, and corporate, finally decide to close those tabs?

        And before anybody says, but it's not that easy... then maybe the system is the problem.

  8. Yikes

    I don't know if it is a problem in every instance, but it bears some examination, and probably should have done sooner.

    Silverman has a point, but take professional publications. When I write an analysis of a newly enacted law, I may well "put it out there" but I put it out there fully credited, as an example of my skill and knowledge.

    I have seen what chatgpt pulled in response to some legal questions, and depending on the situation what it pulled was good, and perhaps was written by somone who was planning on taking professional credit for it.

    Google would have directed someone to the credited version.

  9. lawnorder

    I don't see the problem with ChatGPT learning from copyrighted works. Humans do the same thing. Any successful author has read the works of any number of other authors who write in the same area, and learned from them. ChatGPT and similar programs can quite legitimately do the same, as long as the work they produce doesn't infringe on that copyright.

    1. aldoushickman

      "Any successful author has read the works of any number of other authors who write in the same area, and learned from them."

      And likely did it by buying copies of the works of those authors, or otherwise reading them pursuant to some sort of license or fair use. It's unclear if the LLMs did the same.

      (and further unclear whether it's legal for people to build a machine to do something at a large scale just because humans are allowed to do at human scale. For example, I can walk into a store because the shopowners have indicated that it's fine for people to do that. Nonetheless, I can't drive a bus into the same store. Similarly, I can access a website. But if I write a bot that accesses the website gajillions of times in quick succession, I've done something qualitatively different and possibly illegal).

      "as long as the work they produce doesn't infringe on that copyright."

      Well, that's the rub, isn't it? If they produce something that's an unlicensed derivative work (or even potentially if the LLM people didn't have a license to access the work in the first place), that's potentially contrary to copyright law. And the very thing that this litigation is testing in court.

      1. lawnorder

        I know of no reason to believe, or even suspect, that AIs are obtaining illegal access to written works. I would presume that most of the content they access is publicly available without charge, and that if payment is required the programmers are making the required payments. Unless AIs are being written with the capability to crack paywalls, it couldn't be any other way.

        Everything is derivative in some way. This only becomes an issue if the derived work is extremely similar to the work it is derived from. For example, you can't just paraphrase someone else's work. However, if the derived work tells a different "story" than the work it is derived from, copyright infringement does not exist.

        1. aldoushickman

          Again, all you are saying is a variation of "if it doesn't violate copyrights, then there isn't a copyright violation, so what's the problem?" The issue is that copyrightable material--even if posted publicly--can be used in ways that violate copyright, so maybe there is a problem here.

          The question isn't whether or not the AI "crack[ed] paywalls," it's whether or not the use of copyrightable material violated the license terms (implicit or otherwise) of the work. I could post something publicly--say, an essay on a publicly-available website, or a video on a video platform--and you could read or watch it. But you couldn't necessarily copy that material and incorporate it into, say, a training manual for your employees if you didn't have a license to do so or if what you were doing didn't fall into a fair use category. So there could be an issue on the front end if an AI company is using other people's copyrighted works in databases to create their AI product.

          Similarly, there could be an issue on the back end. If you were to consume a lot of content that I posted, and then you created something very similar to that content, the quetion of whether or not that's a derivative work isn't resolved by simply saying "[e]verything is derivative in some way." There's a whole body of case law developing the contours of what is and isn't a derivative work, and AI systems that spit out works based on other people's works may or may not categorically or in some instances be impermissible derivative works.

          Think about it this way: let's say I create a piece of software that you upload a photo to, and then it recreates that photo as a mosaic of little images randomly selected by color from google image searches. Pretty neat, huh? It's all publicly available source material, and the final image doesn't necessarily look like any of the constituent images. You upload an image, get your mosaic, like it a bunch, and use it as the cover of a book you wrote, which sells a whole lot of copies. Has the copyright act been violated? Quite possibly. And the questions of by whom and when are not entirely obvious, either. The situation here with the LLMs is pretty similar.

          1. lawnorder

            The point is that LLMs are "learning" from the material they "read" (I use quotation marks because we are necessarily analogizing from human intellectual activity; the LLMs are not actually reading or learning). Humans learn from the material they read. That learning CANNOT be a copyright violation.

            So far, it appears that ChatGTP's "creations" are different enough from any of the input material that it does not even come close to violating copyright.

  10. cld

    If AI's are trying to learn human psychology from fiction or news sources or tv shows they're only going to come up with the most extreme and abnormal examples.

    1. Salamander

      Exactly. Which will end up dumping more extreme and abnormal stuff into the info pool that trains future AIs, etc.

      For that matter, if AIs were trained in human discourse based on the "Comments" sections of, say the major online news sources, they'd need to call it "HateGPT" or something. Many take full advantage of "On the Internet, nobody knows you're a dog."

      1. cld

        Who writes Florida?

        This from the last five minutes of web surfing,

        15 cases of leprosy this year in Florida,

        https://www.rawstory.com/florida-leprosy/

        Florida to pave streets with radioactive waste products,

        https://twitter.com/GeorgeTakei/status/1684630879782723585

        DeSantis Puppet Board Moves to Defund $8 Million Police Budget at Disney — The Same Cops Who Prevented Mass Shooting From Happening There,

        https://www.mediaite.com/opinion/desantis-puppet-board-moves-to-defund-8-million-police-budget-at-disney-the-same-cops-who-prevented-mass-shooting-from-happening-there/

        1. iamr4man

          So ironic to see DeSantis “Defund the police’. I guess it’s just another example of “every accusation is an admission”.

      2. Altoid

        I think it's potentially much worse, in terms of what these things will be able to spit out. Output is going to tend more and more toward unusable garbage.

        For an LLM it's all in the sources, isn't it? Well, in a very short while no one will be able to tell what's human-produced and what's AI-produced-- no markers are required, right?-- which means LLMs will be using their own output as source material. And AI output sure seems likely to increase exponentially in volume, and as it exponentiates it will form a greater and greater proportion of LLM source material.

        It will be like making a new xerox of the xerox you just made. You get readable copies the first few times, but see what they look like after 10 or 20 or 100 or 1000 iterations. Or think of a game of post office. AI-produced texts will become their own self-referential dialects, increasingly distant from actual spoken languages. And potentially not really intelligible to humans.

        You might try to prevent this by cutting off all source materials in say, 2022, but actual language in human use changes over time, so AI-produced text would pretty quickly begin to sound quaint-- imagine reading today from an LLM trained on only 18C English texts. It would be self-defeating.

        Admittedly I haven't thought very much about this, but it's surely something the gurus of LLM have thought through. Right?

        1. cld

          Excellent point, and actually a ray of hope.

          After the novelty wears off people will need to see new, original content, and, so far, that can only come from real people.

  11. azumbrunn

    I like the irony: The tech industry is one of the main forces responsible for the insane tightening of intellectual property law and their enforcement. Now those same laws get in the way of developing the newest fad o theirs. Congratulations!

  12. Jimm

    AI is not a special case of this, the "problem" is larger and already existed. When you exchange with a public vendor in a farmer's or other market, this vendor can remember you and know you may be a prospective customer again. Additionally, if the transaction is publicly viewable (by anyone other than the exchanging parties), you can't sue someone for this observation, and if the transaction is truly private between the two parties, both parties own the information about that transaction sans a contract that asserts otherwise.

    People need to understand the nature of exchanges, and the absence of adequate privacy laws in most countries. The state gets to overclassify almost everything they do, while expecting to be the 3rd party in every free exchange/transaction whether public or private.

    Demand both the freedom of information to the citizens (and electorate), and their right to privacy, while still respecting liberty and free exchange, and this balance will tip to the common folk, and against the "elite".

    1. Jimm

      The worst of these "elites" (and brats) will assert that your security will be threatened by asserting your independence, freedom and common responsibility, as they always have, raising the spectre of terrorists and enemies, requiring their so-called "protection"...be smarter and wiser, danger is always out there, with many who propose to fix it often not sincere and looking to enrich themselves (with wealth, attention or power), though not in every case so measure each accordingly and (more importantly) evidentially.

Comments are closed.