People defending AI scraping practices say: “well, humans do the same. They read then remember what they’ve read.” That analogy is flawed:

1. We pay for access
2. We don't swallow Terabytes per second
3. We don't constantly retrain our parameters on Terabytes of the same data
4. We don't resell those patterns
5. We are not commercial machines

nytimes.com/2023/12/27/busines…

This entry was edited (2 years ago)
in reply to iA

I disagree:

  1. We don't always pay for access, and a world where access is always based on pay is abhorrent.
  2. We do, collectively, swallow terabytes per second. Would AI be ok if it were slower to train? How is this distinction relevant?
  3. We do that by remembering, thinking, conceptualising, and so on.
  4. Sure we do. For example people incorporate the things they learn (style and facts both) in their own fictional works, or essays.
in reply to modulux

You are comparing the collective human mental capacity to one cold data pirating commercial computer program? A single human can process about 40bits per second and it needs to adhere to the legal boundaries of available data. Commercially owned AI trains its parameters on exaflops of stolen data repeatedly. It scans it to balance its statistic word prediction using a mountain of stolen data. And yes, it matters how much is stolen.
This entry was edited (2 years ago)
in reply to iA

Where's the 40 b/s datum from? A human recognising a voice, a tune, a face or a scent needs more than 40 bits and is done in less than one second. You might be conflating the speed of processing of individual neurons (about 40hz?) with the capacity of the system as a whole.

The amount of data and the rate at which it is used isn't especially relevant, unless it became a bandwidth issue, but that's not what the complaint is about.

The rhetorical ploy of calling it stolen is both prejudging the result (there's an EU datamining exception to copyright law, for example) and relying on intuitions that might hold for non-replicable goods, but definitely don't hold for information.

It's a dismal world that where the knowledge of all is hostage to the profit of some.

I've consistently opposed information monopolies, from the time of PGP, napster and gnutella, all through the attempt to jail the DeCSS author for performing illegal maths, the encroachment of software patentability in the EU, or Oracle deciding it got to own APIs. In the end you want to forbid people from performing statistics on text corpora. This is such an unreasonable extension of the already too strong intellectual property field it must be opposed.

in reply to modulux

The IP issue is real but you're not helping by freeing Trillion Dollar enterprises (building AI monopolies) from the IP shackles they'd happily employ on others. Copy ethics vs copyright: mastodon.online/@ia/1116606181… 40b/s is for reading (the conscious mind), 10-100 Million b/s for perception (unconscious processing). Unconscious processing is still smaller by orders of magnitude. Source: My 90s Uni lectures, but it's still easily findable today.


Copyright/IP adversaries and OpenAI/Microsoft are strange bed fellows. The issue with copyright and IP is that over time it pretty much perverted its original idea (protecting the original author from being ripped off). IP law allows money extraction by big commercial entities. As a small commercial entity, copyright is not your first priority. Copy ethics matter. Copy ethics matter actively and passively, to you as copying and copied subject. ia.net/topics/copycats-and-oth…

in reply to iA

Thanks for the explanation on the 40 b/s, though I'm not sure the relevant datum is the conscious text reading rate rather than the broader perception rate. After all, websites have a lot more than pure text, and are presented more than as linear disembodied symbols.

Regarding the IP issue, it concerns me that people jump so quickly to the rhetorical maximalist stealing language. Ultimately, I'm not especially worried about OpenAI and its friends. They have money and can pay licences. What worries me is that this is going to raise barriers to entry, either to orgs who already modelled, or to people with infinite money to spend. I think AI is far more interesting in its potential to solve problems trained and run locally for specific user needs. A practical example: multimodal AI for describing images (which relies on large tagged image datasets) makes things accessible to me which otherwise could never be. There is no reasonable world in which I could get a sighted person to answer my every random question about images, nor one in which I want to bother asking a person to do that. But if we start from the running statistics on text is stealing basis, that sort of modelling becomes prohibitive. I certainly won't be able to do it, and it will have to live in the cloud, on someone else's computer. So when I want to take a picture of something in my room and get a description, it would have to leave my control. And so on.

We shouldn't underestimate the damage that restrictions on copying and general purpose computing can have.

in reply to iA

When Sarah Silverman sues, the contrarian in us thinks "Yeah, you're not that important for AI". Her case becomes clearer when the NYT sues with actual proof of stolen and resold information. It clarifies that, pars pro toto, Sarah Silverman (and the millions that she represents) has as much of a case as the NYT. Of course tech optimists, AI ideologues, and those who are just capitalists when it suits them, would viscously disagree. Tough case though if you don't squint.
This entry was edited (2 years ago)
in reply to iA

The common argument is that LLMs don't store the data in a database that they use directly in their responses. This is disingenuous. They store stolen data in a database that they use to train and retrain their weights and parameters. The pirated content is… on a different hard drive, used at a different point in the process.

However you may judge this right now... legally, economically, technically, rhetorically and philosophically this will be a very interesting court case to follow.

This entry was edited (2 years ago)
in reply to iA

Legally, this is about copy*right*. Can you scrape data you get access one way or the other (is a personal license okay, or does it need a commercial account) to train your commercial app that produces a comparable mashup of what you scraped and that you sell at global scale? Is it legally sound to do that since news organizations didn't know about your new technology? There is also a question of copy*ethics*. Does this really scale, morally? Can this behavioral pattern become common practice?
This entry was edited (2 years ago)
in reply to iA

General IT copy ethics should look at it from different ethical scholarly perspectives. F.i. Deontological: does someone taking advantage of a new technology and resell someone else's work by automated alteration at global scale define a pattern that can become common practice? Eudemonistic: Does it further individual happiness? Virtue ethics: Is it beautiful, truthful, good? Analytic: How does one talk about AI, human creativity and copying in a coherent way?
This entry was edited (2 years ago)
in reply to iA

Copyright/IP adversaries and OpenAI/Microsoft are strange bed fellows. The issue with copyright and IP is that over time it pretty much perverted its original idea (protecting the original author from being ripped off). IP law allows money extraction by big commercial entities. As a small commercial entity, copyright is not your first priority. Copy ethics matter. Copy ethics matter actively and passively, to you as copying and copied subject. ia.net/topics/copycats-and-oth…
This entry was edited (2 years ago)