People defending AI scraping practices say: “well, humans do the same. They read then remember what they’ve read.” That analogy is flawed:
1. We pay for access
2. We don't swallow Terabytes per second
3. We don't constantly retrain our parameters on Terabytes of the same data
4. We don't resell those patterns
5. We are not commercial machines
nytimes.com/2023/12/27/busines…
New York Times Sues OpenAI and Microsoft Over Use of Copyrighted Work
Millions of articles from The New York Times were used to train chatbots that now compete with it, the lawsuit said.Michael M. Grynbaum (The New York Times)
This entry was edited (2 years ago)
modulux
in reply to iA • • •I disagree:
iA
in reply to modulux • • •modulux
in reply to iA • • •Where's the 40 b/s datum from? A human recognising a voice, a tune, a face or a scent needs more than 40 bits and is done in less than one second. You might be conflating the speed of processing of individual neurons (about 40hz?) with the capacity of the system as a whole.
The amount of data and the rate at which it is used isn't especially relevant, unless it became a bandwidth issue, but that's not what the complaint is about.
The rhetorical ploy of calling it stolen is both prejudging the result (there's an EU datamining exception to copyright law, for example) and relying on intuitions that might hold for non-replicable goods, but definitely don't hold for information.
It's a dismal world that where the knowledge of all is hostage to the profit of some.
I've consistently opposed information monopolies, from the time of PGP, napster and gnutella, all through the attempt to jail the DeCSS author for performing illegal maths, the encroachment of software patentability in the EU, or Oracle deciding it got to own APIs. In the end you want to forbid people from performing statistics on text corpora. This is such an unreasonable extension of the already too strong intellectual property field it must be opposed.
iA
in reply to modulux • • •The IP issue is real but you're not helping by freeing Trillion Dollar enterprises (building AI monopolies) from the IP shackles they'd happily employ on others. Copy ethics vs copyright: mastodon.online/@ia/1116606181… 40b/s is for reading (the conscious mind), 10-100 Million b/s for perception (unconscious processing). Unconscious processing is still smaller by orders of magnitude. Source: My 90s Uni lectures, but it's still easily findable today.
iA
2023-12-28 23:23:38
modulux
in reply to iA • • •Thanks for the explanation on the 40 b/s, though I'm not sure the relevant datum is the conscious text reading rate rather than the broader perception rate. After all, websites have a lot more than pure text, and are presented more than as linear disembodied symbols.
Regarding the IP issue, it concerns me that people jump so quickly to the rhetorical maximalist stealing language. Ultimately, I'm not especially worried about OpenAI and its friends. They have money and can pay licences. What worries me is that this is going to raise barriers to entry, either to orgs who already modelled, or to people with infinite money to spend. I think AI is far more interesting in its potential to solve problems trained and run locally for specific user needs. A practical example: multimodal AI for describing images (which relies on large tagged image datasets) makes things accessible to me which otherwise could never be. There is no reasonable world in which I could get a sighted person to answer my every random question about images, nor one in which I want to bother asking a person to do that. But if we start from the running statistics on text is stealing basis, that sort of modelling becomes prohibitive. I certainly won't be able to do it, and it will have to live in the cloud, on someone else's computer. So when I want to take a picture of something in my room and get a description, it would have to leave my control. And so on.
We shouldn't underestimate the damage that restrictions on copying and general purpose computing can have.
iA
in reply to iA • • •iA
in reply to iA • • •The common argument is that LLMs don't store the data in a database that they use directly in their responses. This is disingenuous. They store stolen data in a database that they use to train and retrain their weights and parameters. The pirated content is… on a different hard drive, used at a different point in the process.
However you may judge this right now... legally, economically, technically, rhetorically and philosophically this will be a very interesting court case to follow.
iA
in reply to iA • • •iA
in reply to iA • • •iA
in reply to iA • • •Copycats & Other Monsters
Oliver Reichenstein (Information Architects Inc.)