How can we stop corporations from using Lemmy as a training dataset for AI?

Vegan T-34@lemmygrad.ml · 1 year ago

How can we stop corporations from using Lemmy as a training dataset for AI?

asudox@programming.dev · 1 year ago

You can’t stop them. Publicly available data can and will be a training source for LLMs.

redrum@lemmy.ml · 1 year ago

Instances could add this snippet to theirs robots.txt (source: Eff.org, businessinsider.com and nytimes.com/robots.txt ):

User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Meta-ExternalAgent
User-agent: meta-externalagent
Disallow: /

Note: this only tell to the crawlers of openai, google and meta to not crawl the site to traiN a LLM, the nytimes have a large list of other crawlers.

BigDotNet@lemmy.ml · edit-2 1 year ago

Removed by mod

CaptainBasculin@lemmy.ml · 1 year ago

With the way federation works, not much. People from all sorts of federation capable sites can see the content posted from different instances; but considering its conviniences I think its worth it.

mspencer712@programming.dev · 1 year ago

Broadly this is preventing plagiarism. We don’t want someone to scrape all our knowledge, remove the human connection and reference back to experts and people, and serve the information itself, uncredited.

But if a human can read something, so can a bot. I think ultimately we need legislation.