One of my sites was close to being DoS’d by openAI’s crawler along with a couple of other crawlers. Blocking them made the site much faster.
I’d admit the software design offering search suggestions as HTML links didn’t exactly help (this is a FOSS software used for hundreds of sites, and this issue likely applies to similar sites) but their rapid speed of requests turned this from pointless queries into a negligent security threat.
LLM scraping is a parasite on the internet. In the actual ecological definition of parasite: they place a burden on other unwitting
organismscomputer systems, making it harder for the host to survive or carry out their own necessary processes, solely for the parasite’s own benefit while giving nothing to the host in return.I know there’s an ongoing debate (both in the courts and on social media) about whether AI should have to pay royalties to its training data under copyright law, but I think they should at the very least be paying to use infrastructure while collecting the data, even free data, given that it costs the organisation hosting said data real money and resources to be scraped, and it’s orders of magnitude more money and resources compared to serving that data to individual people.
The case can certainly be made that copying is not theft, but copying is by no means free either, especially when done at the scales LLMs do.
While AI crawlers are a problem I’m also kind of astonished why so many projects don’t use tools like ratelimiters or IP-blocklists. These are pretty simple to setup, cause no/very little additional load and don’t cause collateral damage for legitimate users that just happend to use a different browser.
IP based blocking is complicated once you are big enough or providing service to users is critical.
For example, if you are providing some critical service such as health care, you cannot have a situation where a user cannot access health care info without hard proof that they are causing an issue and that you did your best to not block the user.
Let’s say you have a household of 5 people with 20 devices in the LAN, one can be infected and running some bot, you do not want to block 5 people and 20 devices.
Another example, double NAT, you could have literally hundreds or even thousands of people behind one IP.
IP based blocking is complicated once you are big enough
It’s literally as simple as importing an ipset into iptables and refreshing it from time to time. There is even predefined tools for that.
the article posted yesterday mentioned a lot of these requests are only made once per IP address, the botnet is absolutely huge.
They better don’t attack too much, because all of the internet is built on FOSS infrastructure, and they might stop working, lol.
Sad there’s no mention of running an Onion Service. That has built-in PoW for DoS protection. So you dont have to be an asshole and block all if Brazil or China or Edge users.
Just use Tor, silly sysadmins