One of the many pressing issues with Large Language Models (LLMs) is they are trained on content that isn’t theirs to consume.
Since most of what they consume is on the open web, it’s difficult for authors to withhold consent without also depriving legitimate agents (AKA humans or “meat bags”) of information.
Some well-meaning but naive developers have implored authors to instate robots.txt rules, intended to block LLM-associated crawlers.
Related from my bookmarks:
- iocaine, which I’ve been following for quite a while since @algernon does a great job of developing in the open.
This is deliberately malicious software, intended to cause harm. Do not deploy if you aren’t fully comfortable with what you are doing. LLM scrapers are relentless and brutal, they will place additional burden on your server, even if you only serve static content.
- tobi of gotosocial about related fediverse stats poisoning
-
Quixotic is a program that will feed fake content to bots and robots.txt-ignoring LLM scrapers.
-
Nepenthes […] is a tarpit intended to catch web crawlers. Specifically, it’s targetting crawlers that scrape data for LLMs - but really, like the plants it is named after, it’ll eat just about anything that finds it’s way inside.
-
Anubis - Weigh the soul of incoming HTTP requests using proof-of-work to stop AI crawlers.