Hm. Anyone happens to know the IP addresses Software Heritage is using to clone/crawl? Or their user agent?
Back in 2022, I requested my forge to be archived, and earlier today I requested they stop doing that (and remove all content archived from there).
I do not have much faith in them respecting the removal request, so I'll go the extra mile and set up some funky stuff so they'll clone some AI poisoning repos in the future.
It's pretty simple, really. You give it a Forgejo instance URL, it scans for public repos, does a shallow clone of each to get a list of files. Then it creates a new repo, with the same structure and filenames, but where all content is garbage (currently a random sized sampling of an english word list).
It will then simply serve those garbage repos.
With a few lines of nginx config and/or iptables rules, I can redirect requests coming from bad actors, and serve them garbage, while everyone else gets the real deal.