Hm. Anyone happens to know the IP addresses Software Heritage is using to clone/crawl? Or their user agent?

Back in 2022, I requested my forge to be archived, and earlier today I requested they stop doing that (and remove all content archived from there).

I do not have much faith in them respecting the removal request, so I'll go the extra mile and set up some funky stuff so they'll clone some AI poisoning repos in the future.

I now have a little thingy that mimics a forgejo instance, one that looks a lot like mine. Except all repo content served is garbage.

I plan to clean that up over the weekend, and make the repo public.

It's pretty simple, really. You give it a Forgejo instance URL, it scans for public repos, does a shallow clone of each to get a list of files. Then it creates a new repo, with the same structure and filenames, but where all content is garbage (currently a random sized sampling of an english word list).

It will then simply serve those garbage repos.

With a few lines of nginx config and/or iptables rules, I can redirect requests coming from bad actors, and serve them garbage, while everyone else gets the real deal.

I also need a name for this thing, a tool that builds HTTP-cloneable git repos with garbage in them for AI poisoning purposes (more context in previous toots). It's currently under the working title of "garbage", but I'd like something more creative.

The Fediverse helped me out with naming before. Hopefully, you all will help me out this time, too!

(Boosts appreciated)

Follow

@algernon It's kind of like a scraper tarpit, so maybe something along those lines?

· · Web · 0 · 0 · 1
Sign in to participate in the conversation
Pixietown

Small server part of the pixie.town infrastructure. Registration is closed.