We’ve had some good results using Anubis (anubis.techaro.lol) to filter out this crap. - ThreadSky

hcayless.bsky.social • 7 days ago

We’ve had some good results using Anubis (https://anubis.techaro.lol) to filter out this crap.

Comments

Basically, you have to do something to filter them, because they’re essentially a distributed denial of service attack at this point. Some folks have had success with Cloudflare Turnstile too.

hcayless.bsky.social•7 days ago

They’re clearly designed to hoover up content as fast as possible from big corporate operations which can stand up to the beating, when they get turned on your average academic content site, they turn it into a smoking crater.

ctschroeder.bsky.social•7 days ago

Someone wrote a blog post about this a few months ago that I forgot to save in Zotero. Does anyone remember this and where it is? I want to use it or something similar in class this fall

hcayless.bsky.social•7 days ago

Maybe https://thelibre.news/foss-infrastructure-is-under-attack-by-ai-companies/? Or https://drewdevault.com/2025/03/17/2025-03-17-Stop-externalizing-your-costs-on-me.html (which the above references).

ctschroeder.bsky.social•7 days ago

Those were not it, but I really appreciate your posting them, bc I didn't know about them, and they right on target. The blog post I was thinking about was GLAM specific - about the crawlers as a de facto attack on open access GLAM resources

hcayless.bsky.social•7 days ago

There's also https://diff.wikimedia.org/2025/04/01/how-crawlers-impact-the-operations-of-the-wikimedia-projects/, about the impact on Wikimedia.

ewg118.bsky.social•7 days ago

I am not sure I read it, but I did fill in a survey that was sent around about the effect of bots on GLAM resources

ewg118.bsky.social•7 days ago

What I think my approach will be will be like fail2ban where a script or service runs to read the log file and reduce the IP addresses into clusters of IP ranges, and then email a report and/or ban them at the firewall or iptables.

ewg118.bsky.social•7 days ago

Clearly this approach gets around typical fail2ban filters because 1) the user agents are randomized to look like different browsers and OS's and 2) there are only a handful of requests per IP. But you have a thousand IPs working in symphony

ewg118.bsky.social•7 days ago

It does not appear that fail2ban is flexible enough to automatically ban IP ranges as they emerge as a problem.

ewg118.bsky.social•7 days ago

I'm not sure if this will work for our cases since the APIs are vital internally and externally

hcayless.bsky.social•7 days ago

That does complicate things, but you can create allow lists of IPs / user agents so that they don’t get challenged.

Comments

Posting Rules

Reply