Basically, you have to do something to filter them, because they’re essentially a distributed denial of service attack at this point. Some folks have had success with Cloudflare Turnstile too.
They’re clearly designed to hoover up content as fast as possible from big corporate operations which can stand up to the beating, when they get turned on your average academic content site, they turn it into a smoking crater.
Someone wrote a blog post about this a few months ago that I forgot to save in Zotero. Does anyone remember this and where it is? I want to use it or something similar in class this fall
Those were not it, but I really appreciate your posting them, bc I didn't know about them, and they right on target. The blog post I was thinking about was GLAM specific - about the crawlers as a de facto attack on open access GLAM resources
What I think my approach will be will be like fail2ban where a script or service runs to read the log file and reduce the IP addresses into clusters of IP ranges, and then email a report and/or ban them at the firewall or iptables.
Clearly this approach gets around typical fail2ban filters because 1) the user agents are randomized to look like different browsers and OS's and 2) there are only a handful of requests per IP. But you have a thousand IPs working in symphony
Comments