Bot scrapers overloading servers?

in Technical
I am wondering if providers (and website owners) have noticed any problems with scraping bots overloading the server with too many requests?
Is it just me, or is it becoming worse month after month?
Cloudflare seems to be an effective way to stop that (along with all its downsides). My article about dealing with recognized bots gone crazy (like Google) and unknown bots and AI-training content scrapers:
https://io.bikegremlin.com/31865/website-attacked-by-ghosts/
Thanked by (1)ariq01
Comments
None of them really caused any of my sites to overload.
But I can confirm that they generate more and more traffic.
I have the feeling that even cloudflare struggles a bit with filtering them properly without captchas.
At least it feels like and need to pass a lot more cloudflare captchas since a while.
Cloudflare Turnstile works a lot better than captchas for the registration fields and similar (unless you are refering to that when mentioning Cloudflare captchas).
Caching pages for non-logged in users can make a huge difference (without that, the forum I mentioned had big CPU usage spikes). But that is an extra thing to figure out and keep track of.
π§ BikeGremlin guides & resources
yes i have it too. you can either ban the IP range or let it be.
mines redis cached so load isnt too high
I bench YABS 24/7/365 unless it's a leap year.
Yup - caching does make a huge difference (sometimes I miss the good old static HTML stuff
).
π§ BikeGremlin guides & resources
Cloudflare Turnstile is the captcha I'm talking about π
i noticed a lot of new bots that seem to have endless ressources to scrape everything, mostly from china (Huawei, Tencent, etc). they generate lots of traffic for nothing in return, so its really necessary to deal with them.
their own AI search bot
I bench YABS 24/7/365 unless it's a leap year.
I am seeing two kinds of bots:
what's the kind of load you are seeing?
for any static resources I don't really care, but anything that's dynamically generated, these bots are a pain
yeah, the problem is becoming more and more real
Worst example is a friend's forum (his problem urged me to edit my original article and start this thread).
800 different IP addresses in a 5-minute time frame, bots/crawlers, browsing pages.
Putting a huge load on the VPS' CPU - to the point of crashing the site from time to time.
Not a classic DDOS attack, not constant, but frequent and "very eager to read" what's on the forum.
Yes, definitely!
Caching helps a lot (even without any extreme bot traffic) - it can make a noticeable server load reduction.
And, when increased load happens, it can make a difference between the site/forum working fine, and returning a 500 error.
EDIT - tin foil hat section:
If I were power hungry, unscrupulous, and in charge of Cloudflare (or any other big Net corporation), I would make sure there are many bots doing this stuff all over the Net. Fortunately, I'm sticking to bikes and pigeons.
π§ BikeGremlin guides & resources
Ban 'em, ban 'em all! (Seriously)
It wisnae me! A big boy done it and ran away.
NVMe2G for life! until death (the end is nigh)
does bots count as ads traffic?
I bench YABS 24/7/365 unless it's a leap year.
Fuck the AI scraping.
Haven't bought a single service in VirMach Great Ryzen 2022 - 2023 Flash Sale.
https://lowendspirit.com/uploads/editor/gi/ippw0lcmqowk.png
A rare photo of the headquarters of the company that pays for such traffic (and fails to filter it or punish any such deliberate abuse):
π§ BikeGremlin guides & resources
Mythic Beasts did an interesting write up about it - https://www.mythic-beasts.com/blog/2025/04/01/abusive-ai-web-crawlers-get-off-my-lawn/
My plan here is to fingerprint the request headers (to hopefully tell browsers and bots apart) and delay or potentially block bots based on that.
Yeah I've been dealing with this a lot. I ended up setting up the lua-nginx ddos protection scripts and this helped greatly. Using cloudflare or a similar provider would help more. However, I'm not made of money and I'm very disenfranchised with CF since I suspect them of covertly running some of the ddos farms and botnets. Gave DDOS-GUARD a looksie years ago but decided to hoof it..
Won't forget that time a hosting friend went to re-negotiate a deal with cloudflare and after not having an attack for months, suddenly get a massive attack twice leading up to the negotiating table. Suddenly attacks stop when they get told "nah". 10/10 I'm 100% on the conspiracy that they're in on it.
https://www.theregister.com/2025/06/22/ai_search_starves_publishers/
Yup - if you aren't selling anything, people have zero reasons to open any of your pages - Google gives the regurgitated (copy/pasted) text from your site. "The Great Decoupling":
π§ BikeGremlin guides & resources
Hmm, I received an email from Cloudflare congratulating me on passing 100k page views for one of my semi-abandoned portfolio sites, where in reality not even 10 people visit it monthly. Enabled under attack mode and whatever anti-bot features are available in the free tier just as a precaution.
Why?
Ohhh the clouflare imperial conundrum!!! Isn't it special?
Free Hosting at YetiNode | MicroNode| Cryptid Security | URL Shortener | LaunchVPS | ExtraVM | Host-C | In the Node, or Out of the Loop?
almost exactly 2 years ago i'm start dealing with this bs.
one thing that cloudflare really good is at identifying if the scraper really belong to a company (like validating GoogleBot useragent really from Google IP address range, so if someone pretending to be GoogleBot, they got blocked. which is kinda effective against bytedance scraper). you can be the internet though guy and says you can make this list on your own until you realize there's inane amount of AI scraper or search engine scraper in general, it's a mundane task.
still, spiders from bytedance and bing are particularly annoying, they made 6 gorrilion of request per hour all the f*cking time. it's as if they hired cheap ahh 5$/day labor to do the web scraping.
https://gitgud.io/fatchan/haproxy-protection is gold at doing this job, gatekeeping ai scraper (with additional rules depends on your use case)
however lately i've been experimenting https://github.com/TecharoHQ/anubis with more aggresive rules, since it has easier integration to traefik (for needy clients)
Fuck this 24/7 internet spew of trivia and celebrity bullshit.
Interesting project, I have been using DNSProxy reverse proxy and no issues, Cloudflare free plan is useless but DNS resolves quick!
Hmm stop blocking my glorified AIM chatbot!!! She is trying learn dammit!!
Free Hosting at YetiNode | MicroNode| Cryptid Security | URL Shortener | LaunchVPS | ExtraVM | Host-C | In the Node, or Out of the Loop?
it's interesting to see @PureVoltage got on TV da news, congrats you're famous!
https://blog.xkeeper.net/uncategorized/tcrf-has-been-getting-ddosed/
Fuck this 24/7 internet spew of trivia and celebrity bullshit.
I have also seen those request attacks coming from Chinese networks - the interesting thing with these is that they are all using HTTP/1.1, while most browsers use HTTP/2.0 nowadays. At least for my Python-based web apps, I can pretty easily do some pre-filtering on the requests, and I am just limiting the number of concurrent HTTP/1.1 requests more aggressively than HTTP/2.0 requests (so if there are more than say 4 concurrent HTTP requests being processed, any other HTTP/1.1 request gets a 429 response while still accepting HTTP/2.0 requests). I can't do that yet for PHP, but am planning to implement similar pre-filtering logic as a FastCGI proxy.
I simply block entire ASNs to deal with the bots lol. More than 2 dozen in the list rn and going.
For some things I'm 10 years ahead of my time, for others just a month or so LOL
https://blog.cloudflare.com/from-googlebot-to-gptbot-whos-crawling-your-site-in-2025/
π§ BikeGremlin guides & resources