Bot scrapers overloading servers?

bikegremlinbikegremlin ModeratorOGContent Writer

I am wondering if providers (and website owners) have noticed any problems with scraping bots overloading the server with too many requests?

Is it just me, or is it becoming worse month after month?

Cloudflare seems to be an effective way to stop that (along with all its downsides). My article about dealing with recognized bots gone crazy (like Google) and unknown bots and AI-training content scrapers:

https://io.bikegremlin.com/31865/website-attacked-by-ghosts/

Thanked by (1)ariq01

Comments

  • AlyxAlyx Hosting Provider

    None of them really caused any of my sites to overload.
    But I can confirm that they generate more and more traffic.

    I have the feeling that even cloudflare struggles a bit with filtering them properly without captchas.
    At least it feels like and need to pass a lot more cloudflare captchas since a while.

    Thanked by (1)bikegremlin
  • bikegremlinbikegremlin ModeratorOGContent Writer

    @Alyx said:
    None of them really caused any of my sites to overload.
    But I can confirm that they generate more and more traffic.

    I have the feeling that even cloudflare struggles a bit with filtering them properly without captchas.
    At least it feels like and need to pass a lot more cloudflare captchas since a while.

    Cloudflare Turnstile works a lot better than captchas for the registration fields and similar (unless you are refering to that when mentioning Cloudflare captchas).

    Caching pages for non-logged in users can make a huge difference (without that, the forum I mentioned had big CPU usage spikes). But that is an extra thing to figure out and keep track of.

  • cybertechcybertech OGBenchmark King
    edited June 21

    yes i have it too. you can either ban the IP range or let it be.

    mines redis cached so load isnt too high

    Thanked by (1)bikegremlin

    I bench YABS 24/7/365 unless it's a leap year.

  • bikegremlinbikegremlin ModeratorOGContent Writer

    @cybertech said:
    yes i have it too. you can either ban the IP range or let it be.

    mines redis cached so load isnt too high

    Yup - caching does make a huge difference (sometimes I miss the good old static HTML stuff :) ).

  • AlyxAlyx Hosting Provider

    @bikegremlin said:

    @Alyx said:
    None of them really caused any of my sites to overload.
    But I can confirm that they generate more and more traffic.

    I have the feeling that even cloudflare struggles a bit with filtering them properly without captchas.
    At least it feels like and need to pass a lot more cloudflare captchas since a while.

    Cloudflare Turnstile works a lot better than captchas for the registration fields and similar (unless you are refering to that when mentioning Cloudflare captchas).

    Caching pages for non-logged in users can make a huge difference (without that, the forum I mentioned had big CPU usage spikes). But that is an extra thing to figure out and keep track of.

    Cloudflare Turnstile is the captcha I'm talking about πŸ˜…

  • i noticed a lot of new bots that seem to have endless ressources to scrape everything, mostly from china (Huawei, Tencent, etc). they generate lots of traffic for nothing in return, so its really necessary to deal with them.

    Thanked by (3)skhron AlwaysSkint ariq01
  • cybertechcybertech OGBenchmark King

    @someTom said:
    i noticed a lot of new bots that seem to have endless ressources to scrape everything, mostly from china (Huawei, Tencent, etc). they generate lots of traffic for nothing in return, so its really necessary to deal with them.

    their own AI search bot

    Thanked by (1)bikegremlin

    I bench YABS 24/7/365 unless it's a leap year.

  • I am seeing two kinds of bots:

    • those that request a page every few seconds - so far they still seem pretty harmless (although annoying)
    • bots that hit your server at full speed over several concurrent connections (usually using Scrapy) - those are really evil

    what's the kind of load you are seeing?

    for any static resources I don't really care, but anything that's dynamically generated, these bots are a pain

    Thanked by (1)bikegremlin
  • yeah, the problem is becoming more and more real

  • bikegremlinbikegremlin ModeratorOGContent Writer
    edited June 21

    @cmeerw said:
    I am seeing two kinds of bots:

    • those that request a page every few seconds - so far they still seem pretty harmless (although annoying)
    • bots that hit your server at full speed over several concurrent connections (usually using Scrapy) - those are really evil

    what's the kind of load you are seeing?

    Worst example is a friend's forum (his problem urged me to edit my original article and start this thread).
    800 different IP addresses in a 5-minute time frame, bots/crawlers, browsing pages.
    Putting a huge load on the VPS' CPU - to the point of crashing the site from time to time.
    Not a classic DDOS attack, not constant, but frequent and "very eager to read" what's on the forum. :)

    for any static resources I don't really care, but anything that's dynamically generated, these bots are a pain

    Yes, definitely!
    Caching helps a lot (even without any extreme bot traffic) - it can make a noticeable server load reduction.
    And, when increased load happens, it can make a difference between the site/forum working fine, and returning a 500 error.

    EDIT - tin foil hat section:
    If I were power hungry, unscrupulous, and in charge of Cloudflare (or any other big Net corporation), I would make sure there are many bots doing this stuff all over the Net. Fortunately, I'm sticking to bikes and pigeons. :)

    Thanked by (1)AlwaysSkint
  • AlwaysSkintAlwaysSkint OGSenpai

    Ban 'em, ban 'em all! (Seriously)

    It wisnae me! A big boy done it and ran away.
    NVMe2G for life! until death (the end is nigh)

  • cybertechcybertech OGBenchmark King

    @bikegremlin said:

    @cmeerw said:
    I am seeing two kinds of bots:

    • those that request a page every few seconds - so far they still seem pretty harmless (although annoying)
    • bots that hit your server at full speed over several concurrent connections (usually using Scrapy) - those are really evil

    what's the kind of load you are seeing?

    Worst example is a friend's forum (his problem urged me to edit my original article and start this thread).
    800 different IP addresses in a 5-minute time frame, bots/crawlers, browsing pages.
    Putting a huge load on the VPS' CPU - to the point of crashing the site from time to time.
    Not a classic DDOS attack, not constant, but frequent and "very eager to read" what's on the forum. :)

    for any static resources I don't really care, but anything that's dynamically generated, these bots are a pain

    Yes, definitely!
    Caching helps a lot (even without any extreme bot traffic) - it can make a noticeable server load reduction.
    And, when increased load happens, it can make a difference between the site/forum working fine, and returning a 500 error.

    EDIT - tin foil hat section:
    If I were power hungry, unscrupulous, and in charge of Cloudflare (or any other big Net corporation), I would make sure there are many bots doing this stuff all over the Net. Fortunately, I'm sticking to bikes and pigeons. :)

    does bots count as ads traffic?

    I bench YABS 24/7/365 unless it's a leap year.

  • JabJab Senpai

    Fuck the AI scraping.

    Thanked by (2)AlwaysSkint ariq01

    Haven't bought a single service in VirMach Great Ryzen 2022 - 2023 Flash Sale.
    https://lowendspirit.com/uploads/editor/gi/ippw0lcmqowk.png

  • bikegremlinbikegremlin ModeratorOGContent Writer
    edited June 21

    @cybertech said:

    @bikegremlin said:

    @cmeerw said:
    I am seeing two kinds of bots:

    • those that request a page every few seconds - so far they still seem pretty harmless (although annoying)
    • bots that hit your server at full speed over several concurrent connections (usually using Scrapy) - those are really evil

    what's the kind of load you are seeing?

    Worst example is a friend's forum (his problem urged me to edit my original article and start this thread).
    800 different IP addresses in a 5-minute time frame, bots/crawlers, browsing pages.
    Putting a huge load on the VPS' CPU - to the point of crashing the site from time to time.
    Not a classic DDOS attack, not constant, but frequent and "very eager to read" what's on the forum. :)

    for any static resources I don't really care, but anything that's dynamically generated, these bots are a pain

    Yes, definitely!
    Caching helps a lot (even without any extreme bot traffic) - it can make a noticeable server load reduction.
    And, when increased load happens, it can make a difference between the site/forum working fine, and returning a 500 error.

    EDIT - tin foil hat section:
    If I were power hungry, unscrupulous, and in charge of Cloudflare (or any other big Net corporation), I would make sure there are many bots doing this stuff all over the Net. Fortunately, I'm sticking to bikes and pigeons. :)

    does bots count as ads traffic?

    A rare photo of the headquarters of the company that pays for such traffic (and fails to filter it or punish any such deliberate abuse):

    :)

  • My plan here is to fingerprint the request headers (to hopefully tell browsers and bots apart) and delay or potentially block bots based on that.

    Thanked by (1)bikegremlin
  • ZizzyDizzyMCZizzyDizzyMC Hosting Provider

    @bikegremlin said:
    I am wondering if providers (and website owners) have noticed any problems with scraping bots overloading the server with too many requests?

    Is it just me, or is it becoming worse month after month?

    Cloudflare seems to be an effective way to stop that (along with all its downsides). My article about dealing with recognized bots gone crazy (like Google) and unknown bots and AI-training content scrapers:

    https://io.bikegremlin.com/31865/website-attacked-by-ghosts/

    Yeah I've been dealing with this a lot. I ended up setting up the lua-nginx ddos protection scripts and this helped greatly. Using cloudflare or a similar provider would help more. However, I'm not made of money and I'm very disenfranchised with CF since I suspect them of covertly running some of the ddos farms and botnets. Gave DDOS-GUARD a looksie years ago but decided to hoof it..
    Won't forget that time a hosting friend went to re-negotiate a deal with cloudflare and after not having an attack for months, suddenly get a massive attack twice leading up to the negotiating table. Suddenly attacks stop when they get told "nah". 10/10 I'm 100% on the conspiracy that they're in on it.

    Thanked by (2)bikegremlin Kolin
  • havochavoc OGContent WriterSenpai

    https://www.theregister.com/2025/06/22/ai_search_starves_publishers/

    Google AI Overviews and other AI search services appear to be starving the hand that fed them.

    Google's AI-generated summaries of web pages, officially released in May 2024, show up atop its search results pages so search users don't have to click through to the source website.

    Thanked by (1)bikegremlin
  • bikegremlinbikegremlin ModeratorOGContent Writer
    edited June 22

    @havoc said:
    https://www.theregister.com/2025/06/22/ai_search_starves_publishers/

    Google AI Overviews and other AI search services appear to be starving the hand that fed them.

    Google's AI-generated summaries of web pages, officially released in May 2024, show up atop its search results pages so search users don't have to click through to the source website.

    Yup - if you aren't selling anything, people have zero reasons to open any of your pages - Google gives the regurgitated (copy/pasted) text from your site. "The Great Decoupling":

    • Top (purple) line is the number of times a site's page is shown in the search.
    • Bottom (blue) line is the number of times people have clicked to open a page on the site.
    • Blocking bots won't help with this particular problem - that's a tangential, but separate topic (more SEO and SEM related).
  • Hmm, I received an email from Cloudflare congratulating me on passing 100k page views for one of my semi-abandoned portfolio sites, where in reality not even 10 people visit it monthly. Enabled under attack mode and whatever anti-bot features are available in the free tier just as a precaution.

    Thanked by (2)bikegremlin ariq01

    Why?

  • AuroraZeroAuroraZero Hosting ProviderRetired

    @jmaxwell said:
    Hmm, I received an email from Cloudflare congratulating me on passing 100k page views for one of my semi-abandoned portfolio sites, where in reality not even 10 people visit it monthly. Enabled under attack mode and whatever anti-bot features are available in the free tier just as a precaution.

    Ohhh the clouflare imperial conundrum!!! Isn't it special?

    Thanked by (1)bikegremlin
  • almost exactly 2 years ago i'm start dealing with this bs.

    one thing that cloudflare really good is at identifying if the scraper really belong to a company (like validating GoogleBot useragent really from Google IP address range, so if someone pretending to be GoogleBot, they got blocked. which is kinda effective against bytedance scraper). you can be the internet though guy and says you can make this list on your own until you realize there's inane amount of AI scraper or search engine scraper in general, it's a mundane task.

    still, spiders from bytedance and bing are particularly annoying, they made 6 gorrilion of request per hour all the f*cking time. it's as if they hired cheap ahh 5$/day labor to do the web scraping.

    https://gitgud.io/fatchan/haproxy-protection is gold at doing this job, gatekeeping ai scraper (with additional rules depends on your use case)
    however lately i've been experimenting https://github.com/TecharoHQ/anubis with more aggresive rules, since it has easier integration to traefik (for needy clients)

    Fuck this 24/7 internet spew of trivia and celebrity bullshit.

  • @Encoders said:
    almost exactly 2 years ago i'm start dealing with this bs.

    one thing that cloudflare really good is at identifying if the scraper really belong to a company (like validating GoogleBot useragent really from Google IP address range, so if someone pretending to be GoogleBot, they got blocked. which is kinda effective against bytedance scraper). you can be the internet though guy and says you can make this list on your own until you realize there's inane amount of AI scraper or search engine scraper in general, it's a mundane task.

    still, spiders from bytedance and bing are particularly annoying, they made 6 gorrilion of request per hour all the f*cking time. it's as if they hired cheap ahh 5$/day labor to do the web scraping.

    https://gitgud.io/fatchan/haproxy-protection is gold at doing this job, gatekeeping ai scraper (with additional rules depends on your use case)
    however lately i've been experimenting https://github.com/TecharoHQ/anubis with more aggresive rules, since it has easier integration to traefik (for needy clients)

    Interesting project, I have been using DNSProxy reverse proxy and no issues, Cloudflare free plan is useless but DNS resolves quick!

  • AuroraZeroAuroraZero Hosting ProviderRetired

    Hmm stop blocking my glorified AIM chatbot!!! She is trying learn dammit!!

  • it's interesting to see @PureVoltage got on TV da news, congrats you're famous!

    https://blog.xkeeper.net/uncategorized/tcrf-has-been-getting-ddosed/

    The LLM scrapers largely come from cloud providers, especially low-quality ones that are rife with abuse. If you block their scraping attempts, they will simply start up a new VM on their provider of choice, do more scrapes until they’re caught, repeat ad infinitum, until you give up and black hole the entirety of PureVoltage/OVH/DigitalOcean/Amazon/etc. The particularly sophisticated ones spread out their requests across multiple IP addresses in the first place, too, making identifying them much harder.

    Fuck this 24/7 internet spew of trivia and celebrity bullshit.

  • @Encoders said:
    it's interesting to see @PureVoltage got on TV da news, congrats you're famous!

    https://blog.xkeeper.net/uncategorized/tcrf-has-been-getting-ddosed/

    I have also seen those request attacks coming from Chinese networks - the interesting thing with these is that they are all using HTTP/1.1, while most browsers use HTTP/2.0 nowadays. At least for my Python-based web apps, I can pretty easily do some pre-filtering on the requests, and I am just limiting the number of concurrent HTTP/1.1 requests more aggressively than HTTP/2.0 requests (so if there are more than say 4 concurrent HTTP requests being processed, any other HTTP/1.1 request gets a 429 response while still accepting HTTP/2.0 requests). I can't do that yet for PHP, but am planning to implement similar pre-filtering logic as a FastCGI proxy.

    Thanked by (1)bikegremlin
  • @cybertech said: yes i have it too. you can either ban the IP range or let it be.

    I simply block entire ASNs to deal with the bots lol. More than 2 dozen in the list rn and going.

  • bikegremlinbikegremlin ModeratorOGContent Writer

    For some things I'm 10 years ahead of my time, for others just a month or so LOL :)

    https://blog.cloudflare.com/from-googlebot-to-gptbot-whos-crawling-your-site-in-2025/

Sign In or Register to comment.