Strange bot swarm overwhelming a website - SOLVED

bikegremlinbikegremlin ModeratorOGContent Writer

I saw a client's website using up 100% of its allotted CPU resources non-stop, and a huge amount of bandwidth.

It wasn't a DDoS, it wasn't a real bot attack either.

Never experienced something like that (you live and learn).

I wrote it all down as a sort of a pulp-fiction detective story - just for some fun while documenting it:

Website Attacked By Ghosts

Thought it would be cool to not put any spoilers here. :)

Relja of House Novović, the First of His Name, King of the Plains, the Breaker of Chains, WirMach Wolves pack member
BikeGremlin's web-hosting reviews

Comments

  • bikegremlinbikegremlin ModeratorOGContent Writer
    edited September 2023

    According to this tweet (by an SEO expert Lily Ray), this site was not the only one bothered by Google crawling and indexing query strings instead of the canonical URLs:

    https://twitter.com/lilyraynyc/status/1706444837610750119

    I believe that the .htaccess redirect I used for the Medisite would work for that one too.

    #BEGIN Redirect from msclkid to the canonical page
    
    RewriteCond %{QUERY_STRING}    "msclkid=" [NC]
    RewriteRule (.*)  /$1? [R=301,L]
    
    #END Redirect from msclkid to the canonical page
    

    Here, I added a more detailed explanation of how to see if your site is affected and fix it.

    Relja of House Novović, the First of His Name, King of the Plains, the Breaker of Chains, WirMach Wolves pack member
    BikeGremlin's web-hosting reviews

  • good find, bandwidth aside using that 100% cpu isn't acceptable. I recall this isn't the first fuckup by google bot isn't it

    but then again, most of my websites are just flat static html. i probably will never seen this kind of problem anytime soon

    Thanked by (2)bikegremlin vyas

    Fuck this 24/7 internet spew of trivia and celebrity bullshit.

  • bikegremlinbikegremlin ModeratorOGContent Writer

    @Encoders said:
    good find, bandwidth aside using that 100% cpu isn't acceptable. I recall this isn't the first fuckup by google bot isn't it

    but then again, most of my websites are just flat static html. i probably will never seen this kind of problem anytime soon

    Depending on one's priorities, it could be argued that double thousandfold indexing is the biggest problem.

    The affected site(s) had the same pages indexed several times - the same URL, with various query string combinations at its end.

    But yes, CPU load is no fun either.

    Relja of House Novović, the First of His Name, King of the Plains, the Breaker of Chains, WirMach Wolves pack member
    BikeGremlin's web-hosting reviews

  • I've seen similar issues with bingbot as well where it was similar to a small DoS.

    Users can set a max crawl rate for these bots in their search console, but I'm not aware of any user that actually does until it becomes an issue.

    Search companies really need to set reasonable max crawls rates by default and actually stick to them, regardless of what is being indexed.

    Users shouldn't have to go to lengths just to keep search bots from eating up their website's resources.

    SimpleSonic - We Make Fast... Easy!
    New High Performance Economy Shared Hosting Plans Available As Low As $1.46/mo

  • bikegremlinbikegremlin ModeratorOGContent Writer

    @ResellerWiz said:
    I've seen similar issues with bingbot as well where it was similar to a small DoS.

    Users can set a max crawl rate for these bots in their search console, but I'm not aware of any user that actually does until it becomes an issue.

    Search companies really need to set reasonable max crawls rates by default and actually stick to them, regardless of what is being indexed.

    Users shouldn't have to go to lengths just to keep search bots from eating up their website's resources.

    This was more than just a crawl rate issue.

    It was Google crawling the query string variants, and completely ignoring the canonical URL tags of each page (according to their own search result page - no greater proof than that).

    Relja GottaLoveSeo Novovic

    Thanked by (1)AlwaysSkint

    Relja of House Novović, the First of His Name, King of the Plains, the Breaker of Chains, WirMach Wolves pack member
    BikeGremlin's web-hosting reviews

  • @bikegremlin said:

    @ResellerWiz said:
    I've seen similar issues with bingbot as well where it was similar to a small DoS.

    Users can set a max crawl rate for these bots in their search console, but I'm not aware of any user that actually does until it becomes an issue.

    Search companies really need to set reasonable max crawls rates by default and actually stick to them, regardless of what is being indexed.

    Users shouldn't have to go to lengths just to keep search bots from eating up their website's resources.

    This was more than just a crawl rate issue.

    It was Google crawling the query string variants, and completely ignoring the canonical URL tags of each page (according to their own search result page - no greater proof than that).

    Relja GottaLoveSeo Novovic

    Either way, not good.

    Thanked by (1)bikegremlin

    SimpleSonic - We Make Fast... Easy!
    New High Performance Economy Shared Hosting Plans Available As Low As $1.46/mo

  • bikegremlinbikegremlin ModeratorOGContent Writer

    @ResellerWiz said:

    @bikegremlin said:

    @ResellerWiz said:
    I've seen similar issues with bingbot as well where it was similar to a small DoS.

    Users can set a max crawl rate for these bots in their search console, but I'm not aware of any user that actually does until it becomes an issue.

    Search companies really need to set reasonable max crawls rates by default and actually stick to them, regardless of what is being indexed.

    Users shouldn't have to go to lengths just to keep search bots from eating up their website's resources.

    This was more than just a crawl rate issue.

    It was Google crawling the query string variants, and completely ignoring the canonical URL tags of each page (according to their own search result page - no greater proof than that).

    Relja GottaLoveSeo Novovic

    Either way, not good.

    Yup. I'd say it's worse than just a high crawl speed.

    Thanked by (1)SimpleSonic

    Relja of House Novović, the First of His Name, King of the Plains, the Breaker of Chains, WirMach Wolves pack member
    BikeGremlin's web-hosting reviews

  • This article, truly talks to me: :relieved:

    Thanked by (2)FrankZ bikegremlin
  • IMHO, the best way to address this issue is by preventing Googlebot from crawling faceted URLs. This can be achieved by using the disallow directive in the robots.txt file.

    If you choose to redirect the URLs instead, it's important to note that Googlebot will still crawl them.

    UpCloud free $25 through this aff link - Akamai, DigitalOcean and Vultr alternative, multiple location, IPv6.

  • bikegremlinbikegremlin ModeratorOGContent Writer
    edited September 2023

    @Dazzle said:
    IMHO, the best way to address this issue is by preventing Googlebot from crawling faceted URLs. This can be achieved by using the disallow directive in the robots.txt file.

    If you choose to redirect the URLs instead, it's important to note that Googlebot will still crawl them.

    They will try to crawl them - and get 301 redirected.
    Thanks to the .htaccess redirects, there is no bashing the server - WordPress won't even realize someone requested those pages.

    After a while, with 301 redirects, Google ditches the redirected pages in favour of the pages they 301 redirect to.
    Canonical tags, on the other hand, are apparently treated as a suggestion, not as a hard rule.

    Edit:
    My initial (gut) response was to block the crawling of those pages (using firewall - lol, to make the matters worse).
    But that was a very bad idea.
    In my defense, the server was "redlining" so I wanted to do a quick fix, keep the site online, and then take time to think.
    Disallow using the robots.txt is a lot less bad solution, but still not ideal IMO.

    301 .htaccess redirects are an elegant solution that should fix the problem properly, in the long run - in the best and most efficient way.
    I could be wrong, but according to my experience so far, that's what I did and what I would recommend.

    Will see how things end up in the Google Search Console over the next month or two.

    Relja of House Novović, the First of His Name, King of the Plains, the Breaker of Chains, WirMach Wolves pack member
    BikeGremlin's web-hosting reviews

  • @bikegremlin said:

    301 .htaccess redirects are an elegant solution that should fix the problem properly, in the long run - in the best and most efficient way.
    I could be wrong, but according to my experience so far, that's what I did and what I would recommend.

    It can solve the problem, though, but not in the ideal way. The robots.txt file is created specifically for this purpose, to control how bots behave on your site. If the faceted URLs are visited by a human visitor, then a 301 redirect is the ideal solution.

    Regarding GSC, the URLs will remain in the "crawled but not indexed" warning for quite a long time even if you 301 redirect them. A faster method is to use a 404 status code and request index removal from GSC (if the URL is indexed on the SERP). Otherwise, Google may reindex the faceted URLs if they are found elsewhere.

    I don't want to start a debate, but share my thoughts for a second opinion. Just read this topic today.

    Best regards,
    Re

    Thanked by (1)bikegremlin

    UpCloud free $25 through this aff link - Akamai, DigitalOcean and Vultr alternative, multiple location, IPv6.

  • bikegremlinbikegremlin ModeratorOGContent Writer

    @Dazzle said:

    @bikegremlin said:

    301 .htaccess redirects are an elegant solution that should fix the problem properly, in the long run - in the best and most efficient way.
    I could be wrong, but according to my experience so far, that's what I did and what I would recommend.

    It can solve the problem, though, but not in the ideal way. The robots.txt file is created specifically for this purpose, to control how bots behave on your site. If the faceted URLs are visited by a human visitor, then a 301 redirect is the ideal solution.

    Regarding GSC, the URLs will remain in the "crawled but not indexed" warning for quite a long time even if you 301 redirect them. A faster method is to use a 404 status code and request index removal from GSC (if the URL is indexed on the SERP). Otherwise, Google may reindex the faceted URLs if they are found elsewhere.

    I don't want to start a debate, but share my thoughts for a second opinion. Just read this topic today.

    Best regards,
    Re

    It's good to hear different opinions and points of view. Especially when they are disagreeing. It's difficult to learn otherwise.

    Here's my experience (hoping to get corrected if I'm wrong):
    301s are pretty good at getting Google to drop the redirected pages from SERP and replace them with the pages you directed to (if it's basically or literally the same page, not some strange "hack").

    The old version will get dropped from indexing and shown under non-indexed page links (apparently, Google never forgets, LOL), with the reason "Page with redirect."

    It doesn't negatively affect rankings (the dropped pages usually get swapped for the pages you 301 redirected to).

    I've used 301 redirects when I moved the cycling website in my native from "www." to a "bicikl." subdomain, and when I moved all the computer-related articles from the cycling website(s) to the "io." subdomain.

    I've also used 301s when I ditched Google AMP from my sites.

    It all seems to have worked fine with no measurable negative ranking effects.

    Having said that, this is the first time I've seen "random" URLs get indexed, and it will take a while to check and see if it went well.

    But I would expect a 301 redirect to be more efficient than just blocking the bots (even via the robots.txt file). My reasoning:

    • Block: "Don't go there, don't look."
    • 301 redirect: "That page is now here, see?"

    Especially when the indexed pages all had canonical tags pointing to the same URLs that the 301 redirects take the bots to.

    Relja AmateurSEO Novovic

    Thanked by (1)Dazzle

    Relja of House Novović, the First of His Name, King of the Plains, the Breaker of Chains, WirMach Wolves pack member
    BikeGremlin's web-hosting reviews

  • Great thread. I have recently been receiving warning emails from Google about one of my site's pages not being indexed by Google when they're actually 301 redirected to another domain. So far I've just ignored the warnings, but I think I need to rethink this now.

    Thanked by (2)skorous vyas

    ProlimeHost Dedicated Servers
    Los Angeles, CA - Denver, CO - Singapore

  • I've had the search engines totally ignore directives for years! Google in particular loves adding query strings. Webmaster Tools be damned.
    When I'm a bit more "on the ball", I'll slowly read the above and try to glean some tips.
    Cheers.

    Thanked by (2)skorous bikegremlin

    It wisnae me! A big boy done it and ran away.
    NVMe2G for life! until death (the end is nigh)

  • edited March 11

    Ironically, the two sites that had massive bandwidth usage recently, seem to not have the above issue with indexed parameters.
    A point of note, is that I found the following in robots.txt to be a totally wasted effort..

    User-agent: *
    Disallow: /*?

    Along with these, though hardly surprising from the scum/scourge of the web (IMHO):

    User-agent: AhrefsBot
    Disallow: /

    User-agent: YandexDirect
    Disallow: /

    User-agent: YandexBot
    Disallow: /

    Thanked by (1)bikegremlin

    It wisnae me! A big boy done it and ran away.
    NVMe2G for life! until death (the end is nigh)

Sign In or Register to comment.