Have you noticed any user agents in common or IP ranges in common?
Slow Servers IPv6-native VPSs hosted on OpenBSD's VMM in Spokane, WA, USA. (I racked these.) (Now with IPv4!) SporeStack Resold Vultr VPS/baremetal, DO, and a whitelabeled brand in Europe. KYC-free, simple to launch. (I didn't rack these.) Neither have low end pricing!
If you publish anything remotely useful you will have thousands of visitors from all around the world, and each ip address will request exactly one page and never come back again.
At least that's the reality on my site, after I firewall-blocked all Chinese cloud providers who were even worse.
Contemplating starting a blog...the ultra-light static files on a CDN variety so that bot traffic doesn't matter
Folks are on social networks and "AI" - info sites (like sheldonbrown.com for bikes or anandtech for computers) are now just used for bot scraping and practically no one visits those anymore.
Even if you want to, Google is making it next-to-impossible for you to find other sites (blogs or info sites) - and other search engines have always been more crappy than not.
So, you can make a site as a personal diary, but it will remain practically personal - unless you invest in marketing and sell stuff, but that's a bit different from what I understand you plan to do.
Should not stop you, just expect visits to be a pleasant surprise, not a thing to expect nowadays (and the trend is getting worse IMO).
Thanks. I see. That sounds even more grim than previously thought.
hmm...think it may come down to a try then. A simple blog mdx framework shouldn't be too hard to deploy
sheldonbrown.com
What on earth is that? You label it an AI site but it looks like it was designed in the 90s. Is the info on it all AI generated but the look intentionally old school?
Contemplating starting a blog...the ultra-light static files on a CDN variety so that bot traffic doesn't matter
Folks are on social networks and "AI" - info sites (like sheldonbrown.com for bikes or anandtech for computers) are now just used for bot scraping and practically no one visits those anymore.
Even if you want to, Google is making it next-to-impossible for you to find other sites (blogs or info sites) - and other search engines have always been more crappy than not.
So, you can make a site as a personal diary, but it will remain practically personal - unless you invest in marketing and sell stuff, but that's a bit different from what I understand you plan to do.
Should not stop you, just expect visits to be a pleasant surprise, not a thing to expect nowadays (and the trend is getting worse IMO).
Thanks. I see. That sounds even more grim than previously thought.
hmm...think it may come down to a try then. A simple blog mdx framework shouldn't be too hard to deploy
sheldonbrown.com
What on earth is that? You label it an AI site but it looks like it was designed in the 90s. Is the info on it all AI generated but the look intentionally old school?
I believe sheldonbrown.com was an example of an info site that bots scrape but no one visits.
slowservers said:
Have you noticed any user agents in common or IP ranges in common?
Some Autonomous Systems (AS) can be problematic. I encountered one whose requests were so frequent that my website occasionally failed to respond to legitimate traffic. Even after I blocked a /24 range, new requests simply emerged from their other subnets. Consequently, I decided to block all their IP addresses, and performance returned to normal.
A quick search revealed that this AS has been flagged on numerous GitHub 'bad IP' lists for over a decade. This suggests that perhaps I should proactively block known malicious networks, though I am still looking for a definitive, 'go-to' ASN blacklist.
Contemplating starting a blog...the ultra-light static files on a CDN variety so that bot traffic doesn't matter
Folks are on social networks and "AI" - info sites (like sheldonbrown.com for bikes or anandtech for computers) are now just used for bot scraping and practically no one visits those anymore.
Even if you want to, Google is making it next-to-impossible for you to find other sites (blogs or info sites) - and other search engines have always been more crappy than not.
So, you can make a site as a personal diary, but it will remain practically personal - unless you invest in marketing and sell stuff, but that's a bit different from what I understand you plan to do.
Should not stop you, just expect visits to be a pleasant surprise, not a thing to expect nowadays (and the trend is getting worse IMO).
Thanks. I see. That sounds even more grim than previously thought.
hmm...think it may come down to a try then. A simple blog mdx framework shouldn't be too hard to deploy
sheldonbrown.com
What on earth is that? You label it an AI site but it looks like it was designed in the 90s. Is the info on it all AI generated but the look intentionally old school?
I believe sheldonbrown.com was an example of an info site that bots scrape but no one visits.
This.
It's a legendary site that was a role model and an inspiration for my own site(s).
Now people ask AI or ask on Reddit for info/answers that AI scraped from there.
It is becoming a real Menace... they even bypass cloudflare, rate limiting based on headers is being defeated as well. Bandwidth costs have been passing threshold and I get longer CPU spikes. These scrapers seem to be iterating so fast... I feel tired of hunting them down sometimes.
I wonder how they track state of which page to go to next? I think there would have to be a mesh or some kind of centralized system to know what to crawl. Would be interesting to see how that operates.
These must come in pretty fast, one page at a time from all over?
PS: I remembered Sheldon Brown's website. It's a good one!
Slow Servers IPv6-native VPSs hosted on OpenBSD's VMM in Spokane, WA, USA. (I racked these.) (Now with IPv4!) SporeStack Resold Vultr VPS/baremetal, DO, and a whitelabeled brand in Europe. KYC-free, simple to launch. (I didn't rack these.) Neither have low end pricing!
Found this. Looks interesting. Not sure if anyone has seen it in practice recently on effectiveness.
Doesn't really solve the bandwidth problem. But definitely seems plausible to serve junk back to them bots/ slow them away with super slow responses of 1bit/ s
We've had quite a spike in AI/bot traffic across the past couple years across all of our customer sites. We set various CloudFlare settings including relatively-permissive global rate limiting (due to free tier), along with LSCache and OPCache, and a custom WordPress rate limit plugin which is considerably less permissive than the CF one. These changes have more or less resolved the issues for our sites at least (especially for WooCommerce sites, our rate limiting made a significant change here)
Found this. Looks interesting. Not sure if anyone has seen it in practice recently on effectiveness.
Doesn't really solve the bandwidth problem. But definitely seems plausible to serve junk back to them bots/ slow them away with super slow responses of 1bit/ s
slowloris used to be an attak vector for DDOS. reusing this as answer to bots is actually quite smart approach.
as an initial approach rate limiting in nginx as reverse proxy in front of the actual webserver works quite well, but indeed needs some love and adjustment over time for new crawlers/agents etc.
When the scrapers started hitting me hard, I just setup nginx connection rate and connect limits. The naughty ones get thrown a 444 through perpetuity.
Comments
Have you noticed any user agents in common or IP ranges in common?
Slow Servers IPv6-native VPSs hosted on OpenBSD's VMM in Spokane, WA, USA. (I racked these.) (Now with IPv4!)
SporeStack Resold Vultr VPS/baremetal, DO, and a whitelabeled brand in Europe. KYC-free, simple to launch. (I didn't rack these.) Neither have low end pricing!
Google Cloud. AWS. Azure. Everywhere you'd expect, really.
If you publish anything remotely useful you will have thousands of visitors from all around the world, and each ip address will request exactly one page and never come back again.
At least that's the reality on my site, after I firewall-blocked all Chinese cloud providers who were even worse.
I am noticing a lot of requests claiming to be modern browsers (based on the user agent), but who only support HTTP/1.1.
Thanks. I see. That sounds even more grim than previously thought.
hmm...think it may come down to a try then. A simple blog mdx framework shouldn't be too hard to deploy
What on earth is that? You label it an AI site but it looks like it was designed in the 90s. Is the info on it all AI generated but the look intentionally old school?
I believe sheldonbrown.com was an example of an info site that bots scrape but no one visits.
Some Autonomous Systems (AS) can be problematic. I encountered one whose requests were so frequent that my website occasionally failed to respond to legitimate traffic. Even after I blocked a /24 range, new requests simply emerged from their other subnets. Consequently, I decided to block all their IP addresses, and performance returned to normal.
A quick search revealed that this AS has been flagged on numerous GitHub 'bad IP' lists for over a decade. This suggests that perhaps I should proactively block known malicious networks, though I am still looking for a definitive, 'go-to' ASN blacklist.
This.
It's a legendary site that was a role model and an inspiration for my own site(s).
Now people ask AI or ask on Reddit for info/answers that AI scraped from there.
🔧 BikeGremlin guides & resources
It is becoming a real Menace... they even bypass cloudflare, rate limiting based on headers is being defeated as well. Bandwidth costs have been passing threshold and I get longer CPU spikes. These scrapers seem to be iterating so fast... I feel tired of hunting them down sometimes.
Wow! This sounds like a huge pain.
I wonder how they track state of which page to go to next? I think there would have to be a mesh or some kind of centralized system to know what to crawl. Would be interesting to see how that operates.
These must come in pretty fast, one page at a time from all over?
PS: I remembered Sheldon Brown's website. It's a good one!
Slow Servers IPv6-native VPSs hosted on OpenBSD's VMM in Spokane, WA, USA. (I racked these.) (Now with IPv4!)
SporeStack Resold Vultr VPS/baremetal, DO, and a whitelabeled brand in Europe. KYC-free, simple to launch. (I didn't rack these.) Neither have low end pricing!
https://www.hedgehogsecurity.co.uk/blog/what-are-tarpits
Found this. Looks interesting. Not sure if anyone has seen it in practice recently on effectiveness.
Doesn't really solve the bandwidth problem. But definitely seems plausible to serve junk back to them bots/ slow them away with super slow responses of 1bit/ s
Just enable cloudflare "i am under attack" mode for your entire domain and call it a day.
We've had quite a spike in AI/bot traffic across the past couple years across all of our customer sites. We set various CloudFlare settings including relatively-permissive global rate limiting (due to free tier), along with LSCache and OPCache, and a custom WordPress rate limit plugin which is considerably less permissive than the CF one. These changes have more or less resolved the issues for our sites at least (especially for WooCommerce sites, our rate limiting made a significant change here)
Other places you can find me
slowloris used to be an attak vector for DDOS. reusing this as answer to bots is actually quite smart approach.
as an initial approach rate limiting in nginx as reverse proxy in front of the actual webserver works quite well, but indeed needs some love and adjustment over time for new crawlers/agents etc.
When the scrapers started hitting me hard, I just setup nginx connection rate and connect limits. The naughty ones get thrown a 444 through perpetuity.
"It's a hard life- to be a stick insect." - Karl Pilkington