[2022] ★ VirMach ★ RYZEN ★ NVMe ★★ The Epic Sales Offer Thread ★★

AlwaysSkint · July 2022

@Mumbly said: ..annoying for someone who migrate stuff in production..

That's why there's such a thing as planned maintenance.

yoursunny · July 2022

@Mumbly said:

@yoursunny said:

Migration with data is queued into a batch system and not immediate.
Once requested, the service is powered off and locked, until the migration completes, which may take up to 24 hours.
Tickets created within these 24 hours are auto-closed.
This would prevent server overloading.

This hobbyistic point of view may be damn annoying for someone who migrate stuff in production and want to respawn it back online to work on it as soon as possible. Or at least within scheduled time to be around to put thing back in order, not some random "up to 24 hours - so not sleep for you tonight" time.

≤$10/month services are not to be used for production.

AlwaysSkint · July 2022

@yoursunny said: ≤$10/month services are not to be used for production.

They make for good slave nameservers (3), personal VPN (1), development (2) and hobby sites (1).
[Crap; I've got 5 idlers! ]

Later: 3 idlers set to cancel.

Anyone for a Ryzen Special 2560, next due date 05/06/24 (mm/dd/yy)? Currently on NYCB013. Starting to seriously think about it, depending on an offer. It will also depend on whether my multi-IP VPS gets migrated properly.

Mumbly · July 2022

@yoursunny said:

@Mumbly said:

@yoursunny said:

Migration with data is queued into a batch system and not immediate.
Once requested, the service is powered off and locked, until the migration completes, which may take up to 24 hours.
Tickets created within these 24 hours are auto-closed.
This would prevent server overloading.

This hobbyistic point of view may be damn annoying for someone who migrate stuff in production and want to respawn it back online to work on it as soon as possible. Or at least within scheduled time to be around to put thing back in order, not some random "up to 24 hours - so not sleep for you tonight" time.

≤$10/month services are not to be used for production.

Since when? Ah, you newbs...

I use XenVZ.co.uk (openITC) £11.00 GBP/3 months vps in production since 2009 or so.
And Prometeus €11.25/yearly VPS in production since 2012 or so.
There's also no issue with my $18.15/yearly Ramnode vps for more than a decade.
I can't remember when I experienced Weservit NL €4.58 EUR/m outage last time, but I am pretty sure it wasn't in this decade.
There's also no issues with my tiny Securedragon VPS since 2013.

@AlwaysSkint said:

@Mumbly said: ..annoying for someone who migrate stuff in production..

That's why there's such a thing as planned maintenance.

That's exactly my point. You can plan maintenance and work on it only if you know exact time when your shit will go down.

@yoursunny hobbyistic suggestion with "up to 24 hours" window make it a bit harder to plan as you can't be sure when exactly your crap will go down and appear back with the new IP on new server. There are several solutions to prevent this migration downtime while you sleep, but generally looking it's just unnecessarily annoying.
I understand why he suggested that, but what I am trying to say is that not everyone collect VPSes just to run some benchmark every now and then and some people actually use them. 24 hours long maintenance window is simply too long. You want to configure things and get them back online as soon as they are migrated to the new server, not tomorrow after you wake up.

Neoon · July 2022

@yoursunny said:
≤$10/month services are not to be used for production.

Nonsense, bullshit.

loyaltyforge_dan · July 2022

@VirMach said:
Bulk of migrations are complete to some level, with a lot of problems. FFM is still facing disk configuration issue, I couldn't get to it, but luckily only a small number of people are affected outside of FFME04 which rebooted into no disks, and potentially FFME05 which is displaying services as online but large disk issue.

I don't know if it's just me or if NYCB019 is in the same boat. I've been offline for a couple days and no amount of turning off VNC, unmounting CDROM, and attempting to boot seems to help.

skorous · July 2022

@Mumbly said:

>

This hobbyistic point of view may be damn annoying for someone who migrate stuff in production and want to respawn it back online to work on it as soon as possible. Or at least within scheduled time to be around to put thing back in order, not some random "up to 24 hours - so not sleep for you tonight" time.

If it's in production then don't migrate it. If it's not in production then who cares how long it takes. Or better, $50 express fee and you get it there in four hours.

kheng86 · July 2022

@VirMach said:
Bulk of migrations are complete to some level, with a lot of problems. FFM is still facing disk configuration issue, I couldn't get to it, but luckily only a small number of people are affected outside of FFME04 which rebooted into no disks, and potentially FFME05 which is displaying services as online but large disk issue.

FFM has ECC RAM onsite and this will repair FFME001 which is actually correcting the errors just fine for now, but to avoid any comorbidities.

Migrations had issue with reconfigurations, SolusVM can't handle that many and keeps crashing. We'll be going through today and also fixing incorrect IPv4 showing up on WHMCS but for the most part you should be able to reconfigure and get it to work. A small percentage of these will still have problems booting back up, and we're actively still going through those right now.

Any idea when you are going to fix FFME004, FFME005 & FFME006? All my VMs are down on those nodes (Offline, No bootable disk, IO errors etc.) Thank you.

VirMach · July 2022

@Papa said:
Same for my FFME003-004. One sees no disk, other not even trying to boot.

@kheng86 said:

@VirMach said:
Bulk of migrations are complete to some level, with a lot of problems. FFM is still facing disk configuration issue, I couldn't get to it, but luckily only a small number of people are affected outside of FFME04 which rebooted into no disks, and potentially FFME05 which is displaying services as online but large disk issue.

FFM has ECC RAM onsite and this will repair FFME001 which is actually correcting the errors just fine for now, but to avoid any comorbidities.

Migrations had issue with reconfigurations, SolusVM can't handle that many and keeps crashing. We'll be going through today and also fixing incorrect IPv4 showing up on WHMCS but for the most part you should be able to reconfigure and get it to work. A small percentage of these will still have problems booting back up, and we're actively still going through those right now.

Any idea when you are going to fix FFME004, FFME005 & FFME006? All my VMs are down on those nodes (Offline, No bootable disk, IO errors etc.) Thank you.

FFME006 should be fine I think, if yours is down it may be unrelated. FFME005, I don't remember, but FFME004 definitely had an issue. Worked on them an hour or so ago and updated the BIOS settings to match Tokyo which has similar disks and no problems. Hopefully it'll stick. Only updated it for those dropping disks. Double checking them in about half an hour.

VirMach · July 2022

@Jab said:
Out of curiosity what is the difference between NYCB014 and NYCM101? Status page says all nodes should be NYCB, the other one is NYCM, both ends in the same place. Did you fat-fingered B/M?

Those have the whacky names because I'm an idiot and wanted to call them "NYCM" as in "M" for "Migrate" or something, because we set up the naming scheme before we verified some details about the servers, but muscle memory changed half of them to "B" since the cabinet is the "B" cabinet.

The names will be changes later to the NYCB000 scheme.

VirMach · July 2022

Some observations about the network, and ARP packets.

We kind of spoke about this on OGF but I'm tired of going back there for now, so I'm providing a continuation of it here. These seem to happen whenever more than one subnet is on the same VLAN, and the SolusVM setting for ebtables isn't enabled, around 1,000 PPS per VM in these types of packets. Thing is, they appear even with just 2 x /24 IPv4 on the same VLAN, as in two servers. We were told by one of our DC partners that they did this no problem, so we bit after some hesitation and it definitely doesn't work out well. Doesn't seem like it's collisions, maybe related to DHCP within the VMs. Even if we don't mix the subnets, the problem seems to be there AFAIK. For smaller server clusters, we can enable ebtables and it'll handle it fine. Once we go big, it struggles to keep up it seems, and the only choice is to have that turned off which in turn creates those packets.

All the network engineers I've spoken with have no idea what I'm going on about and all say it should be fine with the switch models we're using.

Anyway, when we redid configuration on the LAX switch, we ended up strictly splitting everything up, and the problem went away. Of course this is also how we had it all previously set up hence why IP addresses couldn't move between servers. We're going to try speaking to a few other experts about it and then probably revert everything back to the original setup we're comfortable with, mimicking LAX, if we don't find a better solution.

This is strictly between KVM virtual networks. Originally we though it was abuse-related as it did coincide with abuse incidences when it got really bad but even without abuse, it exists on the setup I mentioned above. Now interesting thing, when looking at arpwatch, it seems like it's a ton of packets flying between all the gateways and VMs, but in LAX we originally had fewer gateways and the situation reached catastrophic level last week hence the emergency and drastic change but unfortunately it was so bad that I couldn't even have a look.

Disclaimer: I may be many things but I'm definitely not a network engineer.

Before I change LAX to also have the fine-tuned NIC driver configuration does anyone who has both LAX and somewhere else functional notice one performing better than the other? On my end, LAX is around 70% cleaner.

VirMach · July 2022

@fan said:

@VirMach said: Ryzen migrate button will be converted to Ryzen to Ryzen location change button by around the end of this week.

Good luck with that, hope this won't cause any more chaos and ruin my only one in Tokyo that isn't suffering from constant packet loss.

Good to hear Tokyo calmed down a bit, at least on a single server.

kheng86 · July 2022

@VirMach said:

@Papa said:
Same for my FFME003-004. One sees no disk, other not even trying to boot.

@kheng86 said:

@VirMach said:
Bulk of migrations are complete to some level, with a lot of problems. FFM is still facing disk configuration issue, I couldn't get to it, but luckily only a small number of people are affected outside of FFME04 which rebooted into no disks, and potentially FFME05 which is displaying services as online but large disk issue.

FFM has ECC RAM onsite and this will repair FFME001 which is actually correcting the errors just fine for now, but to avoid any comorbidities.

Migrations had issue with reconfigurations, SolusVM can't handle that many and keeps crashing. We'll be going through today and also fixing incorrect IPv4 showing up on WHMCS but for the most part you should be able to reconfigure and get it to work. A small percentage of these will still have problems booting back up, and we're actively still going through those right now.

Any idea when you are going to fix FFME004, FFME005 & FFME006? All my VMs are down on those nodes (Offline, No bootable disk, IO errors etc.) Thank you.

FFME006 should be fine I think, if yours is down it may be unrelated. FFME005, I don't remember, but FFME004 definitely had an issue. Worked on them an hour or so ago and updated the BIOS settings to match Tokyo which has similar disks and no problems. Hopefully it'll stick. Only updated it for those dropping disks. Double checking them in about half an hour.

FFME004 seems to be working fine now after your fix. Appreciate that!

FFME005: VMs are having IO/kernel issues or Boot failure issue

FFME006: VMs are "Offline", can't boot them up or reinstall the OS

sahjanivishal · July 2022

@VirMach said:

@Papa said:
Same for my FFME003-004. One sees no disk, other not even trying to boot.

@kheng86 said:

@VirMach said:
Bulk of migrations are complete to some level, with a lot of problems. FFM is still facing disk configuration issue, I couldn't get to it, but luckily only a small number of people are affected outside of FFME04 which rebooted into no disks, and potentially FFME05 which is displaying services as online but large disk issue.

FFM has ECC RAM onsite and this will repair FFME001 which is actually correcting the errors just fine for now, but to avoid any comorbidities.

Migrations had issue with reconfigurations, SolusVM can't handle that many and keeps crashing. We'll be going through today and also fixing incorrect IPv4 showing up on WHMCS but for the most part you should be able to reconfigure and get it to work. A small percentage of these will still have problems booting back up, and we're actively still going through those right now.

Any idea when you are going to fix FFME004, FFME005 & FFME006? All my VMs are down on those nodes (Offline, No bootable disk, IO errors etc.) Thank you.

FFME006 should be fine I think, if yours is down it may be unrelated. FFME005, I don't remember, but FFME004 definitely had an issue. Worked on them an hour or so ago and updated the BIOS settings to match Tokyo which has similar disks and no problems. Hopefully it'll stick. Only updated it for those dropping disks. Double checking them in about half an hour.

Can you please confirm once about FFME003 as well?

cybertech · July 2022

i have LAXA008 which has nice and comfy 0.3 steal, and 20k/s on nload. wonderful and have not experienced downtime on the hardware.

TYOC040 has been shitting on me for weeks with up to 30.0 steal and constant 500k/s nload. However it has never went down.

tetech · July 2022

@VirMach said: Before I change LAX to also have the fine-tuned NIC driver configuration does anyone who has both LAX and somewhere else functional notice one performing better than the other? On my end, LAX is around 70% cleaner.

Network-wise, LAX currently fine;

DFW currently poor (to the extent I have shut down the VMs there);

PHX currently fine, not significantly different from LAX.

Mumbly · July 2022

@skorous said: If it's in production then don't migrate it.

Do you have a crystal ball which tells you why someone needs migrate away from a certain node or location? I can think a various legit reasons like network issues, etc. Who are you to tell what people need or don't need to do to fix some issue.

Papa · July 2022

@VirMach said:

FFME006 should be fine I think, if yours is down it may be unrelated. FFME005, I don't remember, but FFME004 definitely had an issue. Worked on them an hour or so ago and updated the BIOS settings to match Tokyo which has similar disks and no problems. Hopefully it'll stick. Only updated it for those dropping disks. Double checking them in about half an hour.

Well i do see progress. At least at FFME003 my vm was booting into no disks. Now it's not booting at all, just like at FFME004.

VirMach · July 2022

@Papa said:

@VirMach said:

FFME006 should be fine I think, if yours is down it may be unrelated. FFME005, I don't remember, but FFME004 definitely had an issue. Worked on them an hour or so ago and updated the BIOS settings to match Tokyo which has similar disks and no problems. Hopefully it'll stick. Only updated it for those dropping disks. Double checking them in about half an hour.

Well i do see progress. At least at FFME003 my vm was booting into no disks. Now it's not booting at all, just like at FFME004.

Finally got an important setting to stick on this one.

AlwaysSkint · July 2022

@VirMach
ATLZ007 has been in a flap all morning (started ~04:00 UTC+1), with Hetrix showing network going down for a minute each time, approx. 30 minutes apart. Anything funky going on? Abusers?

(Edit: sorry was wrong node being reported. D'oh! It's Atlanta, which has been fine lately, until now.)

Edit2: Seems to have stopped now - last/latest one:

Downtime: 2 min
Noticed at: 2022-07-06 10:10:37 (UTC+00:00)

Papa · July 2022

@VirMach said:

Finally got an important setting to stick on this one.

And it's alive now except network. Network reconfigure shows "Unknown error", but files at /etc/network are being rewrited.
networking.service shows warning that /etc/resolv.conf is not a symbolic link to /run/resolv.conf

skorous · July 2022

@Mumbly said:

@skorous said: If it's in production then don't migrate it.

Do you have a crystal ball which tells you why someone needs migrate away from a certain node or location? I can think a various legit reasons like network issues, etc. Who are you to tell what people need or don't need to do to fix some issue.

I'm not telling them they can't. I'm saying if you have a single production node then you've already made the choice that taking a potentially long outage is acceptable. Bitching because the provider "only" lets you migrate within a twenty four hour window is not taking responsibility for your own decisions.

Mumbly · July 2022

@skorous said:
I'm not telling them they can't. I'm saying if you have a single production node ...

Hey, stop a little bit here.
Single production node? Can you quote who said that? No need to make things up and then use them as a strawman to prove your own arguments.

Beside that no one's bitching because the provider "only" lets you migrate within a twenty four hour window. Where did you get that from? There's no such provider. It's hypothetical situation with all the pros and cons we discuss about. If you can't accept that then that's on you.

skorous · July 2022

@Mumbly said: Single production node? Can you quote who said that?

You're right. You never said single production node. You said a 24h planned maintenance was too long. I inferred from that it must be an outage maintenance since otherwise who would care if it takes 24h.

@Mumbly said: Beside that no one's bitching because the provider "only" lets you migrate within a twenty four hour window. Where did you get that from? There's no such provider. It's hypothetical situation with all the pros and cons we discuss about.

I think it was when you said:

@Mumbly said: This hobbyistic point of view may be damn annoying for someone who migrate stuff in production and want to respawn it back online to work on it as soon as possible. Or at least within scheduled time to be around to put thing back in order, not some random "up to 24 hours - so not sleep for you tonight" time.

AlwaysSkint · July 2022

Guys, find a room, please.

skorous · July 2022

@AlwaysSkint said: Guys, find a room, please.

No need. That'll be my last comment.

Mumbly · July 2022

@skorous said: You're right. You never said single production node. You said a 24h planned maintenance was too long. I inferred from that it must be an outage maintenance since otherwise who would care if it takes 24h.

I said also:

@Mumbly said: There are several solutions to prevent this migration downtime while you sleep, but generally looking it's just unnecessarily annoying. (ie. temporal move of your stuff to the other provider and after maintenance move it back, etc..)

And no, it's just discussion why 24 hours long time window for PLANNED maintenance (migration) isn't most user friendly as there's no way for you to know when exactly to be online to fix things and get them back online after the migration.

With securedragon and justhost.ru you simply push migration button, your stuff get migrated and you work on it. Pretty simple and efficient without need to worry that your stuff will re-appear with new IP on the new node at some unspecific time randomly let's say 20 hours later at 3:00am when you're in bed.
That's why I think that suggestion about 24-hours long migration window isn't most efficent and user friendly for those who actually host something on the vps.

VirMach · July 2022

@Papa said:

@VirMach said:

Finally got an important setting to stick on this one.

And it's alive now except network. Network reconfigure shows "Unknown error", but files at /etc/network are being rewrited.
networking.service shows warning that /etc/resolv.conf is not a symbolic link to /run/resolv.conf

Unfortunately there's pretty much nothing we can do about that really. I really think SolusVM did some updates to that tool and broke it for older operating systems recently. They've been breaking a lot of things, like libvirtd incompatibility, the migration tool wasn't working for a while, the operating systems don't template properly and they haven't been syncing properly a lot of times. They've just been updating all their PHP versions and whatever else, racing forward without actually checking anything.

They broke the entire installer the other day.

@AlwaysSkint said:
@VirMach
ATLZ007 has been in a flap all morning (started ~04:00 UTC+1), with Hetrix showing network going down for a minute each time, approx. 30 minutes apart. Anything funky going on? Abusers?

(Edit: sorry was wrong node being reported. D'oh! It's Atlanta, which has been fine lately, until now.)

Edit2: Seems to have stopped now - last/latest one:

Downtime: 2 min
Noticed at: 2022-07-06 10:10:37 (UTC+00:00)

I'm seeing all the flapping. A lot's been flapping. People are getting situated which means testing network, and a lot of files are still flying around on our end. Unfortunately I can't focus on that right now and nearly impossible to get it all right as everything's happening right now. I did fix a server or two where the networking was completely unusable but flappy flaps are A-OK right now, it's the least of our problems.

VirMach · July 2022

@Mumbly said: And no, it's just discussion why 24 hours long time window for PLANNED maintenance (migration) isn't most user friendly as there's no way for you to know when exactly to be online to fix things and get them back online after the migration.

I agree with you on this. We didn't really have another choice for these. It's painful and bad, for us as well. There are probably 5% of people that have been stuck for 48 hours now and I don't like that but we're doing all we physically can.

Everything keeps breaking and I don't just mean on our end. We're using 10 year old PHP software to manage the virtual servers, so that alone really hurts.

@Mumbly said: With securedragon and justhost.ru you simply push migration button, your stuff get migrated and you work on it. Pretty simple and efficient without need to worry that your stuff will re-appear with new IP on the new node at some unspecific time randomly let's say 20 hours later at 3:00am when you're in bed.

That's why I think that suggestion about 24-hours long migration window isn't most efficent and user friendly for those who actually host something on the vps.

Our developer we hired delayed and then bailed on the project but it was originally going to be planned that way, and then if you didn't do it then you were going to be subjected to this mess.

Mumbly · July 2022

@VirMach that's misunderstanding now
I didn't comment or criticized Virmach migrations here, but discussed @yoursunny's suggestion (well, he made a few good suggestions but this one I feel like wasn't the best one) about 24 hours migration queve in the future

[2022] ★ VirMach ★ RYZEN ★ NVMe ★★ The Epic Sales Offer Thread ★★

Comments