VirMach - Complain - Moan - Praise - Chit Chat

Mumbly · June 2023

@VirMach said: NYCB042, NYCB048, and NYCB009 backups are finally completed.
NYCB027 still needs further repair, NYCB006 is fully functional (as are NYCB009, NYCB042, NYCB008 still.)

Reading about all those those NYCB0's all the time something really puzzle me. I hope I won't jinxed it, but how on earth NYCB028, where I am hosted at, work ... just work all the time. Not a single outage or anything. Zero, zip, zilch, nada!
NYCB028 machine must feel like a real champion!

VirMach · June 2023

@Mumbly said:

@VirMach said: NYCB042, NYCB048, and NYCB009 backups are finally completed.
NYCB027 still needs further repair, NYCB006 is fully functional (as are NYCB009, NYCB042, NYCB008 still.)

Reading about all those those NYCB0's all the time something really puzzle me. I hope I won't jinxed it, but how on earth NYCB028, where I am hosted at, work ... just work all the time. Not a single outage or anything. Zero, zip, zilch, nada!
NYCB028 machine must feel like a real champion!

NYC is cursed. Can't ever seem to get anything working there when it fails until it gets sent elsewhere. Of course it could just be the hands there.

I think we've at this point had about half of servers come back one way or another. Outside of a few of them, it's definitely something that should have been fixable if we got the right information back. For example the server with the toasted CPU, if we got back any error codes, or anything at all. Like we asked for a power cycle and if the CPU is in that state it won't power cycle so you'd think it would be something that would get mentioned, nope. I think out of a dozen times we've asked for error codes we've gotten it back once or twice. One thing they never forget to do though is bill their exorbitant rates.

Then, any time it's a BIOS or BMC related issue it's nearly impossible to get them to flash it for us, outside of xTom and QN which have processed those for us correctly.

I do feel bad for not being able to jump into all of them sooner as I had planned and basically making the situation worse but it's still possible if they stayed we'd still be struggling to get them back up and worse off. A lot of these CPU's when they get popped out, they get stuck to the heatsink due to the latch design and I won't pretend like they don't get bent all the time by the techs. It just seems like the more requests we have to do the more on them breaks. The CPU could be fine and then break after a thermal paste re-application request. I did find thermal paste underneath one of them and it as one where we did ask for the CPU to get reseated. I'm not pointing any fingers but I know when I sent them out I was careful not to do that. Others, they took off the shroud (the thingy that directs air to the right spots) and then didn't latch it back on, and so on.

But yeah now that you've mentioned NYCB028, it's definitely jinxed. I'll keep an extra eye on it.

VirMach · June 2023

@VirMach said: NYCB027 still needs further repair, NYCB006 is fully functional (as are NYCB009, NYCB042, NYCB008 still.)

Also correction, NYCB048 not NYCB008.

VirMach · June 2023

Okay here's the current plan, backups of whatever we have so far are going to get dropped off at QN LAX and plugged in.

I've started the conservation with HV to see if it's possible to get what needs to be done today for four servers to be racked there. IPs need to move over from one region to another but already announced with them AFAIK.

Four servers to be racked:

NYCB009
NYCB006
NYCB048
NYCB042

NYCB027 needs more work and will stay here, backups on it also not complete yet. Working on that now in combination with NYCB011. This is the current plan but it doesn't mean it won't have to be changed again. Waiting to hear back from HV, then configuring the IPMI on all these, confirming all BMC/BIOS updated, giving them a final kiss goodbye and driving them down there.

If any of the four break, I can drive down there and grab it again or we'll have backups at QN LAX to upload instead.

rockinmusicgv · June 2023

So does this mean there might be shared hosting online this weekend?

Jab · June 2023

@rockinmusicgv said: So does this mean there might be shared hosting online this weekend?

Calm down your horses, don't even think about it.

VirMach · June 2023

Dropping off NYCB042, NYCB048, NYCB006, and NYCB009 tonight to HV LAX. I wanted to do it earlier but I got caught up locating rails and getting the networking sorted out and LA traffic to downtown is nightmarish around this time.

They did say they would be able to get it all set up tonight, and I've confirmed with them that they have ethernet cables there, and I have power cables and the rails ready, with all the ports mapped out with the networking team and networking configured already here. All BIOS updated a day or two ago, same with BMC firmware. So they should slide in and work quickly.

Then I haven't decided but either on the way there or on the way back I'll drop off the external drives to QN LAX.

@rockinmusicgv said:
So does this mean there might be shared hosting online this weekend?

Yes, if it remains stable. We have to redo IP addresses and also do that for the other three nodes so I don't know how exactly that'll go down but I'd like to set it up to where people can at least take their backups/data for shared hosting.

NYCB009 will become LAX2Z017
NYCB006 will become LAX2Z016
NYCB048 will become LAX2Z015
NYCB042 will become LAX2Z014

The last two I'm not 100% sure on, they may be flipped.

Then NYCB027 will most likely be done tomorrow if we don't run into further issues, and driving down another external and uploading from there. For the other four above we're going to try to avoid it but still have the externals around just in case. As I mentioned, NYCB006 didn't actually have any RAID controller issues, IIRC, it was an easy fix and again I didn't see much wrong with it. Perhaps this was one where IPMI on it just didn't work, in any case, I didn't take any additional backups for this one on an external so this is the only one where I'll have to drive down to grab it if something goes wrong. I would've taken one but by the time I got to it I ran out of externals, I should have another available by tomorrow for NYCB027 though. I guess we'll see if this was a mistake or not in the coming days.

Jab · June 2023

@VirMach said: LA traffic to downtown is nightmarish around this time.

Bought a car that can transport servers or rented U-haul?

VirMach · June 2023

@Jab said: Bought a car that can transport servers or rented U-haul?

2 seater, one seat for me, one seat for 4 servers. I never said I'm doing it efficiently.

I was going to make four trips from the parking lot until they said they had a dolly (well actually three, two of them are small enough to take together I guess.) It's stick shift which is the main reason for not wanting to do it during traffic, I'm getting old and my left foot gets numb from having to use the clutch pedal 200 times.

VirMach · June 2023

NYCB027 backups being processed now, I noticed the power cable I used for it wasn't working and I had to switch it, so it might have just been a bad power cable. The transfer is still going abnormally slow so at least one of the disks is being weird, we'll see if it completes. In any case we'll be moving these people away from the node and won't be re-using it.

So far it's stable for backups.

skorous · June 2023

Did I miss, is one of the existing LAX nodes having a problem? Noticed one of my vps went down a few days ago and the node shows as locked.

BNNY · June 2023

@VirMach Did NYCB009SH get any hardware upgrade after transferring to LAX2Z017?

rockinmusicgv · June 2023

@BNNY said:
Did NYCB009SH get any hardware upgrade after transferring to LAX2Z017?

From what he's said in the past, there might be a hardware swaps to fix the problems, but probably not any upgrades

BNNY · June 2023

@rockinmusicgv said:

@BNNY said:
Did NYCB009SH get any hardware upgrade after transferring to LAX2Z017?

From what he's said in the past, there might be a hardware swaps to fix the problems, but probably not any upgrades

So LAX2Z017 will still be on the Intel platform instead of AMD Ryzen, correct?

sh97 · June 2023

@FrankZ I had a downtime of 5 minutes an hour back. Now IPv6 is working! TYO Storage node, so I'm guessing TYO v6 is fixed.

FrankZ · June 2023

Thanks for the heads up.
Yes, the IPv6 on my Tokyo VM appears to be functioning normally now.
Thank you @VirMach 👍

sh97 · June 2023

@FrankZ said:
Thanks for the heads up.
Yes, the IPv6 on my Tokyo VM appears to be functioning normally now.
Thank you @VirMach 👍

Indeed, thanks @VirMach . That was pretty fast.

nightcat · June 2023

I haven't config the V6 addr on TYO vm.

lesuser · June 2023

What happened to Ryzen compatible Debian 11 template?

My VPS has Ryzen compatible Debian 11 and I went to reinstall section but Debain 11 template is gone. Only Ryzen compatible Debian 10 is available.

BNNY · June 2023

NYCB009SH/LAX2Z017 is still down. Wondering what's happening. @VirMach

VirMach · June 2023

We're waiting on HV for them, I'll most likely have to drive down to the DC again today once they handle the networking portion as they don't have anyone available over the weekend.

We requested networking be configured beforehand but it looks like it's not set up properly. The two out of four servers I have access to right now are NYCB009 and NYCB006. Networking is set up and gateway pinging, and it's detecting the link, but it's not going anywhere outside of being able to communicate to the switch.

By the time we wrapped up racking the servers and troubleshooting a little bit at the datacenter, it got pretty late so I wasn't able to drop off the externals at QN but I'll be doing that today, especially if HV is unable to fix networking.

VirMach · June 2023

@FrankZ said:
Thanks for the heads up.
Yes, the IPv6 on my Tokyo VM appears to be functioning normally now.
Thank you @VirMach 👍

@sh97 said:

@FrankZ said:
Thanks for the heads up.
Yes, the IPv6 on my Tokyo VM appears to be functioning normally now.
Thank you @VirMach 👍

Indeed, thanks @VirMach . That was pretty fast.

I'd love to take the credit for this but I didn't do anything. I actually didn't even get to contacting xTom so if anything, it's even better they monitor everything on their end and fix it without having to go to them. I don't think I could say the same about any of our other DC partners, not that they're expected to do that, it's just nice.

@BNNY said:
@VirMach Did NYCB009SH get any hardware upgrade after transferring to LAX2Z017?

I swapped out the RAM and CPU, just in case. The RAID controller is still the same one, I took out a lot of the extra hard drives and SSDs just in case one of those are acting up. Those are used for backups, so I have them here for an additional copy in case the other two somehow get destroyed. We still have enough space for backups, just less of them, and we'll move off the extra backups to a remote location moving forward instead.

The CPU is actually a downgrade from 5950X to 3900X, however, we don't have a CPU bottleneck for shared hosting and it'll benefit from running cooler in this case.

I also took out an extra NVMe SSD which we weren't using, again, to try to free up PCIe lanes just in case the motherboard is doing some weird splitting and causing the hardware RAID controller to run into problems. With all that said, even with those changes, the previous original problems still appear to be there so once it's back up I'll have to continue working on that. I'm very confident now that our original diagnosis of it being a kernel issues are correct. We have enough SATA SSDs on there still to where we could end up moving everyone off from the hardware NVMe controller to these instead for the time being should it continue proving difficult to get in a stable state.

VirMach · June 2023

@skorous said:
Did I miss, is one of the existing LAX nodes having a problem? Noticed one of my vps went down a few days ago and the node shows as locked.

I'm assuming you're on LAXA032. I got this fixed and going again but leaving it in a locked state for a few days to make sure it's still stable. The type of issue it was facing means a lot of people may have re-installed, and if they spam re-installs it could destabilize the server quickly. That's the main reason we keep certain nodes locked for several days after maintenance.

VirMach · June 2023

I forgot that HV's advertised 24x7 means 16x5 and it's the weekend. No one at the facility, let's see if they'll let me have facility access at least. And no networking team to correct any potential misconfiguration on their end. I've asked them to see if anything can be done upstream since the issue could also lie there.

If I can get facility access I'll attach a KVM to see if I can get the other two IPMI working, that portion might have potentially been my mistake in the BMC configuration since the other two are working.

I'm also going to check the switch and make sure everything's good there but again it can communicate to the switch just fine, it has to be a routing issue elsewhere.

NYCB027 backups had to be paused yesterday as I had to disconnect it to get to the other servers I was taking to the datacenter, I'm going to resume that while I figure things out with HV but since that's not looking good the next plan would be to take the externals down to QN as originally planned so we can begin restoring what we can over the weekend since we may have to wait until Monday for HV to get the four racked servers going.

VirMach · June 2023

**Update ** we had two more IP blocks at HV LAX2, I'm using those instead and it's working. NYCB009 and NYCB006 being worked on now with connectivity. Looks like other four /24 weren't done properly then, it's waiting on HV until at least tomorrow to fix.

Not sure how busy I'll be today but I'm going to try to drive down there and fix IPMI on NYCB042 and NYCB048 and/or externals to QN. Let's see how it goes for NYCB006 and NYCB009 right now though.

skorous · June 2023

@VirMach said:

@skorous said:
Did I miss, is one of the existing LAX nodes having a problem? Noticed one of my vps went down a few days ago and the node shows as locked.

I'm assuming you're on LAXA032. I got this fixed and going again but leaving it in a locked state for a few days to make sure it's still stable. The type of issue it was facing means a lot of people may have re-installed, and if they spam re-installs it could destabilize the server quickly. That's the main reason we keep certain nodes locked for several days after maintenance.

Ahhhhh, right on. Thank you sir.

MallocVoidstar · June 2023

Is the control panel broken for anyone else? It's displaying

Error - Connection Unencrypted This system will not operate over an unencrypted connection. You will now be forwarded to a secure connection.

for all of my servers. If I try to load the solusvm URL directly, it infinitely redirects to itself.

I don't think this is the cause of the panel not working, but the code has if(window.location.protocol!="https") which will always be true, because window.location.protocol returns scheme:, like https:, including a colon.

BNNY · June 2023

@MallocVoidstar said:
Is the control panel broken for anyone else? It's displaying

Error - Connection Unencrypted This system will not operate over an unencrypted connection. You will now be forwarded to a secure connection.

for all of my servers. If I try to load the solusvm URL directly, it infinitely redirects to itself.

I don't think this is the cause of the panel not working, but the code has if(window.location.protocol!="https") which will always be true, because window.location.protocol returns scheme:, like https:, including a colon.

Same here

tenpera · June 2023

control panel seems to be out of control.

yoursunny · June 2023

@MallocVoidstar said:
I don't think this is the cause of the panel not working, but the code has if(window.location.protocol!="https") which will always be true, because window.location.protocol returns scheme:, like https:, including a colon.

VirDev:
tonight we test in production

VirMach - Complain - Moan - Praise - Chit Chat

Comments