SolusVM cancelled services bug

drmike · Apr 23, 2015

Seems like SolusVM has a multiple year bug involving cancelled accounts and containers that are still online.

Customer has VPS service, but cancels. Yet the container is still online eons after termination should have happened.

Folks seeing this and zombie containers? Someone pegged this issue as existing for at least two years.

Licensecart · Apr 23, 2015

sure it's not a billing system module issue?

drmike · Apr 23, 2015

Licensecart said:
sure it's not a billing system module issue?

Nope, but there are WHMCS bugs that are similar also.

This Solus one, multiple providers have mentioned in private.

SkylarM · Apr 23, 2015

Solus has tons of issues, phantom containers, extra creations, not terminating properly, removing from Solus but the container still remains on the node, etc. Nothing new really.

Geek · Apr 26, 2015

@drmike - If i recall correctly you're not a large fan of OVZ (or at least, when it's done wrong). Keep that in mind if I used too many Layman's terms by our definition.

Having seen your opinions of some other OpenVZ providers but still wanting to help, I'm going out on a limb here!

So, here's my sleep-deprived diatribe about ghost containers....

While I don't doubt that SolusVM has issues in a lot of places with VZ (not to mention /usr/local/solusvm/tmp/extras/ has some fun stuff in there that doesn't look entirely safe) but can make zombies happen pretty easily when I'm not trying to. More of an OpenVZ thing I believe. Usually happens when a container slightly exceeds it's resource threshold (physpages, in the below examples) and a failcnt is recorded. Kir told me a hell of a long time ago that it's "how it is", that by default, beancounters aren't reset until the node is rebooted (or stopped, or terminated, or even migrated, for that matter).

Here are a couple of phantom containers that I have in production right now. I replaced all my X5650s with dual E5s this past month, and these are two examples of containers which migrated fine (no kid sis here) ... but notice physpages is just a hair over it's limit on both containers? Kernel memory is being held somehow and/or reclamation might be askew. Anyway, that's pretty much what I believe creates a phantom. Never been able to "fix" it. There's a hokey ub-reset doc somewhere on the OVZ wiki that may work, although I think it did diddly shit for me.

The part where SolusVM could fit into this... is that according to the DB, the container no longer exists. SolusVM may try and use that CTID again during, say, an automated provision, and perhaps is running into the /etc/vz/conf/CTID.conf.destroyed file and attempting to doing restore from that instead of starting a fresh config.

2312: kmemsize 319 147173376 268435456 268435456 0
lockedpages 0 1023 9223372036854775807 9223372036854775807 0
privvmpages 0 411797 9223372036854775807 9223372036854775807 0
shmpages 0 262335 9223372036854775807 9223372036854775807 0
dummy 0 0 9223372036854775807 9223372036854775807 0
numproc 0 150 9223372036854775807 9223372036854775807 0
physpages 63 131100 0 131072 0
vmguarpages 0 0 0 9223372036854775807 0
oomguarpages 0 170449 0 9223372036854775807 0
numtcpsock 0 40 9223372036854775807 9223372036854775807 0
numflock 0 41 9223372036854775807 9223372036854775807 0
numpty 0 1 9223372036854775807 9223372036854775807 0
numsiginfo 0 48 9223372036854775807 9223372036854775807 0
tcpsndbuf 0 697600 9223372036854775807 9223372036854775807 0
tcprcvbuf 0 4100920 9223372036854775807 9223372036854775807 0
othersockbuf 0 603584 9223372036854775807 9223372036854775807 0
dgramrcvbuf 0 8720 9223372036854775807 9223372036854775807 0
numothersock 0 48 9223372036854775807 9223372036854775807 0
dcachesize 0 134217728 134217728 134217728 0
numfile 0 1995 9223372036854775807 9223372036854775807 0
dummy 0 0 9223372036854775807 9223372036854775807 0
dummy 0 0 9223372036854775807 9223372036854775807 0
dummy 0 0 9223372036854775807 9223372036854775807 0
numiptent 0 14504 9223372036854775807 9223372036854775807 0
2230: kmemsize 319 135000064 805306368 805306368 0
lockedpages 0 12 9223372036854775807 9223372036854775807 0
privvmpages 0 439581 9223372036854775807 9223372036854775807 0
shmpages 0 2017 9223372036854775807 9223372036854775807 0
dummy 0 0 9223372036854775807 9223372036854775807 0
numproc 0 180 9223372036854775807 9223372036854775807 0
physpages 1 393250 0 393216 0
vmguarpages 0 0 0 9223372036854775807 0
oomguarpages 0 278749 0 9223372036854775807 0
numtcpsock 0 120 9223372036854775807 9223372036854775807 0
numflock 0 318 9223372036854775807 9223372036854775807 0
numpty 0 2 9223372036854775807 9223372036854775807 0
numsiginfo 0 87 9223372036854775807 9223372036854775807 0
tcpsndbuf 0 16773712 9223372036854775807 9223372036854775807 0
tcprcvbuf 0 7803408 9223372036854775807 9223372036854775807 0
othersockbuf 0 724904 9223372036854775807 9223372036854775807 0
dgramrcvbuf 0 210848 9223372036854775807 9223372036854775807 0
numothersock 0 239 9223372036854775807 9223372036854775807 0
dcachesize 0 97813079 402653184 402653184 0
numfile 0 3555 9223372036854775807 9223372036854775807 0
dummy 0 0 9223372036854775807 9223372036854775807 0
dummy 0 0 9223372036854775807 9223372036854775807 0
dummy 0 0 9223372036854775807 9223372036854775807 0
numiptent 0 17650 9223372036854775807 9223372036854775807 0

Now ... if you mean a client cancels, termination module runs, but absolutely nothing was even attempted according to vzctl.log, I'm afraid I don't know. Blame Ploop.

Or add "VE_STOP_MODE=stop" to the vz.conf and prevent the damn things from checkpointing in the first place ... 500 day uptime my ass...

Anyway, I'm pretty sure it's got something to do with checkpointing. When a CT is restored, for some reason all of the mount points get all screwed up inside the container. Sure, by default all of the containers are mounted with the node's memory segments and creates unnecessary load on the HWN, but hey, at least people get to show off their uptime. VE_STOP_MODE="stop" in vz,conf instead of using "suspend" fixes (most) of the issues with phantoms, I've found.

I hope I gave ya somethin' worthwhile.

MannDude · Apr 26, 2015

I love Solus in the sense that it's generally a pretty decent product, what other commercially available options exist in the same price range? Though it does have a lot of skull numbing issues that makes you wonder why in the hell have they not been properly addressed yet.

In an industry (Virtual Server Hosting) where you own the most commonly used software with no direct competitors to it your motivation to make it rock solid just doesn't seem to be there. I'm sure all the issues with Solus would be worked out if they had a smaller share of the VPS control panel base and was forced to actually fix shit to stay competitive.

For years now, you still can not provision VMs automatically with 'secure' passwords. Special characters break that. I've seen orders that were automatically marked as fraud in WHMCS get provisioned on the node anyhow, commands issued via CLI on the node don't appear to reflect in the SolusCP, etc.

But seriously, what is the explanation for the password issue? It reflects very poorly on the software and there have been many times in the past I had to explain to customers that you can not use 'secure' passwords while ordering a VPS as the VPS will not setup, and I give them new root details with a less secure password. (Randomly generated camel case password vs randomly generated camel case + special character PW) Not all customers are aware or believe that it is a 'issue with the software used to provision and manage virtual servers' and instead think it's a flaw with the company they are ordering from.

Geek · Apr 26, 2015

This is a production CT that was checkpointed and restored for a live migration. Left 300k held in physpages, created a phantom on the source. Nice allocation on the destination though. Half of the node's RAM. Gee whiz. I honestly can't say whether this is merely a display bug that has no impact on overall performance, or if I'm just officially nuts, but I've looked at it from quite a few angles. Hell, now I turn off checkpointing as a courtesy whenever I do a private job, or it literally takes an hour and unacceptable I/O levels to restore 80+ containers. Rebooting them takes maybe 15-20 minutes to normalize the load. The result was fewer instances of these "phantom" containers, at least in the cases of the providers I've assisted.

Code:

root@hexaline [~]# uptime
 14:59:42 up 101 days,  3:11,  0 users,  load average: 0.06, 0.04, 0.01
root@hexaline [~]# df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/simfs       60G   28G   33G  47% /
none             48G  4.0K   48G   1% /dev
none             48G     0   48G   0% /dev/shm
tmpfs            48G   24M   48G   1% /tmp
tmpfs            48G   24M   48G   1% /var/tmp

Munzy · Apr 26, 2015

I actually think it is a missing cronjob in whmcs....

Next Due Date:
01/07/2015

Status: Online

mitgib · Apr 26, 2015

Munzy said:
I actually think it is a missing cronjob in whmcs....

Next Due Date:
01/07/2015

Status: Online

I don't know who's problem it is, but I will point a finger at SolusVM since I use their WHMCS module, and I have run into this many times over. The VPS is terminated in WHMCS and from Solus but still is active on the node, so when the IP is back in the pool, and gets re-assigned, this new VPS will not start because of the IP conflict

nuweb · Apr 26, 2015

mitgib said:
I don't know who's problem it is, but I will point a finger at SolusVM since I use their WHMCS module, and I have run into this many times over. The VPS is terminated in WHMCS and from Solus but still is active on the node, so when the IP is back in the pool, and gets re-assigned, this new VPS will not start because of the IP conflict

That definitely sounds like a SolusVM issue to me, glad I've never experienced it.

lowesthost · Apr 26, 2015

I love Solus in the sense that it's generally a pretty decent product, what other commercially available options exist in the same price range?

Virtualizor cant speak for the OpenVZ implementation because we don't use it, have not had many issues with XEN/KVM if there is a bug they listen and fix it quickly.

KuJoe · Apr 26, 2015

I've seen this enough times with SolusVM that I coded something specifically in Wyvern to alert me if this happens.

In my experience, the most common reason for this issue is WHMCS/SolusVM not being able to talk to the VPS node and WHMCS/SolusVM don't retry (at least WHMCS gives a SUCCESS or FAIL message in the cron job e-mail now). I have seen this for both OpenVZ, Xen, and KVM so it's not specific to one virtual type.

When I did see this for OpenVZ, there was a 50/50 chance that the VPS was in the "mounted" state though.

drmike · Apr 26, 2015

Geek said:
@drmike - If i recall correctly you're not a large fan of OVZ (or at least, when it's done wrong). Keep that in mind if I used too many Layman's terms by our definition.

I hope I gave ya somethin' worthwhile.

Very interesting and that's something for me to chew on. I am overdue to finally just set up a personal node for education and experiment abuse.

This post is getting printed out and thrown in my future look at file

mitgib said:
I don't know who's problem it is, but I will point a finger at SolusVM since I use their WHMCS module, and I have run into this many times over. The VPS is terminated in WHMCS and from Solus but still is active on the node, so when the IP is back in the pool, and gets re-assigned, this new VPS will not start because of the IP conflict

I think @mitgib is right on that. Heard this exact experience with other providers.

I'll do some leg work @Munzy and maybe track that down and all and see why the container is still lingering with that provider.

Seems like *A LOT* of providers have the phantom container issue and even those auditing seem to be missing containers that shouldn't be online.

Munzy · Apr 26, 2015

The worry is not for my containers, but the containers that are no longer secured.

In my case if I find these containers I generally add some stuff to monitor and auto upgrade them just to make sure they are secure.

Geek · Apr 27, 2015

mitgib said:
I don't know who's problem it is, but I will point a finger at SolusVM since I use their WHMCS module, and I have run into this many times over. The VPS is terminated in WHMCS and from Solus but still is active on the node, so when the IP is back in the pool, and gets re-assigned, this new VPS will not start because of the IP conflict

One possible band-aid (at least for the I.P. conflict) is to amend your vz.conf so that:

SKIP_ARPDETECT="no"

ERROR_ON_ARPFAIL="yes"

These were in vz.conf as of vzctl 4.7. Unfortunately SolusVM now overwrites this with a jacked up vanilla vz.conf from about 6 years ago that, from a performance standpoint, should have died with UBC.... I added it back because, well, I'm human, I've transposed an I.P. or two in my day, and, while scaling up over the years, gained a nice collection of /26's, so the only way to prevent the pool from dropping from an accidental conflict was to re-enable this and a few more recent enhancements that Solus should have made prior to 2011.

Tonight I decided to QA OpenVZ on Wheezy from my personal machine, "nested" inside KVM. I won't call it a slab since it's not in production, but I was astonished by the awesome.

http://x3.jetfi.re/dev/vz.conf.txt - a proper, recent vz.conf from a manual build

http://x3.jetfi.re/dev/vz-solus.conf.txt - an old, ambiguous, non-vswap config with most of the goodies turned off or completely missing.

People are better off doing a reinstall of vzctl after SolusVM finishes installing. @Francisco's Wheezy HWN configurations are actually much more clear to me now.

Migrations took note of the directory structure and accounted for it properly during the sync. and unmounted as expected. Maybe next weekend I'll re-enable my WHMCS module, @Munzy has a good point as I recall this happening as well. I'm not in a position to automate my builds/teardowns, I'm small enough to still want to get to know my clients a bit... but I may attach a QA node to my WHMCS QA and lend a hand.

Francisco · Apr 27, 2015

What's that? Fran's right again?

Aww you shouldn't have.

There's logic to all of my madness, I assure you. I have no personal life (that's a bad thing) and I spend a retarded amount of time doing research testing/etc of ideas. There's countless hosts that try to implement our features/ideas/jacked research and don't have a fucking clue what they're doing or why it's done.

Francisco

Geek · Apr 29, 2015

See when I hear "phantom container" I think of the same old bug that's been in OpenVZ for probably two years if not longer. These are my QA containers that I just moved from a CentOS node to a Debian 7.8 QA node. This image is from the CentOS side. Guess which containers in the screen below have already been migrated but are still mounted somewhere on the source node? Ding-ding-ding! If you said "every single fucking one of them", you win!

CT 8898 is the newest one, and the 10/1200s are about three years old, but I've migrated older containers before...that's why I think at least this part is an OpenVZ thing. They don't even have to be migrated, sometimes you can stop or reboot the CT and the beancounters won't reset properly. Another question to ask at Linux Plumbers this summer....

SolusVM cancelled services bug

drmike

100% Tier-1 Gogent

Licensecart

Active Member

drmike

100% Tier-1 Gogent

SkylarM

Well-Known Member

Geek

Technolojesus

MannDude

Just a dude

Geek

Technolojesus

Munzy

Active Member

mitgib

New Member

nuweb

New Member

lowesthost

Member

KuJoe

Well-Known Member

drmike

100% Tier-1 Gogent

Munzy

Active Member

Geek

Technolojesus

Francisco

Company Lube

Geek

Technolojesus