# Heads up: OpenVZ updates will probably break your system



## sean (May 7, 2014)

Hi Guys,

Many of you may already know but for though of you who do not, be very careful when updating your OpenVZ host nodes in the future.

We have just been stung by two major changes that crept into vzctl version 4.7 released 15-April-2014:


The default filesystem has changed from simfs to ploop. This broke all new containers created by our system. We had to fix this by setting the following in vz.conf:

VE_LAYOUT=simfs

A new option, --netfilter, has been added to vzctl. This broke nat/connection tracking and most other modules for containers. We fixed this by adding --netfilter full
It's a little worrying that changes like this made it into openvz.org's rhel/centos respository!!


----------



## zionvps (May 7, 2014)

It is quite better to only update when solusvm and other  panels roll out updates with their own configured kernels


----------



## 5n1p (May 7, 2014)

For nat you can:


nano /etc/modprobe.d/openvz.conf
and then replace 


options nf_conntrack ip_conntrack_disable_ve0=1
whit 


options nf_conntrack ip_conntrack_disable_ve0=0
source http://serverfault.com/questions/593263/iptables-nat-does-not-exist

at least worked for me.


----------



## SkylarM (May 7, 2014)

Sean,

We've deployed a few servers since the updates with Solus and haven't had issues. We've gone in and re-enabled conntrack though of course. The bulk of providers are likely fine as they aren't running anything custom.

Still find it rather amusing OpenVZ is running ploop as default, but doesn't install ploop on initial install.


----------



## blergh (May 7, 2014)

zionvps said:


> It is quite better to only update when solusvm and other  panels roll out updates with their own configured kernels


I hope you are joking.


----------



## KuJoe (May 7, 2014)

The warning shouldn't be "be careful when updating your OpenVZ host nodes in the future", it should be "*ALWAYS READ THE EFFING CHANGELOG BEFORE YOU UPDATE YOUR SOFTWARE*".


----------



## Magiobiwan (May 7, 2014)

The fact is that OpenVZ neglected to throw warnings when attempting to use --iptables (they didn't even provide a deprecation period...). ALSO, they didn't even add the ploop package to the dependencies list on the latest vzctl RPMs so it doesn't get installed automatically. We ran into issues with it trying to default to ploop BUT not having the ploop libs installed. Thankfully that only affected rebuilds and new provisions. OpenVZ is just such a mess...


----------



## perennate (May 7, 2014)

Magiobiwan said:


> OpenVZ is just such a mess...


https://lists.openvz.org/mailman/listinfo/devel

go there to help


----------



## sean (May 7, 2014)

5n1p said:


> For nat you can:
> 
> 
> nano /etc/modprobe.d/openvz.conf
> ...


That's nothing to do with the issue I'm afraid.


----------



## sean (May 7, 2014)

KuJoe said:


> The warning shouldn't be "be careful when updating your OpenVZ host nodes in the future", it should be "*ALWAYS READ THE EFFING CHANGELOG BEFORE YOU UPDATE YOUR SOFTWARE*".


As we're running an enterprise distribution, changes like this should not be making it in to their repositories for RHEL/CentOS. If we wanted to be subject to that crap we'd be on a rolling release distribution.


----------



## SkylarM (May 7, 2014)

sean said:


> As we're running an enterprise distribution, changes like this should not be making it in to their repositories for RHEL/CentOS. If we wanted to be subject to that crap we'd be on a rolling release distribution.


So because you're on an "enterprise distribution" that means you shouldn't read changelogs prior to updates?


----------



## KuJoe (May 7, 2014)

sean said:


> As we're running an enterprise distribution, changes like this should not be making it in to their repositories for RHEL/CentOS. If we wanted to be subject to that crap we'd be on a rolling release distribution.


The OpenVZ repos have nothing to do with RHEL/CentOS, they are independent of each other. Any company that does not read the changelog AND does not install the software in dev before production has bigger problems than can be addressed on a public forum.


----------



## Kruno (Jun 20, 2014)

Has anyone managed to convert existing Nodes to ploop? Any major issues down the road?


----------



## devonblzx (Jun 20, 2014)

We switched over nearly a year ago.  Ploop is the way to go now.   I wrote a script to help the migration from simfs to ploop.  The vzctl conversion process takes the VPS offline during the conversion which could be hours of downtime for large virtual servers.  My script only takes the server down for a couple of minutes while rsync syncs up files.

http://blog.byteonsite.com/?p=10


----------



## Magiobiwan (Jun 20, 2014)

It's worth noting that if you do want to switch to ploop, if you change the VE_LAYOUT option, as people rebuild their VPSes or as new ones are created, they'll switch over without affecting the ones using simfs. Likewise for going the other way around.


----------



## Kruno (Jun 21, 2014)

devonblzx said:


> We switched over nearly a year ago.  Ploop is the way to go now.   I wrote a script to help the migration from simfs to ploop.  The vzctl conversion process takes the VPS offline during the conversion which could be hours of downtime for large virtual servers.  My script only takes the server down for a couple of minutes while rsync syncs up files.
> 
> http://blog.byteonsite.com/?p=10


Nice script but doesn't work as expected. 

I took a while and fixed for everyone else who will need it. The script does a nice job 


#!/bin/sh
# ./convert VEID
rsync_options='-aHv'
partition='vz'
if [ ! -e /etc/vz/conf/$1.conf ]; then
echo "Virtual server configuration file: /etc/vz/conf/$1.conf does not exist."
exit 1
fi
if [ -d /$partition/private/$1/root.hdd ]; then
echo "Server already has ploop device"
exit 1
fi
if [ ! -d /$partition/private/$1 ]; then
echo "Server does not exist"
exit 1
fi
# Get disk space in G of current VPS
disk=`vzctl exec $1 df -BG | grep simfs | awk {'print $2'} | head -n1`
if [ ! $disk ]; then
echo "Could not retrieve disk space figure. Is VPS running?"
exit 1
fi
# Create and mount file system
mkdir -p /$partition/private/1000$1/root.hdd
ploop init -s $disk /$partition/private/1000$1/root.hdd/root.hdd
cp /etc/vz/conf/$1.conf /etc/vz/conf/1000$1.conf
vzctl mount 1000$1
# Rsync over files (sync 1)
rsync $rsync_options /$partition/root/$1/. /$partition/root/1000$1/
# Stop primary, mount, sync final
vzctl stop $1
vzctl mount $1
rsync $rsync_options /$partition/root/$1/. /$partition/root/1000$1/
vzctl umount $1
vzctl umount 1000$1
mv /$partition/private/$1 /$partition/private/$1.backup
mv /$partition/private/1000$1 /$partition/private/$1
vzctl start $1
# Cleanup
rm -f /etc/vz/conf/1000$1.conf
rmdir /vz/root/1000$1
# Verification
verify=`vzlist -H -o status $1`
if [ $verify = "running" ]; then
echo "Virtual server conversion successful. Verify manually then run: rm -Rf /$partition/private/$1.backup to remove backup."
else
echo "Server conversion was not successful..Reverting.."
mv -f /$partition/private/$1 /$partition/private/$1.fail
mv /$partition/private/$1.backup /$partition/private/$1
vzctl start $1
fi


Fixes:

1) Changed disk=`vzctl exec $1 df -BG | grep simfs | awk {'print $2'}` to disk=`vzctl exec $1 df -BG | grep simfs | awk {'print $2'} | head -n1` because there may be multiple simfs per VPS(for example, cPanel's virtfs)

2) Changed mkdir /$partition/private/1000$1/root.hdd to mkdir -p /$partition/private/1000$1/root.hdd. /$partition/private/1000$i doesn't exist in the first place, hence root.hdd will fail to create. -p flag will do the trick. 

3) Changed ploop init -s $/$partition/private/1000$1/root.hdd/root.hdd to ploop init -s $disk /$partition/private/1000$1/root.hdd/root.hdd. I assume Devon had a typo there.

Either way, the updated script works fine now. Thanks Devon!


----------



## devonblzx (Jun 21, 2014)

Kruno said:


> Nice script but doesn't work as expected.
> 
> I took a while and fixed for everyone else who will need it. The script does a nice job
> 
> ...


I have PMed you back and update the script with these changes along with a few more I found.  The new ploop no longer sets a default filesystem, so I made sure ext4 is now the default.  I tested the script on the newest vzctl/ploop available and it works now.  Thanks for the input!


----------



## Kruno (Jun 25, 2014)

FLY, the script fails on a small amount of containers due to unexplained reasons. I talked to Devon and other WHT guy who experienced with ploop and none of us could explain it. It was only 1-2 per 100 containers in our case. vzctl convert worked for those, though. If you want to minimize the downtime you can setup a NEW container on the ploop Node with same resources, and then rsync old simfs container over to newly created ploop container. That way it works.

Make sure to run vzctl compact against all containers once it's converted to ploop, otherwise disk usage will be messed up. We got more than 100GB of free disk space on certain Nodes after this.

Note: Every ploop container will reserve 5% of usable disk space for itself so don't let df outputs surprise you


----------



## eddynetweb (Jun 26, 2014)

I've seen hosts in the past few days having issues upgrading. Kernal panics and such.


----------



## Kruno (Jun 26, 2014)

eddynetweb said:


> I've seen hosts in the past few days having issues upgrading. Kernal panics and such.


That was related to latest OVZ kernel, not ploop.


----------



## eddynetweb (Jun 26, 2014)

Kruno said:


> That was related to latest OVZ kernel, not ploop.


Oh, my mistake.


----------



## eddynetweb (Jun 26, 2014)

So has anybody been affected by this in a large scale?


----------



## Magiobiwan (Jun 26, 2014)

I remember reading multiple threads on LET about how SolusVM broke with it, but iirc they patched that quickly. We noticed a few minor hiccups with Feathur, but it was due to the switch to ploop for the default and the ploop libraries not being installed on the node. Turns out that doesn't work too well!


----------



## Kruno (Jun 28, 2014)

All in all, ploop is a major improvement compared to simfs.

There are still a few issues to be sorted out, though. The biggest one being ploop containers don't shrink down when files inside them get deleted. While df inside container shows accurate results, /vz/private/cid/root.hdd/root.hdd still uses too much space and makes less free space on the node itself. 

Solution is vzctl compact. 



#!/bin/sh
vzlist -a | awk '{print $1}' | sed '/CTID/d' > ctid.txt

echo "Compact started on $(date)" >> compactlogs.txt
for i in `cat ctid.txt`
do
vzctl compact $i;
done
echo "Compact finished on $(date)" >> compactlogs.txt

Putting above script to daily cron would be a good workaround until they release a proper patch to this issue. Other than that, everything has been working like a charm so far.  

Any of you experienced that on your Nodes?


----------



## mtwiscool (Jul 2, 2014)

Kruno said:


> All in all, ploop is a major improvement compared to simfs.
> 
> There are still a few issues to be sorted out, though. The biggest one being ploop containers don't shrink down when files inside them get deleted. While df inside container shows accurate results, /vz/private/cid/root.hdd/root.hdd still uses too much space and makes less free space on the node itself.
> 
> ...


I have been using ploop for over 6 months and it works vary well.

Zero crashes and vary fast and effencint.

I would recmend anyone who uses openvz gets ploop.


----------



## Kruno (Jul 4, 2014)

Update: Looks like they are not going to fix this.

BugZilla updates(status changed to WONTFIX):

"This is the way it works. Dynamic instant downsize would require much more resources, i.e. ploop would be way slower. If this is what you need, do vzctl compact from cron."

 

"I mean, vzctl compact is done online, you don't need to stop anything. Please explain what prevents you from doing it on a daily basis if you are so tied on diskspace."


----------



## devonblzx (Jul 4, 2014)

Kruno said:


> Update: Looks like they are not going to fix this.
> 
> BugZilla updates(status changed to WONTFIX):
> 
> ...


I have written a script to resolve this.  Run it as cron every 15 minutes.  The script will:


Check for a minimum level of free space to automatically run compact every 15 minutes
Automatically run compact every 24 hours on all servers
If the compact fails to reclaim enough space, it will email you.
http://blog.byteonsite.com/?p=87


----------



## Kruno (Jul 4, 2014)

Thanks Devon. 

I already wrote my own script for this, though. Hope it will come handy for other providers. 

Cheers


----------



## devonblzx (Aug 20, 2014)

@Kruno and others.  Recent circumstances has made me reconsider ploop images as a whole.  We have been using them for over a year and not until the past month have we seen this problem but it appears to be a major problem.  I have had this problem with both HW RAID and SW RAID where there is either a power failure (hard reboot) or a RAID degradation.  I have seen this on multiple servers now and OpenVZ doesn't appear to be recognizing this as a major problem.  The data integrity of the ploop images is in question in my mind, it appears like a part of the images are getting corrupted (quite easily) and then the whole ploop image won't mount making it completely inaccessible.  From my experiences, this seems to be more at risk to larger ploop images than smaller ones.

I have pointed that I had not had an issue running the same setups for 12 months prior but only in the last month have I seen these errors so I believe this is an issue with a more recent update either in the kernel or ploop libraries.

You can read about my bug report 3030 here:  https://bugzilla.openvz.org/show_bug.cgi?id=3030.  So far two other users have experienced the same issue.

I highly recommend people stay at simfs for now.  We are switching to a combination between private volumes (using a volume manager, eg LVM) and simfs for a similar goal of ploop but much more stable from the appearance of things as of recently.  I will share my experience once everything is completed on our end.  I know many users are stuck with SolusVM so it may be harder to work a custom solution.


----------



## Kruno (Aug 21, 2014)

@devonblzx I didn't experience any of those yet. Granted those are enterprise drives on highly quality HP RAID cards(e.g. P222) with BBU and failure rate is honestly very very small. 

Did you have BBU on your RAID cards and did you have write-cache enabled on either RAID card or a separate SSD enabled?


----------



## devonblzx (Aug 21, 2014)

Kruno said:


> @devonblzx I didn't experience any of those yet. Granted those are enterprise drives on highly quality HP RAID cards(e.g. P222) with BBU and failure rate is honestly very very small.
> 
> Did you have BBU on your RAID cards and did you have write-cache enabled on either RAID card or a separate SSD enabled?


Didn't have writeback enabled on the one server using P410i but did have a RAID degradation caused by power supply issues.  The other servers were software RAID and software SSD caching with hard reboots.  It seems like it is occurring on all types of systems as the two other users involved were using HW RAID if I recall correctly, I'm not sure if they had writeback enabled but I would assume it is a good chance that they did.  The major problem I'm seeing is not necessarily the corruption but the lack of ability to repair.  There is no way to fsck the filesystem to correct any corruption, the whole disk images are lost.  The fact that this has happened on three nodes in the past ~3 weeks has made me decide to abandon ploop all together.  It isn't worth the risk or trouble and the benefits of ploop can be substituted by other more reliable means (as I suggested in my previous post).


----------



## Geek (Aug 21, 2014)

I'm absolutely backing off from Ploop as I was telling another provider not long ago.  Originally had positive results testing Ploop in dev/qa. Then I got my horror story...

http://www.johnedel.me/ext4-file-system-corruption-openvz-ploop/

Sorry for the lousy markup, I just recently migrated a bunch of WP posts to Ghost and I have some cleaning up to do. 

_Jun 30 20:18:00 ovz-angelica kernel: [526746.951505] EXT4-fs error (device ploop60273p1): mb_free_blocks: double-free of inode 0's block 79217420(bit 17164 in group 2417)
Jun 30 20:18:00 ovz-angelica kernel: [526746.951563] EXT4-fs error (device ploop60273p1): mb_free_blocks: double-free of inode 0's block 79217421(bit 17165 in group 2417) _

_Jun 30 20:18:00_ ovz-angelica kernel: [553166.612375] JBD: Spotted dirty metadata buffer (dev = ploop43717p1, blocknr = 0). There's a risk of filesystem corruption in case of system crash

The next day, another of my nodes with identical specs (HW RAID 10 across 8x4TB WD RE4 + 3ware controller), the BBU failed, showed a 0 capacity, went into hard reboot, and when back online, the group of Ploop containers on that node reported read-only and required repairing.  In both cases, writeback was enabled. The BBU situation was my own fault of course, I know better than to miss something like that, but I got to thinking that Ploop may end up being too much hassle for the client long-term wise. I found myself losing out on certain 

I also don't feel like having to vzctl compact to correct weirdness with resizing... automated or not I've grown quite fond of quick vertical scaling given to us by SimFS... I also noticed a while back that there's at least 5 new Ploop bugs reported each month, and twice that for broken live migrations as of late -- try and train yourselves to use the '-r no' switch if you haven't yet. 

I'm not sure why Kir and the guys thew us the equivalent of a *vz overhaul with little warning (I had about 2 days to test netfilter in dev. Most of it I wasn't worried too much about and was easy enough to tune up again, but it created a lot of confusion for the blissfully unaware who run yum updates without paying attention....

Good luck switching between SimFS and Ploop if you use SolusVM. Right now it's either one or the other. Turn off Ploop, disk usage stops reporting. Enable it and VE_LAYOUT=simfs is overwritten and spools the container as a Ploop anyway, and I believe a blank $CTID.conf was created (or one that had no parameters set past physpages/swappages) then too.

Virtualizor seems to have gotten around that problem though... except that I'm not their #1 fan these days.

Also Ploop containers and SolusVM Quick Backup = unusable as far as I can tell. That was the start of the problem I blogged about. Interestingly enough you can create a template from a Ploop container, mount it with --layout simfs and for some reason it actually works. I haven't figured out why yet, but it does.

Also if you go to do a manual vzdump on a ploop container the dump is stored in /var/tmp/vzdumptmp - greaaaat fun if /vz is it's own partition and your root filesystem is only 10 o 15gb. It can be tweaked to change the destination but that's a lot of work...

This was kind of jumbled together in between tickets and trying to remember everything I wrote about in May.... but I've rolled all but a few containers back to simfs where everything's comfortable and functions how the clients want. 



devonblzx said:


> Didn't have writeback enabled on the one server using P410i but did have a RAID degradation caused by power supply issues.  The other servers were software RAID and software SSD caching with hard reboots.  It seems like it is occurring on all types of systems as the two other users involved were using HW RAID if I recall correctly, I'm not sure if they had writeback enabled but I would assume it is a good chance that they did.  The major problem I'm seeing is not necessarily the corruption but the lack of ability to repair.  There is no way to fsck the filesystem to correct any corruption, the whole disk images are lost.  The fact that this has happened on three nodes in the past ~3 weeks has made me decide to abandon ploop all together.  It isn't worth the risk or trouble and the benefits of ploop can be substituted by other more reliable means (as I suggested in my previous post).


----------



## Francisco (Aug 21, 2014)

Goddammit Solus, how are you this broken.

Francisco


----------



## KuJoe (Aug 21, 2014)

Francisco said:


> Goddammit Solus, how are you this broken.
> 
> 
> Francisco


I'm more surprised that you're surprised.

Also, did anybody know that solus is latin for alone, i.e. how you feel when you open a ticket with them for support sometimes (I've heard this was remedied recently but I stopped opening tickets when they told me that include and exclude meant the same thing).


----------



## devonblzx (Aug 21, 2014)

For hosts that have the ability for customization, I recommend looking into LVM with thin provisioning.   Steps to take:


Set the default layout back to simfs in /etc/vz/vz.conf
Create a thin volume, aka sparse, (volume that grows in size as it is filled) for each container with the size you wish.
Make an ext4 filesystem on that volume
Mount that volume as /vz/private/VEID
Since vzctl create fails when /vz/private/VEID exists, you have to rework it a little bit.  Basically you can extract the OS template into /vz/private/VEID and copy the base configuration file.  Then set the disk quotas to unlimited so the container will only be limited by their filesystem that you created.


tar -zxf /vz/template/cache/${OST}.tar.gz -C /vz/private/${VEID}/
cp -n /etc/vz/conf/ve-${CONFIG}.conf-sample /etc/vz/conf/${VEID}.conf
vzctl set ${VEID} --ostemplate $OST --diskspace 100T:100T --diskinodes 9223372036854775807:9223372036854775807 --save
Then create an action script named vps.premount to mount the private directory for each container before the container is started.  Read more about action scripts here: http://openvz.org/Man/vzctl.8#ACTION_SCRIPTS

vps.premount:


#!/bin/bash
mount /dev/vg0/lv${VEID} /vz/private/${VEID}
if ! mount | grep /vz/private/${VEID} >/dev/null; then
    echo "Unable to mount storage volume"
    exit 1
else
    exit 0
fi

You can easily grow the filesystems online with resize2fs, however, shrinking requires the container to be offline and the filesystem to be unmounted, so that is the one con to this setup, but I think the price is worth it for better data integrity.

This would allow you to take full advantage of LVM per VE, such as snapshots and more.  Should any filesystem issues occur, you can also run fscks per VE rather than having the host system and the VE to worry about.  Hope this helps you guys!


----------



## Geek (Aug 21, 2014)

KuJoe said:


> I'm more surprised that you're surprised.
> 
> Also, did anybody know that solus is latin for alone, i.e. how you feel when you open a ticket with them for support sometimes (I've heard this was remedied recently but I stopped opening tickets when they told me that include and exclude meant the same thing).


I was not aware of the Latin meaning -- nor was I aware that it's safe to swear here without being reprimanded.

Also - a thousand pardons where I left out a couple things ... like, ya know, words...  


Getting by on cat naps this week.

What I was going to say was that I've noticed myself having to tell clients that "unfortunately this method of _______ has changed with the introduction of ___________"  -- and that's where I have to draw the line. As much as I appreciate and respect the enhancements I've (we've) seen over the past few years, I refuse to put a client at risk of silent corruption because ||s rushed the introduction of Ploop. I came, I saw, I rolled back.

What's more I get the feeling we'll be seeing more complaints of container corruption, more so with providers who are either young and clueless, or cheap and careless. Read about one just the other day... provider picks out a $4000.00 in procs, another $2000 for the drives, and a nice chunk of RAM, a few months later they did a blind conversion, corrupted 3/4 of their client base because they didn't spend $100 to rent a budget box to screw up first. Most expensive foot stool I've ever seen.

That's what I meant to add earlier -- the quick backups in SolusVM -- are still stored in /vz/private/panelbackup making them inaccessable at the container level. The admin has to move the /panelbackup manually into /vz/root/$CTID/.

And this stuff is peanuts compared to Devon's 800gb inaccessible container. https://bugzilla.openvz.org/show_bug.cgi?id=3030#c21

 

It's just not mature enough yet ... just like some of the service providers out there...


Urban Dictionary defines ploop as:



> "The sound that is made when you have let a gigantic piece of feces, poop, shit, or crap fall out of your butt and into the toilet. Mostly likely the poop has been held in for a long LONG time.
> 
> When can I use the bathroom? In a minute. Ploop! Uh oh! What? We gonna need a new toilet! EWWWW!"


 


...or maybe it's this guy.


----------



## devonblzx (Aug 21, 2014)

Update for my previous post above.  I had figured unlimited would work with simfs, but apparently it doesn't.  So instead of unlimited:unlimited, just use incredibly high limits like 100T:100T and 9223372036854775807:9223372036854775807 for inodes.  The limits will basically be non-existent in simfs then and will fall back to the limits of the ext4 volume you created.


----------



## KuJoe (Aug 22, 2014)

invpsus said:


> I my opinion, OpenVZ is already outdated OS virtualization and for the better server performance is it better to upgrade to Xen or KVM.


False on both accounts.

1) OpenVZ has been proven to have much better performance than Xen and KVM.

2) Xen is two years older than OpenVZ.

I know you're only here to advertise and your posts are just fluff so that you can post your ads (as evident from your other posts with similar bot-like qualities), but please try to be accurate in your posts.


----------



## Geek (Aug 22, 2014)

> I my opinion, OpenVZ is already outdated OS virtualization and for the better server performance is it better to upgrade to Xen or KVM.


Wow. That's so ... 2008.  Containerization is far more sophisticated now, making the usual tired opinions that much harder to stomach. Especially when you consider that Google is almost completely containerized now (maybe 100% by now), and I believe that some of Facebook's infrastructure is as well.

I had someone at WHT earlier this summer trying to start a rumor in some of my threads that "OpenVZ was closing this summer". Don't know what their problem was, but just about any seasoned veteran can reach out to Kirill or someone over at ||'s to squash that crap... done and done, btw. 

People with that opinion should probably watch James Bottomley's talk at CloudOpen last year...

https://www.youtube.com/watch?v=p-x9wC94E38 - some of it's common knowledge for providers, but it's actually a pretty interesting lecture considering the KVM folks were in the audience the whole time. :lol:


----------



## KuJoe (Aug 22, 2014)

Geek said:


> Wow. That's so ... 2008.  Containerization is far more sophisticated now, making the usual tired opinions that much harder to stomach. Especially when you consider that Google is almost completely containerized now (maybe 100% by now), and I believe that some of Facebook's infrastructure is as well.
> 
> I had someone at WHT earlier this summer trying to start a rumor in some of my threads that "OpenVZ was closing this summer". Don't know what their problem was, but just about any seasoned veteran can reach out to Kirill or someone over at ||'s to squash that crap... done and done, btw.
> 
> ...


While a very passionate post, I feel it was wasted on a bot who's just here to increase post count and post copy+paste ads once they post enough nonsense. 

I'm going to watch the YouTube video now though so it wasn't a total waste.


----------



## Geek (Aug 22, 2014)

KuJoe said:


> While a very passionate post, I feel it was wasted on a bot who's just here to increase post count and post copy+paste ads once they post enough nonsense.
> 
> I'm going to watch the YouTube video now though so it wasn't a total waste.


Yeh, perhaps you're right, but I'd still prefer to contribute something worthwhile over fluff or copy/paste like you said.  Plus I just started posting here yesterday, and even though I see a lot of familiar faces, first impressions are important, and I'd hate for people to think that I have a hidden agenda when I really just enjoy contributing... if that makes sense. Hope you enjoyed the vid though.


----------



## Geek (Aug 27, 2014)

Just a little FYI, it seems if you go to convert a container where second-level quotas are established, the /etc/mtab symbolic link to /proc/mounts gets broken and isn't correctly re-established, so whenever the container is rebooted, the container will show the former simfs layout unless you re-create the symlink.  Observe....

*[email protected] [~]# mount; echo; cat /proc/mounts; echo; ls -al /etc/mtab; echo;*
*/dev/simfs on / type reiserfs (rw,usrquota,grpquota)*
proc on /proc type proc (rw,relatime)
sysfs on /sys type sysfs (rw,relatime)
none on /dev type devtmpfs (rw,relatime,mode=755)
none on /dev/pts type devpts (rw,relatime,mode=600,ptmxmode=000)
tmpfs on /tmp type tmpfs (rw,nosuid,nodev,noexec,relatime)
tmpfs on /var/tmp type tmpfs (rw,nosuid,nodev,noexec,relatime)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw,relatime)

*/dev/ploop63231p1 / ext4 rw,relatime,barrier=1,data=ordered,balloon_ino=12,jqfmt=vfsv0,usrjquota=aquota.user,grpjquota=aquota.group 0 0*
proc /proc proc rw,relatime 0 0
sysfs /sys sysfs rw,relatime 0 0
none /dev devtmpfs rw,relatime,mode=755 0 0
none /dev/pts devpts rw,relatime,mode=600,ptmxmode=000 0 0
tmpfs /tmp tmpfs rw,nosuid,nodev,noexec,relatime 0 0
tmpfs /var/tmp tmpfs rw,nosuid,nodev,noexec,relatime 0 0
none /proc/sys/fs/binfmt_misc binfmt_misc rw,relatime 0 0

*-rw-r--r-- 1 root root 383 Aug 22 04:45 /etc/mtab*

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

[email protected] [~]# rm -rf /etc/mtab
[email protected] [~]# ln -s /proc/mounts /etc/mtab

 

*[email protected] [~]# mount; echo; cat /proc/mounts; echo; ls -al /etc/mtab; echo;*
*/dev/ploop63231p1 on / type ext4 (rw,relatime,barrier=1,data=ordered,balloon_ino=12,jqfmt=vfsv0,usrjquota=aquota.user,grpjquota=aquota.group)*
proc on /proc type proc (rw,relatime)
sysfs on /sys type sysfs (rw,relatime)
none on /dev type devtmpfs (rw,relatime,mode=755)
none on /dev/pts type devpts (rw,relatime,mode=600,ptmxmode=000)
tmpfs on /tmp type tmpfs (rw,nosuid,nodev,noexec,relatime)
tmpfs on /var/tmp type tmpfs (rw,nosuid,nodev,noexec,relatime)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw,relatime)

*/dev/ploop63231p1 / ext4 rw,relatime,barrier=1,data=ordered,balloon_ino=12,jqfmt=vfsv0,usrjquota=aquota.user,grpjquota=aquota.group 0 0*
proc /proc proc rw,relatime 0 0
sysfs /sys sysfs rw,relatime 0 0
none /dev devtmpfs rw,relatime,mode=755 0 0
none /dev/pts devpts rw,relatime,mode=600,ptmxmode=000 0 0
tmpfs /tmp tmpfs rw,nosuid,nodev,noexec,relatime 0 0
tmpfs /var/tmp tmpfs rw,nosuid,nodev,noexec,relatime 0 0
none /proc/sys/fs/binfmt_misc binfmt_misc rw,relatime 0 0

*lrwxrwxrwx 1 root root 12 Aug 27 08:55 /etc/mtab -> /proc/mounts*


The behavior of the conversion is normal if second-level quotas are temporarily disabled prior to converting to Ploop.  If anyone feels like dissecting the conversion to find out why, be my guest. Otherwise I guess we'll just have to remember.


----------



## Francisco (Sep 2, 2014)

It looks like there's a hotfix on its way out from OpenVZ for ploop:

https://twitter.com/_openvz_/status/506877041937874944

I wonder if this addresses the corruption issues?

Francisco


----------



## Francisco (Sep 2, 2014)

Nope, looks to address some race condition in live migrations 

https://twitter.com/_openvz_/status/506880304263331840

Francisco


----------



## Francisco (Sep 13, 2014)

@Geek - Have you tried the latest ploop? It looks like they provided a way to zero out bad blocks and they feel

that the issue is due to writeback mode being set on EXT4.

Francisco


----------



## devonblzx (Sep 13, 2014)

Francisco said:


> @Geek - Have you tried the latest ploop? It looks like they provided a way to zero out bad blocks and they feel
> 
> 
> that the issue is due to writeback mode being set on EXT4.
> ...


I am not entirely confident in that resolution but I didn't have enough time to debug further or expertise in ploop to state otherwise.  We already started moving servers away from ploop to a custom solution.

While writeback mode did seem dangerous with how ploop works, one of the other servers that experienced data loss without any hard drive issues was running in ordered mode.  So I just don't trust it for the time being, not to mention the compact issue.  Ploop images will continue to grow if you don't compact them regularly.  If the underlying filesystem runs out of space because of this, then there will be data integrity issues inside the VPS.   This problem is because when files are deleted, they are just marked in the ploop image not the underlying filesystem.   Therefore, from my understanding, if a VPS deletes and recreates a 10GB file 100 times the ploop image could be as large as 1TB if you haven't compacted it.


----------



## Francisco (Sep 13, 2014)

devonblzx said:


> I am not entirely confident in that resolution but I didn't have enough time to debug further or expertise in ploop to state otherwise.  We already started moving servers away from ploop to a custom solution.
> 
> While writeback mode did seem dangerous with how ploop works, one of the other servers that experienced data loss without any hard drive issues was running in ordered mode.  So I just don't trust it for the time being, not to mention the compact issue.  Ploop images will continue to grow if you don't compact them regularly.  If the underlying filesystem runs out of space because of this, then there will be data integrity issues inside the VPS.   This problem is because when files are deleted, they are just marked in the ploop image not the underlying filesystem.   Therefore, from my understanding, if a VPS deletes and recreates a 10GB file 100 times the ploop image could be as large as 1TB if you haven't compacted it.


I don't think that's how it works.....

I'm pretty sure EXT4 would just re-allocate the same sectors for the last 10GB file.

I've looked into an LVM based solution but was unsure of it.

Francisco


----------



## devonblzx (Sep 13, 2014)

Francisco said:


> I don't think that's how it works.....
> 
> 
> I'm pretty sure EXT4 would just re-allocate the same sectors for the last 10GB file.
> ...


Ext4 isn't being made aware of the delete, that is the problem.  Just from normal usage before we started doing compacts, we saw ploop images that were over 100GB with servers that were only using* ~40GB.


----------



## Francisco (Sep 13, 2014)

What the hell.

Francisco


----------



## Kruno (Sep 13, 2014)

Francisco said:


> What the hell.
> 
> 
> Francisco


https://bugzilla.openvz.org/show_bug.cgi?id=3008


----------



## Geek (Sep 13, 2014)

Sorry I'm late.  Had a root canal this week & I delegate everything when Vicodin's involved... 



devonblzx said:


> Ext4 isn't being made aware of the delete, that is the problem.  Just from normal usage before we started doing compacts, we saw ploop images that were over 100GB with servers that were only using* ~40GB.


So this is the one where Kir said "that's just how it is, use cron if you have to" or something?  Yet more resource-intensive rituals for people like us. I did experience data loss running in ordered mode, then there was my little fiasco with the BBU failure in writeback mode which damaged all the ploops.  I'm certain this will spell trouble for some.  Gonna be sticking with ol' trusty for a while longer.  Sorry I don't have more to contribute this time. Freaking jaw is throbbing... thanks for the info Devon!


----------



## Geek (Oct 14, 2014)

vzctl compact/ploop-baloon discard isn't quite making the cut, it seems...

 

 

[[email protected] ~]# vzctl exec 1230 "df -h"; echo

Filesystem         Size  Used Avail Use% Mounted on

/dev/ploop18019p1   60G  9.7G   47G  18% /

none               768M  4.0K  768M   1% /dev

none               768M     0  768M   0% /dev/shm

tmpfs              768M     0  768M   0% /tmp

 


[[email protected] ~]# du -sh /vz/private/1230/root.hdd/*

4.0K    /vz/private/1230/root.hdd/DiskDescriptor.xml

0       /vz/private/1230/root.hdd/DiskDescriptor.xml.lck

16G     /vz/private/1230/root.hdd/root.hdd

4.0K    /vz/private/1230/root.hdd/root.hdd.mnt

 


[[email protected] ~]# ploop-balloon discard /vz/private/1230/root.hdd/DiskDescriptor.xml

 

Trying to find free extents bigger than 0 bytes

Waiting

Call FITRIM, for minlen=33554432

Call FITRIM, for minlen=16777216

Call FITRIM, for minlen=8388608

Call FITRIM, for minlen=4194304

Call FITRIM, for minlen=2097152

Call FITRIM, for minlen=1048576

0 clusters have been relocated

 


 

...so I guess that leaves me with 7.3gb wasted space until they figure it out?

 

 

*Edit: *Meant to add that there's an entry in OVZ Bugzilla about vzctl compact not doing the whole job... apparently you can use vzctl and shrink the fs to recover a bit of that, but even in the QA node I don't really want to screw around much with it yet.


----------



## devonblzx (Oct 14, 2014)

I saw similar results with my testing of ploop.  It wasn't a huge deal for me because we tend to have plenty of free space on our nodes.  I'm not sure why there 

I think the discard issue will be always be present on ploop because of its design of a filesystem over a filesystem.  The first filesystem won't automatically pass the deletions to the second filesystem unless it is mounted with a discard option. 

Ext4 supports trim with the discard option, which is made to pass deletions to the next level (mainly for SSDs).  It seems like they could mount the ploop device with the discard option to resolve this if they have configured ploop to allow for that.

I haven't tested and I don't have a dev box to test with right now, but does this command work?


fstrim -v /vz/root/VEID 
I wouldn't run it on a production node, but that is the standard method of doing a manual trim.  If that works, then mounting the ploop device with discard should also.


----------



## Geek (Oct 14, 2014)

Nothing that changed the issue with CT 1230 in dev.  I thought I saw something a while back on the *vz forum that mounting in discard wasn't an option.  

 

I'm going to shrink container 2222 and see if I can grab any of that 18gb wasted space.  Just saw that the testing kernel was updated to include some of your bug reports.  Might just boot into into it after I try resizing that CT.   

 

[[email protected] ~]# fstrim -v /vz/root/2222

/vz/root/2222: 294179635200 bytes were trimmed

 

[[email protected] ~]# du -sh /vz/private/2222/root.hdd/*

4.0K    /vz/private/2222/root.hdd/DiskDescriptor.xml

0       /vz/private/2222/root.hdd/DiskDescriptor.xml.lck

138G    /vz/private/2222/root.hdd/root.hdd

 

[[email protected] ~]# vzctl exec 2222 "df -h"

Filesystem         Size  Used Avail Use% Mounted on

/dev/ploop16657p1  394G  120G  254G  33% /

none               768M  4.0K  768M   1% /dev

none               154M  1.1M  153M   1% /run

none               5.0M     0  5.0M   0% /run/lock

none               768M     0  768M   0% /run/shm

 

 

Oh, you're welcome to have access to my dev box if you want a safe place to play.  It's going away next month after I get settled into my larger office ...anyhow, if we kill it, we kill it.  Keep ya posted


----------



## Geek (Oct 14, 2014)

Well, wouldja look at that...

Sooo.... now what?  Another thing to add to the already long list of concerns?  Like you, I have some beefy arrays where it would likely go unnoticed for some time, but still, that feeling of knowing you have wasted space on a production node?  I don't think I like that much....

[[email protected] ~]# vzctl set 2222 --diskspace 264246648 --save

Completing an on-going operation RELOC for device /dev/ploop16657

TRUNCATED: 16163 cluster-blocks (0 bytes)

dumpe2fs 1.41.12 (17-May-2010)

CT configuration saved to /etc/vz/conf/2222.conf

 

[[email protected] ~]# vzctl exec 2222 "df -h"

Filesystem         Size  Used Avail Use% Mounted on

/dev/ploop16657p1  246G  120G  106G  54% /

none               768M  4.0K  768M   1% /dev

none               154M  1.1M  153M   1% /run

none               5.0M     0  5.0M   0% /run/lock

none               768M     0  768M   0% /run/shm

 

[[email protected] ~]# du -sh /vz/private/2222/root.hdd/*

4.0K    /vz/private/2222/root.hdd/DiskDescriptor.xml

0       /vz/private/2222/root.hdd/DiskDescriptor.xml.lck

122G    /vz/private/2222/root.hdd/root.hdd


----------



## devonblzx (Oct 14, 2014)

So what'd you do with the last attempt?  Just reset the diskspace and it trimmed it on its own?


----------



## Geek (Oct 14, 2014)

devonblzx said:


> So what'd you do with the last attempt?  Just reset the diskspace and it trimmed it on its own?


Precisely.


----------



## devonblzx (Oct 14, 2014)

Geek said:


> Precisely.


Interesting, seems simple enough and a useful workaround for now for ploop admins.  Do you have to set a different diskspace amount or can you just reset the same value repeatedly?   I don't use ploop anymore on my systems so I can't test but thanks for sharing.


----------



## Geek (Oct 14, 2014)

Set the diskspace about 35% less as the initial test, it trimmed, then I set it back. I hadn't thought to reset to the same value. I haven't done much with Ploop this summer either.  Frankly, using it takes away from the density and scalability that containers are known for.


----------



## devonblzx (Oct 14, 2014)

Geek said:


> Set the diskspace about 35% less as the initial test, it trimmed, then I set it back. I hadn't thought to reset to the same value. I haven't done much with Ploop this summer either.  Frankly, using it takes away from the density and scalability that containers are known for.


Well that may be the right type of approach as it may require a lower setting to attempt to truncate unused blocks.  Using that logic, I wonder if this approach would work:


vzctl set VEID --diskspace 1M
So by setting a ridiculously low value, my assumption would be that OpenVZ would first attempt to truncate then check to see if the setting is lower than the used disk space, thus failing to actually change the setting.

This may truncate without altering anything.  Of course you'd also want to skip the --save option.  Of course noting that I haven't tested, this is only a theory.


----------



## Geek (Oct 15, 2014)

devonblzx said:


> Well that may be the right type of approach as it may require a lower setting to attempt to truncate unused blocks.  Using that logic, I wonder if this approach would work:
> 
> 
> vzctl set VEID --diskspace 1M
> ...


Good point.  I'll try and get it in before my UK people start their day. 

And another thing, why do all of my simfs converted Ploop devices report 4gb less diskspace than when they were spooled with SimFS?  Is that more wasted space somewhere?

Also, dumpe2fs isn't going to fire up unless I --save ... thoughts?


----------



## Geekion (Jan 6, 2015)

i will consider ur methods before updating, thanks


----------



## Francisco (Feb 25, 2015)

May the mods beat me for this necro but figured someone google-fooing would like to know.

I had a case today where a users VPS wouldn't boot up, giving ploop related corruption errors:

(02:20:41) lv-node28:~/442/root.hdd root: ploop mount root.hdd -r -m /mnt
Adding delta dev=/dev/ploop22693 img=root.hdd (ro)
Error in ploop_mount_fs (ploop.c:1231): Can't mount file system dev=/dev/ploop22693p1 target=/mnt data='(null)': No such device or address
...and...


```
(03:35:48) lv-node07:~ root: /usr/src/ploop_userspace/ploop_userspace 442.root.hdd
We process: 442.root.hdd
Ploop file size is: 2403336192
version: 2 disk type: 2 heads count: 16 cylinder count: 204800 sector count: 2048 size in tracks: 51200 size in sectors: 104857600 disk in use: 1953459801 first block offset: 2048 flags: 0
For storing 53687091200 bytes on disk we need 51200 ploop blocks
We have 1 BAT blocks
We have 262128 slots in 1 map
Number of non zero blocks in map: 2291
Please be careful because this disk used now! If you need consistent backup please stop VE
We found GPT table on this disk
We found ext4 signature at first partition block
Set device /dev/nbd0 as read only
Try to found partitions on ploop device
Error: Both the primary and backup GPT tables are corrupt.  Try making a fresh table, and using Parted's rescue feature to recover partitions.
First ploop partition was not detected properly, please call partx/partprobe manually
You could mount ploop filesystem with command: mount -r -o noload /dev/nbd0p1 /mnt
```
Lovely errors right? The issue is that OVZ is somehow *corrupting* their own GPT partition tables.
Well, here's the fix:

1) *Backup the root.hdd, you will be making possibly destructive changes to it*

2) Mount the ploop so it gets connected into /dev/ploop*

ploop mount root.hdd
3) Install testdisk/photorec


```
apt-get install testdisk
```
4) run testdisk on the drive


```
testdisk /dev/ploopXXXXXX
```
It must be on the *bare* ploop, not the p1 ending.
Now, just tell it look for a GPT partition table and do a 'quick search'. Select the first hit (should be a 2048 boundry), and tell it to write it to the disk. Exit out, umount the root.hdd and tell the VPS to boot. For me it booted without any shows of issue (with the customers permission we randomly opened files in /etc/ to confirm they were intact, etc).

*DON'T BOTHER WITH PLOOP, THE DD PORN ISNT WORTH IT.*

For us we had ploop on a good chunk of LV after doing a vzctl update and not re-applying our standard vz.conf. For about 1 - 2 weeks we had people reinstalling/provisioning on ploop and we've been moving them off as they ask. We've had others get random corruption issues for no reason as well.

_PRAISE THE TESTDISK GODS_

Why they chose to write their own fustercluck instead of just retrofitting qcow I have no idea. Sure the space trimming is a nice feature but it obviously doesn't work very well.

Francisco


----------



## HalfEatenPie (Feb 25, 2015)

Francisco said:


> May the mods beat me for this necro but figured someone google-fooing would like to know.


Time for a soap party!!!



Kidding.  Thanks chief for your solution!


----------

