Heads up: OpenVZ updates will probably break your system

Geek · Aug 22, 2014

KuJoe said:
While a very passionate post, I feel it was wasted on a bot who's just here to increase post count and post copy+paste ads once they post enough nonsense.

I'm going to watch the YouTube video now though so it wasn't a total waste.

Yeh, perhaps you're right, but I'd still prefer to contribute something worthwhile over fluff or copy/paste like you said. Plus I just started posting here yesterday, and even though I see a lot of familiar faces, first impressions are important, and I'd hate for people to think that I have a hidden agenda when I really just enjoy contributing... if that makes sense. Hope you enjoyed the vid though.

Geek · Aug 27, 2014

Just a little FYI, it seems if you go to convert a container where second-level quotas are established, the /etc/mtab symbolic link to /proc/mounts gets broken and isn't correctly re-established, so whenever the container is rebooted, the container will show the former simfs layout unless you re-create the symlink. Observe....

root@ctdev19 [~]# mount; echo; cat /proc/mounts; echo; ls -al /etc/mtab; echo;
/dev/simfs on / type reiserfs (rw,usrquota,grpquota)
proc on /proc type proc (rw,relatime)
sysfs on /sys type sysfs (rw,relatime)
none on /dev type devtmpfs (rw,relatime,mode=755)
none on /dev/pts type devpts (rw,relatime,mode=600,ptmxmode=000)
tmpfs on /tmp type tmpfs (rw,nosuid,nodev,noexec,relatime)
tmpfs on /var/tmp type tmpfs (rw,nosuid,nodev,noexec,relatime)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw,relatime)

/dev/ploop63231p1 / ext4 rw,relatime,barrier=1,data=ordered,balloon_ino=12,jqfmt=vfsv0,usrjquota=aquota.user,grpjquota=aquota.group 0 0
proc /proc proc rw,relatime 0 0
sysfs /sys sysfs rw,relatime 0 0
none /dev devtmpfs rw,relatime,mode=755 0 0
none /dev/pts devpts rw,relatime,mode=600,ptmxmode=000 0 0
tmpfs /tmp tmpfs rw,nosuid,nodev,noexec,relatime 0 0
tmpfs /var/tmp tmpfs rw,nosuid,nodev,noexec,relatime 0 0
none /proc/sys/fs/binfmt_misc binfmt_misc rw,relatime 0 0

-rw-r--r-- 1 root root 383 Aug 22 04:45 /etc/mtab

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

root@ctdev19 [~]# rm -rf /etc/mtab
root@ctdev19 [~]# ln -s /proc/mounts /etc/mtab

root@ctdev19 [~]# mount; echo; cat /proc/mounts; echo; ls -al /etc/mtab; echo;
/dev/ploop63231p1 on / type ext4 (rw,relatime,barrier=1,data=ordered,balloon_ino=12,jqfmt=vfsv0,usrjquota=aquota.user,grpjquota=aquota.group)
proc on /proc type proc (rw,relatime)
sysfs on /sys type sysfs (rw,relatime)
none on /dev type devtmpfs (rw,relatime,mode=755)
none on /dev/pts type devpts (rw,relatime,mode=600,ptmxmode=000)
tmpfs on /tmp type tmpfs (rw,nosuid,nodev,noexec,relatime)
tmpfs on /var/tmp type tmpfs (rw,nosuid,nodev,noexec,relatime)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw,relatime)

/dev/ploop63231p1 / ext4 rw,relatime,barrier=1,data=ordered,balloon_ino=12,jqfmt=vfsv0,usrjquota=aquota.user,grpjquota=aquota.group 0 0
proc /proc proc rw,relatime 0 0
sysfs /sys sysfs rw,relatime 0 0
none /dev devtmpfs rw,relatime,mode=755 0 0
none /dev/pts devpts rw,relatime,mode=600,ptmxmode=000 0 0
tmpfs /tmp tmpfs rw,nosuid,nodev,noexec,relatime 0 0
tmpfs /var/tmp tmpfs rw,nosuid,nodev,noexec,relatime 0 0
none /proc/sys/fs/binfmt_misc binfmt_misc rw,relatime 0 0

lrwxrwxrwx 1 root root 12 Aug 27 08:55 /etc/mtab -> /proc/mounts

The behavior of the conversion is normal if second-level quotas are temporarily disabled prior to converting to Ploop. If anyone feels like dissecting the conversion to find out why, be my guest. Otherwise I guess we'll just have to remember.

Francisco · Sep 2, 2014

It looks like there's a hotfix on its way out from OpenVZ for ploop:

https://twitter.com/_openvz_/status/506877041937874944

I wonder if this addresses the corruption issues?

Francisco

Francisco · Sep 2, 2014

Nope, looks to address some race condition in live migrations

https://twitter.com/_openvz_/status/506880304263331840

Francisco

Francisco · Sep 13, 2014

@Geek - Have you tried the latest ploop? It looks like they provided a way to zero out bad blocks and they feel

that the issue is due to writeback mode being set on EXT4.

Francisco

devonblzx · Sep 13, 2014

Francisco said:
@Geek - Have you tried the latest ploop? It looks like they provided a way to zero out bad blocks and they feel

that the issue is due to writeback mode being set on EXT4.

Francisco

I am not entirely confident in that resolution but I didn't have enough time to debug further or expertise in ploop to state otherwise. We already started moving servers away from ploop to a custom solution.

While writeback mode did seem dangerous with how ploop works, one of the other servers that experienced data loss without any hard drive issues was running in ordered mode. So I just don't trust it for the time being, not to mention the compact issue. Ploop images will continue to grow if you don't compact them regularly. If the underlying filesystem runs out of space because of this, then there will be data integrity issues inside the VPS. This problem is because when files are deleted, they are just marked in the ploop image not the underlying filesystem. Therefore, from my understanding, if a VPS deletes and recreates a 10GB file 100 times the ploop image could be as large as 1TB if you haven't compacted it.

Francisco · Sep 13, 2014

devonblzx said:
I am not entirely confident in that resolution but I didn't have enough time to debug further or expertise in ploop to state otherwise. We already started moving servers away from ploop to a custom solution.

While writeback mode did seem dangerous with how ploop works, one of the other servers that experienced data loss without any hard drive issues was running in ordered mode. So I just don't trust it for the time being, not to mention the compact issue. Ploop images will continue to grow if you don't compact them regularly. If the underlying filesystem runs out of space because of this, then there will be data integrity issues inside the VPS. This problem is because when files are deleted, they are just marked in the ploop image not the underlying filesystem. Therefore, from my understanding, if a VPS deletes and recreates a 10GB file 100 times the ploop image could be as large as 1TB if you haven't compacted it.

I don't think that's how it works.....

I'm pretty sure EXT4 would just re-allocate the same sectors for the last 10GB file.

I've looked into an LVM based solution but was unsure of it.

Francisco

devonblzx · Sep 13, 2014

Francisco said:
I don't think that's how it works.....

I'm pretty sure EXT4 would just re-allocate the same sectors for the last 10GB file.

I've looked into an LVM based solution but was unsure of it.

Francisco

Ext4 isn't being made aware of the delete, that is the problem. Just from normal usage before we started doing compacts, we saw ploop images that were over 100GB with servers that were only using* ~40GB.

Francisco · Sep 13, 2014

What the hell.

Francisco

Kruno · Sep 13, 2014

Francisco said:
What the hell.

Francisco

https://bugzilla.openvz.org/show_bug.cgi?id=3008

Geek · Sep 13, 2014

Sorry I'm late. Had a root canal this week & I delegate everything when Vicodin's involved...

devonblzx said:
Ext4 isn't being made aware of the delete, that is the problem. Just from normal usage before we started doing compacts, we saw ploop images that were over 100GB with servers that were only using* ~40GB.

So this is the one where Kir said "that's just how it is, use cron if you have to" or something? Yet more resource-intensive rituals for people like us. I did experience data loss running in ordered mode, then there was my little fiasco with the BBU failure in writeback mode which damaged all the ploops. I'm certain this will spell trouble for some. Gonna be sticking with ol' trusty for a while longer. Sorry I don't have more to contribute this time. Freaking jaw is throbbing... thanks for the info Devon!

Geek · Oct 14, 2014

vzctl compact/ploop-baloon discard isn't quite making the cut, it seems...

[root@mulva ~]# vzctl exec 1230 "df -h"; echo

Filesystem Size Used Avail Use% Mounted on

/dev/ploop18019p1 60G 9.7G 47G 18% /

none 768M 4.0K 768M 1% /dev

none 768M 0 768M 0% /dev/shm

tmpfs 768M 0 768M 0% /tmp

[root@mulva ~]# du -sh /vz/private/1230/root.hdd/*

4.0K /vz/private/1230/root.hdd/DiskDescriptor.xml

0 /vz/private/1230/root.hdd/DiskDescriptor.xml.lck

16G /vz/private/1230/root.hdd/root.hdd

4.0K /vz/private/1230/root.hdd/root.hdd.mnt

[root@mulva ~]# ploop-balloon discard /vz/private/1230/root.hdd/DiskDescriptor.xml

Trying to find free extents bigger than 0 bytes

Waiting

Call FITRIM, for minlen=33554432

Call FITRIM, for minlen=16777216

Call FITRIM, for minlen=8388608

Call FITRIM, for minlen=4194304

Call FITRIM, for minlen=2097152

Call FITRIM, for minlen=1048576

0 clusters have been relocated

...so I guess that leaves me with 7.3gb wasted space until they figure it out?

Edit: Meant to add that there's an entry in OVZ Bugzilla about vzctl compact not doing the whole job... apparently you can use vzctl and shrink the fs to recover a bit of that, but even in the QA node I don't really want to screw around much with it yet.

devonblzx · Oct 14, 2014

I saw similar results with my testing of ploop. It wasn't a huge deal for me because we tend to have plenty of free space on our nodes. I'm not sure why there

I think the discard issue will be always be present on ploop because of its design of a filesystem over a filesystem. The first filesystem won't automatically pass the deletions to the second filesystem unless it is mounted with a discard option.

Ext4 supports trim with the discard option, which is made to pass deletions to the next level (mainly for SSDs). It seems like they could mount the ploop device with the discard option to resolve this if they have configured ploop to allow for that.

I haven't tested and I don't have a dev box to test with right now, but does this command work?

fstrim -v /vz/root/VEID
I wouldn't run it on a production node, but that is the standard method of doing a manual trim. If that works, then mounting the ploop device with discard should also.

Geek · Oct 14, 2014

Nothing that changed the issue with CT 1230 in dev. I thought I saw something a while back on the *vz forum that mounting in discard wasn't an option.

I'm going to shrink container 2222 and see if I can grab any of that 18gb wasted space. Just saw that the testing kernel was updated to include some of your bug reports. Might just boot into into it after I try resizing that CT.

[root@mulva ~]# fstrim -v /vz/root/2222

/vz/root/2222: 294179635200 bytes were trimmed

[root@mulva ~]# du -sh /vz/private/2222/root.hdd/*

4.0K /vz/private/2222/root.hdd/DiskDescriptor.xml

0 /vz/private/2222/root.hdd/DiskDescriptor.xml.lck

138G /vz/private/2222/root.hdd/root.hdd

[root@mulva ~]# vzctl exec 2222 "df -h"

Filesystem Size Used Avail Use% Mounted on

/dev/ploop16657p1 394G 120G 254G 33% /

none 768M 4.0K 768M 1% /dev

none 154M 1.1M 153M 1% /run

none 5.0M 0 5.0M 0% /run/lock

none 768M 0 768M 0% /run/shm

Oh, you're welcome to have access to my dev box if you want a safe place to play. It's going away next month after I get settled into my larger office ...anyhow, if we kill it, we kill it. Keep ya posted

Geek · Oct 14, 2014

Well, wouldja look at that...

Sooo.... now what? Another thing to add to the already long list of concerns? Like you, I have some beefy arrays where it would likely go unnoticed for some time, but still, that feeling of knowing you have wasted space on a production node? I don't think I like that much....

[root@mulva ~]# vzctl set 2222 --diskspace 264246648 --save

Completing an on-going operation RELOC for device /dev/ploop16657

TRUNCATED: 16163 cluster-blocks (0 bytes)

dumpe2fs 1.41.12 (17-May-2010)

CT configuration saved to /etc/vz/conf/2222.conf

[root@imulva ~]# vzctl exec 2222 "df -h"

Filesystem Size Used Avail Use% Mounted on

/dev/ploop16657p1 246G 120G 106G 54% /

none 768M 4.0K 768M 1% /dev

none 154M 1.1M 153M 1% /run

none 5.0M 0 5.0M 0% /run/lock

none 768M 0 768M 0% /run/shm

[root@mulva ~]# du -sh /vz/private/2222/root.hdd/*

4.0K /vz/private/2222/root.hdd/DiskDescriptor.xml

0 /vz/private/2222/root.hdd/DiskDescriptor.xml.lck

122G /vz/private/2222/root.hdd/root.hdd

devonblzx · Oct 14, 2014

So what'd you do with the last attempt? Just reset the diskspace and it trimmed it on its own?

Geek · Oct 14, 2014

devonblzx said:
So what'd you do with the last attempt? Just reset the diskspace and it trimmed it on its own?

Precisely.

devonblzx · Oct 14, 2014

Geek said:
Precisely.

Interesting, seems simple enough and a useful workaround for now for ploop admins. Do you have to set a different diskspace amount or can you just reset the same value repeatedly? I don't use ploop anymore on my systems so I can't test but thanks for sharing.

Geek · Oct 14, 2014

Set the diskspace about 35% less as the initial test, it trimmed, then I set it back. I hadn't thought to reset to the same value. I haven't done much with Ploop this summer either. Frankly, using it takes away from the density and scalability that containers are known for.

devonblzx · Oct 14, 2014

Geek said:
Set the diskspace about 35% less as the initial test, it trimmed, then I set it back. I hadn't thought to reset to the same value. I haven't done much with Ploop this summer either. Frankly, using it takes away from the density and scalability that containers are known for.

Well that may be the right type of approach as it may require a lower setting to attempt to truncate unused blocks. Using that logic, I wonder if this approach would work:

vzctl set VEID --diskspace 1M
So by setting a ridiculously low value, my assumption would be that OpenVZ would first attempt to truncate then check to see if the setting is lower than the used disk space, thus failing to actually change the setting.

This may truncate without altering anything. Of course you'd also want to skip the --save option. Of course noting that I haven't tested, this is only a theory.

Heads up: OpenVZ updates will probably break your system

Technolojesus

Technolojesus

Company Lube

Company Lube

Company Lube

New Member

Company Lube

New Member

Company Lube

New Member

Technolojesus

Technolojesus

New Member

Technolojesus

Technolojesus

New Member

Technolojesus

New Member

Technolojesus

New Member