Replicating Linux block devices within the same server?

Damian · Aug 28, 2013

I'm attempting to effect short-delay syncing. I'd like either blocks or files to be synced within a few seconds up to one minute, whereas something like rsnapshot on an hourly basis will have too much of a potential data delta.

Anyone have brainy ideas on how to replicate one block device to another within the same server? Kinda like what DRBD does but without the networking?

I'm also opening to filesystem-level syncing. I've tried using inotifywait from the inotify-tools package, however it seems to be overwhelmed by the number of directories in the /vz/private partition of a VM host node.

rds100 · Aug 28, 2013

What abourd mdadm raid1? Or is the second device significantly slower on writes than the first one?

Damian · Aug 28, 2013

The storage device is slower and a different size than the device to be replicated. I'm looking to replicate an SSD array to a single mechanical drive.

rds100 · Aug 28, 2013

You can set the second device as write-mostly in the mdadm raid1, so all reads will come from the SSD, if it's alive. The larger HDD size doesn't matter, you will just partition it and use a partition of the right size for the mirror (the rest can stay unused).

The problem would be with the write speeds, they will be limited by the ability of the mechanical HDD to handle writes. Could be improved a little by connecting the mechanical HDD to a cheap RAID card with big RAM buffer, to do the write buffering. If such a card exists, of course.

rds100 · Aug 28, 2013

Where is the edit button ???

Anyway, you can also set the HDD as write-behind device in mdadm raid1, so the writes can be buffered / deferred. Haven't tried it / benchmarked it myself though. If you make some benchmarks, please share.

Jack · Aug 28, 2013

rds100 said:
Where is the edit button ???

Martin stole it.

Damian · Aug 29, 2013

I'm not sure that the RAID idea would be the best method... yes it would work, but as you pointed out, it would crap up the write speed to the entire array, kinda negating the point of using SSDs. I don't have enough experience with mdadm to know if I could mangle it to work otherwise. The write-behind idea might be viable.

What i've tried thus far:

inotify-wait from the inotifytools package - Does exactly what I wanted: sets up inotify ( http://en.wikipedia.org/wiki/Inotify ) watches and then makes a list of files that have changed in real-time. This list can then be piped to rsync or whatever to move the files. Unfortunately, when applied at the scale of a VM host node, it's overwhelmed; I let it run for about 50 minutes setting up inotify watches on our busiest VM server, at which point I killed it. So that's not going to work.

rsnapshot on an hourly basis - Meets all the required criteria except for having an hour interval. We all know what rsnapshot is and how it works, so I won't bother with that. This ended up not working out for us, as it caused massive disk i/o loading:

So that's not going to work either. This was, once again, on our heaviest-loaded server, and is actually a really poor example: it's basically reading from /vz/private and attempting to snapshot to /vz/snapshots, which are both on the same partition of the same disk array. Granted, my expected use scenario of backing up an SSD array to a mechanical drive is completely different, my mindset is that if something works for the worst-case scenario, then it'll be fine for anything better.

I'm going to try idera's "Server Backup Free" tool from http://www.idera.com/productssolutions/freetools/serverbackupfree which claims to do 5-minute incremental backups to (amongst other things) another disk in the server. We'll see what happens....

acd · Aug 30, 2013

Out of curiosity, may I ask why this is desirable over flashcache and friends? If you're on the same server you don't have to worry about multiple readers and writers so coherency problems go away. The device is a 100% replica so you never want to be reading from the slow, seekbound disk. Your working set is, by definition, smaller than (or equal) what your cache's size must be...

Damian · Aug 30, 2013

acd said:
Out of curiosity, may I ask why this is desirable over flashcache and friends? If you're on the same server you don't have to worry about multiple readers and writers so coherency problems go away. The device is a 100% replica so you never want to be reading from the slow, seekbound disk. Your working set is, by definition, smaller than (or equal) what your cache's size must be...

Generally, a lack of support or documentation. I can't really find much on flashcache's failure modes and how to fix them, which is probably one of:

It doesn't break, so no one has written about it breaking.
No one uses it.
It breaks, but no one writes anything about it (unlikely).

One of the benefits of inotify or rsnapshot is that they operate at the filesystem level, which i'm comfortable working with. Idera/r1soft/whatever-its-called has purchaseable support if needed, plus there's plenty of documentation otherwise.

Another pivot is the application: 1U VM servers. 2U of rack space costs us twice as much regardless if we're using additional power, so 1U is extremely desirable to us. We're finding that with our current E3-12x0 builds, all of them are running out of steam on disk i/o with a 4-spindle RAID 10 array way before any other resource (CPU, RAM, etc) gets anywhere near full utilization.

After doing a lot of pricing, we determined that a RAID 1 (or RAID 0, not sure if want to be that ballsy) SSD array ends up being cheaper at the price of about half the storage amount versus a mechanical drive array. While the SSDs are more expensive on a per-unit basis, I don't need to purchase more of them for better throughput, I don't need to purchase a $700 RAID controller for better throughput, I don't need to purchase a BBU/CV module for the RAID controller, etc. All of our servers are under 50% of their disk array utilized, and we have a storage VPS product with free internal transit for people who need to store a lot of stuff.

There's a sentiment amongst the community to use 2U servers so that you can have more (6, 8, or 12) spindles, but that seems extremely wasteful, both from the fact that I then need to buy more drives, need to buy a SAS expander or a 16-port RAID card, additional power used to spin the drives, and then I have wasted storage space that I can't sell as actual storage space due to clients generally abusing disk i/o on storage, as I can tune an array and scheduler for random I/O (true VM usage) or contiguous read/write (storage), not both. Additionally, more spindles conveys that there's now more points of failure, whereas I can just implement two SSDs and a replication drive.

Granted, i'm conveying this concept of SSD-and-replication-drive as backup; I'm really not thinking of it that way. "Backup" insinuates being not in the same chassis for one. I'm thinking more of it as "due diligence", in that if the SSD array fails, we have a "hot spare" that we can copy all of our stuff from and get up and running in the time it takes to copy. With the durability of modern SSDs, this may even be completely irrelevant, but having this safety net, even if the safety net is made out of thin paper, makes me feel better about what we're offering.

I appreciate your input; i'm always interested in the thoughts and opinions of others. Don't hesitate to respond if you have anything further

acd · Aug 30, 2013

Ok, lack of support is a legitimate concern. Documentation on failure modes is pretty reasonable, assuming you can take word of mouth from mailing lists. dm-cache and bcache both made it into the mainline kernel in the 3.9 and 3.10 respectively so worst case scenario, you can take it to a lkml and doco will become available. Also since it is in-kernel now, they should be reasonably reliable; both of those caching solutions have been proven over a significant amount of time by many companies. That may not be reliable enough for you but it counts for a fair bit in my book. But then again, I'm also not above debugging a kernel issue on my own if I can find a consistent failure case.

If power is not an issue and the virtualization is ovz or xen pv, the first thing I would do is load those boxes to the brim with ram. The linux vfs (specifically block cache, unfortunately also referred to as bcache) will kick the crap out of scattered reads if there is temporal reuse and the working set is small enough (ie, smaller than the amount of excess ram you have budgeted for read caching). Write queueing is not so lucky; fsync() and fdatasync() will stall out until their writes go to disk (longer for the former). With that in mind, you may want to play with the /sys/block/<x>/queue/nr_requests parameter a little. Increasing it allows linux to better optimize read and write performance of small reads and writes. You can also play with filesystem options like setting ext3's journal to write-back (extra performance at the cost of integrity). IBM has a manual for performance optimizing linux with a section on the disk subsystem (http://www.redbooks.ibm.com/redpapers/pdfs/redp4285.pdf section 4.6); I apologize if you've already tried these things.

I know what you mean about expensive freakin' raid boards. PMC boards are stupidly expensive for the ones that support maxcache, and the 250$ for the LSI cachecade license does not endear them to me either. Even if you have one of those, write-back caching is still a bit of a data integrity issue (like if your SSD hasn't pushed data to a cell on power-loss). Still, I'm surprised a used LSI SAS2108-based board with bbu and license would run you more than 700.

Regarding more spindles, (despite other problems) while it does increase the number of points of failure, in raid10, it shouldn't decrease your resiliance to failures; each disk still has its span mirror. Additionally, you can probably use that extra disk space for backups from other machines when IO wains (I imagine your io utilization goes up and down).

Ok, a couple suggestions and a thought; if people are abusing IO, you can use cgroups to limit their IOPS capacity, container-by-container, and/or for all containers as a whole. The blkio cgroup allows both proportional-to-peer priority and per-device-maximum limits for both throughput and IOPS. Next, if you have the network capacity, why not actually use DRBD in async mode to replicate to another box (or even DRBD to the *same* box)? You have already said you have excess cpu and ram capacity on the system and a local (either site or machine) tcp/ip connection will cost very little. Finally, if your deployment is large enough, have you considered SAN boxes dedicated to handling IO? Granted, I don't know what your actual io loading looks like, but I imagine not all machines are loaded evenly or at even times so it might help average out the io load between them.

I still believe one of the write-back cache solutions will be reliable enough for your purposes with 2 SSDs in RAID 1 backed by a pair of rotating disks RAID 1.

Replicating Linux block devices within the same server?

Damian

New Member

rds100

New Member

Damian

New Member

rds100

New Member

rds100

New Member

Jack

Active Member

Damian

New Member

acd

New Member

Damian

New Member

acd

New Member