amuck-landowner

DreamCompute Comes Out of Beta And Undercuts Hourly Competitors By 40% On Price

graeme

Active Member
As you posted in the other thread, OVH has had problems with Ceph as well. Is Ceph immature, or are they using it wrong (maybe through lack of familiarity?)
 

fm7

Active Member
As you posted in the other thread, OVH has had problems with Ceph as well. Is Ceph immature, or are they using it wrong (maybe through lack of familiarity?)

DreamHost and RedHat/Inktank are lead development contributors. OVH was nominated for the OpenStack Superuser Awards. I think "using it wrong / lack of familiarity" is not the case. I guess Ceph was immature years ago (e.g. Ceph Apocalypse - Storage unavailability Friday November 21st – 26th, 2014) and nowadays barely tested on large setups/scale -- Australian VPS provider Binary Lane just dropped Ceph. Surely criticism regarding Ceph's performance is old news (e.g.Killing the Storage Unicorn: Purpose-Built ScaleIO Spanks Multi-Purpose Ceph on Performance) but data loss and extended outages I can't take lightly. However, DigitalOcean almost certainly is using Ceph (block storage offering) and a number of providers is using Ceph-based Parallels Cloud Storage, supposedly without a glitch. :)
 

Mammoth's Binary Lane e-mail:


2016-May-6


Migrating to local storage

From today, most new Binary Lane cloud servers will be deployed onto direct-attached RAID-10 SSD ("local storage") instead of the Ceph network-distributed cluster ("cloud storage") that Binary Lane has used to date.


Existing cloud servers will be live-migrated to local storage starting June 1 on an ongoing basis as internal resources permit.


Local RAID-10 SSD storage will provide customers with increased server performance; while improvements to KVM will allow us to continue to utilise live-migrations for scheduled maintenance.


We have limited capacity for early adopters who would like to migrate a VPS to local storage prior to June. To do so, please contact [email protected] .

Requests will be processed on a first-come, first-served basis and depending on demand may not necessarily be completed prior to June; however we will continue to prioritise such requests ahead of the service-wide migration.

What changes can I expect?

By far, the biggest impact the migration will have is on disk performance. In our testing, we have seen improvements of up to 500%:


The primary negative that may be encountered is when using Change Plan to increase disk size. With the current Cloud SSD solution, all SSDs are combined together into a single, massive pool.


By comparison, when using local SSD each cloud server's disk size will be limited by the amount of storage available on the individual host node that the VPS is located on.

In the future we plan to work around this limitation by enabling Change Plan to automatically perform live migrations when the desired upgrade is not available on the current host, but currently customers will need to contact support to request a live-migration if Change Plan reports that sufficient resources are not available.

Why are we dropping cloud storage now?

Before Binary Lane launched in February 2014, early in our solution design process we reached two key decisions:
 

  • The service would be 100% SSD for fantastic disk performance
  • The service would utilise Ceph network-distributed storage, for increased reliability and better availability

In our testing of Ceph before launch, it was apparent that the disk performance of Ceph was significantly below that of a local RAID solution – somewhere around 20% to 30%.

From our investigation at the time, it was apparent that using SSD's with Ceph was a relatively unexplored usecase (with typical usage being massive 1PB+ clusters of slow disks) and that the software was not yet optimized enough to bridge the gap.

However, the then-upcoming "Firefly" release was adding support for SSD caching and we felt confident that we could essentially "ride the wave" of new version releases to reach a solution as fast as local storage, while providing more functionality.

Instead, we saw a few different things happen that in combination have, at least for us, changed the preferred solution:

  • While Ceph has improved performance to some degree, it has not been the focus of the developers and is still far behind local storage.
  • Inktank (the company who developed Ceph) was purchased by Redhat, and there appears to be a greater focus on enterprise functionality instead.
  • KVM (our virtualization platform) has implemented a variety of new features allowing for live "mirror" migrations, which allows for a VPS and its local storage to be moved from one host to another by transparently copying the disk (and any modifications during the copy) to a new host.

This has left us with a scenario where the majority of functionality that we wanted from Ceph is now available in KVM with local storage, and without paying the performance penalty that is still associated with Ceph to date.
 
Last edited by a moderator:

graeme

Active Member
Interesting, so what went wrong, then? I know more about Ceph after reading that, but not what has gone wrong in these cases.

Binary lane seem to have been using Ceph even though they did not want distributed storage (or decided that a 20% to 30% penalty for distributed storage was not worth paying).
 

fm7

Active Member
Binary lane seem to have been using Ceph even though they did not want distributed storage (or decided that a 20% to 30% penalty for distributed storage was not worth paying).

 They (and some customers) did want distributed storage:

  • "The service would utilise Ceph network-distributed storage, for increased reliability and better availability"

But Binary Lane also did want stellar performance:)

  • "The service would be 100% SSD for fantastic disk performance"

However ... BL's VPS DD was not that great even without forcing immediate disk writes.

Bynary Lane (Brisbane, NextDC, July 6, 2014)


:~# dd if=/dev/zero of=/swap bs=64k count=64k
65536+0 records in
65536+0 records out
4294967296 bytes (4.3 GB) copied, 35.871 s, 120 MB/s

Compare to direct competitor Vultr

Vultr


(Sydney, July 9, 2014)


:~# dd if=/dev/zero of=test bs=64k count=16k conv=fdatasync
16384+0 records in
16384+0 records out
1073741824 bytes (1.1 GB) copied, 2.38899 s, 449 MB/s


(Sydney, Aug 19, 2014)


:~# dd if=/dev/zero of=test bs=64k count=16k conv=fdatasync
16384+0 records in
16384+0 records out
1073741824 bytes (1.1 GB) copied, 2.47158 s, 434 MB/s



Yet, they stood bravely supporting Ceph. :)
 

From our investigation at the time, it was apparent that using SSD's with Ceph was a relatively unexplored usecase (with typical usage being massive 1PB+ clusters of slow disks) and that the software was not yet optimized enough to bridge the gap.


However, the then-upcoming "Firefly" release was adding support for SSD caching and we felt confident that we could essentially "ride the wave" of new version releases to reach a solution as fast as local storage, while providing more functionality.

Instead, we saw a few different things happen that in combination have, at least for us, changed the preferred solution:


While Ceph has improved performance to some degree, it has not been the focus of the developers and is still far behind local storage



Customers reaction:

I've used BL since they started, and this one change is a real shame.

What is even more disappointing is they have not notified us what will occur in case of total hardware failure.

Linode for example leave a spare server per rack in case a node fails, they simply get a tech on site and move the disks to the spare node and boot up and away they go.

BL really needs to explain what their process is in case of a hardware failure. If they follow the Linode model, i'll probably hang around as it is a reasonable solution for a local storage VPS solution.


VM migration to a hot spare during HV failure was a manual process anyway (even with network block storage). So, I'm not sure there is a huge RTO difference in them replacing a bum RAID controller or hitting the button to migrate VMs off a host.

Personally I think the benefit of SANs/EBS are a bit overblown. Risk of outage to any individual volume is higher, performance is worse. Auto-failover is rarely a reality. Meh.
 
Last edited by a moderator:

fm7

Active Member
Decoupling Storage from Compute in Apache Hadoop with Ceph

Intel and QCT (Quanta Cloud Technology) partnered to create a block storage solution that delivers multi-tenancy, workload flexibility, and massive scalability utilizing Ceph. Using a Hadoop workload, Ceph was optimized to provide backend storage solution. To provide the performance needed to disaggregate Hadoop storage, a hybrid deployment of Intel® Solid State Drives with NVMe and HDDs where performance is optimized with Intel® Cache Acceleration Software. Veda will share the results of key benchmarks and the underlying architecture that is being deployed by a large Cloud Service Provider (CSP) to reduce their operational complexity and save cost.

Live online Oct 17 9:00 am United States - Los Angeles or after on demand 45 mins


Presented by
Veda Shankar Director, Emerging Technologies, Solution Business Development Group, QCT (Quanta Cloud Technology)


 https://www.brighttalk.com/webcast/14395/226755
 
Last edited by a moderator:
Top
amuck-landowner