GitLab: Time to Leave the Cloud (bare-metal vs shared-environment virtualization)


Active Member
What we found is that the cloud was not meant to provide the level of IOPS performance we needed to run an agressive system like CephFS.


The problem with CephFS is that in order to work, it needs to have a really performant underlaying infrastructure because it needs to read and write a lot of things really fast. If one of the hosts delays writing to the journal, then the rest of the fleet is waiting for that operation alone, and the whole file system is blocked. When this happens, all of the hosts halt, and you have a locked file system; no one can read or write anything and that basically takes everything down.


Recap: What We Learned

  1. CephFS gives us more scalability and ostensibly performance but did not work well in the cloud on shared resources, despite tweaking and tuning it to try to make it work.
  2. There is a threshold of performance on the cloud and if you need more, you will have to pay a lot more, be punished with latencies, or leave the cloud.
  3. Moving to dedicated hardware is more economical and reliable for the scale and performance of our application.
  4. Building an observable system by pulling and aggregating performance data into understandable dashboards helps us spot non-obvious trends and correlations, leading to addressing issues faster.
  5. Monitoring some things can be really application specific which is why we are building our own gitlab-monitor Prometheus exporter. We plan to ship this with GitLab CE soon.
Last edited by a moderator:


The Irrational One
Retired Staff
That's actually interesting.  Especially since right now many people are going TO Cloud over Bare Metal.  Most usually state that the reliability of cloud hardware over bare metal is what usually drives it (and usually the concept that cloud is much easier to scale over bare metal).  
  • Like
Reactions: fm7


Active Member
It is easier to scale at a trivial level (adding resources is easy). It makes the easy problems easier.

I doubt it makes the hard problem of developing a scalable architecture easier.

I have personally found that an easier a service is to get started with, (particularly things like Heroku that have very quick set up) the more likely you are to run into things you cannot easily do.
  • Like
Reactions: fm7