Zabbix Series 1: Scalability

splitice

Just a little bit crazy...
Verified Provider
This is the first of what I hope will be a series of posts on the topic of Zabbix. For those not in the know, Zabbix is a free and open source piece of monitoring software with all the features of an enterprise solution. All this advice should be taken with a grain of salt and is based on our experiences over the past year. You mileage may vary.

When should I consider the scalability of my monitoring system?
Well, ideally you should plan for the future from the start but since this rarely happens you should begin planning no later than 100-200 values per second. Performance is dependent on many factors, including the number of checks you are performing on proxies (or agents) vs the number of simple checks. We use many simple checks so we hit our first issues at 100 items/s.

What hardware should I be looking at?

The best thing you can do for this software is to ensure its database is stored on a SSD. This alone will increase your performance more than you would believe. We use a 60GB plan from DigitalOcean and have been very impressed with the IOPS.

At around 150-200 items a second you should hit a point that the housekeeper can not delete enough records a second when completing with the insert mutexes (lock contention). At this point you will need to introduce partitioning on your history tables. Now you can either partition and keep the existing housekeeper or write your own housekeeper that runs via dropping partitions. If you have items with varying history storage periods you will most likely need to choose the first solution.

An example of what we use is below:


CREATE TABLE `history_uint` (
`itemid` bigint(20) unsigned NOT NULL,
`clock` int(11) NOT NULL DEFAULT '0',
`value` bigint(20) unsigned NOT NULL DEFAULT '0',
`ns` int(11) NOT NULL DEFAULT '0',
KEY `history_uint_1` (`itemid`,`clock`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
/*!50100 PARTITION BY HASH (clock DIV 86400)
PARTITIONS 40 */

40 partitions was chosen as most of our data is stored for 30-40 days. This ensures data is always being inserted far away from where the housekeeping process is purging rows. You should partition all the tables you use extensively including trends and events as applicable.

As InnoDB does not recover unused table space if you create too many partitions (and I assume like any sane person you are using file per table) it will result in disk space wastage.

Be aware adding partitioning will take many hours on multi-gb tables. Factor this into your plan if applicable.

What software should I be looking at?

If possible check out the Zabbix 2.1 (or soon 2.2) branch. Its currently in beta but the performance improvements are exceptional. From our experience 2.1 (Beta 2) is bug free, or atleast the features we are using are.

For maximum performance we run Percona MySQL 5.5.

If you heavily utilize the API ensure that you have an opcode cache such as APC setup.

So how far can this scale?

Who knows? This new setup has us sitting with a load average of below 1.0. The old setup was over 15 (i7, 4GB ram, 2x500GB raid 1 spinning rust bucket) and well and truly overloaded.

57trg.png

Like this tutorial? Be sure to let me know. Any requests?
 
Last edited by a moderator:

HalfEatenPie

The Irrational One
Retired Staff
So... after I saw that this was posted I got really excited.

You know why?  Because previously (a year or two ago) when I was looking for a easily scale-able monitoring system I was trying to setup Zabbix.  Unfortunately I couldn't get it to work (I spent a week or two on this) and in the end just went with a Centreon + Cacti setup.  I would definitely love to get Zabbix working and hopefully in the future use it! 

Basically, I'm really excited for this.  

Don't stop.

Ever.
 
Last edited by a moderator:

Raymii

New Member
I've also tried Zabbix, but after a few weeks it ground to its kneep and was not reliably monitoring anymore. Nagios core however still is. I must say I have not tried this new version. But still, writing your own housekeeper for a monitoring system? I want that to be as stable and fast as hell out of the box.

In your opinion, what does zabbix offer over Nagios?
 

splitice

Just a little bit crazy...
Verified Provider
I've also tried Zabbix, but after a few weeks it ground to its kneep and was not reliably monitoring anymore. Nagios core however still is. I must say I have not tried this new version. But still, writing your own housekeeper for a monitoring system? I want that to be as stable and fast as hell out of the box.

In your opinion, what does zabbix offer over Nagios?
For one, you do not need to implement your own housekeeper. That is an option one that may be needed in the future. I am currently cleaning up 1,324,800 items per housekeeper cycle (hour) without issues on the discussed configuration.

Two I would like to see stock nagios scale to 200+ items a second. Anything using mysql will be falling foul to the same scalability concerns/issues.

This article is on scaling zabbix, like any software when pushed to business levels additional configuration is needed to ensure it meets your needs now and into the future (e.g monitoring 100 vps nodes)

Zabbix has far more features than nagios built in (proxies, templates, customizable network and low level discovery, web scenario monitoring etc). Its an enterprise quality solution where as cacti/nagios is home quality. Not meaning to insult the developers of cacti but I have done alot of contracting work on cacti installations and now would not touch it with a 10ft poll.
 
Last edited by a moderator:

Dylan

Active Member
Zabbix is definitely a good alternative to Nagios Core (the free version). The big advantage Nagios has is the massive number of addons that are already out there -- chances are if there's something you want to monitor or a piece of software you want to integrate with it's already been done.

If you have hundreds of nodes, though, you shouldn't be looking at Nagios Core. You should be looking at the commercial Nagios XI, which is way ahead of Zabbix and used by everyone from Amazon to Yahoo.
 
Last edited by a moderator:

splitice

Just a little bit crazy...
Verified Provider
While Nagios does have quite an extensive set of scripts, so does Zabbix. Not quite the same amount, but for relatively new (compared to Nagios) its quite acceptable. Ive never had any difficulty creating tiny bash scripts to do my dirty work so its a moot point for me.

https://github.com/search?q=zabbix+scripts&type=Repositories&ref=searchresults

Nagios Core + Cacti is pretty poor from my experience (mainly with the Cacti side of things, that software is very poorly coded). Not to mention the double up.

Nagios XI is another kettle of very expensive fish. I have only a cursory experience and can not really compare it.
 

splitice

Just a little bit crazy...
Verified Provider
Scalability 1.1 - More information.

Some interesting tweaks discovered over the past week.

MariaDB

If you are running Zabbix in production, you likely have many excess processes to handle cascading failures. If you are anything like us you probably have in excess of 200 Zabbix processes with only a 30-40 being used at any time and even fewer using their database connection (presuming Zabbix 2.1+ with Value Cache). This results in 200 odd idle connections to MySQL most of the time. This can be a decently large CPU drain (30-40%) as well as increasing the average time of queries. Percona did a benchmark of this -

NOTPM_vs_idle_conn-1023x578.png

See the problem?... 

The solution as I have found so far is to switch to MariaDB (previously we were using Percona Server 5.5). This has dropped our load average from 1.0-1.5 to 0.1-0.2 (aka basically idle). Amazing.

The reason for this improvement would be the way MariaDB handles sleeping threads (single thread and event polling) which results in much better scalability.

Zabbix 2.1 / Zabbix 2.2

The performance improvements gained in this release are ground breaking. The value cache results in a server that rarely has to query the database for historical data. Of course this depends on how far your triggers go back with averages, sums and counts etc as well as the allocated size of the cache. It should be out of RC soon so :)
 
Last edited by a moderator:

splitice

Just a little bit crazy...
Verified Provider
Scalability 1.2 - Zabbix 2.2 packages.

Just a quick update as per a request with a copy of the pre-release Debian (Wheezy) packages I am using. I would not advocate using them in production etc. Majority of the credit goes to Dotdeb. It is Based off the dotdeb Zabbix 2.0.9 package template, all I did is update the source and re-package. 

Version: 2.2.0RC2
Links: 


I haven't tested upgrading 2.0 -> 2.2 with these packages but I have tested 2.0 -> 2.18 -> 2.2RC2 without issue so I assume it works fine.

No warranty implied, use at own risk etc.

Enjoy.
 
Last edited by a moderator:
Top