Zabbix Series 3: Trigger Anti Flap and Cascading failures

splitice

Just a little bit crazy...
Verified Provider
So have you created your first trigger yet? 

If I have made any mistakes in this post let me know.

Flapping

Presuming you have created your first triggers, you might have noticed that unlike Nagios Zabbix does not include automatic Anti-Flap detection, that is triggers can go on and off repeatedly sending you emails (if configured) or just generally causing havoc.

There are two ways to handle flapping depending upon your alerting needs and the type of information you are receiving. 

1. Take the Minimum/Maximum value

This is a simple way to reduce the flap, this will not be perfect but is fairly effective for reducing the number of emails generated for values alternating over the value. 


{X4B - DB:system.cpu.load.min(5m)} > 2

2. hysteresis

This way is a bit more complicated, before we get started on this you should already know this but its best to put it in writing - A trigger is a "PROBLEM" when its value is "1". Therefore by using macro's based on the trigger's current value it is possible to define conditions for declaring a trigger as down vs recover from that condition.

Lets say we have a check for a disk reaching capacity. We might set the trigger limit at 1GB, but if an action has been taken to recover from this situation we expect atleast 20GB to be recovered to put the disk back to normal operating amounts. It is not forseeable that an application would write and delete 20GB of data repeatedly in a short period of time therefore there will be no alert spam or flap.

In pseudocode this scenario our conditions would be something like:


if(trigger_value == 0 && disk_remaining < 1GB){
return 1;
} else if(trigger_value == 1 && disk_remaining > 20GB) {
return 0;
} else {
return trigger_value;
}

Now how do we write this in Zabbix?


({TRIGGER.VALUE}=0 & {server:vfs.fs.size[/,free].last(0)} < 1G)
|
({TRIGGER.VALUE}=1 & {server:vfs.fs.size[/,free].last(0)} < 20G)

This expression should be entered into Zabbix without the newline.

What about more complex expressions?

Its reasonable you may want to factor in other conditions such as if a backup is running or different sized disks. All is possible but beyond the scope of this post.

3. difference from normal

This is one method I am not currently making heavy use of (due to performance) however I plan to make more extensive use of in the future. This method (it can be used with #2) can lead to some pretty amazing results. The example I am providing is untested (the trigger expression I am currently using is too complicated to use an example) and a bit contrived. But I hope it makes sense.


({server:mysql_tps.avg(43200)} - {server:mysql_tps.avg(60)}) < ({server:mysql_tps.avg(43200)}/-10)
|
({server:mysql_tps.avg(43200)} - {server:mysql_tps.avg(60)}) > ({server:mysql_tps.avg(43200)}/10)

This expression triggers if the average MySQL transactions per second exceeds 10% of the 12hr average. The performance of this particular expression would probably be bad. If I have not made any mistakes, someone should let me know :p

Cascading Failures

I wont spend too much time on this subject, but say you have twenty servers connected to a single router. Should that router go offline, of course those servers are going to go offline. Without the configuration of dependencies these will generate alerts, possibly hiding the router offline trigger. 

Avoid all this by correctly setting dependencies. In this case the servers are dependent on the router. Here is an example for ICMP ping checks.

58twW.png

In this case as the events are time dependent the events will be generated in sequence. But as soon as 2m is triggered, 1m will be removed from the dashboard. The 1m trigger will remain down until 2m is resolved.
 
Last edited by a moderator:

HalfEatenPie

The Irrational One
Retired Staff
... its official.  Once these series are done I'll be installing Zabbix onto a VM and try my hand at it again (from reading it so far it seems pretty straight forward once laid out like this...).  

Seriously man this is great stuff!  
 
Top