Hardware Monitoring System?

HalfEatenPie · Mar 16, 2016

So we have server network monitoring and such. My concern is figuring out the server hardware monitoring portions. Other monitoring options are greatly welcome (because I don't know what else can be monitored off the top of my head), but one specific example would be the hard drives.

I want a way to be able to keep an eye on the hard drive quality and up time and have an ability to know if/when an impending hard drive failure can happen. A similar setup added on to Observium or LibreNMS would be great, but I don't really know if there's any easy centralized solution is available. If I recall SpiceWorks does all this, but I was wondering if anyone had any other ideas.

Thanks!

willie · Mar 16, 2016

Basic idea for disks is monitor the SMART statistics and schedule replacement if (at least for hdd's) any of the failure-predicting parameters change. There's an old Google hdd reliability paper that said even tiny blips in thosee reports meant the drive was much more likely to fail. Servers usually have some other monitoring facilities for case temperature, ECC error events, CPU temperature, and other things like that. What to alert on? Not sure, maybe there's some advice out there.

HalfEatenPie · Mar 17, 2016

Yeah disk monitoring would be parsing the data outputted from SMART. I might look into that reliability report.

If I recall correctly, Spiceworks and PRTG should be handling these things, but I guess there's no easy focused solution for this. Bummer! Maybe I'll look into making something for myself down the road when I have time.

fm7 · Mar 19, 2016

Online.net's HardwareWatch service (*) uses IPMI to check the status of various sensors on their servers (almost 100% HP and Dell; mostly HW RAID).

PRTG has the feature

Nagios has at least one plugin -- you may want to take a look at this white paper https://www.thomas-krenn.com/de/wikiDE/images/7/7c/20100610-Hardware-Monitoring-with-the-new-Nagios-IPMI-Plugin.pdf

BTW if you are using HW RAID controllers S.M.A.R.T. data may be not available to your favorite program (e.g. smartmontools -- "Checking disks behind RAID controllers")

---

(*) From Online.net

HARDWAREWATCH®

Your Dedibox® is supervised automatically 24/24 by our teams. In case of a hardware problem we automatically launch an intervention to replace the defective part.

...

HardwareWatch® service ... several actions will be attempted if parts of your server become defective.

Our service can monitor your server's hardware, detect and repair defective parts within an hour, 24/7.

Note: if your server has only one disk (SC) or if you use RAID 0, we won't change disks without your approval

fm7 · Mar 19, 2016

willie said:
...There's an old Google hdd reliability paper that said even tiny blips in thosee reports meant the drive was much more likely to fail.

There is a new Google SSD study that says " RBER (raw bit error rate), the standard metric for drive reliability, is not a good predictor of those failure modes that are the major concern in practice".

Higher rate of problems with SSDs rather than HDDs

Flash Reliability in Production: The Expected and the Unexpected

Authors: Bianca Schroeder, University of Toronto; Raghav Lagisetty and Arif Merchant, Google, Inc.

Abstract:

As solid state drives based on flash technology are becoming a staple for persistent data storage in data centers, it is important to understand their reliability characteristics. While there is a large body of work based on experiments with individual flash chips in a controlled lab environment under synthetic workloads, there is a dearth of information on their behavior in the field. This paper provides a large-scale field study covering many millions of drive days, ten different drive models, different flash technologies (MLC, eMLC, SLC) over 6 years of production use in Google’s data centers. We study a wide range of reliability characteristics and come to a number of unexpected conclusions. For example, raw bit error rates (RBER) grow at a much slower rate with wear-out than the exponential rate commonly assumed and, more importantly, they are not predictive of uncorrectable errors or other error modes. The widely used metric UBER (uncorrectable bit error rate) is not a meaningful metric, since we see no correlation between the number of reads and the number of uncorrectable errors. We see no evidence that higher-end SLC drives are more reliable than MLC drives within typical drive lifetimes. Comparing with traditional hard disk drives, flash drives have a significantly lower replacement rate in the field, however, they have a higher rate of uncorrectable errors.

https://www.usenix.org/conference/fast16/technical-sessions/presentation/schroeder

Some of the findings and conclusions might be surprising.

Between 20–63% of drives experience at least one uncorrectable error during their first four years in the field, making uncorrectable errors the most common non-transparent error in these drives. Between 2–6 out of 1,000 drive days are affected by them.

The majority of drive days experience at least one correctable error, however other types of transparent errors, i.e. errors which the drive can mask from the user, are rare compared to non-transparent errors.

We find that RBER (raw bit error rate), the standard metric for drive reliability, is not a good predictor of those failure modes that are the major concern in practice. In particular, higher RBER does not translate to a higher incidence of uncorrectable errors.

We find that UBER (uncorrectable bit error rate), the standard metric to measure uncorrectable errors, is not very meaningful. We see no correlation between UEs and number of reads, so normalizing uncorrectable errors by the number of bits read will artificially inflate the reported error rate for drives with low read count.

Both RBER and the number of uncorrectable errors grow with PE cycles, however the rate of growth is slower than commonly expected, following a linear rather than exponential rate, and there are no sudden spikes once a drive exceeds the vendor’s PE cycle limit, within the PE cycle ranges we observe in the field.

While wear-out from usage is often the focus of attention, we note that independently of usage the age of a drive, i.e. the time spent in the field, affects reliability.

SLC drives, which are targeted at the enterprise market and considered to be higher end, are not more reliable than the lower end MLC drives.

We observe that chips with smaller feature size tend to experience higher RBER, but are not necessarily the ones with the highest incidence of non-transparent errors, such as uncorrectable errors.

While flash drives offer lower field replacement rates than hard disk drives, they have a significantly higher rate of problems that can impact the user, such as uncorrectable errors.

Previous errors of various types are predictive of later uncorrectable errors. (In fact, we have work in progress showing that standard machine learning techniques can predict uncorrectable errors based on age and prior errors with an interesting accuracy.)

Bad blocks and bad chips occur at a signicant rate: depending on the model, 30-80% of drives develop at least one bad block and and 2-7% develop at least one bad chip during the first four years in the field. The latter emphasizes the importance of mechanisms for mapping out bad chips, as otherwise drives with a bad chips will require repairs or be returned to the vendor.

Drives tend to either have less than a handful of bad blocks, or a large number of them, suggesting that impending chip failure could be predicted based on prior number of bad blocks (and maybe other factors). Also, a drive with a large number of factory bad blocks has a higher chance of developing more bad blocks in the field, as well as certain types of errors.

BTW

Industrial Temperature and NAND Flash in SSD Products | EEWeb

Conclusion

NAND is subject to two competing factors relative to temperature. At high temperature, programming and erasing a NAND cell is relatively less stressful to its structure, but data retention of a NAND cell suffers. At low temperature, data retention of the NAND cell is enhanced but the relative stress to the cell structure due to program and erase operations increases.

ikoula · Mar 22, 2016

Hello,

A very basic way to do that could be to setup a cron that make an output from smartctl command and send it to you by email ?

Greetings

willie · Mar 23, 2016

ikoula said:
A very basic way to do that could be to setup a cron that make an output from smartctl command and send it to you by email

Manually examining the output would get annoying after the second or third time, especially if you have multiple servers. You need a program to monitor the relevant parameters and alert you if something goes wrong. It shouldn't be complicated though. It's mostly a matter of deciding what parameters to monitor. Then it can be a standard nagios test that runs a few times a day.

FM7: yes I did see something about that SSD study, which is interesting. One word takeaway: "backup".

fm7 · Mar 23, 2016

willie said:
One word takeaway: "backup".

Backup is a sort of miraculous snake oil recommended to cure any issue posted in WHT.

Unfortunately backup saves only the past.

Not that problem if you have zero visitors, zero transactions, zero e-mails.

HalfEatenPie · Mar 23, 2016

ikoula said:
Hello,

A very basic way to do that could be to setup a cron that make an output from smartctl command and send it to you by email ?

Greetings

Yeah that doesn't work because the focus of this is to create meaning and definition behind the numbers you get. Organizing the dataset and showing it in a meaningful way is more important than simply seeing the numbers every day via email.

Basically what's probably going to be needed is an agent which would parse the output from smartctl and then probably work from there. @willie mentioned nagios which is actually a pretty good idea, but I already have a central observium installation setup with modified plugins so I'm thinking (since I already have collectd integration working on my Observium) using a collectd plugin that parses the output from SMART. I'm thinking a modified version of this might work: https://gist.github.com/jinnko/6366979 It's a bit of a half-baked idea right now, but probably could work. Gotta look into it some more.

fm7 said:
Backup is a sort of miraculous snake oil recommended to cure any issue posted in WHT.

Unfortunately backup saves only the past. Not that problem if you have zero visitors, zero transactions, zero e-mails.

Backup won't help with redundancy and automatic fail-over, so in regards to uptime yeah backups don't do squat. However, backups do benefit it regards to risk management and mitigation of potential damages. To some extent anyways. So yeah I agree, backups don't help with uptime, but a backup is the cheapest solution out there for most people's projects rather than setting up a fully redundant setup with database mirroring and such

.

Haha totally unrelated, it doesn't help when everyone's penny pinching and then complain on WHT and other forums when a server's down. Yeah, you don't pay for the server to be down but at the same time this is the real world, you gotta have backup solutions in place to help mitigate potential damages and downtime. Want full redundancy and failover? Pay for or organize a proper fail-over solution.

Oh that reminds me, again totally unrelated and de-railing this thread, anyone ever try Constellix? It's made by the people who run DNSMadeEasy. I'm using it right now and it's pretty rockin and solid. Constellix Sonar is pretty baller as well! Majority of their nodes are with HostVirtual, and for those who don't know VR is pretty top notch.

fm7 · Mar 24, 2016

HalfEatenPie said:
Backup won't help with redundancy and automatic fail-over, so in regards to uptime yeah backups don't do squat

Assuming you are using a reputable provider, backup is useful to restore one or a set of files (edited, updated, removed, compromised, damaged). Nothing more. Of course if one is using a fly-by-night provider there are others "creative" uses for backup ... but those users think rsync is backup.

HalfEatenPie said:
However, backups do benefit it regards to risk management and mitigation of potential damages.

I couldn't disagree more

I guess users aren't fond of applications that loses data they entered manually and surely they don't like services that loses their unread e-mails or their payments, and so on ...

HalfEatenPie said:
... but a backup is the cheapest solution out there for most people's projects

Houston, we have a problem

IMO there isn't such thing as "cheapest solution". A project has requirements, one optimal solution and then "the lowest cost for the optimal solution".

Unfortunately many buyers of the so-called "cheapest solution" aren't buying the cheapest nor the solution. They are buying price.

fm7 · Mar 24, 2016

"risk management and mitigation of potential damages"

3/21/16 8:00pm PST

We discovered an issue in one of our backup systems last Thursday night (03/17). Maintenance was scheduled to resolve the issue over the weekend. On working to resolve the issue, an administrator accidentally deleted the production database.

The good news is that we have copies of our database that are replicated daily, up until the exact point of time when the database was deleted. We are working hard to retrieve all of your data.

We have been in the process of restoring our database. Due to its sheer size, a restore can take days to complete. While that is running in the background, we’ve been attempting different tactics in parallel to restore the database and your data quicker. If one of the current attempts is successful, we can be online as early as tomorrow morning, pacific standard time (PST). However, we will not know for certain until it has been completed.

We feel like we have failed you, our customers, and you expected better from us. After the restoration is completed, we will be taking a hard look at our processes and procedures to understand how and why this happened in the first place, and if there are other issues to be resolved as well....

Please stay tuned for further updates.

http://support.gliffy.com/entries/98911057--Gliffy-Online-System-Outage-

;-)

HalfEatenPie · Mar 24, 2016

Well @fm7, i'd have to respectfully disagree

I think first setting some definition would be necessary.

Risk Management is basically the identification, assessment, and prioritization of risks. Risk is a defined science and goes hand in hand with decision making science. I won't get into the science and the background of risk (its it's own research topic), but the mathematical definition of risk is commonly accepted as Risk = Hazard * Exposure * Vulnerability / Capacity. Capacity, which is also commonly expressed as mitigation, is defined as anything (any measure) used to minimize the value of risk. Therefore, as you can see, Risk can never be 0 and has an inverse relationship with the value of Risk. Hazard is defined as the potential danger/damages possible. Vulnerability is defined as the characteristics or circumstances that makes it susceptible to damage. Exposure is the likelihood the hazard (event) will actually happen.

To put this in an example, think about putting your money in a bank. The vulnerability is the amount/value of money you have saved in that bank. Hazard is the event that your bank can get looted. Exposure is the probability of your bank getting looted (e.g. bad neighborhood, etc.). Capacity would be the security measures placed in order to minimize that event.

Note: I've been observing this discussion in a macro sense, since that's what my research is in.

Now when I wrote that previous post and included backups into my discussion as a mitigation of risk, I was focusing on the risk regarding to damages on a website. I guess now putting this into these words, I think your focus on risk was the risk of downtime. So I think it was simply a breakdown in figuring out what we were specifically focusing on.

Personally, I think of it this way. As long as you're going with a "reliable" provider, for an average person running a basic website, uptime really isn't that big of an issue. Small businesses that don't get most of their customers via online means but rather simply having an online presence probably won't care about their website being down for a day or two. Probably won't notice actually. Therefore, if during their cost-benefit analysis if they believe saving some extra money instead of setting up a properly configured redundant solution is worth it for them, then technically it can be considered the cheaper alternative. However, if they changed the weighting of their cost-benefit analysis to be more focused on uptime, then they may believe setting up a proper redundant solution with automatic fail-over may be worth the extra bucks.

For a website that's more high valued, then that's when you'd probably look into redundancy and such.

Funny thing is, most people don't perform proper cost-benefit analysis and risk assessment when they look into hosting. They don't look at proper situations and what they need properly. All they look at is the price tag and how much "space" you get and go "ooh cheap!" and buy. This is how problems like GVH happen.

Wow this thread took a big detour.

Edit: and re-reading it definitely shows this post is my first draft and I haven't edited anything. Regardless, I'm a lazy pooter and I'll keep this the jumbled mess that it is, since I'm sure you all understand the point.

fm7 · Mar 24, 2016

HalfEatenPie said:
Now when I wrote that previous post and included backups into my discussion as a mitigation of risk, I was focusing on the risk regarding to damages on a website. I guess now putting this into these words, I think your focus on risk was the risk of downtime.

Nope. My focus was the risk of data loss caused by user mistakes, hardware malfunction, or program errors.

Not a matter of uptime/availability but data integrity.

HalfEatenPie said:
Therefore, if during their cost-benefit analysis if they believe saving some extra money instead of setting up a properly configured redundant solution is worth it for them, then technically it can be considered the cheaper alternative. However, if they changed the weighting of their cost-benefit analysis to be more focused on uptime, then they may believe setting up a proper redundant solution with automatic fail-over may be worth the extra bucks.

The "benefits" set the requirements. The cost depends on the requirements.

fm7 · Mar 24, 2016

HalfEatenPie said:
Funny thing is, most people don't perform proper cost-benefit analysis and risk assessment when they look into hosting. They don't look at proper situations and what they need properly. All they look at is the price tag and how much "space" you get and go "ooh cheap!" and buy. This is how problems like GVH happen.

Wow this thread took a big detour.

The buyer is just searching the lowest price for a "wish list".

I don't think it is a problem because there is no real problem to be solved.

Chances are the same happens in zillions areas where you are not an expert and you don't care/search expert opinion because you don't value the product or service.

fm7 · Mar 25, 2016

Gliffy: fully redundant setup with database replication and such

...

Thursday 03.17.16

Our alerting system identified an issue with the replication of one of our databases. We use MySQL in a master-master-slave configuration and the secondary master node had gotten too far behind the primary master node, so a full reseed was required to restart replication. Maintenance was scheduled to repair this issue over the weekend.

Sunday 03.20.16, 10:30 PM PDT: Disaster Strikes

A system administrator was tasked with reseeding our secondary master. As a part of any replication seed, it starts with a command to drop all tables from the schema. Unfortunately, they failed to sever the master-master link before executing the restore and the drop table commands cascaded to our primary node, and ultimately to the slaves. This happened in a matter of seconds and all of our data was deleted. Our engineering team was immediately notified that the application had failed as our alerting system triggered once again. We immediately kicked off a restoration process from our last daily backup which occurred on late Saturday evening.

Because we use replication, MySQL produces binary logs (binlogs) of every query that is produced at the master. The combination of our Saturday backup and our retention policy of these binlogs (~2-3 days of data) would give us enough overlap to perform a full and complete restore of all customer data up to the point of disaster. This gave us some level of comfort.

Monday morning to Monday evening, 03.21.16

Every company should have a backup and recovery plan. We did. The last time we tested it, our restoration took 10-12h. An inconvenient amount of downtime, but not catastrophic.

We watched the restore process and waited. However, the combination of a recent change to use table level compression in our configuration, the sheer amount of data that we’ve built up since our last backup/recovery test, and the fact that the MySQL restore process was single-threaded meant that our restore would take an estimated 4+ days to complete. We were caught off guard at how long this process would run, and it was clearly an unacceptable amount of downtime.

...

Wednesday 03.23.16, 09:45 PM PDT

Full restoration of the system was completed with total downtime at seventy one (71) hours, fifteen (15) minutes.

Ironically, in the end, we ended up following our restoration playbook as it was written.

...

https://www.gliffy.com/blog/2016/03/25/turtle-and-the-hare/

Hardware Monitoring System?

HalfEatenPie

The Irrational One

willie

Active Member

HalfEatenPie

The Irrational One

fm7

Active Member

fm7

Active Member

ikoula

Member

willie

Active Member

fm7

Active Member

HalfEatenPie

The Irrational One

fm7

Active Member

fm7

Active Member

HalfEatenPie

The Irrational One

fm7

Active Member

fm7

Active Member

fm7

Active Member