# Hardware Monitoring System?



## HalfEatenPie (Mar 16, 2016)

So we have server network monitoring and such. My concern is figuring out the server hardware monitoring portions.  Other monitoring options are greatly welcome (because I don't know what else can be monitored off the top of my head), but one specific example would be the hard drives.


I want a way to be able to keep an eye on the hard drive quality and up time and have an ability to know if/when an impending hard drive failure can happen.  A similar setup added on to Observium or LibreNMS would be great, but I don't really know if there's any easy centralized solution is available.  If I recall SpiceWorks does all this, but I was wondering if anyone had any other ideas.


Thanks!


----------



## willie (Mar 16, 2016)

Basic idea for disks is monitor the SMART statistics and schedule replacement if (at least for hdd's) any of the failure-predicting parameters change.  There's an old Google hdd reliability paper that said even tiny blips in thosee reports meant the drive was much more likely to fail.  Servers usually have some other monitoring facilities for case temperature, ECC error events, CPU temperature, and other things like that.  What to alert on?  Not sure, maybe there's some advice out there.


----------



## HalfEatenPie (Mar 17, 2016)

Yeah disk monitoring would be parsing the data outputted from SMART.  I might look into that reliability report.


If I recall correctly, Spiceworks and PRTG should be handling these things, but I guess there's no easy focused solution for this.  Bummer!  Maybe I'll look into making something for myself down the road when I have time.


----------



## fm7 (Mar 19, 2016)

Online.net's HardwareWatch service (*)  uses IPMI to check the status of various sensors on their servers (almost 100% HP and Dell; mostly HW RAID).


PRTG has the feature

Nagios has at least one plugin -- you may want to take a look at this white paper https://www.thomas-krenn.com/de/wikiDE/images/7/7c/20100610-Hardware-Monitoring-with-the-new-Nagios-IPMI-Plugin.pdf


BTW if you are using HW RAID controllers S.M.A.R.T. data may be not available to your favorite program (e.g. smartmontools -- "Checking disks behind RAID controllers")


---



(*) From Online.net





> HARDWAREWATCH®
> 
> 
> Your Dedibox® is supervised automatically 24/24 by our teams. In case of a hardware problem we automatically launch an intervention to replace the defective part.
> ...


----------



## fm7 (Mar 19, 2016)

willie said:


> ...There's an old Google hdd reliability paper that said even tiny blips in thosee reports meant the drive was much more likely to fail.



There is a new Google *SSD* study that says " RBER (raw bit error rate), the standard metric for drive reliability, is not a good predictor of those failure modes that are the major concern in practice".



> Higher rate of problems with SSDs rather than HDDs
> 
> 
> 
> ...






> Some of the findings and conclusions might be surprising.
> 
> 
> Between 20–63% of drives experience at least one uncorrectable error during their first four years in the field, making uncorrectable errors the most common non-transparent error in these drives. Between 2–6 out of 1,000 drive days are affected by them.
> ...





*BTW*



> *Industrial Temperature and NAND Flash in SSD Products | EEWeb*
> 
> 
> Conclusion
> ...


----------



## ikoula (Mar 22, 2016)

Hello,


A very basic way to do that could be to setup a cron that make an output from smartctl command and send it to you by email ?


Greetings


----------



## willie (Mar 23, 2016)

ikoula said:


> A very basic way to do that could be to setup a cron that make an output from smartctl command and send it to you by email



Manually examining the output would get annoying after the second or third time, especially if you have multiple servers.  You need a program to monitor the relevant parameters and alert you if something goes wrong.  It shouldn't be complicated though.  It's mostly a matter of deciding what parameters to monitor.  Then it can be a standard nagios test that runs a few times a day.


FM7: yes I did see something about that SSD study, which is interesting.  One word takeaway: "backup".


----------



## fm7 (Mar 23, 2016)

willie said:


> One word takeaway: "backup".



Backup is a sort of miraculous snake oil recommended to cure any issue posted in WHT.


Unfortunately backup saves only the past.  Not that problem if you have zero visitors, zero transactions, zero e-mails.


----------



## HalfEatenPie (Mar 23, 2016)

ikoula said:


> Hello,
> 
> 
> A very basic way to do that could be to setup a cron that make an output from smartctl command and send it to you by email ?
> ...



Yeah that doesn't work because the focus of this is to create meaning and definition behind the numbers you get.  Organizing the dataset and showing it in a meaningful way is more important than simply seeing the numbers every day via email.  


Basically what's probably going to be needed is an agent which would parse the output from smartctl and then probably work from there.  @willie mentioned nagios which is actually a pretty good idea, but I already have a central observium installation setup with modified plugins so I'm thinking (since I already have collectd integration working on my Observium) using a collectd plugin that parses the output from SMART.  I'm thinking a modified version of this might work: https://gist.github.com/jinnko/6366979 It's a bit of a half-baked idea right now, but probably could work.  Gotta look into it some more. 



fm7 said:


> Backup is a sort of miraculous snake oil recommended to cure any issue posted in WHT.
> 
> 
> Unfortunately backup saves only the past.  Not that problem if you have zero visitors, zero transactions, zero e-mails.



Backup won't help with redundancy and automatic fail-over, so in regards to uptime yeah backups don't do squat.  However, backups do benefit it regards to risk management and mitigation of potential damages.  To some extent anyways.  So yeah I agree, backups don't help with uptime, but a backup is the cheapest solution out there for most people's projects rather than setting up a fully redundant setup with database mirroring and such  .  


Haha totally unrelated, it doesn't help when everyone's penny pinching and then complain on WHT and other forums when a server's down.  Yeah, you don't pay for the server to be down but at the same time this is the real world, you gotta have backup solutions in place to help mitigate potential damages and downtime.  Want full redundancy and failover?  Pay for or organize a proper fail-over solution.  


Oh that reminds me, again totally unrelated and de-railing this thread, anyone ever try Constellix?  It's made by the people who run DNSMadeEasy.  I'm using it right now and it's pretty rockin and solid.  Constellix Sonar is pretty baller as well!  Majority of their nodes are with HostVirtual, and for those who don't know VR is pretty top notch.


----------



## fm7 (Mar 24, 2016)

HalfEatenPie said:


> Backup won't help with redundancy and automatic fail-over, so in regards to uptime yeah backups don't do squat



Assuming you are using a reputable provider, backup is useful to restore one or a set of files (edited, updated, removed, compromised, damaged). Nothing more. Of course if one is using a fly-by-night provider there are others "creative" uses for backup ... but those users think rsync is backup. 



HalfEatenPie said:


> However, backups do benefit it regards to risk management and mitigation of potential damages.



I couldn't disagree more 


I guess users aren't fond of applications that loses data they entered manually and surely they don't like services that loses their unread e-mails or their payments, and so on ...



HalfEatenPie said:


> ... but a backup is the cheapest solution out there for most people's projects



Houston, we have a problem 


IMO there isn't such thing as "cheapest solution". A project has _requirements_, _one_ _optimal_ solution and then "the lowest cost for the _optimal_ solution". 


Unfortunately many buyers of the so-called "cheapest solution" aren't buying the cheapest nor the solution. They are buying price.


----------



## fm7 (Mar 24, 2016)

"risk management and mitigation of potential damages"



> *3/21/16 8:00pm PST*
> 
> 
> We discovered an issue in one of our backup systems last Thursday night (03/17). Maintenance was scheduled to resolve the issue over the weekend. On working to resolve the issue, an administrator accidentally deleted the production database.
> ...


 


;-)


----------



## HalfEatenPie (Mar 24, 2016)

Well @fm7, i'd have to respectfully disagree 


I think first setting some definition would be necessary. 


Risk Management is basically the identification, assessment, and prioritization of risks.  Risk is a defined science and goes hand in hand with decision making science.  I won't get into the science and the background of risk (its it's own research topic), but the mathematical definition of risk is commonly accepted as Risk = Hazard * Exposure * Vulnerability / Capacity.  Capacity, which is also commonly expressed as mitigation, is defined as anything (any measure) used to minimize the value of risk.  Therefore, as you can see, Risk can never be 0 and has an inverse relationship with the value of Risk.  Hazard is defined as the potential danger/damages possible.  Vulnerability is defined as the characteristics or circumstances that makes it susceptible to damage.  Exposure is the likelihood the hazard (event) will actually happen.


To put this in an example, think about putting your money in a bank.  The vulnerability is the amount/value of money you have saved in that bank.  Hazard is the event that your bank can get looted.  Exposure is the probability of your bank getting looted (e.g. bad neighborhood, etc.).  Capacity would be the security measures placed in order to minimize that event.  


_*Note*: I've been observing this discussion in a macro sense, since that's what my research is in.  _


Now when I wrote that previous post and included backups into my discussion as a mitigation of risk, I was focusing on the risk regarding to damages on a website.  I guess now putting this into these words, I think your focus on risk was the risk of downtime.  So I think it was simply a breakdown in figuring out what we were specifically focusing on. 


Personally, I think of it this way.  As long as you're going with a "reliable" provider, for an average person running a basic website, uptime really isn't that big of an issue.  Small businesses that don't get most of their customers via online means but rather simply having an online presence probably won't care about their website being down for a day or two.  Probably won't notice actually. Therefore, if during their cost-benefit analysis if they believe saving some extra money instead of setting up a properly configured redundant solution is worth it for them, then technically it can be considered the cheaper alternative.  However, if they changed the weighting of their cost-benefit analysis to be more focused on uptime, then they may believe setting up a proper redundant solution with automatic fail-over may be worth the extra bucks.  


For a website that's more high valued, then that's when you'd probably look into redundancy and such. 


Funny thing is, most people don't perform proper cost-benefit analysis and risk assessment when they look into hosting.  They don't look at proper situations and what they need properly.  All they look at is the price tag and how much "space" you get and go "ooh cheap!" and buy.  This is how problems like GVH happen.


Wow this thread took a big detour.  


*Edit:* and re-reading it definitely shows this post is my first draft and I haven't edited anything.  Regardless, I'm a lazy pooter and I'll keep this the jumbled mess that it is, since I'm sure you all understand the point.


----------



## fm7 (Mar 24, 2016)

HalfEatenPie said:


> Now when I wrote that previous post and included backups into my discussion as a mitigation of risk, I was focusing on the risk regarding to damages on a website.  I guess now putting this into these words, I think your focus on risk was the risk of downtime.



Nope. My focus was the risk of data loss caused by user mistakes, hardware malfunction, or program errors.


Not a matter of uptime/availability but data integrity.



HalfEatenPie said:


> Therefore, if during their cost-benefit analysis if they believe saving some extra money instead of setting up a properly configured redundant solution is worth it for them, then technically it can be considered the cheaper alternative.  However, if they changed the weighting of their cost-benefit analysis to be more focused on uptime, then they may believe setting up a proper redundant solution with automatic fail-over may be worth the extra bucks.





The "benefits" set the requirements. The cost depends on the requirements.


----------



## fm7 (Mar 24, 2016)

HalfEatenPie said:


> Funny thing is, most people don't perform proper cost-benefit analysis and risk assessment when they look into hosting.  They don't look at proper situations and what they need properly.  All they look at is the price tag and how much "space" you get and go "ooh cheap!" and buy.  This is how problems like GVH happen.
> 
> 
> Wow this thread took a big detour.



The buyer is just searching the lowest price for a "wish list".


I don't think it is a problem because there is no real problem to be solved.


Chances are the same happens in zillions areas where you are not an expert and you don't care/search expert opinion because you don't value the product or service.


----------



## fm7 (Mar 25, 2016)

Gliffy: fully redundant setup with database replication and such



> ...
> 
> 
> *Thursday 03.17.16*
> ...


----------

