# Backplane killing drives?



## devonblzx (Oct 16, 2014)

I purchased as a storage server, it turns out it was a very bad purchase.  First, we had a RAID6 fail on us twice (up to 4 drives would randomly fail in the array).  The server originally came in with a dented power supply and a bent hard drive cage, which made us have to pry it open just to get a hard drive out so I don't know if the server was damaged by them or by CC when they had it under their control.

I first thought maybe it was a power issue, so I replaced the power supply, but it happened again.  So I replaced the RAID card with an HBA card since I start using a software RAID.

After the first failure, no drives showed errors in an extended smart test.  There were a few reporting interrupts though.

After the second failure, two drives showed a read error, so I swapped out those drives and two other drives that were showing interrupts (which seems to be working fine in a different server now).

Now, two of the brand new drives are showing read failures.   So in two months time I've replaced hard drives, the power supply, and the RAID card and four hard drives have now shown read failures.

I confirmed that the read failures also show when running the test on a different system so it isn't just a bad connection with those drives.

So either I've had the worst luck with hard drives or *could something in the system be killing the drives?*  My first suspect would be the power supply but since that is new, what else could it be?  Anyone have any ideas?

I'd rather not lose all the money I had invested in this system because I'm not going to be like the last person and sell off a faulty system.


----------



## Virpus-Ken (Oct 16, 2014)

I have not experienced what you are describing, mostly because we have performance issues right off the bat on 3 different systems.  We experience over 50% reduction in disk performance when using the DL160's.  Doesn't matter if we are using the mobo or a h/w RAID controller.  Did this with SSD's so I'm thinking it is a compatibility issue.  

Also, what kind of drives are you using?  SSD? HDD?  Brand, model?


----------



## devonblzx (Oct 16, 2014)

Well, the drives that were in the system were Seagate SV35 which conveniently ended warranty right before they had shown the read failures.

The drives I replaced them with were new WD Red 3TB.


----------



## Munzy (Oct 16, 2014)

The problem may more over be the fact you are using RAID 6. Raid 5/6 is very very damaging on hard drives.


----------



## Virpus-Ken (Oct 16, 2014)

In the past, I found that when I used cheap RAID drives (Toshiba I think), they would randomly fall out of the array and error out (even though the drive was good).  When I replaced them with RE4's, this fixed the problem.

Maybe this is a similar issue.


----------



## devonblzx (Oct 16, 2014)

Virpus-Ken said:


> In the past, I found that when I used cheap RAID drives (Toshiba I think), they would randomly fall out of the array and error out (even though the drive was good).  When I replaced them with RE4's, this fixed the problem.
> 
> Maybe this is a similar issue.


That's probably more hardware RAID related (TLER) unless the drives actually failed.  I don't think that is the case here.  The WD Reds are made for arrays and two of the four failed within 2 weeks.  Could be a bad batch but considering the other situations, I'm leaning towards something else.



Munzy said:


> The problem may more over be the fact you are using RAID 6. Raid 5/6 is very very damaging on hard drives.


Why do you say that?  AFAIK, the parity calculations are done in the CPU (or card) so less data should be written to the drives than a mirror and the reads should only be heavy during a rebuild, otherwise it should act closer to a striped array.  RAID6 (2P) and RAID7 (3P) are actually the primary RAID method for backup servers since it can allow for better failure handling than RAID10.  RAID10 could fail if you lose two of the drives in the same mirror (which has a higher chance during a rebuild).


----------



## Munzy (Oct 16, 2014)

From my understanding and general deployments, RAID 5/6 is very tough on hard drives. Yes, it has better chance for surviving a drive failure then Raid10, but it is more complicated and thus harder. 

This is also why most normal drives including RED drives aren't suggested to be used with RAID 5/6/7.

Only enterprise grade drives are suggested for RAID 5/6/7.


----------



## devonblzx (Oct 17, 2014)

Munzy said:


> This is also why most normal drives including RED drives aren't suggested to be used with RAID 5/6/7.


Do you have any source or external link on that suggestion?  I've never heard of that before. 

As I explained, the calculations are done by the card or cpu, not the drives so the complexity doesn't weigh on the disks.   During a write, after the calculation, only the striped block and the parity block are committed to each disk. Therefore it should be less stressful on the drives than a mirror (RAID1) because the striped+parity block are smaller than the actual write (which would be written as a whole in a mirror).  Reads act like a RAID0 across the non-parity blocks.  The only time it should be more stressful is during a rebuild in which a RAID5 or RAID6 has to read in the entire array to recalculate the parities.


----------



## serverian (Oct 17, 2014)

devonblzx said:


> I'm not going to be like the last person and sell off a faulty system.


The server has been sold in December 2013 for virtually very under what it'd regularly cost (even lower than what my set price was) *with all drives in warranty.*

As mentioned the server was in use as a backup server for 6 months for nightly backups at RAID10 configuration and then left idle for some time *which can be provable by the power on hours count of the drives.*

One of the disks was showing some errors due to power loss at the DC and it's mentioned to you. *You said, you were going to replace it within the warranty. None of the other drives reported anything on SMART test.*

I've given you SSH access to the server for you to check anything you wanted to check before the sale happened.

*You did and confirmed the sale.*

It's October 2014, 10 months after the purchase, and you are claiming that I sold you a faulty system.

I've sold this as a backup server since the drives are simply not suitable for high IO usage. You have decided to make it a VPS node. You have decided to run RAID6. You have put the server on production without doing extended stress tests. Therefore your disks started to fail after you started to put some load on it.

- Disks were under the warranty. Therefore, easily replaceable if they were faulty.

- I've never seen the chassis. I didn't know it was bent. However, if this was a problem, you would have told me after I've got it unracked and moved to your space and I'd have been more than glad to take it back.

- CPU, Motherboard, Memory, RAID Card is functioning from what you said.

-You are suspecting a backplane issue. Backplane costs $30 on ebay: http://www.ebay.com/itm/HP-507304-001-ProLiant-DL180-G6-12-Bay-Hard-Drive-BackPlane-Board-/371145261944?pt=US_Server_Boards&hash=item5669fb4378

If you want, give me your Paypal email and I'll send you the $30 to replace what you think is faulty.

Just don't be so low and try to blame me for your own mistakes and throw mud on my name.


----------



## devonblzx (Oct 17, 2014)

Clearly, the UDMA errors shown in smart weren't the result of a power issue at the DC because it started showing on 4 other drives as soon as the system was running for more than 30 days.

I upgraded the system and ran tests, but didn't place it in service until this summer.  The warranties expired at the end of June and the drives didn't show read failures until August.

I wasn't told of the damage to the server when I purchased it and I had already paid ~$120 for Choopa to unrack and ship it to me so I decided to deal with it but you really should have made me aware of damage prior to selling it.

It wasn't a "VPS node".   It was a storage server with virtual servers on it.  It was used for backups just like you had claimed you used it for.   It wasn't handling databases and high IOPS.

I'm just stating facts about the server I bought from you, any "mud slinging" is a result of the server issues from the system I bought and paid a lot of money for and have dumped countless man hours into.

The backplane was just a question, I've already had to replace the power supply which was damaged and the RAID card which appeared to be causing the UDMA errors.  Whoever was handling your server did not handle it with care so I highly suggest you check out the server for damage before selling it in the future.


----------



## willie (Oct 17, 2014)

Raid 5 and raid 6 put a lot of stress on drives because every update has to write stuff on all the drives, more seeking, etc.  I can understand the advice to stick with raid 1 or raid 10 on consumer drives.  Enterprise drives are built to handle much more mechanical load.


----------



## devonblzx (Oct 17, 2014)

willie said:


> Raid 5 and raid 6 put a lot of stress on drives because every update has to write stuff on all the drives, more seeking, etc.  I can understand the advice to stick with raid 1 or raid 10 on consumer drives.  Enterprise drives are built to handle much more mechanical load.


Thanks for the explanation.  I do see your point, especially with random writes versus a striped mirror like RAID10.  Sequential writes larger than the chunk size probably don't make a difference so a storage/backup server shouldn't be too much of a concern with this because sequential writes greater than the chunk size in any type of striped array are written to multiple, or all, drives.



serverian said:


> As mentioned the server was in use as a backup server for 6 months for nightly backups at RAID10 configuration
> 
> ....
> 
> You have decided to run RAID6.


The server was configured with a RAID50, not a RAID10.  So a parity type array with these drives was nothing new.


----------



## Munzy (Oct 17, 2014)

http://wdc.custhelp.com/app/answers/detail/a_id/996/~/support-for-wd-desktop-drives-in-a-raid-0-or-raid-1-configuration


You used a setup that isn't acceptable as per WD. It is your own fault for running said setup with raid6. Red drives are not designed for those type of loads.


----------



## devonblzx (Oct 17, 2014)

Munzy said:


> http://wdc.custhelp.com/app/answers/detail/a_id/996/~/support-for-wd-desktop-drives-in-a-raid-0-or-raid-1-configuration
> 
> 
> You used a setup that isn't acceptable as per WD. It is your own fault for running said setup with raid6. Red drives are not designed for those type of loads.


That's more about TLER than anything if you read the details.  They suggest using RE because of TLER and RAFF and are upselling those products for RAID but it all depends on the configuration, it has nothing to do with RAID6.   TLER is the timing fix for hardware RAID cards and RAFF is more extensive vibration protection but I know from experience Blacks run fine in RAID environments without TLER and RAFF as long as you use a card that doesn't have the low time out or you use software RAID.  Reds also are balanced with extra precision before they leave the factory to help with vibration even though they lack RAFF.

I think this thread has gotten a little off topic from the original backplane question.  The same hard drives have been running with no additional failures in two other systems so I highly doubt the RAID is the contributing factor.


----------



## pcan (Oct 18, 2014)

A high disk failure rate could be caused by excessive temperatures, either now or in the past. Are all the fans working? Do you checked the BIOS warning log, to see if the server has overheated in the past?
By the way, last year I received a full box of WD RED 3 Tb, and 30% of them failed in 2 months. I suspect a shipping mishandling. This may have happened to your replacement disks.


----------



## fileMEDIA (Oct 18, 2014)

Stay away from add non HP disks into HP servers. Other drives can cause issues like FANs at 100% or failure detection on hard drives which do not have issues. HP backplane and raid controller can also not read the temperature correctly from non HP drives, so temperature control doesn't work correctly and can kill drives. If you use HP, as we do, choose only disks which are verified for your server.


----------



## dcdan (Oct 18, 2014)

Seagate drives do not survive for too long under load/vibration, even their enterprise drives are meh. With WD, you have to be careful as each model is designed to deal with specific type of vibration (4 / 8 / 12 drives in same chassis).

For server type load, hitachi drives work best for the price.


----------



## Munzy (Oct 18, 2014)

You clearly don't understand. You are trying to run a consumer grade hard drive in a large chassis with lots of other drives in a very tough RAID setup. You clearly show that you have little to no understanding of what you are doing. TLER is vital for your setup and is most likely the cause of your "backplane" issues. What is happening is you are having one failed read, this happens with all hard drives, and the drive is waiting that full 30 seconds to try and get the file before it sends the FAIL alert back to the controller. As such the hard drive to the controller seems like it is dead and throws it into the failed category. Thus your failing drives.


----------



## markjcc (Oct 18, 2014)

dcdan said:


> Seagate drives do not survive for too long under load/vibration, even their enterprise drives are meh. With WD, you have to be careful as each model is designed to deal with specific type of vibration (4 / 8 / 12 drives in same chassis).
> 
> For server type load, hitachi drives work best for the price.


This, I've always had problems with Seagate branded HDD's at my home NAS, for me they're bad luck for me...

Western Digital is fine for me hasn't failed me yet. I've got 2X WD Green's right now configured in SW Raid 1 (mirror)

They're refurb however they haven't failed me yet.


----------



## devonblzx (Oct 18, 2014)

Munzy said:


> You clearly don't understand. You are trying to run a consumer grade hard drive in a large chassis with lots of other drives in a very tough RAID setup. You clearly show that you have little to no understanding of what you are doing. TLER is vital for your setup and is most likely the cause of your "backplane" issues. What is happening is you are having one failed read, this happens with all hard drives, and the drive is waiting that full 30 seconds to try and get the file before it sends the FAIL alert back to the controller. As such the hard drive to the controller seems like it is dead and throws it into the failed category. Thus your failing drives.


Did you miss the part that I have been running a software RAID and the drives are failing SMART tests?  Your reply only applies to hardware RAID and the drives wouldn't fail SMART tests in that situation.  Saying I have no idea what I'm doing and just repeating generic information that doesn't apply to my situation does not validate your messages.  Western Digital Reds also include TLER but I do not have the need for that feature in a software RAID.  Lastly, I bought a server that was already running these drives in a RAID50 so a RAID6 isn't a big change.  A RAID6 is usually more reliable because you can lose any 2 drives, as opposed to a RAID50 which will fail if you lose 2 drives in the same stripe which can be common during a rebuild.

To others following the thread:

I have seen high failure rates with Seagate barracudas in PCs but I usually only use WD drives in servers.  Most commonly Black or RE depending on the need for TLER, but I started using Red in these servers because they only use ~4W and speed wasn't a primary concern.

Seeing as the UDMA errors seem to have stopped, I'm thinking the backplane is okay.  Heat is not an issue, all the drives are ~25-30 degrees in SMART readings, but maybe vibration is becoming an issue.

After monitoring of the sensors, the chassis fans of the DL180 G5 and G6 seem to have a bug that the fans run at 100% constantly when a non-HP expansion card is used.  Unfortunately there is no way to adjust the fan settings in these models so my number one thought right now is that the hard drive dropping issue is resolved after replacing the RAID card and power supply but the fans could now be causing excess vibration. Since I'm moving to low power drives, I think the fans could be replaced and some inserts can be used to reduce vibration.  I'll update this if I discover anything else.


----------



## devonblzx (Oct 19, 2014)

dcdan said:


> For server type load, hitachi drives work best for the price.


I usually use WD RE or Black for servers (depending on if there is a TLER requirement).  I have read good things about Hitachi Enterprise drives but have never used them personally.  I know WD owns them now so I'm sure quality standards are similar between those and the REs.

The reason I had Seagate SV35s is because that is what the server came with.  I had suspected this would have been an easy server to setup and go with since it was already setup on a 12 drive RAID50.

The reason I have used Reds is because of the lower power.  Reds save about 6W per drive compared to REs,  So greener, as I like to orient my services when it is possible, less heat, and a power savings of ~$100/year per system.   I haven't seen any other high capacity drives to match that type of power usage and the benefits of their 3D Active Balance and 3 year warranty made the Reds pretty attractive.  I know they have the new Red Pros now that are recommended for larger drive systems but they are 7200rpm and use an additional 4-5W per drive so I see that drive as a remarketed SE more than anything.

I've had pretty good success with the Reds on the other systems, had a couple DOAs but those were easily replaced.  I may have just gotten a bad batch with this latest order or it could be something in the system causing it to fail faster which is why I opened this thread.   I have talked to others using Reds in large SANs and they seem to have good success as well.  As long as the drive is good from the start, they seem to be pretty reliable.  I had talked to another admin who had a 16 drive RAID60 with Reds, so large capacity systems isn't out of the question.


----------



## fileMEDIA (Oct 19, 2014)

devonblzx said:


> After monitoring of the sensors, the chassis fans of the DL180 G5 and G6 seem to have a bug that the fans run at 100% constantly when a non-HP expansion card is used.  Unfortunately there is no way to adjust the fan settings in these models so my number one thought right now is that the hard drive dropping issue is resolved after replacing the RAID card and power supply but the fans could now be causing excess vibration. Since I'm moving to low power drives, I think the fans could be replaced and some inserts can be used to reduce vibration.  I'll update this if I discover anything else.


As i written above, non HP disks create money trouble in HP servers because raid controller and backplane cannot read the data like the temperature from the disk. They only work probably with HP disks. FANs are managed from the system controller and cannot be reduced or controlled manually. There are several threads about this issue in the HP forum.

Which raid controller do you use?


----------

