Backblaze Hard Drive Stats for 2018

448 points by sashk 5 years ago

I really appreciate BackBlaze opening up this data. While I don't purchase HDDs often, I always refer to these reports when deciding.

They also open sourced their server chassis which is awesome!

It's nice to see a reduction of the failure rates as a whole. It looks like the next few years will be some interesting times for the growth in storage capacities for HDDs.

atYevP 5 years ago

Yev from Backblaze here -> Glad you're enjoying the stats!
- ericd 5 years ago
  
  These reports were super helpful when we were choosing how to kit out our servers (we went all HGST as a result). It's pretty hard to find good reliability data otherwise. So, a really big thanks from us.
- HankB99 5 years ago
  
  Please thank those responsible on my behalf (and all of the others who study the numbers before purchasing drives.)
  
  atYevP 5 years ago
  
  I'll let Andy know - he might be around here somewhere :D
- JustSomeNobody 5 years ago
  
  Thanks! You guys rock.

fpgaminer 5 years ago

Tangentially related. Whenever I get a new drive, I always do a "burn-in". A program writes data to the whole drive and then reads it back (reproducible random data).

Is there any real justification for doing this kind of test on a new drive?

Doing it takes quite awhile, so I've been wondering lately if it's even worth it. I've never found anything with it.

bluedino 5 years ago

In one of their first blog posts about the pods they stated that they saw 'high infant mortality' in drives so they would burn-in each pod for a few days before writing any customer data to one.
- fpgaminer 5 years ago
  
  Thank you. Found the post: https://www.backblaze.com/blog/alas-poor-stephen-is-dead/
  They have another good idea in there: recording the SMART data before and after the test. Their theory is that even if SMART attributes are still OK after the test, looking for a trend in the attributes may point to eventual failure later. Nice.
StrangeDoctor 5 years ago

I do this with SD cards, I've found fakes with regards to speed and capacity. But I use ZFS for drives and just trust it to do its thing.
- cerberusss 5 years ago
  
  Yup, I do this with SD cards as well. To do this on the commandline, I just fill it up with big gigabyte-sized files. The loop uses "seq 1 16"; at least 15 files should fit on a 16 GB stick.
  $ cd /Volumes/your_new_SD_card $ dd if=/dev/urandom of=test.bin bs=10000000 count=1000 $ for i in $(seq 1 16); do cp test.bin test$i.bin; done
  For good measure, you could run md5/md5sum on the files.
conbandit 5 years ago

If you've never found anything with it, why do you keep doing it (see: the definition of insanity)?
- cataflam 5 years ago
  
  Not the parent, but probably because
  1. It's notorious that hard drives have a higher failure rate at the beginning of their lives than in the middle (see bathtub curve [0]). So it's not absurd to test them hard early on before writing any useful data and to do an early RMA.
  2. The failure rate on drives is low enough that his methodology may be right but he still never has any failure in his life. Doesn't it make insane.
  [0] https://en.wikipedia.org/wiki/Bathtub_curve
  
  philliphaydon 5 years ago
  
  I once bought a new 1TB Drive when they were fairly new. MOVED about 500gb if data to the new drive. Checked it. Seemed fine. Turned computer off and went to bed.
  Next day the HDD didn’t turn on. Completely dead. :( I’ve never had a failure since but I backup everything now.
  
  kbutler 5 years ago
  
  It would depend on the effort to do the methodology vs. the expected return (savings of finding a failed drive times the probability).
- Sohcahtoa82 5 years ago
  
  This is a very strange invocation of the definition of insanity.
  Apparently, quality assurance should never be a thing.
- e40 5 years ago
  
  I have. Last time I bought a bunch of drives for a RAID array, I used the WD utility to test each of them. 1 of the 6 failed the test.
  
  hinkley 5 years ago
  
  The first time I learned about burnin, it was because the drives for our RAID array showed up three days before we were supposed to start experimenting. Two of the ten were DOA. FED-EX got us replacements in two days. One of those was dead.
  So then I started doing math on MTBF and lots of drives and things looked bleak. Just a couple years later MTBF had gone way up and continued to climb for a while after, but at that particular moment it was something like eight months between failure of you had any more drives than we had and just accelerated from there.
  
  eps 5 years ago
  
  "Failed" as in "with an IO error" as in "got the wrong data back"?
  Because if it's the latter, it could've been caused by some other part of IO stack and not necessarily the drive itself.
  
  kiwijamo 5 years ago
  
  I believe the WD utility sends a command to the drive so it does a self test. That would eliminate any upper layers issues.
- jonhohle 5 years ago
  
  insanity | inˈsanədē | noun the state of being seriously mentally ill; madness
  I'm not sure when the definition of obsessive-compulsive started being used to describe insanity (it's been going on for at least a decade), but I don't like it (and cringe every time I hear someone repeat it).
  
  jrace 5 years ago
  
  Me too, one could argue that practicing anything is "doing the same thing over and over expecting different (getting better) results" ================================ I too have switched to HGST thanks to the backblaze stats.
- icelancer 5 years ago
  
  Why do you back up drives if you've never had a failure?
  
  ridgeguy 5 years ago
  
  For the same reason you buy fire insurance even though your home has never burned down.

sigi45 5 years ago

I wonder if any of those companies talk to Backblaze about it. Like sending drives back for inspection :)

I'm also curious why those companies wouldn't directly talk to backblaze. I read somewhere a blog post on how they bought specific drives online at a sale.

atYevP 5 years ago

Yev from Backblaze here ->
> I read somewhere a blog post on how they bought specific drives online at a sale.
Yea, we used to buy drives wherever we could, but that was years ago. We're larger now so we go through more established channels.
- hinkley 5 years ago
  
  Also no natural disasters taking out the manufacturers again, right?
  
  atYevP 5 years ago
  
  Can't really count on those NOT happening - but we are more prepared now ;)
  
  hinkley 5 years ago
  
  Yeah I mean that was the impetus for buying them off the shelves, wasn’t it? It was a crisis management technique not a business plan :)
  
  atYevP 5 years ago
  
  Yes, that's right. It was one of those "better think quick" scenarios and we did what we had to in order to stay in business!
  
  sigi45 5 years ago
  
  What a bad time :|. When i remember correctly, it took over a year to get prices after the disaster.

b3lvedere 5 years ago

Thank you Backblaze! I love your reports.

What is your procedure/policy on which disks to use in the pods? Do you try and maybe control the risk by using different harddisk brands in a single storage pod? Or do you just not care, because there have never been 3 pods dead at the same time? :)

Do you still use 17 data plus 3 parity shards?

evil-olive 5 years ago

Their Q3 2018 stats had a bit of info on the lifecycle of introducing new disks:
https://www.backblaze.com/blog/2018-hard-drive-failure-rates...
> In Q3 we added 79 HGST 12TB drives (model: HUH721212ALN604) to the farm. While 79 may seem like an unusual number of drives to add, it represents “stage 2” of our drive testing process. Stage 1 uses 20 drives, the number of hard drives in one Backblaze Vault tome. That is, there are are 20 Storage Pods in a Backblaze Vault, and there is one “test” drive in each Storage Pod. This allows us to compare the performance, etc., of the test tome to the remaining 59 production tomes (which are running already-qualified drives). There are 60 tomes in each Backblaze Vault. In stage 2, we fill an entire Storage Pod with the test drives, adding 59 test drives to the one currently being tested in one of the 20 Storage Pods in a Backblaze Vault.
- b3lvedere 5 years ago
  
  Thank you for the info! Much appreciated.

peterwwillis 5 years ago

I'm interested in failure rate per iops. If the drive fails infrequently, great, but if it's also the worst performing drive, screw that. Would rather buy drives that perform as well as possible with the least failure rate.

walrus01 5 years ago

In my recent experience there is not a lot of speed difference anymore between multiple manufacturers' 6TB to 12TB sized hard drives, when comparing between two competing products in the same rpm class (5400 or 7200) and areal density. Assuming similarly sized RAM cache on drive and not something like a drive with 64GB of SSD cache (hybrid drive).
- peterwwillis 5 years ago
  
  I mean more like benchmarked performance. Two drives with the same specs may end up performing differently. I realize benchmarks are not entirely realistic and tuning can affect the outcome, but if there's a clearly outsized performance difference between two seemingly equivalent products, I want the one that performs better and fails the least. So, aggregate random read and write operations per second over failure rate, as a general spec. (this might also expose flaws in the stats, if one drive model is getting predominately more of a certain operation which results in more failures)
- loeg 5 years ago
  
  To put it another way, everyone's pushing up against the same physical limits of spinning rust.

krob 5 years ago

HGST look like the best, but they don't have the quantities of the Seagate, makes me wonder if these numbers are skewed :/

linsomniac 5 years ago

My experience, over decades, but now coming up on half that ago, was that IBM/Hitachi/HGST are definitely worth it. I used to run a small hosting business, which had hundreds of discs, and consulting business which had clients with hundreds more discs.
Seagates, in general, could always be expected to fail in a 3 year span. Often multiple times. IBM/Hitachi/HGST, especially the UltraStar enterprise line, would basically never fail. Over 20 years and hundreds of discs concurrently running (switching out as opportunity arose), we probably RMAed single digits of HGST drives.
In comparison, while I had a few pockets of Seagate drives that didn't fail (I had 6 in a storage server at home that never gave me problems), we could generally expect 5% of Seagate drives that we had running to fail in a given year. Enterprise or regular didn't really seem to matter.
For us, replacing a drive was fairly expensive, using a limited resource (our time). So we gravitated to HGST almost exclusively.
But: We also did an extensive burn-in process before a drive went into production. Basically: "badblocks -svw" for a week. He noticed that we had some drives that would fall out of RAID arrays, but if we ran badblocks on would never report an error. My theory was that there were some marginal sectors that would bit-rot. Running a week of badblocks would exercise those and allow badblock remapping to remove them from use.
Remember the IBM "Deathstar" 75GXP? We even had good luck with those. I had one of them start freaking out, and I was aware of the "Deathstar" name, so I went to replace it with another drive I had on hand. When I pulled it out to replace it I realized it was HOT. Not in the grand scheme of things, but definitely hot to the touch. I looked up the temp specs and it was clearly above that. The box it was in had 2 5-inch bays that didn't have the covers on them. I covered those bays, turned the machine back on after re-installing the Deathstar, and the drive continued operating for another 3-5 years with no problem. Made me wonder if improper cooling was the source of those reports.
- HillaryBriss 5 years ago
  
  > improper cooling
  You've had far more experience with this stuff than I have. I wonder what your view is on all these NAS RAID boxes and HDD longevity and heat.
  I personally never saw so many HDD failures until I started running them inside a small NAS I bought some years ago. I nicknamed it my "drive killer" because I was replacing the things so often. Other HDDs I have in other (non-NAS) machines sure seemed to last much longer.
  As for the Deathstar story: I had a similar experience with a Seagate. Inside the NAS, one of the SMART error counts started to rise in a foreboding way, so I replaced it (with an HGST) before it had a chance to die. Then, out of curiosity, I tried the old Seagate in a different non-critical setting (non-NAS) and it's continued to work for two more years. IDK. Too much heat?
  
  saltcured 5 years ago
  
  It's not impossible to have a drive appear unreliable in one machine and reliable in another (if you physically move it). There could be something electrically or mechanically wrong with the machine which causes a particular drive to malfunction more frequently.
  If you see this repeatedly with one machine, I would definitely consider that it is the machine which is bad rather than the drives. A power supply may be insufficient to support the peak requirements of the drives, for example. A poor mounting structure might resonate or transfer vibration or shock into one or more drive slots from the environment.
  
  lathiat 5 years ago
  
  Vibrations can be problematic in multi drive chassis. The more NAS rated drives are apparently rated specifically to deal better with vibrations and of course the NAS itself some isolate drives far better than others from each other.
  It might seem minor, but just watch this rather old now but infamous video of Brendan Gregg shouting at HDDs causing significant drops in throughput: https://www.youtube.com/watch?v=tDacjrSCeq4
  
  HillaryBriss 5 years ago
  
  fun video! it doesn't take much vibration. (or he's just an incredibly loud screamer.)
  
  HillaryBriss 5 years ago
  
  thanks for that interesting note. i had been thinking only about heat as a cause of trouble in that enclosure (an old Netgear ReadyNas).
  
  linsomniac 5 years ago
  
  As others have mentioned, vibration can be a problem in these environments. But heat is also something you want to deal with. Clean power is also something to be concerned with.
  In the past I've used a lot of those Supermicro 5x3" drive carriers and they have really good airflow. This was for storage in servers I built for home use. I wrote up something in 2008 about one I built here: www.tummy.com/articles/ultimatestorage2008/
  Faster spinning disks generate more heat and vibrations than slower spinning ones. 15K drives are notorious for needing good environmentals.
  My current storage needs are pretty limited, which probably would make SSD attractive. Or we are getting there at least. My storage server right now has 6TB used, and around 1TB of that is backups. If I could get by with 4x 2TB SSDs, I might consider it, just for longevity sake. I'm using ZFS BTW.
  
  flyinghamster 5 years ago
  
  I built up a homebrew NAS with three of those Supermicro drive bays, and, other than the need to remove/replace screws (and not lose them for empty bays), I've really liked them.
  However, I can't for the life of me find out whether or not those bays will work with 6 Gb/s SATA. I've been considering putting my Ryzen 7 board into the NAS case to consolidate two systems into one, but I'm not going to go to that effort if I have to get new hotswap bays.
  
  nathanlv 5 years ago
  
  Yes. I used to see equipment -- hard drives, video cards, motherboards -- burn out at least once a year, and then I installed a power conditioner on my hardware. That was 4 years ago. Last year, for the first time in 17 years, I upgraded my motherboard by choice rather than replacing it after a failure.
  
  XorNot 5 years ago
  
  AFAIK Google have said they never really find any correlation between SMART errors and actual drive failures.
  
  HillaryBriss 5 years ago
  
  Interesting. Have not heard that before. My anecdotal experience in that NAS enclosure with Seagate drives was that when, IIRC, "Reallocated Sector Count" started to go up week after week, the drive would fail within a few months.
- kalleboo 5 years ago
  
  The Deathstar 75GXP issues were [quote Wikipedia] "due to the magnetic coating soon beginning to loosen and sprinkle off from the platters, creating dust in the hard disk array and leading to crashes over large areas of the platters".
  Perhaps heat could have accelerated the process by loosening the material somehow, but it was definitely a design/manufacturing flaw.
e40 5 years ago

Anyone have a good source of HGST drives? On Amazon, it used to be you could only get them from resellers and some comments complained about getting used drives. I basically stopped buying HDs from Amazon from resellers.
EDIT:
Example:
THEY ONLY WAY YOU CAN VERIFY THAT YOU DID INDEED RECEIVE THE FULL 5 YEAR HGST WARRANTY IS YOU MUST CALL HGST TECHNICAL SERVICE, GIVE THEM THE SERIAL NUMBER AND ASK IF THE DRIVE IS A NON-OEM, OEM OR REMANUFACTURED.
I have ordered expecting the full 5 year warranty, and verifying with technical service, have discovered - one with over 18,000 hours (from Crystaldiskinfo), several OEM's (no manufacturer warranty), a remanufactured and a full 5 years manufacturer warranty. AND once in awhile I get lucky and received as advertised - a full 5 year manufacturer's warranty.
- superhuzza 5 years ago
  
  Agreed. I only buy storage for personal use, but I've learned to avoid Amazon entirely. There are just way too many reports of fake or repackaged HDDs, SSDs, SD cards etc.
  I can generally get the same or relatively comparable prices at local computer stores that I trust significantly more than Amazon.
  
  penagwin 5 years ago
  
  Be careful as there are reports of the same issue at Bestbuy(And will be with any store). With the 8TB easystore (and now the 10tb as well) being so cheap, there've been several reports of shucking the drives only to find a lower tier drive or just sand or whatnot.
  This includes drives that were shrink-wrapped! I've heard bestbuy re-wraps returns and doesn't check them very well.
simcop2387 5 years ago

They're not going to be skewed. HGST disks are usually more expensive, so that limits the quantity that they buy. The seagate disks are usually cheaper but have a slightly higher (except when it's a brand new line) failure rate. When filling out a single server I go with the HGST disks because the premium price and quality means fewer failures, but it's more cost effective to go for lots of seagate disks when you have more redundancy and can eat more failures.
- level 5 years ago
  
  I built a server a few years ago, and I determined it was more cost effective for a drive to fail than to use HGST. It would be more inconvenient, but having a drive fail on a home server with only 6 drives didn't seem very likely anyway.
  
  jacobolus 5 years ago
  
  If you have a 2%/year failure rate per drive, then that leaves you with a nearly 12%/year chance that at least 1 of 6 drives will fail each year. Or a 31% chance that at least 1 of 6 drives will fail within 3 years. Or a 52% chance that at least 1 of 6 drives will fail within 6 years.
  
  sangnoir 5 years ago
  
  The question then is, would it be cheaper to replace that one drive or get the more expensive disks with lower chances of failing?
  
  Arn_Thor 5 years ago
  
  Isn't there a cost to time and convenience too? Buying a new disk takes time, as does rebuilding the RAID. And during that time you are vulnerable to another drive failure which could be disastrous if you only have one drive redundancy, especially during the very intensive rebuilding process.
  
  simcop2387 5 years ago
  
  There's also the cost of the time it takes you to replace the drive, get it replaced under warranty, etc. that takes a part of the cost of everything too. I decided that the (at the time) marginal cost of $20/disk was worth my time in likely not having to deal with it.
  
  cm2187 5 years ago
  
  Particularly if it fails within the warranty period.
- zepearl 5 years ago
  
  I agree (about Seagate being more cost effective for e.g. a datacenter). As private user I had in general bad experiences with Seagate (e.g. years ago I bought 8 2TB drives for 2 raid5 and 2 didn't even spin up, 1 failed after ~1 year) => I'm using since then only HGST and (pls. no jinx) so far only 1 has failed, but it "warned me in advance" with weird noises and terrible performance for ~1 week or so.
  I now have at hetzner.de a host that has 3 4TB Seagate drives and I ended up building a 3-disks-raid1 instead of a raid5 because I'm too scared that when one drive fails, an additional one will fail as well during the rebuild-process when the remaining ones are stressed (a raid5 would then implode in this situation).
- VectorLock 5 years ago
  
  This makes me curious of where the cost/benefit point is between these brands of drives. Do they buy enough that the reduced cost for higher failure rate outweighs the expense of buying more expensive drives?
  
  fludlight 5 years ago
  
  It's only partially a question of quantity.
  The larger factor is the amount of labor involved in dealing with a bad drive.
  Their (large) operation seems automated so I assume if they have a single drive failure in machine #1234 (which has 50+ drives) they have an automated way of switching off said drive. Then they leave it there until the entire rack is replaced many years later.
  Other operations have to send a human to manually replace every single failed drive as soon as it fails, which is very expensive as a % of the cost of the bare drive.
  
  simcop2387 5 years ago
  
  They buy enough drives that there's another point of data that's needed for them. Availability of the disks. If they need to expand a few 10s of petabytes they may only be able to get enough disks from one group regardless of reliability concerns.
  
  VectorLock 5 years ago
  
  Thats what surprised me when I learned that they were "shucking" drives from off the shelf external drives at some point.
jeremy7600 5 years ago

They show the rates over years, too.. Quantities may be a factor in lower numbers, but over time they are reliable as well.

linsomniac 5 years ago

Anything interesting in this one? I've stopped reading them because they all seem to be "Seagates fail kind of a lot but we use them because reasons. HGST doesn't fail a lot, but we also have statistically insignificant numbers of them, so <shrug>."

metalliqaz 5 years ago

"because reasons" is an overly negative way to say "because they offer the best value"
They can get large numbers of them cheap, and their system is good at detecting and replacing bad drives, so why not use them?
- linsomniac 5 years ago
  
  In my head it wasn't negative, it was that there are a lot of reasons that I don't think needed to be gone into. Offering value is one, having systems that are designed to minimize the impact of failures is another. Others that came to mind when I wrote that include: Being able to get them at the quantities they need, having systems that reduce the COST of failures (drive replacement and RMA), cost of RMAing 10 drives is < 10x the cost of RMAing 1 drive, having drives available in the SIZES Backblaze wants. And there are others I could speculate about but don't have as concrete information on (vendor relationships, manufacturer relationships, marketshare, firmware quality/suitability, temperature).
  Maybe it's just my social/business groups, but "because reasons" doesn't have to have a negative meaning. I use it as more a statement of fact: There are reasons for this.
  
  metalliqaz 5 years ago
  
  hmmm, I always thought "because reasons" was sarcastic, as in the person would list reasons but they are all bullshit. Of course, sarcasm can be impossible to discern correctly on Internet forums and social media. I admit I could be completely wrong about this.
  
  nathanlv 5 years ago
  
  Sarcasm tends to depend on context, tone, and delivery. It is particularly difficult to interpret in written commentary such as blogs like this.
walrus01 5 years ago

If I had to guess, the seagates are sufficiently cheap in ridiculous quantities, and Backblaze's multi-pod redundancy/parity and drive monitoring is good enough, that even if they have a 3% failure rate and the HGSTs have a 0.75% failure rate, that it's still an acceptable risk for them.
If you're an individual person with a 12-drive ZFS RAIDZ2 with hotspare, in a file server in your garage or something, you might prefer to pay the premium to buy HGST, because what if you have a double drive failure during a three week period while you're on vacation out of the country?
- loeg 5 years ago
  
  Especially if you're an individual person with a non-raid6 / hot spare system, you might pay a premium for a lower failure rate (standalone drive, RAID1, or RAID5, for example).
  
  walrus01 5 years ago
  
  Although in the year 2019 it would be pretty reckless to use RAID5 without hotspare. Or even RAID5 with hotspare. This has been true for a long, long time:
  https://www.google.com/search?&q=why+raid5+stops+working&ie=...
  
  loeg 5 years ago
  
  I'm certainly not advocating for it.
merlincorey 5 years ago

> HGST doesn't fail a lot, but we also have statistically insignificant numbers of them, so <shrug>."
They have over 20 thousand HGST drives now, according to this year's report, which seems significant to me.
They continue to have lower failure rates than the Seagates.
- mjevans 5 years ago
  
  It will be when the drives they just added have had more time in use. The best part of this is that them adding those drives signifies that they were on the market at good prices to add; competition is a good thing for consumers.
standardUser 5 years ago

"Seagates fail kind of a lot..."
That Seagate 10 TB was the singular standout in terms of reliability. Maybe it is worth checking out.
- mark-r 5 years ago
  
  If there's one thing I've learned from seeing these reports over the years, it's that the specific drive model matters more than the brand.
  
  wiredfool 5 years ago
  
  Yeah. There was that one Seagate 3tb a few years back. I think they had a 26% afr before yanking all of them.
kristofferR 5 years ago

Your comment is frankly incredibly arrogant - "I can't be bothered to click and skim through the short post, so hurry up and summarize the contents for me."
- zeroname 5 years ago
  
  Ten people give their opinions, as requested. One person thinks you're arrogant and tells you so.
  You can't argue with results.
  
  yjftsjthsd-h 5 years ago
  
  What's the old line? If you have a problem with your Linux system, don't post asking how to fix it because you'll get nothing; post saying how Linux sucks because this thing doesn't work and you'll get a dozen replies in an hour.
  Sometimes, personality dynamics cause certain approaches to work better than one might prefer. :)
  
  tinus_hn 5 years ago
  
  It’s the JWZ Lazyweb
Sahhaese 5 years ago

With two categories A and B you don't need the same number in each category to compare whether they come from the same distribution.
I would be surprised if the number of drives they have is actually "statistically insignificant" \* but I haven't crunched the numbers.
https://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test
\* By this I guess you mean that the null hypothesis of "they have the same failure rate" is not rejected.
jsgo 5 years ago

Haven't read this one yet, but Seagates from what I saw in these tended to be okay. It was Western Digital that had, surprisingly, the most issue (which is unfortunate, considering I finally figured out the cheaper method using Best Buy for WD Red disks).
Granted, I'm not a hardware guy so my takeaway could've been wrong, but that's what it looked like to me when reading them.
SiempreViernes 5 years ago

No, the opportunity to do a better analysis hasn't gone away.

icelancer 5 years ago

My company uses Backblaze since I think it's a good product, but blogs like this really cemented my choice. I appreciate their attention to detail and publishing data openly.

atYevP 5 years ago

Yev from Backblaze here -> That's awesome to hear! I'm glad you're with us! That's one of the nice side-benefits of this blog and one of the reasons we adopted an "open" policy with the Storage Pods. The first time we published that post it was because folks didn't believe we could store data so inexpensively - so it's nice to hear that we're building some trust along the way!

Rebelgecko 5 years ago

I wonder if they have states failure rate based on a drive's manufacturing or installation date? It looks like there's a bit of a bathtub curve, and it would be interesting to see if that's attributed to individual drives having a tendency to fail quickly (if they're going to), or if drives are less likely to crap out once their model has been manufactured for a few years

linux2647 5 years ago

Is there anything similar for SSDs?

akulbe 5 years ago

I realize my comment is a tangent. I'm hoping folks might be understanding and hopefully have some advice.

If you have terabytes to back up, are there still any backup services left that'll let you ship them a drive for a faster initial backup?

brianwski 5 years ago

Disclaimer: I work at Backblaze so I'm biased. :-)
> If you have terabytes to back up, are there still any backup services left that'll let you ship them a drive for a faster initial backup?
Backblaze offers a "Backblaze Rapid Ingest Fireball" to allow you to ship us 60 TBytes of data on an appliance.
https://www.backblaze.com/blog/introducing-backblazes-rapid-...
If you only have 2 - 10 TBytes, I suggest you get a faster network connection, or carry your laptop to a location (like your work place, or a library, or your neighbor's house) with a fast connection and just upload it. You might be surprised how easy and fast it is to upload a couple of TBytes nowadays. Using the Backblaze Personal Backup with 30 threads, I can upload about 1 TByte every 12 hours or so. So if you can leave your laptop at your workplace for 4 days you can upload 8 TBytes, then bring your laptop back home for the incrementals.
- post_break 5 years ago
  
  It's been over a year, can you please ask someone at backblaze to add a single line of code into the snapshots so you can leave a comment? Right now if you make a snapshot, the only info is the date and time. Not the data, what's in the snapshot, literally anything about it. Please!
  
  brianwski 5 years ago
  
  Disclaimer: I work at Backblaze.
  > please add ability to comment on snapshots
  Actually, very very quietly as part of the 6.0 release (4 days ago), we now allow you to "name your snapshot".
  While this is not a comment, and you can't change the name LATER (so useless to old snapshots), at least going forward you can put up to about 1,000 characters of description in the snapshot name.
  
  post_break 5 years ago
  
  Thank you!
  
  icelancer 5 years ago
  
  That is going to be way more than a single line of code, come on man.
- chx 5 years ago
  
  I have cable Internet with 20mbit/s upload which I believe is decent (unless you have fiber) means uploading 8 000 000 gigabits (=1 terabyte) would take 400 000 seconds, that's about five days and that's the theoretical maximum. At ten terabytes you are now looking at two months...
  
  brianwski 5 years ago
  
  > ten TBytes will take two months
  Backblaze "Best Practices" recommends you get fully backed up within 30 days, but honestly we won't be bothered and cut you off even if it takes 6 months to upload your whole backup. As long as you are aware of the exposure for the first two months (where only half your data is backed up), I still think this would be FINE for somebody with 10 TBytes.
  If you do something like this, the only thing you have to know is Backblaze backs up files in "size order". Small files first. So maybe if your digital movies are more replaceable than your pictures, you might be through 5 TBytes of photos in the first month and be "protected enough" to live with?
- jl6 5 years ago
  
  Be wary of freeloading the upload bandwidth on a pipe that belongs to someone else.
  
  brianwski 5 years ago
  
  I wouldn't advocate for breaking any laws or stealing anything that doesn't belong to you, but a whole lot of places have "totally unused bandwidth" and it won't hurt them at all financially for you to use 1/4 of their unused capacity.
  Funny story: I'm kind of unusual that I run an open WiFi hotspot in my home, because I have plenty of unused bandwidth and guests in my house and even neighbors are welcome to borrow some of it anytime. But one of my neighbors one time downloaded illegal content on my WiFi (the exact name of the "True Blood" episode was included, with a date and time), and I got a "cease and desist" letter from the ISP. (sigh) I felt it was pretty rude of my neighbor. I mean, just because I leave my car door open in my driveway doesn't mean it isn't rude of you to use my car as a getaway car in a robbery, you know?
  
  viraptor 5 years ago
  
  > and it won't hurt them at all financially for you to use 1/4 of their unused capacity.
  Unless they have data caps in their contact and you go over them. Or they monitor for anomalies and data leaks and the IT will come asking questions / reporting incident. If you don't know everything about the setup and have an approval, don't do it.
  
  jrace 5 years ago
  
  That is why you must secure your network. Depending on where you live you could be criminally responsible. What if one of your neighbors is sharing child-pornography?
  
  superhuzza 5 years ago
  
  tragedy_of_the_commons.torrent
- akulbe 5 years ago
  
  Hi Brian,
  Thank you for reaching out. I love it when company folks will get into the conversation on HN.
  As far as connection, I've got 300/300 from Frontier. It's good, and I can help out my friends. But more questions below...
  I had been using CrashPlan for years. Converted to their business plan when they decided to ditch the Consumer stuff.
  My confidence in their viability/user experience has eroded. I have personally ditched their service. My concern is for some family members who I've also had on my plan. I think they have roughly ~3TB of photos backed up. I want to find them a new home.
  Color me skeptical that folks outside of the public cloud providers are going to be around in 10 years.
  Convince me why Backblaze is a good option to send folks to, and please understand, I'm not trying to be a jerk. I'm just wary after having "the CrashPlan experience." Know what I mean?
  Thanks.
  
  brianwski 5 years ago
  
  Disclaimer: I work at Backblaze so you should always view my answers skeptically. :-)
  > Color me skeptical that folks outside of the public cloud providers are going to be around in 10 years.
  Backblaze is now 12 years old, and we're actually kind of unique in that we have never raised any significant VC funding and we're (slightly) profitable. We have run the business entirely as a business (not on unsustainable VC dollars), and we aren't planning on going anywhere. Backblaze is employee owned and run, the only voting board members are the original five founders. We have a couple of "board observers" from outside the company for "adult supervision and experienced advice", but they cannot control us, and they cannot even vote.
  Side Note: The Backblaze founders and a good portion of the staff all came from the same previous startup/company, and in that case the VCs forced us to sell it, which murdered it. The whole reason we self funded Backblaze and ran a sustainable (profitable) business was a reaction to how horrible that situation was.
  Backblaze currently has 803 PBytes of storage in our three datacenters, and business is really going well and we are growing quite healthy. We also understand (and talk about among ourselves) the large responsibility here, which is realistically that amount of data cannot ever be moved. If Backblaze decided on a whim to shut down, we would seriously, SERIOUSLY hurt or even manage to destroy thousands of businesses which depend on us. So we are not going to do that.
  
  rolleiflex 5 years ago
  
  Same here, another burned ex-Crashplan customer. My setup was like a few TBs of personal data, and a few GBs each of family from 7-8 different computers (I was on the family plan). Not surprisingly, it's a lot easier to find a replacement for my own data than the family's, and they are under higher risk. Per computer pricing is a terrible deal for that — I'd love to have a per computer plan, plus quota of x Gb per month for N amount of computers that I can distribute as I wish.
  Though what I probably want is just S3... Maybe I should just build it and sell that.
- Arn_Thor 5 years ago
  
  Thanks for weighing in. I wish it was that fast. I'm a new customer (formerly CrashPlan) uploading about 7TB, mostly photos and videos. Got a 500Mbps up/down line (in theory and in practice), but my daily uploading to Backblaze has never surpassed 30Mbps I've tried both low, medium and max thread numbers.
  In such a case, could the HDD itself be a bottleneck as it's trying to slice and dice a lot of files large and small? It's running at 70-80% active time.
  Probably doesn't help that I'm located abroad either..
  
  brianwski 5 years ago
  
  > largefiles sliced into small
  Yes. Backblaze makes a copy locally of all files larger than 30 MBytes, broken into 10 MByte chunks. They’re stored on your “Temporary Scratch Disk” (which you can specify). One hint would be to put your temporary scratch disk on a fast SSD.
  Personally I can get over 150 Mbits/sec upload, but I am on a PCIe SSD and have excellent latency to the datacenter. The worst case is a 5400 RPM drive powered only by USB, located in New Zealand. They would have trouble hitting even 20 Mbits/sec using the newest 6.0 client with the max of 30 threads.
  
  Arn_Thor 5 years ago
  
  Because Backblaze doesn't allow program files I'm forced to upload a backup image of my C drive, and since the scratch disk needs to be bigger than the biggest file.. I don't have an SSD big enough. But not to worry, I'll be up to speed in a few weeks
- AnIdiotOnTheNet 5 years ago
  
  > You might be surprised how easy and fast it is to upload a couple of TBytes nowadays.
  You might be surprised how slow it is in many areas of the US, let alone the world.
  
  brianwski 5 years ago
  
  > You might be surprised how slow it is in many areas of the US, let alone the world.
  I know there are digital deserts. (I just made that name up.)
  But one thing I'm curious about -> often when the only CONSUMER ISP in one area (like your DSL company) is slow, there are quietly companies in your area with Gigabits of connection. If you really live out in a remote area of Colorado this may be 30 miles away from your current location, but I would LOVE to see a real "heat map" of high speed connections in the USA.
  I think an absolutely killer feature for a company like Kinkos would be to have a "rent a super fast internet connection for a couple days" so you could drive someplace, download a movie or upload your backup, then come home.
  The Backblaze office in San Mateo California originally only had very sad DSL line available for consumers. We requested a 10 Gbit/sec symmetric fiber connection to our office, and it took a VERY ANNOYING 3 month wait, but eventually a commercial provider (AboveNet or whatever it is called now) brought it to us as long as we signed a 3 year contract. Totally without asking, AboveNet put a fiber line in where we can light it up at up to 100 Gbits on a single strand of fiber, and they put 40 fiber strands into our office!! This costs $1,500/month so not really viable for an individual home, but maybe for a Kinkos or local ISP or shared among 10 - 100 houses this may be viable.
- trumped 5 years ago
  
  but don't most ISPs have datacaps nowadays, Comcast has 1TB
  
  brianwski 5 years ago
  
  > Comcast has 1TB/month data cap
  I have three suggestions, but I have only tried the second suggestion, so please do your own research before my bad advice costs you a lot of money. :-)
  1) Comcast (at least in many places) allows you to exceed your bandwidth cap for two months before clamping down on you. I think they are trying to prevent serious long term abuse, not a one time overage. So if you have 3 TBytes and can get it uploaded in one or two months, just do it, apologize, and it won't cost you anything. Backblaze only does "incrementals" after the initial upload.
  2) Personally I have Comcast and I pay them an extra $30 or so per month for "unlimited" (remove the cap). Now when I look at my usage, my family stays just under the 1 TByte limit ANYWAY, so this is wasted money, but I don't want to stress about it, and I run like 5 Nestcams CONSTANTLY streaming video, plus my family loves Netflix, so I just drop the $30 and relax. So you could call up Comcast and change over to "unlimited" if you can afford it.
  3) A modification of #2 that I have NOT TRIED is to raise it to "unlimited" for the duration of the initial backup. I don't know if you have to commit to a year of unlimited bandwidth, or six months, or if you can change at any time?
  
  Semaphor 5 years ago
  
  No. Not sure how it is in the US though.
  
  kiwijamo 5 years ago
  
  Unlimited fibre is the norm here in New Zealand.
ajford 5 years ago

AWS has the Snowball program. They ship you what is essentially a network storage device with 10G ethernet, and you upload data over your local 10G network using their protocols, then ship back and they ingest into S3/Glacier.
Haven't used it personally, but did a fair amount of digging ~3 years ago. It was a potential backup solution, but we ended up shipping a 4U server full of 4TB drives to a partner for off-site backup, since they needed routine access anyways.
Check out the AWS page: https://aws.amazon.com/snowball/

mikece 5 years ago

While I've heard lots of great things about the Backblaze reports, I've noticed in comments on NewEgg, Amazon, etc that the SKUs mentioned in the report frequently aren't available anymore. I've never had problems with WD Red drives though I don't purchase them all from the same vendor at the same time to make sure I get drives from different lots in case of a lot defect.

wmf 5 years ago

By the time you have accurate reliability data on any equipment it's obsolete. Maybe this will change with the slowdown of Moore's Law/Kryder's Law.

humantiy 5 years ago

Curious if anyone knows how they calculate drive days. For example the first drive on their report (hgst 4tb) has a count of 50 but total days of 23069. If I take the 50 by the days it should be 18250, so not sure where the extra 4k in days is coming from. Retired drives or something?

tzs 5 years ago

I think it is over the time they have had the drive, not over the reporting interval. It's a measure of the age of the drives. Assuming their drives are operating 24/7, that means that those particular 50 drives have been in service an average of 461 days.
I'd expect on next year's report, those particular drives will show up as 49 drives with around 42000 drive days, assuming they aren't replaced by then.
- humantiy 5 years ago
  
  If that is the case then wouldn't the Annualized Failure Rate be based of the year total not the drive days if it is total days in service? For example the drive count(50)/drive days(23,236) gives the AFR of 1.58% which equals out their numbers. The drive days is more than the total amount possible for that year.

samstave 5 years ago

BackBlaze spent ~12 million dollars on 12TB Seagate drives (at full retail)

metalliqaz 5 years ago

Good thing they charge me $0.94 a month for my B2 storage. Gotta make that budget from somewhere!
- brianwski 5 years ago
  
  Disclaimer: I work at Backblaze.
  > Good thing they charge me $0.94 a month for my B2 storage.
  We thank you for your business! :-) The absolute beauty of Backblaze B2 (or Amazon S3 or Azure) is that we can build a storage system at scale, and sell off all the little pieces of that. You win because you get a fair price on a sliver, and we win because we add up 100,000 customers like you and make about $1 million per year.
  The very best business is where the customer and provider are happy with the relationship.
  
  leowoo91 5 years ago
  
  AWS nerd here, I recently googled you guys and found out you have 1/4 bandwith cost already. See you soon for my upcoming project.
  
  metalliqaz 5 years ago
  
  I like my B2. I also like that Backblaze is one of those companies that does one thing and does it very well.
- atYevP 5 years ago
  
  Every little bit helps ;)
samstave 5 years ago

EDIT: It may have appeared that I was denigrating them for the spend...
NOPE
I was just curious when I saw the # of drives as the largest chunk in this writeup - to see how much that chunk cost them.

alinde 5 years ago

Would be interesting to also have metrics on failure per TB storage.

theandrewbailey 5 years ago

I'm not sure how that would be useful, since terabytes don't fail. When a drive fails, it's effectively a brick with no terabytes.
- brianwski 5 years ago
  
  Disclaimer: I work at Backblaze.
  > When a drive fails, it's effectively a brick with no terabytes.
  Interesting factoid: that isn't always true. What you describe is actually the CLEANEST type of failure, the drive suddenly becomes a brick. We replace the drive and rebuild it from parity.
  A way more interesting failure is when disk blocks start going bad at an unacceptable rate. Backblaze splits your data across 20 different hard drives in 20 different machines in our datacenter. The sub-parts we call "shards", a shard sits on one disk. Each shard has a SHA-1 checksum, so we know if each shard has been corrupted. If an individual shard is missing or corrupted, we know it needs to be rebuilt from parity.
  So when a drive is HALF-FAILED, we even have a procedure to pull the drive out, and then opportunistically copy whatever files we can recover onto a new drive, then put the new drive back into production. Any files we recover where they are in the correct filesystem location and their SHA1 says they have not been corrupted speeds up the rebuild.
  The reason the speed of rebuild is important is the whole concept of 11 or 12 "nines" of durability. We can't have more than 3 drives fail in any one group of 20 drives, and the faster the rebuild time, the less likely for 4 simultaneous failures. It plugs into the formulas in this blog post we did about durability: https://www.backblaze.com/blog/cloud-storage-durability/
  
  zepearl 5 years ago
  
  >>So when a drive is HALF-FAILED, we even have a procedure to pull the drive out, and then opportunistically copy whatever files we can recover onto a new drive, then put the new drive back into production.
  Do I understand correctly, that when the drive is half-failed, you don't just say "it will probably completely stop working in the near future" and discard/replace it but keep using it?
  
  brianwski 5 years ago
  
  If it was half failed we would DEFINITELY pull the drive out because it is already 1 drive down out of 20 for half the files. A lot of times the IT guys will make a judgement call that a drive is acting funny or slightly off so they just "fail it on purpose" which means yank it and replace with a new drive. We have done this just because a drive is "slow" (slow can mean the drive is having trouble writing data reliably on one attempt), or because some SMART stat looks wonky.
  To provide more color, if a 20 drive "tome" (as we call it) is 1 drive down, we don't even wake people up in the middle of the night, but Backblaze datacenter employees replace it when they arrive at the datacenter the next day at 8am. All drives having problems are replaced by 5pm when the employees go home. This is completely business as usual, about 5 - 10 drives fail every day.
  However, if 2 drives fail out of 20 (or 1.5 in our example above), pagers go off, people wake up and get out of bed at 3am and start driving towards the datacenter. Or we employ "remote hands" to swap the drives immediately, it depends on the capabilities of the night crew in the datacenter which varies by datacenter. "remote hands" is a contract service where semi-skilled technicians work for the datacenter and we can pay them $80/incident or there abouts to do things you can only do "in person" like replace drives. All the pods (where data is stored) have "base board management" which means as long as they are powered up and online we can log in remotely from home or office to figure out what is going on and fix a variety of problems. AUTOMATICALLY if 2 drive fail we stop sending any data into that "tome" of 20 drives. We have found that writing to drives causes more failures, so not writing to them is safer.
  If 3 drives fail, it is instantly a "Red Alert" at Backblaze and a whole lot of official procedures kick in. An "incident manager" is assigned and the whole company's number one concern is to drop EVERYTHING and never sleep again until the Red Alert is lowered to Yellow. We light up a "situation room" (in Slack - our internal chat tool) and information and status is relayed through that.
  SIDE NOTE: Backblaze has a relationship with an excellent company named "DriveSavers" who can recover SOME data off of failed drives. This is very expensive (thousands of dollars per drive) so we only do it to test the procedure and then in extreme situations. Three drives down is an extreme situation and extremely rare, so ALL OF THE THREE FAILED DRIVES would be immediately hand carried to DriveSavers even while we rebuild the customer data from parity. Notice Backblaze STILL has a complete copy of the customer data on 17 drives -> But if a 4th drive dies, the hope is we can recover at least one of the drives via DriveSavers thus saving the customer data. (We need at least 17 out of 20 drives in a "tome" to reconstruct the data.) In our experiments, DriveSavers seems to recover about half the drives, or in some situations half the data from a drive (imagine if 1 platter on a drive has a head crash and is destroyed, but the other platters are fine). We have made the decision that it is less expensive (for the same durability) to pay DriveSavers the thousands of dollars rarely instead of increasing parity to allow reconstructing data from 16 out of 20 drives instead of the current 17 out of 20 drives.
  
  zepearl 5 years ago
  
  Thanks a lot - absolutely interesting/fascinating.
  You guys should think about writing a short eBook about e.g. general recommendations about setups/analysis/projections & stories about past failures/chain-of-events/etc - I might buy it :)
  
  zepearl 5 years ago
  
  (thanks a lot - all extremely interesting)
  >> We can't have more than 3 drives fail in any one group of 20 drives...
  Wow, for me, subjectively, an low threshold - and I underderstand that each drive being hosted on a different machine protects you as well from a machine/controller failure (happened to me twice with the controller - both times it was very hard to diagnose and the experience in general has been terrible).
  Do you have as well "backups"? Or is that in the hands of the customers/users?
  
  brianwski 5 years ago
  
  > Do you have as well "backups"? Or is that in the hands of the customers/users?
  If you store data in Backblaze, there is no "backup" of that data. If Backblaze ever lost 4 drives simultaneously and could not recover the data, the customer would lose data. This is much like Amazon S3.
  In general, we recommend a 3-2-1 backup strategy where there are 3 copies of the data, at least 2 copies on your site, and 1 copy in the cloud. You can read about that philosophy in our blog post here: https://www.backblaze.com/blog/the-3-2-1-backup-strategy/
  
  zepearl 5 years ago
  
  Thank you!
  To summarize I understand: A) the local working copy (locally replicated in your case), B) the local backup and C) the cloud/very remote backup. B & C cover each other if any datacenter is completely wiped out.
  
  brianwski 5 years ago
  
  Correct.
  > if any datacenter is completely wiped out
  Correct. When all of our datacenters were in Sacramento, California, some customers told us they were concerned because they were ALSO in Sacramento and a meteor could wipe out both their computer, the local backup, and Backblaze's cloud backup, all in one meteor strike.
  While by default we put your data where it is convenient for Backblaze, we CAN work with customers (and have done so) to place their data in our Phoenix Arizona datacenter or one of our Sacramento datacenters if it is important. As we add our European region (coming soon) this will become a pull down menu for all customers. For now, we only work with larger customers to make sure the customer data lands in the correct location for them.
  
  zepearl 5 years ago
  
  Thanks, once more, for the nice reply :)
  Interesting about the new European region, for sure at least from the point of view of "locality" (I assume that from the point of view of "data ownership" the US will still consider itself "owner" of the data as the holding/legal entity (don't know what kind of company it is, but your website mentions San Mateo US) has its headquarters in the US.
  
  brianwski 5 years ago
  
  > US consider itself “owner”
  Well, I work most days in San Mateo, California, but 15% of the data we store for customers ALREADY comes from the EU, and more from other countries. Backblaze fully complies with all EU laws already, such as collecting VAT and passing that money through to EU countries.
  Philosophically, we feel the data belongs to the customer, but we comply with all laws in that customer’s country. For the Backblaze Personal Backup product this was easiest, since it is encrypted on the customer machine before being sent. For B2 (our object storage product like Amazon S3) it got much more complicated because for the first time customers can configure it to be a publically accessible web host, so Backblaze sometimes gets served with takedown notices due to illegal content hosting.
  We ABSOLUTELY comply with standard procedures the same as Amazon S3 must. Backblaze is not some crazy safe harbor for criminals hosting stolen movies. With that said, if you encrypt the data before it leaves your computer and store it in a private bucket, Backblaze has no possible way to know your file contents and we do not want to know. And we would have no way of handing that over to the US government (or the EU) even if they demanded it.
  
  boulos 5 years ago
  
  Disclosure: I work on Google Cloud.
  Thanks for mentioning this! I’ve always (begrudgingly) had to tell people that while I love Backblaze you have to understand the geographic risk. I always suspected you had a “yeah, we can put you here if we edit this config file” but the drop down will be much better for everyone. Looking forward to it!
  
  atYevP 5 years ago
  
  Yev from Backblaze here -> We look forward to you hosting GCS in B2 :P Seriously though - it IS something we've done if a customer had legitimate concerns but having it be more streamlined will be a much cleaner process!
- alinde 5 years ago
  
  I was thinking as one failure of a 100TB disk has a very different impact of 10 failures of 1TB disks. It'd give some idea on how much data is lost due failures, no?
  
  atYevP 5 years ago
  
  Yev from Backblaze here -> Not sure if you'd get that metric from that data. We use Reed-Solomon erasure coding (https://www.backblaze.com/blog/reed-solomon/) to make sure that data is "rebuilt" should we lose drives (which happens all the time).
  
  oliveshell 5 years ago
  
  I suppose, but there’s no such thing as a single HDD that stores 100TB. The biggest you can get currently are (I believe) 14TB helium-filled drives.
  
  jsgo 5 years ago
  
  My guess is they meant 10TB as it would be a more "equal" comparison:
  1 10TB drive
  10 1TB drives
  
  oliveshell 5 years ago
  
  That makes way more sense. Didn’t think before posting!
klodolph 5 years ago

I’m not sure that it’s especially useful to measure that way, which is why they wouldn’t report it. The chance that a given GB of data is on a failed disk is equal to the disk failure rate, regardless of disk size (>1GB).
For large deployments, the concern is between failure rates and the amount of time it takes to rebuild data from a failed disk.
For small deployments, my main concerns are whether disk failure takes a machine or volume out, causing availability problems.
I’m trying to figure where failures per GB would be how you would choose, what scenario we’re you thinking of?

chemmail 5 years ago

Relibality looks decent this year. All my seagates are having tons of errors, i think i'll stick to the other team from now on even though seagates seems to be getting better.

Arn_Thor 5 years ago

Odd they don't have the 6TB HGST on the list. Well, maybe not odd, but annoying since I'm curious about that drive's performance especially

GNU_IS_UNIX 5 years ago

Do larger capacity drives have higher failure rates?

simcop2387 5 years ago

It's usually not so much that larger drives are inherently less reliable, but that the larger drives are also the newest lines and so yields and manufacturing issues are more likely to crop up than in the older lines that have had issues sorted out.
SiempreViernes 5 years ago

I don't think so, it's just that they didn't include their older and smaller drives, from previous report I remember a segate (I think) 3 TB disk that was abysmal.
platters 5 years ago

Only if you yell at them.

mark-r 5 years ago

A perennial favorite, literally. Thank you.

atYevP 5 years ago

You're welcome!

arcaster 5 years ago

TIL - Seagate drives are still basically shit :)

charliebrownau 5 years ago

Anyone that has been in IT for 20-30 years really surprised with the number of seagates that fail