Why use ECC? (2015)

159 points by vsgherzi 12 days ago

kimixa 12 days ago

I wish there was a hard requirement for ECC, as a developer working on GPU drivers, there's a huge amount of reported issues that just... don't make sense? One offs with slightly different symptoms, memory dumps of nonsense, just nowhere to start rooting out the cause for an issue. Even on "widely reported" issues that make it to reddit and similar.

Probably not surprising, there's a naturally antagonistic relationship between Performance and Reliability here, and it's clear which way many of those "enthusiast" forums lean.

I haven't got actual numbers, but I feel that most [0] of the issues I start looking at just can never be reproduced, or even make sense from the backtrace or similar. I can't say it's 100% hardware issues for this, as many games are a little... loose... with reliability if it works "well enough", and is heavily interacting with code and data we work on so might also be a source for "impossible" issues. But even on straightforward code paths, no weird OS interaction, no allocation, nothing async etc. "Impossible" states happen pretty regularly.

I would love there to be enough ECC-using gamers out there to statistically see if it makes a difference.

[0] Most in terms of number of different issues, not total reports of the same issue. That's dominated by one or two things, normally around the latest game or update doing something dumb :P

vlovich123 12 days ago

Statistically, in my experience, plain old memory corruption bugs within your code or within the GPU itself are a more likely explanation than issues ECC would fix.
Games have pretty terrible quality because games can remain playable in the face of bugs. GPUs have similar issues because numerical accuracy is an after thought but numerical accuracy issues can cause issues in code that assume stability of results.
Things have gotten better particular on the GPU vendor side of things, but still, memory safety issues and UB issues in c/C++ code far dominate as the root cause any issue you’re likely to see vs memory bit flips (unless people are running very overclocked and unstable machines).
- kimixa 12 days ago
  
  I'm very much aware of issues mistakes around memory can cause, and we use many tools (including different languages) to avoid them or verify code. My point is even taking that into account there's a long tail of unexplained issues.
  And UB shouldn't be a scary thing 99% of the time, as you won't be hitting it anyway (or its actually the result of another real bug, like not handling overflows etc.), though as at some level you start trying on platform specific behavior and start /defining/ them. And again, there are tools and options around that, or highlighting areas where you might still be relying on them. They just might be toolchain and/or platform specific. It's never been the scary monster some online programming language fanatics seem to make it sound like, but is something to be aware of and managed.
  And
  > unless people are running very overclocked and unstable machines
  Yes. Have you seen the Gamer Hardware Enthusiast community?
  
  Sweepi 11 days ago
  
  >> unless people are running very overclocked and unstable machines
  > Yes. Have you seen the Gamer Hardware Enthusiast community?
  Having seen it for the last 2 decades: A lot of them are interested in OC, but nobody is doing it. Maybe 5%, most likely <2%. Also, most modern Motherboards do a "RAM-Training" session before boot. On unstable machines, this will result in a "test failed" once in a while, with the motherboard showing a "Boot failed, returned every clock setting to default" message, which most likely is not being read by the user, who is either not at the PC the moment it happens or in "why is this taking so long? skip, skip, skip"- mode.
  What does happen is that Motherboards have OC-like settings out-of-the-box/per default, which were stable most of the time and de facto-accepted by Intel, however is now hitting issues and diminishing returns:
  https://www.radgametools.com/oodleintel.htm https://www.igorslab.de/en/intel-spielt-mit-dem-namen-und-de...
  
  fourfour3 11 days ago
  
  What I see a lot of too is poor quality hardware causing crashes - eg poor quality power supplies, actually faulty memory, memory that is marginal at the speeds that it's sold at (eg fine at the standard JEDEC speeds, but fails at the XMP speeds), but also poor cooling - eg not enough airflow in the chassis paired with a modern GPU.
  I had a PC that was having inexplicable game crashes, but only in stressful games - swapping the power supply from an existing ~10 year old one (which was still rated as enough capacity for my hardware) to a brand new ATX 3.0 unit resolved it.
  
  gpderetta 11 days ago
  
  With XMP many motherboards effectively overclock ram out of the box.
  
  jorvi 11 days ago
  
  > Yes. Have you seen the Gamer Hardware Enthusiast community?
  I'd venture 95% of people run stock, 4% do undervolting and 1% do overclocking.
  Which isn't that strange because GPU vendors have become relatively adept at pushing the silicon near its max. Its why you can run an aggressive undervolt with a few percentage point lower clocks and be rewarded with a 20-30% wattage drop with commensurate drop temperatures and fan noise, all whilst losing perhaps <5% performance.
  But the GPU vendors want to put big number on box so they'll have users suffer loud fans :-)
  
  kimixa 11 days ago
  
  > I'd venture 95% of people run stock, 4% do undervolting and 1% do overclocking.
  As others are pointing out, there's an ongoing issue where it seems some products push it beyond the point of stability by default.
  I think I'm more pointing out the trend that even "informed" purchasers just look at the benchmark charts. There's no other information required.
  Perhaps things like the mentioned instability making the news will change things going forward? But not holding my breath. If there's still advantages to pushing it beyond the limit, then manufacturers will do so.
  
  worthless-trash 12 days ago
  
  > And UB shouldn't be a scary thing 99% of the time, as you won't be hitting it anyway (or its actually the result of another real bug,
  This is exactly how exploits are made. Accessing UB usually is the result of "another bug" because the UB isnt a bug itself.
  
  vlovich123 11 days ago
  
  > And UB shouldn't be a scary thing 99% of the time, as you won't be hitting it anyway (or its actually the result of another real bug, like not handling overflows etc.), though as at some level you start trying on platform specific behavior and start /defining/ them
  I’m talking about language-level UB - that’s not anything you can rely on on any toolchain / platform. And UB will also manifest as random unexplainable crashes in random spots just like memory corruption will.
  As for tooling, unless you’re using something 100% memory safe with absolutely no call out to unsafe code (which definitely isn’t a game/GPU scenario), you’re going to have this risk & it almost doesn’t matter how much you test or use memory checkers because the long tail of issues are going to be the result of a really difficult to reproduce sequence of events. Additionally, if you have any multi-threaded code, all your testing goes out the window because I’ve seen many concurrency bugs that have hidden in plain sight in very public spots until someone figured out a repro (e.g. https://reviews.llvm.org/D114119, https://probablydance.com/2022/09/17/finding-the-second-bug-...). Race conditions are notoriously difficult to reproduce in controlled environments.
  Oh and when I said you’re safe if you’re in a 100% memory safe language? I lied. There are compiler bugs that could be misgenerating your code + you’re likely running unsafe constructs somewhere in your code that has access to your memory space (whether the OS kernel, whatever is making syscalls to the kernel on your behalf, something that eventually calls the C library, something that is needed for performance, etc etc etc) not to mention bugs in the compiler/JIT that either allow unsafe/unsound constructs or just straight up miscompile correct code. And finally, there are sources of HW issues unrelated to memory bitflips (your CPU is super complex and has bugs too you know as does your memory controller).
  > research has shown that the majority of one-off soft errors in DRAM chips occur as a result of background radiation, chiefly neutrons from cosmic ray secondaries,
  > Recent studies[5] show that single-event upsets due to cosmic radiation have been dropping dramatically with process geometry and previous concerns over increasing bit cell error rates are unfounded.
  So aside from the fact that DRAM susceptibility to cosmic rays has been decreasing (I’m not convinced with the Wikipedia explanation - an alternate explanation can be that the percentage of critically important data in DRAM has shrunk as a percentage as the overall capacity has increased), you’re argument would be that random cosmic radiation is going to randomly hit the DRAM cell containing your code / critical data. On the order of things that are likely, that’s the last thing.
  Oh and all this relies on you correctly grouping related crashes correctly and I’ve generally seen that to be a significant challenge on any project I’ve participated on (e.g. Oculus for the longest time would group unrelated crash reports and not group the same ones correctly although I helped the team try to make progress on that).
  Again, it’s not impossible that it’s a legit HW corruption issue. However, all the engineers I know frequently also blamed cosmic rays but at the end it’s all just shorthand for “not worth wasting time trying to track down because it’s in the tail of issues you’ll never get to“.
- asveikau 11 days ago
  
  Also statistically and in my experience, a low probability event multiplied by a large number of units means millions of people hit hardware failures. It's not at all surprising that someone working on graphics drivers ends up seeing a lot of it.
  
  vlovich123 11 days ago
  
  Yes, and the issues that ECC can correct for are a small fraction of that overall HW failure rate. And SW issues will always still dominate.
  
  asveikau 11 days ago
  
  There are certain niches where the hardware issues will actually dominate the queue of issues and support requests.
  One I can relate to is if you have something with low code churn and extremely logically simple that does many gigabytes of I/O per machine over a short period. The maintainer of that code is going to see tons of bad disks. That was me once.
NohatCoder 11 days ago

I forgot the source, but I recall some game dev telling that they had taken to run a basic stability test every time the game launched. Half the crash reports came from systems that failed the test, so those were summarily ignored.
There is an error correction layer in DDR5, could that be spotted statistically as a difference between Ryzen 5000 and 7000 processors?
- kevingadd 11 days ago
  
  We did this in Guild Wars, yeah. I'm sure lots of other games are doing it by now. There were a significant number of Problem Customer PCs that would just crash all the time because of stuff that was obviously a CPU or RAM defect, though I don't know if it was half our crash reports.
  
  immibis 11 days ago
  
  What kind of stability test is quick enough to run every time the game starts and good enough to detect real problems?
  
  kevingadd 11 days ago
  
  I believe it was actually an ambient check of things like floating point arithmetic that ran on an ongoing basis, and if it ever failed we popped up an error message.
  We already shipped with debug assertions enabled in release builds, so it wouldn't have been unprecedented to do things like validate result bytes after doing big memcpy operations or things like that. Not sure if we did, though.
  
  steve_rambo 11 days ago
  
  What was the difference between CPU and video card vendors (if you can talk about that at all)?
Onavo 12 days ago

And then you have car companies like Tesla that removed ECC memory in newer iterations of their car computers to save money.
- bayindirh 12 days ago
  
  At worst, you suddenly veer off the road and roll a couple of times, no biggie.
  It's not a rocket or something dangerous, right?
  Also, newer iterations of the car will be even more cheaper due to cut corners, so you may save some money in the process, too.
  
  72deluxe 11 days ago
  
  I followed a Tesla yesterday. It was clearly on autopilot along this single lane road and it was veering all over the place like a drunk person, sometimes into the verge, sometimes over the double white line (the line you should not cross in the UK) and obviously making micro adjustments to steering with periodic random braking. I don't see how someone could relax as a passenger in that. Shockingly bad.
  
  giantg2 11 days ago
  
  I wonder if the software quality is lower for self-driving in non-US roads. I assume there would be less focus on development and testing in other markets.
  
  bayindirh 9 days ago
  
  I remember reading that Tesla can't recognize lighted or dynamic road signs (which are essentially a matrix of lights which change according to road conditions) present in parts of Europe because they're not present in the US to begin with.
  Same signs were easily detected by a Ford Puma I drove recently on a highway.
  I assume training data for European roads are also is of inferior quality due to amount of miles traveled and because of sheer variety of the roads, road signs and norms around Europe.
  
  pipe01 11 days ago
  
  The latest fsd version definitely doesn't do that, you can look up videos on yt
  
  72deluxe 11 days ago
  
  Perhaps it really was a drunk driver then!
kevin_nisbet 11 days ago

Early in my career I had to deal with similar issues, but with cellular networking equipment. I do think it's one of those things where if you have a way to detect the bit flips it happens way more than you think... but most people can't detect them and mostly get lucky enough that they never see an impact when it does happen.
But because we had some equipment that would checksum itself we were able to see the bit flips happen, probably every 1-3 weeks or so I'd find a bit flip.
- immibis 11 days ago
  
  I literally just built a new workstation with 256GB of DDR5 ECC memory and had two detected bitflips within the first hour. It's likely that something's off with the timings or voltages, then, but... if I didn't have ECC, didn't have an AMD processor that generates MCEs Linux can decode, or didn't have a kernel set to log MCEs, would I know about them? Definitely not.
fourfour3 11 days ago

I’ve found crashing in games using NVIDIA’s DX12 driver is one of the most reliable indicators of CPU/RAM instability in gaming PCs, so doesn’t surprise me!
Lots of people’s gaming PCs (self built and otherwise) are very crashy - there’s a surprising number of people who get regular BSODs who just accept it as “normal”.
- formerly_proven 11 days ago
  
  Also see the recent "UE5 is GARBAGE it KEEPS CRASHING on my rock-solid 14900K" episode.
kevingadd 11 days ago

I got ECC on my previous system (at great cost) but had to settle for non-ECC for my current workstation since the cost of Threadripper builds has gone up so much since the 3xxx era. It's really frustrating, even if DDR5 (supposedly) has much lower error rates. Every time a game crashes I'm going to suspect the RAM.
a_t48 12 days ago

Back when I heavily used GCP's GPUs I'd actually get failed tests with ECC errors. It was kind of neat.

jrockway 12 days ago

Not having ECC is the biggest scam in computing. Ever hear of "bitrot"? That's memory errors that have been saved to disk. We have made millions of people lose their data so that servers can be artificially more expensive.

Intel was responsible for most of this. It is hard to be sad when seeing how they've lost the market lead.

hi-v-rocknroll 11 days ago

I would wager 99.99% of bitrot is silent corruption that goes unnoticed until it affects something particularly important. Without integrity and error correction in all paths along a processor, storage hierarchy, and network paths and at rest, there's no way to prove a system will ever remain reliable.
- thfuran 11 days ago
  
  The network level is usually pretty solid in that regard with checksumming at wifi/Ethernet protocol level and at tcp, but that's the end of it for most systems.
  
  immibis 11 days ago
  
  TCP checksums are very basic. Ethernet has a proper CRC, which at least has certain mathematical guarantees about what it can detect. The TCP checksum fails if the same bit is flipped from 0 to 1 and 1 to 0 in the same bit position modulo 16.
  
  renonce 11 days ago
  
  modulo 16 really? I thought it is modulo 2^16-1 (size of multiplicative group of GF(2^16)) which is much bigger than typical packet size
  
  immibis 11 days ago
  
  The TCP checksum is based on the sum of all odd bytes and the sum of all even bytes. Adding a power of 2 to an odd byte and subtracting the same power from another odd byte will cancel out. I'm not sure what guarantees CRC-32 gives you, but they're better than that.
cornholio 11 days ago

I wonder if it's practical to do ECC on 64 bit words instead of bytes. A 13% price increase (or capacity drop for the same DRAM price) is substantial and might justify the penny pinching, 1.5% is negligible if it leads to a similar stability increase as standard ECC DRAM. If you are often getting more then one bit flip per 64 bit word, than that RAM is garbage anyway.
- pclmulqdq 11 days ago
  
  ECC is done on 8-byte words. The basic form of RAM ECC is called "SECDED" ECC, and is an 8-bit hamming code that can correct one error silently and detect any two. It's not just a parity bit.
  The number of ECC bits needed for any error correction level on n bits are proportional to log(n).
  DDR actually operates in bursts of 4 or 8 64-bit words (+8 ECC), so you can make a better ECC using all 32 or 64 bits to cover the full burst rather than covering each 8 bytes separately.
  
  ajross 11 days ago
  
  In point of fact most L2+ caches are already ECC'd on whole cache lines, as I understand it. At the level of on-SOC integration this makes a little more sense [1] as it effectively "improves yield" by recovering marginal cache regions that wouldn't otherwise pass tests.
  [1] vs. the semi-religious flame war about DRAM ECC in which I won't engage. People get nuts over this, and IMHO the actual data is awfully inconclusive.
  
  pclmulqdq 11 days ago
  
  Yes, most modern CPUs seem to do RAM ECC at the level of cache lines, too. That lets them do things like erasure coding to cover you if an entire RAM chip goes bad.
  SECDED ECC is the academic example.
- baq 11 days ago
  
  this a loaded question.
  anyone interested in the topic should absolutely start from reading up on https://en.wikipedia.org/wiki/Error_correction_code and only then start looking into the engineering side, starting with https://en.wikipedia.org/wiki/ECC_memory, notably some benchmarks reported 25% performance hits (not sure if it's real or sales propaganda.)
  
  cornholio 11 days ago
  
  I clearly am not talking about in-band ECC (storing the error detection data inside the program accessible RAM) that the 25% performance drop refers to.
ajross 11 days ago

> Intel was responsible for most of this.
Only in the sense of "Intel is responsible for most of computation" ... No one uses ECC pervasively anywhere, that's sort of the point of the article.
- jrockway 11 days ago
  
  Intel has been pretty intent on making ECC server-only. I used their HEDT platforms for years and never had ECC. AMD is much nicer about the thing; if you want to use ECC on their HEDT platform, you can if you want. It's not super supported, but it's also not a $5000 upgrade. (Though my understanding is that ECC is mandatory for the current generation of Threadripper? That's great!)
  
  steve_rambo 11 days ago
  
  ECC is fully supported by consumer AMD processors (at least Ryzen 7000, and I think earlier ones too). You need to pick a matching motherboard, most boards from ASRock will do. And you need to find unbuffered ECC RAM, this is more difficult than the previous two and is why I had to give up on the whole idea.
  Related post:
  https://sunshowers.io/posts/am5-ryzen-7000-ecc-ram
jan_Sate 11 days ago

Huh? I thought that "bitrot" was like content saved into the disk and the disk left unpowered for an extended period of time causing data loss. And I thought that the content stored on the disk has ECC on its own?
- lukaslalinsky 11 days ago
  
  You get a corrupted bit in memory, you take and save that corrupted content to disk and no error correction on the disk level will help you.
  I made the mistake of running a big PostgresSQL database on non-ECC memory once and I must say, it taught me some hard lessons.

transpute 12 days ago

PC Engines $150 APU2 (RIP) shipped with 4GB ECC RAM and AMD Embedded CPU. Since it was a headless device used mostly for 1GbE networking, the RAM was throttled and relatively impervious to Rowhammer.

QNAP has a $600 1U short-depth (11") 4x3.5 2xM.2 2x10GbE 2x2.5GbE 4-32GB DDR4 SODIMM Arm NAS that would benefit from OSS community attention. Based on a Marvell/Armada CN9130 SoC which supports ECC, has mainline Linux support, and public-but-non-upstream code for uboot [2]. With local serial console and a bit of effort, the QNAP OS can be replaced by Arm Debian/Devuan with ZFS. Rare combo of low power, small size, fast network, ECC memory and upstream-friendly Linux. QNAP also sell a 10GbE router based on the same SoC.

Ryzen Pro (OEM) can support ECC [3].

[1] https://www.qnap.com/en-us/product/ts-435xeu

[2] https://solidrun.atlassian.net/wiki/spaces/developer/pages/3...

[3] https://www.tomshardware.com/pc-components/cpus/amd-confirms...

karma_pharmer 11 days ago

Marvell/Armada CN9130 SoC ... public-but-non-upstream code for uboot
Dude, not even close. That repo is for an opaque blob:
https://github.com/SolidRun/cn913x_build/blob/master/binarie...
... which is mashed together-bits of other blobs, including the notorious will-never-be-open-source highly-radioactive "snps blob":
https://github.com/MarvellEmbeddedProcessors/mv-ddr-marvell/...
and its vile EULA which you have agreed to:
https://github.com/MarvellEmbeddedProcessors/mv-ddr-marvell/...
That shit contains an ARC core. Never heard of ARC? That's okay, most people haven't. It's an obscure niche architecture used for almost nothing except for ... drum roll ... the Intel Management Engine (up until 2019). Gee I wonder why that's in there.
Don't touch Marvell shit with a ten foot pole, it's blobs all the way down. They're just good at hiding it.
- transpute 11 days ago
  
  Thanks for the analysis.
  > That repo is for an opaque blob
  Marvell patches have been going to upstream u-boot for a subset of CN-9130 functionality. Some were rejected as unwanted SoC-unique features, but that code is still public and could be consolidated into a public git repo for evaluation. Support for core features was merged, e.g. this thread from 2020 to 2023, https://lore.kernel.org/u-boot/ddd355c2-344e-4fbd-ace9-29d10...
  > the notorious will-never-be-open-source highly-radioactive "snps blob" .. ARC core
  Thanks for highlighting snps+eula. While we all want blob-free hardware like the expensive Talos OpenPOWER, all modern CPUs from Intel (ME) and AMD (PSP, MS Pluton) have management cores and firmware blobs (Intel FSP, AMD Agesa). AMD has a public roadmap for open firmware with OpenSIL, but PSP code is not public. One mitigation is to keep the device offline.
  > Marvell.. good at hiding it
  While this obfuscation is undesirable, many Arm vendors don't even bother with a fig leaf of upstream support. Would you recommend any Arm SoCs that have ongoing upstream Linux and u-boot coverage/fixes? Ideally, the SoC would also have OSS code for TrustZone (TF-A) and support for ECC memory. RockChip RK-3588 looks promising, https://www.collabora.com/news-and-blog/blog/2024/02/21/almo...
  
  karma_pharmer 11 days ago
  
  While we all want blob-free hardware like the Talos Power9
  ... which is 100% blobless. Or the Cavium MIPS64 chips (100% blobless). Or if you want Arm64, the Rockchip RK3399 (100% blobless). You have plenty of choices.
  One mitigation is to keep the device offline.
  Here's an even better mitigation: don't buy junk like this chip.
  Stop blobwashing junk hardware like this with deceptive misrepresentations.
  
  transpute 11 days ago
  
  > Cavium MIPS64
  Does that include the Cavium Plus CN5020 in Ubquiti EdgeRouter Lite?
  https://openwrt.org/toh/hwdata/ubiquiti/ubiquiti_edgerouter_...
  https://www.insidegadgets.com/wp-content/uploads/2015/06/CN5...
  I looked into running OpenBSD on ER3-Lite and was told that network routing performance was poor without a binary blob that ships in Ubiquiti router firmware. Cavium is owned by Marvell.
  > if you want Arm64, the Rockchip RK3399 (100% blobless). You have plenty of choices.
  The challenge is finding existing products that people can buy at retail. We're having this conversation in an HN thread about ECC. The only off-the-shelf Arm NAS (4xSATA, 2xM.2) that I've found with ECC + Debian is the QNAP above.
  I will look for an RK3399 SBC that supports ECC memory, 4xSATA and at least one NVME slot, which could be installed into a 1U NAS chassis.
  > Stop blobwashing junk hardware like this with deceptive misrepresentations.
  ECC is a high priority requirement for an Arm NAS. Would be delighted to find a blobless Arm SBC with ECC support and enough I/O channels for use in a NAS. Otherise, ECC trumps blobs for an offline NAS use case in a 1U chassis, which can be physically secured in a rack against physical tampering.
Sakos 11 days ago

Afaik, almost every Ryzen mainboard supports ECC. Just randomly picked a budget AM5 mainboard from Asrock and it has ECC support: https://www.asrock.com/mb/AMD/B650E%20Taichi/index.asp#Speci...

TheAmazingRace 12 days ago

So I have to say, ECC memory is definitely something we should not have gotten away from for consumer hardware. My current PC, which is rocking a Core i9 14900k (pray for me) and an ASUS W680M ACE SE motherboard, allowed me to install some 5600MHz speed DDR5 ECC memory, and it works flawlessly.

The only downside in my view is the cost. Unbuffered ECC and the cost of using a workstation class chipset really pushes this into luxury territory. Plus, I'm never too sure what Intel's future plans are for successor processors and chipsets, which is why I settled on W680. I don't really want to go full blown Xeon.

Sweepi 12 days ago

the more significant downside than cost is speed, with your DDR5-5600 ECC having most likely a latency of 16ns, while DDR5-7000 non-ECC 12ns (10ns if you are only interested in Column Strobe) is available for your platform, which has 25% more Bandwidth while also featuring a 25% - 40% lower latency.
Dont be fooled (like me) by the DDR5-6000/6400/6800 ECC registered Modules, all Desktop Motherboards only support unbuffered Modules, and most dont even support DDR5-5600 ECC, only DDR5-4800/5200 ECC.
- TheAmazingRace 11 days ago
  
  Yeah I realize I have pretty sloppy timings, courtesy of JEDEC. But despite this, I feel like my new ECC memory runs rings around the old overclocked DDR3 kit I ran in my previous Z97 build. It's all relative.
pixelpoet 12 days ago

14900k might be one of the least reliable CPUs in recent times, pairing it with ECC seems almost ironic!
- vardump 12 days ago
  
  The recent trend with Intel CPU (at least 13900K and 14900K) reliability is indeed rather worrisome.
  
  walteweiss 12 days ago
  
  Do you mind sharing more details for those who weren’t paying attention? I’m still on 10 years old generations with my home PCs and my laptop. And don’t feel the urge to upgrade anything yet. Hence, I’m not following the trends.
  The only thing I’m considering is the upgrade of my iPad Pro and probably MacBook at some point (not sure if I need it at all). My home computer is doing just fine with 4th generation Intel CPU and I don’t see any need for upgrade in upcoming years, if not a decade.
  
  zrm 12 days ago
  
  Intel stagnated for several years, allowing AMD and Apple (both using TSMC) to produce CPUs that are simultaneously faster and more power efficient than Intel's. There wasn't really anything Intel could do about the power efficiency in the short run so they went all in on performance and started selling desktop CPUs with a 6 GHz turbo that draw >250 watts. Apparently this is not great for reliability.
  
  vardump 12 days ago
  
  There are significantly more errors occurring on those CPUs. I don't have much time now, but you can check details: https://www.google.com/search?q=14900K+failures
- TheAmazingRace 12 days ago
  
  Perhaps… but maybe I see the ECC as an insurance policy on top of a dodgy CPU. Frankly, I set my CPU to Intel Baseline and then ran Cinebench for over an hour on the multi-core test with zero ill effects. Hopefully I’m lucky in the end.
pseudalopex 11 days ago

> Unbuffered ECC and the cost of using a workstation class chipset really pushes this into luxury territory. Plus, I'm never too sure what Intel's future plans are for successor processors and chipsets, which is why I settled on W680.
Some cheap AMD motherboards support ECC. But the future is unknown. Ryzen 8000 CPUs don't.
- TheAmazingRace 11 days ago
  
  AMD has a sort of unofficial stance on ECC. The support isn't explicitly stated for desktop parts, although unofficially it typically works.
  I like having 100% assurances and guarantees that it will work, making the W680 based platform I'm on a no-brainer for this application.
  
  pseudalopex 10 days ago
  
  AMD's product pages say yes with motherboard support or no.[1]
  [1] https://www.amd.com/en/product/12741
  
  TheAmazingRace 10 days ago
  
  Hmm I actually didn't see this before. I stand corrected.

summerlight 12 days ago

https://discourse.codinghorror.com/t/to-ecc-or-not-to-ecc/37...

Interestingly, Jeff Atwood has changed his mind on ECC memory.

pixelpoet 12 days ago

Even more interestingly in the comment below, apparently you can just flick on ECC for RTX 4090 cards!
Extra weird is Nvidia singling out ray tracing as a use case which shouldn't use ECC... I suppose it's no biggie if a single ray goes the wrong way down the BVH, out of trillions.
- kragen 12 days ago
  
  yeah, makes sense to me. it'll just result in unnecessarily antialiasing that pixel usually. unless it corrupts the definition of an object and then that object might disappear or become enormous, and probably most raytracing memory is objects, right? not rays being traced
sufehmi 11 days ago

As soon as I worked with high-load servers in the 90s, it's already clear that ECC should be the default.
Intel's marketing ploy on ECC is very desctructive and have costed many parties a lot of wasted time money & resources to handle problems caused by non-ECC memory ; and Linux Torvalds is absolutely right in roasting them for this.

magicalhippo 12 days ago

Memory corruptions can impact very differently. A sample of decoded music getting corrupted leads to a small glitch, maybe even inaudible. An instruction in executable code getting corrupted can leads to all sorts of havoc.

Since ECC is seemingly not getting mandatory, I've been wishing CPUs would support "soft-ECC". That is, the OS could mark certain pages as needing "soft-ECC", and the CPU would then store (at least) three copies of that page in RAM. When reading such pages back from RAM the CPU would read all physical copies and compare. If the majority agrees it can use that, otherwise raise an error.

This could then be used for executable pages and important configuration data which occupies relatively few pages, and where integrity matters a lot more than speed.

There's probably some good reasons why this is non-trivial to implement, I've forgotten most of what I learned about the virtual memory implementation in CPUs. But a man can dream...

HideousKojima 12 days ago

The triple reading/writing to memory along with the comparing would probably be a significant performance hit. You could just use a bit of extra memory for parity bits etc. instead
- magicalhippo 12 days ago
  
  Sure it would be a significant hit. But only for relatively few pages, and I imagine most of them would either stay cached or are very cold.
  Of course you could trade implementation complexity for speed as always. My main point was to have effective ECC without any additional support from motherboard and memory modules.
  
  menaerus 11 days ago
  
  The important detail to make your theory plausible, and without a huge performance impact and/or jittering, is how the kernel memory management system is going to decide what exactly is those "relatively few pages" on a server that runs a plentiful of processes, many of them including critical/heavy ones such as databases? One such example comes to my mind - transparent huge pages. And it didn't turn out to be quite successful.
  I imagine soft-ECC would be more plausible if it could be applied/enabled:
  1. Per-process (e.g. whole application)
  2. Or fine-grained per-allocation within a process/application
  I think both could be implemented through a (system) memory allocator by taking the advantage of page alignment LSBs and/or (non) addressable bits of the virtual memory. Those spare bits in memory can be used to store the encoding of ECC algorithm (Hamming, Reed-Solomon, or something more primitive but less robust).
  And then, depending on the ECC algo, one could substantially minimize the performance impact of encode/decode by using SIMD.
  
  magicalhippo 11 days ago
  
  I imagined executable pages and pages explicitly marked by the application/kernel via VirtualProtect[1] or similar.
  Of course an application that marks GB of data this way could play havoc with other processes, so maybe some OS limits on how many non-executable pages a process can mark may be needed.
  And there certainly might be some further dragons I'm not considering.
  [1]: https://learn.microsoft.com/en-us/windows/win32/api/memoryap...
grog454 11 days ago

How does the OS know which pages to mark?
- magicalhippo 11 days ago
  
  As I mentioned in my other reply[1], I imagined executable pages and those marked explicitly through code.
  This allows the kernel and applications to protect important variables and data structures.
  [1]: https://news.ycombinator.com/item?id=40297035
  
  smallpipe 11 days ago
  
  That relies on hardware support in the TLB. If you need TLB support for "please don't corrupt this page" you might as well get hardware with ECC in caches and RAMs
  
  magicalhippo 11 days ago
  
  Which relies on three components supporting ECC, not just one.
  But sure, ECC all around would be best.
  Btw, it was my understanding that CPU caches already use ECC?

Animats 12 days ago

ECC memory should have a price premium of only 1 - 9/8, or 12.5%. It costs more than that, because it's "enterprise".

thfuran 11 days ago

Probably slightly more than just the increase in memory modules, since there's also the extra complexity of actually checking/reporting, but roughly.
wmf 11 days ago

It's now 25% for DDR5.
- Animats 11 days ago
  
  That's not bad. A few years ago, it was more like 100%, and hard to get.
  
  wmf 11 days ago
  
  No, I mean a DDR5 ECC DIMM has 25% more chips (10 vs. 8). I don't know what the price difference is.
  
  adrian_b 11 days ago
  
  That is no longer true.
  The DDR5 standard allows either 40-bit channels or 36-bit channels in ECC DIMMs. (A DDR5 UDIMM has 2 channels, while the desktop CPUs use 4 channels provided by 2 sockets.)
  The former choice corresponds with a 25% overhead, while the latter with a 12.5% overhead.
  In the beginning there were only 80-bit ECC DDR5 UDIMMs, because the DRAM vendors preferred to make only x8 chips. More than a year ago, 72-bit ECC DDR5 UDIMMs have also apppeared, which use a mixture of x8 and x4 chips.
  Nowadays no ECC memory vendor can justify a 25% overprice, because if they happen to use 25% more memory that is because they believe this reduces their production costs (by making a single kind of chip). Only an overprice slightly more than 12.5% is right.
  Unfortunately, due to limited offer, the ECC DDR5 UDIMMs can still be up to 50% more expensive than non-ECC modules.
  
  P_I_Staker 10 days ago
  
  I don't know what it is either
ido 12 days ago

True, as is explicitly mentioned in the article.

snvzz 11 days ago

ECC should be a requirement.

The FCC could just not allow computers to ship it without.

CPU makers like Intel and AMD could simply have their CPUs not work with non-ECC RAM.

Microsoft could e.g. require ECC RAM for Windows 12.

It is insanity that most computers shipping today do not use ECC and are thus unreliable.

With luck they'll crash, but most likely they will fail silently, while corrupting data.

1oooqooq 11 days ago

so true. anecdotal, I've gone from 4 blue screens a month down to zero after going ECC, on the few desktops we require windows.
everyone must still be quoting numbers when we had 4mb of premium chips. now that all pcs have 8-128gb of the crumiest, cheapest silicon... i bet the failure rates are way more noticeable.
sadly, i got laptops for my company that have a PRO amd cpu and sodimm sockets... only to find out ecc sodimm ram is sold by one manufacturer gouging the NAS market with crazy insane prices.
adhoc32 11 days ago

I've been there, and it was a pain. All my backups were corrupted due to a faulty RAM module. Initially, I blamed the hard drives because they seemed to be failing right before my eyes. I was copying a large file; sometimes it copied okay, but occasionally it would become corrupted. Since then, I've been paying a premium for ECC.
- TheCondor 11 days ago
  
  Same experience. We were doing all the things, regular backups, rotating them, verifying them. During a weekly verification test, it failed. Tested some older backups and they failed too! If the data matters, it’s hard to express the stress and disconcert you feel in this moment.
  Memory is different from all other resources in the system. We are conditioned as engineers, we know drives fail more frequently than other resources. When memory fails it is indistinguishable from a drive failure. There are some system behaviors that matter too, we tend to think that page allocation is random and on heavily loaded systems it appears to be, but on specialized systems it can be rather consistent so the verification can fail in nearly the same place, repeatedly. Riddle me this: what is more likely? A memory failure, a drive failure, or a postgresql bug that results in a corrupted row? Badblocks checks out on the server’s disks… if the data matters, it is extremely unpleasant going through that whole thing, it’s crystal clear after the fact but it’s a bloody nightmare in the heat of it all.

dang 12 days ago

Why Use ECC? (2015) - https://news.ycombinator.com/item?id=25167288 - Nov 2020 (98 comments)

Why Use ECC Memory? - https://news.ycombinator.com/item?id=23361577 - May 2020 (2 comments)

Should I buy ECC memory? (2015) - https://news.ycombinator.com/item?id=14206635 - April 2017 (224 comments)

Why use ECC? - https://news.ycombinator.com/item?id=10638324 - Nov 2015 (95 comments)

hi-v-rocknroll 11 days ago

Other references:

https://cr.yp.to/hardware/ecc.html (2001)

DEF CON 19 - Artem Dinaburg - Bit-squatting DNS Hijacking Without Exploitation (2011)

https://youtu.be/aT7mnSstKGs

https://media.defcon.org/DEF%20CON%2019/DEF%20CON%2019%20vid...

anonymousiam 11 days ago

Nice to see Artem's name show up here. I had the pleasure of working with him about ten years ago.

Nerada 12 days ago

DDR5 comes with on-die ECC. My understanding is this only checks errors occuring within the RAM itself, not errors that occur during transmission to and from RAM.

My question is, how common are transmission errors over errors happening within RAM?

pclmulqdq 12 days ago

On-die ECC is so they can give you a memory array with a few faults. It's a yield enhancement not an introduction of ECC as you think of it.
Adding protocol-level ECC on top only helps, although it is somewhat inefficient.
- thfuran 11 days ago
  
  Similar to SSDs, which are constantly switching to less and less reliable cells for density and now need fault correction built in to function at all.
gjjydfhgd 12 days ago

Another problem with on-die ECC is the lack of reporting.
You have no idea if you have tons of errors and how many were corrected.
hi-v-rocknroll 11 days ago

Sort of. It's not the same as extended ECC like ChipKill.
https://en.wikipedia.org/wiki/Chipkill
DRAM Errors in the Wild: A Large-Scale Field Study (2009)
https://static.googleusercontent.com/media/research.google.c...
geerlingguy 12 days ago

LPDDR4/4X has also had on-die ECC for a while (at least the chips I'm used to, like in the Raspberry P); with such small lithography it's basically required to get the ram to work reliably.

Sweepi 12 days ago

I would love to put ECC in my Desktop computers, however its more expensive (ok), is not officially supported on most Desktop Motherboards (and in reality does not work in "ECC-Mode" on the majority of them) and finally: ECC Ram available to purchase is painfully slow, in both bandwidth (:/) and latency (://)

sph 11 days ago

Please, I'd love someone to tell me how to find and buy computers that support ECC. I am looking to buy a NUC/mini-server, and they basically all sell with non-ECC RAM. Last time I asked on this forum, I was told that on Intel, only Xeon processors support ECC, while all modern (?) AMD CPU support them. Elsewhere I read that was matter is the mobo needs to support it. I have no idea how to go about it.

So, let me ask again. I was to buy a NUC new or off Ebay, how can I be 100% sure it works with ECC RAM without having to spend half a hour researching CPU, mobo and BIOS specs for each single product I come across?

If I had a budget in the thousands, I would go with a Xeon server that comes with ECC pre-installed. I don't and have modest needs. I only want to splurge on ECC RAM to replace the original sticks.

(No "you don't need ECC for a NUC" reply please. That is not the point of my question, yet it is a far too common response)

adrian_b 11 days ago

I have never seen any true NUC-like computer that supports ECC SODIMMs. Intel has also used the NUC brand for a much larger computer that supported laptop Xeon CPUs, but that line has been abandoned.
There have been some NUC-like computers from ASRock industrial, Supermicro and others, with either Tiger Lake or Elkhart Lake CPUs, where you could enable in BIOS the so-called in-band ECC.
All these models are obsolete. Moreover, in-band ECC is an ugly and inefficient workaround. It can be used with soldered LPDDR memories, which do not have ECC variants, but it has worse performances than standard ECC. It is not cheaper, because it diminishes the memory capacity in the same ratio as any ECC and it requires a greater die area inside the CPU for its implementation (including a dedicated cache memory for the ECC bits).
There are many mini-ITX motherboards that support ECC (but you must check carefully the specifications, even if they are for AMD CPUs). For a smaller size than mini-ITX, there are 2 choices, either expensive industrial single-board computers, which usually have the 3.5" form factor of the PCB, or one of the so-called mobile workstation laptops from Dell, HP or Lenovo, e.g. a Dell Precision mobile workstation, which are also much more expensive than an equivalent NUC-like computer.
So, if a low price is desired and an up-to-date fast CPU, you cannot have ECC in form factors smaller than mini-ITX. If paying double or triple is not a problem, there are solutions.
If you want a preassembled small computer with ECC and a mini-ITX motherboard, there are some at companies like ASRock Rack or Supermicro, but they are much more expensive than if you get the best components and you assemble them yourself.
user_7832 11 days ago

Recent AMD processors need to be AMD pro series to support ECC. Motherboard support is also required. On intels side I think standard Xeon type commercial boards very often support it. Unfortunately you’ll likely need to ask around when buying to ensure it supports it. If you can, getting a mini ATX known-good mobo in a small case may be easier.
- pseudalopex 11 days ago
  
  Don't all Ryzen 7000 CPUs support ECC?
  
  user_7832 11 days ago
  
  At least in the 7x4x series, the pro designation is required. Fortunately amd’s website will mention if Ecc is supported.
  
  pseudalopex 11 days ago
  
  I didn't think about laptop CPUs. It looks like 7x4xHS and 7x4xU require the Pro designation but 7x4xH and all others do not.
petronio 11 days ago

I was on the NUC search a while ago and I'm not sure you can. Although the AMD motherboards may not support ECC, I haven't heard of any that actually don't. Best bet is probably to buy a recent, barebones AMD NUC, and buy the ECC RAM yourself. Sometimes they'll advertise ECC support as well.
P_I_Staker 10 days ago

keep reading. books will set you free

eadmund 11 days ago

What’s the best price/performance for a home lab server running Linux with ECC these days? Bonus points if it is rackable.

Sadly, my go-to Linux hardware manufacturers either don’t offer ECC RAM, or only offer it as an option on their absolute top-end machines. Yes, yes, the extra two thousand dollars for a machine with a six-year lifespan probably is worth it on a monthly basis, but man it still hurts.

adql 11 days ago

> What’s the best price/performance for a home lab server running Linux with ECC these days? Bonus points if it is rackable.
Old used enterprise server. None of them will be great at power/performance in typical (i.e. mostly idle) home use tho. Intel ones usually far better here
NorwegianDude 11 days ago

I recently(ish) built a new home server using an cheap AM5 motherboard from ASUS that supports ECC. Good performance and power usage is around 45 W idle with a couple of SSDs and a couple of HDDs spinning.
Not the cheapest, but I wanted to keep power consumption low for noise and reduced heating while still having good performance if needed.
I also considered a motherboard with IPMI on AM5(Asrock rack), but that was much more expensive.
Worked out quite nicely.
Palomides 11 days ago

put something together with used supermicro parts, maybe with a H11SSL-i or H12SSL motherboard and epyc cpu
or whatever dell 730 or something fits your budget

BlueTemplar 12 days ago

> From talking to folks at a lot of large tech companies, it seems that most of them have had a climate control issue resulting in clouds or fog in their datacenters. You might call this a clever plan by Google to reproduce Seattle weather so they can poach MS employees. Alternately, it might be a plan to create literal cloud computing. Or maybe not.

oskarkk 11 days ago

> For example, at 20nm, a DRAM capacitor might hold something like 50 electrons, and that number will get smaller for next generation DRAM and things continue to shrink.

Nice. That got me curious, how many electrons are in today's DRAM capacitor? I tried searching but haven't found any recent info.

dvt 12 days ago

I tried building an old rig (maybe ~7 years ago or so) using ECC RAM (since I was running two Xeons). It was such a pain in the butt to get it to boot and find sticks that were compatible with each other, don't really want to go down that path again.

danparsonson 12 days ago

I built my most recent desktop using ECC and it was a breeze, so maybe you were just unlucky?
_factor 11 days ago

The cheap sticks work if you don’t mind buying a bunch and sending back the ones that don’t work together in your config.
Binning is likely the problem.
Fnoord 11 days ago

For me it was a one shot with a Xeon, no issues whatsoever. Any decent price comparison is able to filter on ECC memory.
bjoli 12 days ago

I just chugged 128gb of ddr4 LRDIMMs into my old xeon server. It worked flawlessly.

nextaccountic 12 days ago

Why not ECC CPUs and GPUs? They can be hit by cosmic rays too.

adrian_b 12 days ago

Almost all CPUs (except perhaps some very cheap microcontrollers) already use ECC for their internal cache memories, because those are even more prone to errors than the external DRAM.
Consumer GPUs, which do not have ECC, are notorious for frequent errors when they are used for general-purpose computation tasks instead of graphics or AI applications (where errrors happen, but they usually do not matter), so for any important computation it is recommended to repeat it and compare the results.
- nextaccountic 11 days ago
  
  What about ECC for the logic circuits of CPUs? Like, some form of redundancy that enables detection and correction of logical errors caused by some kind of malfunctioning in circuits that do computation (perhaps caused by cosmic rays or something)
  
  adrian_b 11 days ago
  
  The makers of server CPUs have always claimed that those have additional "RAS" (Reliability, availability and serviceability) features in comparison with consumer CPUs.
  Nevertheless, I have never seen such a vendor giving a non-fuzzy description of those features. Moreover, after the sinking of Itanium, except for IBM the other server CPU makers use CPU cores that are also designed for consumer applications, so at most the server CPUs may have additional redundancy in some execution units (for validating the results) that is disabled in the consumer variants, to reduce the power consumption. However if that were true I do not see why the CPU vendors would not brag about it, to justify the premium price that is requested for the server CPUs.
  For the applications that need very high reliability, it appears that in most cases it is preferable to use multiple CPUs operating in lock-step, which can be compared, ensuring that errors will be detected regardless in which part of the CPU they appear, instead of attempting to use a single CPU that is extremely fault-tolerant by using specially designed redundant execution units (because errors can appear anywhere, including on interconnections and buffers).
- caf 12 days ago
  
  If the CPU exposes statistics from the internal cache ECC, I wonder if you could use that as a rough radiation sensor?
  
  adrian_b 12 days ago
  
  Most CPUs do not document the details of the cache controllers, so typically you cannot know when correctable errors happened.
  Only when non-correctable errors happen, which should be very seldom (e.g. less than one per year), the CPU generates a machine error exception, for which it can be identified that the cause was a (multiple-bit) error in one of the internal cache memories.
  Therefore you could sense only extremely-high radiation levels, which would also require a customized OS kernel to recover from exceptions without crashes. Such great radiation levels could be much easier detected with some discrete reverse-biased diodes, which are also easier to place wherever the radiation exists, keeping the microcontroller or whatever device records the radiation levels either away or shielded from the radiation.
  
  caf 11 days ago
  
  Yes, it was correctable errors I was thinking of. Not too surprising that they don't expose those statistics, I guess - it probably doesn't even count them.
sufehmi 11 days ago

GPUs have ECC, it can be enabled anytime, see https://discourse.codinghorror.com/t/to-ecc-or-not-to-ecc/37...

nottorp 11 days ago

I believe the first part could make for the start of a great 'if Google does it it doesn't mean it's good for you' article...

forty 11 days ago

I read somewhere that DDR5 has some kind of internal ECC mechanism even for non ECC stick, is that right? Does it make ECC less relevant?

luckystarr 12 days ago

That's the reason I chose AMD for my own laptop. AFAIR Intel doesn't support ECC.

genpfault 12 days ago

(2015)

wmf 12 days ago

There is a 2024 update in the middle saying that "Gabriele Svelto estimated that approximately 10% to 20% of all Firefox crashes were due to memory corruption."

ceving 12 days ago

The only problem with ECC is that it is only an improvement, not a solution. Every error correction has a limit to the number of errors it can detect or correct. There is no such thing as absolute security.

cbolton 11 days ago

Having a small remaining probability of error is not a problem at all, if it's small enough. There are always potential sources of failures (such as a meteorite crashing on your server). If the probability of one source is dwarfed by others then it doesn't matter.
sph 11 days ago

It's not a "problem" if, as you say, perfection is an unreachable goal. The only question is whether it's an improvement over the status quo, and it clearly is.

stefantalpalaru 11 days ago

[dead]