I.MX7 M4 Atomic Cache Bug

rschaefertech.wordpress.com

160 points by luu 6 years ago

fest 6 years ago

I've never personally encountered a bug like this but I have hit my fair share of weird/hard to track down bugs over my embedded software career.

Almost always, they leave me a) longing for the blissful ignorance of low level details our whole computing infrastructure is built upon and b) wondering, how on Earth our technology is working as well it does, considering there are layers upon layers of abstractions which could have a lot of issues which are either worked around or just not hit in particular application.

rtpg 6 years ago

This is dumb and "obvious" but the advantage of the layers of abstraction is that a lot of people share layers of abstraction. So the issues need to only be found once, so to speak
Of course you lose specialization and you end in "the image decoder supports some weird format which allows RPC " territory but most software can barely handle the high level bugs.
It's pretty crazy how something like Django exists, powers thousands and thousands of websites, but has perhaps less than a thousand contributors overall? Less than one person per project. Now that's an amazing payoff!
jacquesm 6 years ago

> how on Earth our technology is working as well it does, considering there are layers upon layers of abstractions which could have a lot of issues which are either worked around or just not hit in particular application.
Our technology works as well as it does because of these layers upon layers of abstraction. That's the only way you are going to be able to construct something with a few billion components and a fighting chance at avoiding unwanted interference between parts. The amazing thing is how often we get it just right, not that there are super rare edge cases that were not taken into account during the abstraction process that lead to bugs.
Every leaky abstraction is a bug in the waiting, all it takes is for someone to focus on the discrepancy with enough time, effort and resources thrown at it it might lead to a crash or an exploit.
Also note that it is not as if we don't know that caching is a hard problem to get right, it is one of the three explicitly mentioned in the 'there are two things hard about computing' joke.
- flamedoge 6 years ago
  
  I feel like the more I learn, the more convinced I become that computing is because we build stupidly impenetrable abstractions that keep us from shooting our feet. Yet I can't shake the feeling that we are leaving so much room for optimization on the table.
  
  jacquesm 6 years ago
  
  That's true, but optimization is always an exercise in economy. If the money is there someone will do the optimization, for instance, in the Bitcoin mining arms race you could see the writing on the wall for CPUs long before the jump to GPU's, FPGA's and eventually ASIC's.
  In mobile phones I always expected battery life to cause a resurgence of things like assembly programming but it never happened, people are happy to recharge their phones. I wonder what would happen if someone introduced a smartphone OS based on old school principles jacking up the battery life to 5 days or so.
  
  jl6 6 years ago
  
  It could still happen. Mobile phones have been riding the CPU speed improvement gravy train for a decade or so, but there are signs that this is coming to an end like it did for desktop CPUs.
  There will be increased demand for faster software when the hardware stops getting faster.
  Optimization is vertical integration. Guess which mobile phone manufacturer is best placed to pull that off!
dfox 6 years ago

I have exactly same experience.
And it is somewhat reinforced by the kind of "software problems" that only manifests themselves on new hardware revision or even production batch. And the causes range from wrong chip assembled in production (sot23-6, same package marking, but dual jfet really does not work as replacement for LDO), through various signal integrity issues ("how did you determine, that this makes 100R mictrostrip?" "It looked right and makes the board look right", somehow the "right-looking" diffpair on 6+ layer board does not work on 2 layer backplane) to real sillicon errata that either can (MSP430G-series USCI interrupt handling, you will not even trigger that when using TI provided sample code) or can not (Microchip's various pin routing issues, some Intel Atom erratas and such) be sanely worked around in software.

codys 6 years ago

Reading the NXP thread, it is not yet clear that NXP considers this errata, only that it is something that is desirable to avoid.

Does anyone have a link to the changes to FreeRTOS & use of libclang mentioned in the article?

NXP thread: https://community.nxp.com/thread/459977

ChuckMcM 6 years ago

In that thread -- After reproducing the issue and performing some tests, it was found that the issue is because “LDREX” and “STREX” instructions overlooked LMEM cache. That means those instructions always access external memory directly, which leads to data inconsistency.
There’s no SW configuration to make the cacheable data consistent with those atomic instructions, and design team will fix it in later CM4 integration.
Its a bug. But they see a workaround so they aren't in a hurry to fix it apparently.
- Gibbon1 6 years ago
  
  That reminds me of an article about a similar problem with the xbox. Cache consistency is extremely brittle combine that with speculative execution and that means just having instructions that break cache consistency in memory is dangerous.
  
  Dylan16807 6 years ago
  
  Here you go:
  https://randomascii.wordpress.com/2018/01/07/finding-a-cpu-d...
  https://news.ycombinator.com/item?id=16094925

pslam 6 years ago

I’ve found my fair share of SoC bugs over the last couple of decades, and cache coherency is by far the most common problem. It’s complex to implement, and implementers are always messing with it to gain a cycle here and there. They get it wrong, frequently.

I would go so far as to bet every mainstream SoC has at least one cache coherency bug either already documented (errata) or undiscovered.

mbilker 6 years ago

This bug reminds me of the cache in-coherency bug with the xdcbt instruction of the Xbox 360 PowerPC CPU.

https://randomascii.wordpress.com/2018/01/07/finding-a-cpu-d...

comex 6 years ago

There's an even more similar bug in the PowerPC CPU used by the Wii U: all atomic operations have to perform a cache flush (dcbst) between the load-linked (lwarx) and store-conditional (stwcx) instructions, or else they won't work properly. (But I believe the issue is that the operations aren't being propagated from per-CPU caches to main memory, so it's sort of the opposite of the I.MX7 bug where operations are skipping the cache.)

digikata 6 years ago

Nice writeup with coverage of both a discovery and fix. It makes me wonder where the best distribution point to this ends up. The most convenient, but hidden, is that it ends up in some embedded dev kit somewhere. But I suppose it might go into the FreeRTOS software, but then it seems like it's not globally applicable to the ARM platform, just the iMX.7 (and likely not even all variants of iMX.7).

wallacoloo 6 years ago
I believe that the only generic fixes are to either (a) ensure DRAM is never cached, (b) ensure no atomic operations address DRAM (as the author proposed by using TCM), (c) implement atomics by disabling interrupts, performing a normal RMW operation (i.e. no LL/SC, as even with IRQs disabled that would cause cache incoherency), and re-enabling interrupts.
No library like FreeRTOS can guarantee (a). Not even the compiler can guarantee (a), since the user can control the cache by memory-mapped registers. (b) can't be guaranteed by a library, nor can it be guaranteed statically by a compiler, since only the linker knows where the atomic variables will reside in memory (and, atomic operations could be performed on an address that isn't a compile-time constant, e.g. dynamically allocated memory).
(c) also can't be guaranteed by any library, but it could be guaranteed by a compiler that has access to the full source of the binary. That's a hefty limitation though, since it means you can't mix any other compiler/language (e.g. assembly, which is almost always used for the startup sequence) into the binary and still have these guarantees.
For method (c), I believe gcc allows one to somehow override __atomic_load & other "builtins" - the use-case being that atomics can be implemented for new or uncommon architectures without modifying the compiler itself. If this is the case, then a potential fix could be shipped by a library (e.g. FreeRTOS) which defines __atomic_load as something like
```
  void __atomic_load (type *ptr, type *ret, int memorder) {  
  #ifdef IMX7_<partnumber>  
  __disable_irq();  
  *ret = *ptr;  
  __enable_irq();  
  #else  
  // insert code to perform a normal atomic load.  
  #endif
  }
```
In fact, gcc allows one to "wrap" functions - it might be possible to do something like
```
  void __wrap___atomic_load (type *ptr, type *ret, int memorder) {
  #ifdef IMX7_<partnumber>
  __disable_irq();
  *ret = *ptr;
  __enable_irq();
  #else
  __real___atomic_load(ptr, ret, memorder);
  #endif
  }
```
which would have the benefit that the library doesn't need to know how to implement atomic ops on other platforms. This approach could also be used to implement (b) by performing a runtime assert that `ptr` lives in a cache-coherent section of memory.
But again, this approach only works if you're not relying on any binary blobs that perform atomic ops. In the end, if you're doing anything nontrivial (e.g. atomic ops on heap-allocated memory, or even stack-allocated memory), it's impossible for a dev kit to completely hide this bug from the developer.
Alternatively, do these M4 processors have some type of updatable microcode like X86 processors do? NXP might be able to push a fix that somehow patches LL/SC primitives, or traps when they're encountered at runtime and allows the user to decide how to handle them (e.g. putting them in a no-IRQ critical section like above, but since it's done at runtime now you can mix multiple languages / binary blobs, etc).
- rschaefer2 6 years ago
  
  The M4, to my knowledge, has no microcode like x86 processors.
  For solution (b), while it can't be guaranteed by the compiler, it can be guaranteed with external tooling that manually ensures that all atomic variables have a gcc section attribute specifying the linker section in the TCM when all sources are available. This also will prevent heap and stack allocated atomics, as I believe the linker will error when specifying a section attribute that the linker cannot respect.
  Solution (c), that is actually the solution developed for use with some M0 implementations that don't support LL/SC. It works with gcc as the gcc functions implementations have the __weak attribute, meaning that your implementation takes priority. An example override of fetch_add:
  uint32_t __atomic_fetch_add_4(uint32_t* addr, uint32_t value, int memmodel) { (void)memmodel; uint32_t mask = __get_PRIMASK(); __disable_irq(); uint32_t temp = *addr; *addr = temp + value; if (mask) { __enable_irq(); } return temp; }
- burfog 6 years ago
  
  I think there is another fix: avoid all normal loads and stores to cache lines that contain atomic data structures.
  When allocating the data structure, make sure to grab a whole cache line or more. Flush and invalidate that, even if you have to do the entire cache. Now simply use the LL/SC operations (apparently named LDREX/STREX on ARM) exclusively. The bug will protect you from itself, preventing the data structure from ever being cached.
  The only danger left is that some processors will predict memory access, leading to a speculative load of the cache line. I don't know if this CPU has such a feature. If so, a larger allocation might be needed.

unwind 6 years ago

Learning the LDREX/STREX instructions a couple of years back was a great "aha moment". I was/am fairly new to the ARM platform, and never really dug into x86 so I'm not very familiar with the corresponding instructions there.

But it's a really elegant model, and it was really fun to use them directly to implement some primitives we needed.

Later, of course, I realized that since we build with GCC, we can use their atomic/sync functions instead that compile to LDREX/STREX but are more high-level in the C code.

Great find, this must have been very frustrating.

cesarb 6 years ago

> and never really dug into x86 so I'm not very familiar with the corresponding instructions there.
The x86 has no corresponding instructions (other than perhaps the very recent and not yet popular transactional extensions). Instead, the x86 world uses an "atomic compare and exchange" instruction. Wikipedia articles: https://en.wikipedia.org/wiki/Compare-and-swap versus https://en.wikipedia.org/wiki/Load-link/store-conditional

epx 6 years ago

My case of hardware bug was a FPU bug in a PC/104 platform that either returned an absurd value for a floating point division, or crashed the program with SIGFPU. It was the only FP operation in the program and I was lucky enough to log the result. Replaced by scaled integer division to avoid the bug because replacing thousands of boards was not an option.