Mill vs. Spectre: Performance and Security [video]

71 points by leoc 6 years ago

For anyone interested, here are links to the slides and the accompanying white paper.

[Slides] https://millcomputing.com/blog/wp-content/uploads/2018/04/20...

[White Paper] https://millcomputing.com/blog/wp-content/uploads/2018/01/Sp...

I haven't read the paper yet; hopefully it offers more detail than the talk does because I am still confused about how the Mill avoids cache pollution from speculative loads.

EDIT: Here is my attempt at a summary of the relevant bits of the whitepaper:

The Mill is immune to Meltdown for the same reason AMD et al. are; it does permission checks before loading rather than in parallel and thus the load faults before going to memory.

The Mill is immune to Spectre because "Current Mill configurations will [speculatively] issue, and revoke, a maximum of two instructions. Revocation includes all cache and other micro-architectural side effects."

Neither of those points is covered in the talk. I don't know enough about the subject to judge, but the arguments in the paper seem a bit glib. I'd like to hear from an expert on the subject.

strstr 6 years ago

I’m pretty surprised if they don’t leave speculatively loaded (and still correct) data in the cache. My understanding of speculation is that was sort of the point: often you won’t compute the right value (because you have to be right in every instance) but you will have loaded nearly all of the relevant data into the cache, so it’s comparatively fast the second time around.
- Veedrac 6 years ago
  
  This argument holds better for an OoO CPU that is speculating 100 instructions ahead, so there's significant work done in this window. When your speculative execution is only 2 cycles ahead, you aren't throwing away much work; you'd be lucky to even have work to throw away by that point, at least as it applies to cache misses.
- Symmetry 6 years ago
  
  I'd be very surprised is they didn't too. But Spectre isn't just about what's in cache, you have to load secret data and then do another load with a location based on that secret data before the the mis-predicted branch is caught. The number of clock cycles from branch prediction to branch resolution on the Mill is just too short for you to do all of that, just like it is on most in order architectures. Just loading the secret data into cache isn't enough to be a problem. You already knew its address if the attack is going to work.

evancox100 6 years ago

(Started watching at 24:30, thanks y'all).

Everything he's saying misses the mark. If the only issue was hiding the memory latency of a load when you know the address, you could solve this with existing techniques like simultaneous multithreading (a la HyperThreading), prefetch hints, etc.

The need for speculation arises when you do not know ahead of time which address to access, which branch to take, etc. For example, you're accessing an element in an array and need to multiply the index by the element size. You don't know which address to load until the multiply completes, so you speculate. I don't see how the Mill's deferred load semantics help you any more than a prefetch or dummy load would. Actually, unless I'm missing something you couldn't even use the deferred load because, again, you don't have the address.

snuxoll 6 years ago

> The need for speculation arises when you do not know ahead of time which address to access, which branch to take, etc. For example, you're accessing an element in an array and need to multiply the index by the element size. You don't know which address to load until the multiply completes, so you speculate. I don't see how the Mill's deferred load semantics help you any more than a prefetch or dummy load would. Actually, unless I'm missing something you couldn't even use the deferred load because, again, you don't have the address.
You kind of hit three different issues here, there's three completely different scenarios I can think of off the top of my head to cover and the Mill design ties with out-of-order designs in the worst case and beats them in the other two.
1. Random I/O on array elements - nobody wins here because branch prediction and speculative loads will consistently fail, you hope your data is in cache and everybody stalls if not.
2. Sequential I/O on array elements - Mill can perform equally to an out-of-order design in most cases and beat it in others, you don't rely on the CPU seeing far enough ahead to reorder loads and have much better facilities for parallelizing common operations (their strstr example using their smear instruction, NaR values and pervasive vectors is truly mindblowing).
3. Switche statements with jump tables, the Mill's wide-issue design handles many of these cases without needing jump tables to begin with, especially when paired with speculative operations on potential NaR's. When you need to call code at another address you are again at the hands of the branch predictor and instruction prefetch, which the Mill does do and has some novel designs for that provides a low mispredict penalty and purportedly better prediction results. Ultimately though, if you keep hitting mispredicts you're in the same worst-case as you have on out-of-order designs.
The Mill can't beat out-of-order designs where your code just thrashes cache, causes mispredicts all the time, etc, but it can match them without eating gobs of power.
- twtw 6 years ago
  
  You talk about the mill as if it exists. There is no hardware, there are no benchmarks. Bloviating about the excellent performance of the mill is not valuable - showing SPEC CPU results is. VLIW performance was great too, until it wasn't. You can statically schedule everything in theory and the performance will be great, but experience suggests that giving hardware the capability to react dynamically cannot be replaced by static scheduling, except in code with limited branching and a known execution pattern. This is why VLIW works nicely for DSP, and fairly poorly for general purpose computing.
  The mill has been in development for 15 years, and almost done for 5. Forgive me for not holding my breath.
  
  deepnotderp 6 years ago
  
  I don't understand why people are so willing to say "it won't work" without actually taking the time to understand it. They literally spend like every one of their talks addressing how they overcome traditional vliw problems.
  
  evancox100 6 years ago
  
  He's not saying it won't work, he's saying it doesn't exist yet, so talking about it as if it does is a bit silly.
  
  youdontknowtho 6 years ago
  
  It's the myth of the perfect system that is an underdog to industry giants whose products are inferior but dominant for (insert reason).
  It's like: 1. rewrite it in rust 2. Plan 9 3. Functional Programming for everything. 4. Lisp. 5. There are more, but you take my meaning...
  
  wtallis 6 years ago
  
  > Forgive me for not holding my breath.
  You're the one who chose to click on the link to enter this discussion. You know that the Mill isn't in silicon yet and you're personally only interested in things that are, so why are you here? You're just trolling while other people are trying to have a productive academic discussion.
  
  angry_octet 6 years ago
  
  It isn't a productive academic discussion if you're gatekeeping views that don't match your own, yet are also technically informed.
  If there were results from an FPGA synthesised version of the Mill there would be less scepticism. But as it is, the Mill is just a design, and claimed performance/features require more evidence than for an existing architecture.
  
  twtw 6 years ago
  
  This "productive academic discussion" of the myriad benefits of the mill architecture has been repeated over and over again for at least 5 years, with very few new developments. There is great value to thinking about new and non traditional architectures, but discussion around this particular venture is pretty tired. I don't know that much more discussion is valuable at this point without some evidence.
  
  wtallis 6 years ago
  
  > has been repeated over and over again for at least 5 years, with very few new developments.
  The same is true of discussions of Intel's architectures; they've only released one new microarchitecture in the past 5 years. Hardware development is slow, even for the people who are already shipping silicon.
  
  sparkie 6 years ago
  
  The difference is Intel have a proven track record of producing actual products. Mill Computing have not produced anything tangible yet.
  The strategy appears from the outside to be one of aggregating IP in the hope that they'll license it (like ARM) or will get acquired for some big amount.
  I hope I'm wrong, but I'm not expecting to be able to pick up a Mill CPU in the next 5 years. Maybe even 10.

__s 6 years ago

24:30 to reach info about Mill architecture, beforehand is building context by explaining Spectre

analognoise 6 years ago

Does this thing even exist on an FPGA yet?

Veedrac 6 years ago

No.

jcranmer 6 years ago

Does anyone have a link to the slides? I find that a much preferable way to access this sort of stuff...

leoc 6 years ago

Presumably it will show up at https://millcomputing.com/docs/ eventually but it doesn't seem to be there yet.

gizmo686 6 years ago

Discussion on Mill begins at about 24:30.

ptc 6 years ago

This Mill guy is the gift that keeps on giving. With any luck he’ll still be around to explain how the soon-to-be-released Mill 1.0 cpu would have avoided the year 2038 problem.

gbrown_ 6 years ago

> With any luck he’ll still be around to explain how the soon-to-be-released Mill 1.0 cpu would have avoided the year 2038 problem.
What? Software working with a 64-bit time_t is not the CPU's problem.

twtw 6 years ago

Talk is cheap.

The TL;DW is that the mill cpu will have better performance than existing CPUs without speculative execution because it has "deferred loads," while the straw man not-mill architecture doesn't and therefore stalls after every load. Also, newsflash - Spectre doesn't impact architectures that don't speculate.

This is great, except for that existing CPUs don't stall after issuing a load. Scoreboarding + prefetch are together capable of more than this "deferred load," and require less work from the compiler. If you have independent instructions following a load, any existing architecture worth its salt will notice that and execute them while the load is in progress.

It's potentially a neat idea to include the number of cycles until load retire in the instruction, but it's a joke to pretend that it's higher performance than what x86 does and will get you back all the performance lost by not speculating.

I can't help but think that the mill architecture gets a lot of hype from a lot of people that don't know very much about computer architecture. There have been lots of great ideas that didn't pan out for general purpose computing, and I'm not sure that this vaporware architecture deserves to be thought about.

snuxoll 6 years ago

> This is great, except for that existing CPUs don't stall after issuing a load. Scoreboarding + prefetch are together capable of more than this "deferred load" [...]
Except existing CPU's spend a lot of die space and power budget on doing speculative execution to hide the stall, the point of a deferred load is you don't need all this hardware to extract the same performance.
> and require less work from the compiler. [...]
Three words: static single assignment. If you can work out the dataflow of a function you already have everything you need to order loads in the most efficient way possible, this is why all of Mill Computing's work has been around LLVM because LLVM IR forces SSA by design. Hell, your compiler doesn't even need to think about the ordering if it relies on LLVM to do the native code generation, because the Mill backend is supposed to do all of this for you.
> It's potentially a neat idea to include the number of cycles until load retire in the instruction, but it's a joke to pretend that it's higher performance than what x86 does and will get you back all the performance lost by not speculating.
Deferred loads alone aren't there to beat x86 in terms of performance, they're there to avoid needing all the costly out-of-order hardware while avoiding memory stalls that previous statically scheduled/in-order machines incur. There's other features in the architecture to bring better performance, but that's all around the VLIW-like design.
- gpderetta 6 years ago
  
  As far as I know (and I don't know much because I'm not a compiler guy) LLVM (and most compilers) doesn't keep everything in SSA form. Any value whole address has escaped (most of the things not in the C stack and evem some local variables as well) must treated as memory. I think that not automatic, not recently used variables would also be the values that would benefit the most from deferred loads. So IIRC Mill has hardware to help with aliasing but it wouldn't in fact plug out of the box in LLVM.
  Do the Mill guys even have an LLVM backend yet? Or even any compiler at all?
  
  Veedrac 6 years ago
  
  I believe the latest we've heard is that their LLVM backend is mostly working but still pretty buggy.
- jcranmer 6 years ago
  
  > Three words: static single assignment. If you can work out the dataflow of a function you already have everything you need to order loads in the most efficient way possible
  The ordering of loads has almost nothing to do with dataflow (with the slight caveat that data dependencies from loads guarantees a small amount of the memory order). I'm speaking from experience here, any computation DAG model is going to very quickly run into the problems of dealing with branches and the inherently undecidable problem of static alias analysis.
deepnotderp 6 years ago

> This is great, except for that existing CPUs don't stall after issuing a load. Scoreboarding + prefetch are together capable of more than this "deferred load," and require less work from the compiler. If you have independent instructions following a load, any existing architecture worth its salt will notice that and execute them while the load is in progress.
<Citation needed>. BTW, deferred loads were rediscovered as "decoupled load" (http://people.duke.edu/~bcl15/documents/huang2016-nisc.pdf) and achieved a respectable 8.4% avg speedup.
> It's potentially a neat idea to include the number of cycles until load retire in the instruction, but it's a joke to pretend that it's higher performance than what x86 does and will get you back all the performance lost by not speculating.
That's not at all what's claimed. The entire idea is that you try to approximate OoO performance, not beat OoO performance.
> I can't help but think that the mill architecture gets a lot of hype from a lot of people that don't know very much about computer architecture. There have been lots of great ideas that didn't pan out for general purpose computing, and I'm not sure that this vaporware architecture deserves to be thought about.
I mean, it's been discussed as a valid idea by people on the P6 team (one of the first commercial OoOs), but what would they know? ¯\_(ツ)_/¯
- twtw 6 years ago
  
  "<Citation needed>"
  This is a joke, right? Or are you actually calling me out for no evidence on a discussion of the mill architecture, which has been proclaimed for a decade as the greatest thing since sliced bread without a shred of supporting evidence?
  
  deepnotderp 6 years ago
  
  You claimed that deferred loads help less than prefetching and scoreboarding. You should substantiate that claim.
  In any case, see the linked paper for why you're probably wrong on that point.
petermcneeley 6 years ago

"I can't help but think that the mill architecture gets a lot of hype from a lot of people that don't know very much about computer architecture. There have been lots of great ideas that didn't pan out for general purpose computing, and I'm not sure that this vaporware architecture deserves to be thought about."
This architecture is basically itanium++ which was a very serious arch but didnt make it (just like PS3 Cell). To ask if this arch has any possible future one should really ask why itanium didnt succeed.