104 points by rbanffy a year ago
There have been many previous attempts at "throughput computing", meaning many slow cores or threads for server workloads. 3 generations of Niagara (UltraSPARC T1 and descendants), SeaMicro's box full of Atoms, etc. It doesn't mean this is attempt is doomed, but is there compelling answer to "What's different this time?"
The difference here is that the individual cores on this are very fast. Unfortunately I cannot publish benchmarks because of NDAs, so you'll have to believe me on this until the hardware is more widely available.
Cloudflare did a quick review:
They are decent, but don't seem quite as good as the ones on AMD's (or even more Intel's) processors.
Qualcomm Centriq 2434 [https://www.nextplatform.com/2017/11/08/qualcomms-amberwing-...]:
- 40 cores (no SMT)
- 2.5 GHz peak
- 4 uops/instructions per cycle [https://www.qualcomm.com/media/documents/files/qualcomm-cent...]
- 110 W TDP
- 10 Guops/s/core
- 0.011 Guops/s/core/$
- 400 Guops/s
- 0.45 Guops/s/$
- 3.63 Guops/s/W
AMD Epyc 7401P [https://en.wikipedia.org/wiki/Epyc]:
- 24 cores (2x SMT)
- 2.8 GHz all-core boost
- 6 uops per cycle [http://www.agner.org/optimize/microarchitecture.pdf]
- 170 W TDP
- 16.8 Guops/s/core
- 0.016 Guops/s/core/$
- 403 Guops/s
- 0.37 Guops/s/$
- 2.37 Guops/s/W
So based on this the AMD processor has 170% the Qualcomm's per-core performance, equal on total throughput, 83% of Qualcomm's total thoughput per $ and 65% of Qualcomm's total throughput per W.
Note that the AMD CPU has SMT while the Qualcomm doesn't which improves utilization, and its components are probably faster (due to higher TDP and more experience making CPUs), so it looks like the AMD CPUs are likely to be strictly better in practice except possibly on performance/watt.
Also, with AMD/Intel, albeit at much lower performance/$, you can have 32-core instead of 24-core CPUs and there is up to 4/8-way SMP support that Qualcomm doesn't mention.
This is the most meaningless comparison you could possibly make. By this logic Pentium 4 is also better, because it had 3.8 peak GHz, SMT and could do 4 uops/cycle.
In reality things don't work like that. First of all some of the uops can only be loads, others stores and other branches. Second, you have factors like branch prediction, cache latency, branch mis-prediction penalty that plays a huge role on performance.
I am yet to see a workload that can saturate 6 execution ports, even with SMT.
In the real world Centriq is better in performance/watt, and is even better in perfomance/thread that an SMT thread on an Intel Skylake.
> In the real world Centriq is better in performance/watt, and is even better in perfomance/thread that an SMT thread on an Intel Skylake.
Is there one such thing as "in the real world"? I mean isn't it use-case dependent and if you want a build machine, a web server, or a database server you'll get different results out of your benchmarks?
I am sorry, you are 100% correct. There are workloads where you can't beat Intel. Pure number crunching for example, where you can utilize AVX-512.
I meant in the real world of (most) web servers in this case.
> There are workloads where you can't beat Intel. Pure number crunching for example, where you can utilize AVX-512.
That's also debatable. I've read HPC papers that show opterons outperforming xeons on heavy fp workloads due to the higher throughput and larger cache. Baseless claims regarding "real world performance" are only good for marketeers.
Edge cases are edge cases. I once made a semi-retired 5 year-old server crush a brand new one on a specific workload just because I noticed the working dataset did fit entirely in its L2 cache.
Will I ever do it again? I have no idea. At the time, I got a very nice bottle of wine for my bet.
> Edge cases are edge cases.
It's not an edge case when we're talking about basic BLAS kernels.
> I once made a semi-retired 5 year-old server crush a brand new one on a specific workload just because I noticed the working dataset did fit entirely in its L2 cache.
You seem to be oblivious to the fact that for a long time cache access is the main bottleneck in HPC applications, and although the focus on parallel programming gets all the attention the bulk of the research in the field is placed on figuring ways to minimize cache misses while pumping data to the ever growing number of registers. Opterons outperformed xeons because researchers figured how to harness opteron's larger cache and throughput to avoid performance penalties imposed by cache misses and thrashing and it showed. That's also one of the reasons why the old bulldozer architecture showed linear per-core performance even when each pair of cores shared a floating point unit.
> so it looks like the AMD CPUs are likely to be strictly better in practice except possibly on performance/watt.
Cloud scale providers don't care about raw performance (within reason).
TCO wins the day, so if Qualcomm CPUs offer higher performance per watt than Intel or AMD, I can definitely see them buying these like hotcakes.
They do seem surprisingly decent. Not yet clearly better, but they are surprisingly decent.
It is too bad someone has't made a graph of these two trajectories, Intel/AMD performance and ARM performance, over time. I bet it would let us see if there is going to be an intercept between the two and when it would happen. We have like 7 years of data now in this race, a graph should be possible.
Hold up. Comparing uops/sec between CPU architectures is not meaningful.
Yes. Furthermore, the assumption that a core being capable of a maximum of x uops/cycle means that the core actually executes x uops/cycle when running a workload seems really far-fetched.
Some public benchmark data: https://blog.cloudflare.com/arm-takes-wing/
Re: "What's different this time?"
1. Tier-1 server suppliers like HPE plan to make a Centriq server. For customers to stick their neck out and start porting critical software to a new architecture, they have to believe that there's going to be a refresh next year, and the year after that. Some adventurous customers will be ready to explore this territory but the critical mass won't move until they see momentum shift that way.
HPE and Qualcomm each have a popular brand and big budgets that can sustain a slow ramp of a couple generations of these products before they start to see major adoption.
2. What else is different this time is the ever-increasing popularity of: open source software, linux, containerization, python, golang, node/JS, Java, C#, etc. Redhat announced at SC this year that they will offer a supported ARM release. That means that all of the above will Just Work.
We can roughly estimate generalized performance per core via their performance per dollar comparisons with Intel.
Their 48 core Centriq 2400 is listed at $2000. The 28 core Skylake they compare it to is $10000. They claim 4x better performance per dollar, which puts it at the chip level as 0.8x as fast as the Intel. That would peg every core as 0.5x as fast as the Intel.
It won't win a single-thread contest, but that puts it well within the "very fast" category.
These go way back, the Thinking Machines Corporation was doing stuff like this back in the early 80s.
It didn't pan out that time either, not even on machines with 64k processors.
Intel's Knights Landing / Xeon Phi architecture is a bit like that.
Eventually we'll face the fact cores won't get any faster and that even phones will have a dozen of them.
Porting our code to run well on such machines is a bet people should start making.
... or a bet that vendors should start making.
Intel could sell bottom bin Xeon Phi chips as development systems. I don't care if half the cores have failed testing and are masked off or it won't run at full speed.
I would love if they did that.
Developers always have the computers of the future on their desks. If Intel wants Phi to be part of that future, they'd better put them on developers' desks.
Considering that even my far-from-flagship phone has an eight-core processor, I wouldn't be surprised to see a 12-core in the near future.
Sure, but that's more like dual-quad (since no one uses Asymmetrical MultiProcessing).
Which is also another bet people should start making.
Most of the time, my laptop is doing workloads that would leave an Atom bored to death, but when I need it, I sure love those i7 cores.
I would gladly sacrifice one i7 core for 4 Atom ones, provided the OS knew what to do with them.
Yeah, it's a latency vs power vs parallelism issue
This is one reason why WinRT initially only had async APIs.
Wasn't Phi more of an HPC/machine learning solution, not a server solution?
I don't think they target the same kind of applications anyway.
pushed by a bigger company using a better established architecture and a more reasonable core count.
Since Intel is hitting a wall on transistor size, AMD have a viable and competitive chip at approx half price perf/$ and Arm manufacturers keep making them faster it's going to be an interesting next few years to say the least.
> Since Intel is hitting a wall on transistor size
Well everyone is running into these problems.
However Intel's historical process advantage is probably drawing to a close (or at least the gap between Intel's best and say, TSMC's best, will be a lot smaller). So it's all about the micro-architecture from here on out.
That's not really true. Intel's competitors are doing a great job of finding metrics via which they can claim their transistors are '10nm' or '7nm', but if you look at the actual full dimensions of the transistors that they're using Intel retains an advantage. It's just that advantage is 7 vs 10 vs 14 rather than 90 vs 65. Back with 90 vs 65 it was much clearer cut what they were talking about.
On the flip side though, 65 vs 90 is basically the same % difference as 7 vs 10.
Obviously the performance difference of 7 vs 10 is smaller than historical changes (especially WRT power), but Intel's still ahead and looks like it will stay that way.
Putting the node naming issues aside, Intel used to lead the rest of industry between 2 - 3 years time. And not only it is technologically superior, Intel also has the scale enjoyed by no one else in the industry. From that point of view, Intel has a monopoly.
Now not only has TSMC or Samsung shrank the gap to less then a year or so, they now both enjoy the same economy of scale provided by the mobile phone market. 1.7 Billion Smartphones Vs 250M PC. While many of the smartphones aren't using cutting edge tech, the same goes to PC market as well.
Unless there is any change of pace from Intel, TSMC will likely have the same level of tech as Intel in 2019, and a scale that Intel wont match assuming Apple dont change course and continue to Fab with TSMC.
They will need to sell more chips, more transistor, keep the Fab fully utilize, with the ~200M Modem from Apple next year that is going to help tremendously, but given Apple switch supplier from time to time their move into GPU in 2 - 3 years time seems like a backup plan incase anything goes south.
Yes, it's true. TSMC and Samsung may call their upcoming nodes 7nm, while Intel calls it 10nm, but what matters is that they have equivalent specifications and that they will arrive roughly at the same time.
Check out the last table comparing TSMC and Samsung's 7nm features vs Intel's 10nm:
Intel may still have a 6-months lead or so, but it's nowhere near its 2-3 year lead it used to have only 5 years ago, when not only did it have smaller transistors than the competition, but also used FinFET transistors, with much better thermal performance.
Intel won't even be the first to EUV now, which may not be a huge advantage at first (cost-wise), but it does give Samsung a lead in EUV expertise, from which I think they'll benefit in the long term.
Maybe a better metric would be transistors per cm^2, that way it shouldn't matter where you measure the transistor width. It's all marketing anyway.
System price for current generation AMD and Intel seem quite a bit higher than the previous generation Intel. Price/performance for the current generation AMD and Intel seem very similar. If you have anywhere close to a factor of 2 different please post your numbers.
My Ryzen 1700 matches an E5 2630v4 that has a price differential of ~2.
Matches in what way? Certified ECC support? 4 memory channels? Works in dual socket motherboards? Kernel support for temp sensors? Can get with IPMI management? Video console over network? Chipkill?
Matches in what way? Certified ECC support? 4 memory channels? Works in dual socket motherboards? Kernel support for temp sensors?
as shown by SPECint_rate20062 score estimates
"estimates"? Did they actually benchmark it?
Driving an open ecosystem around the Qualcomm Centriq 2400 processor is a critical pillar of our strategy.
Show us the datasheets. This being Qualcomm, I'm not holding my breath for that...
These are called estimated scores, as they have not yet gone through the SPEC.org reporting methods.
The specs look pretty interesting on paper, but it is just on paper until average developers are actually allowed to have access to such new ARM platforms at reasonable prices.
ARM managed to take the entire mobile market because those low power consumption ARM SoCs are everywhere, you can buy a tiny ARM board running Linux for $50 or less and there are thousands of sellers on ebay.com/taobao.com. However, it is a completely different situation when it comes to ARM for data centres - it is damn hard to find any 64bit ARM platform that allow you have say 32GBytes RAM and a few PCIE slots at reasonable price (e.g. comparable to Xeons). Not talking about cloud, I want 3 such ARM machines sitting in my home lab working 24/7 at 100% load for me and I need to play on the hardware side a little bit (e.g. try different SSDs).
Actually, just ordered one. https://www.avantek.co.uk/store/avantek-32-core-cavium-thund... They’re pretty easy to find these days. Soft iron has a desktop box too.
You have your cause and effect backwards -- Arm cores had a huge percentage of the mobile market well before anybody produced a $50 devboard, and indeed the only reason you can get a $50 devboard is because there is enough volume selling SoCs to the mobile market to bring the chip costs down.
I built my first ARM board for around $30 each, $50 is enough to have a pretty decent profit margin on top of that cost. It was a S3C4510 board running uclinux and I did it 14-15 years ago. That chip is one of the first few publicly available SoCs and no it was never used in mobile. There is no such volume.
As of writing, most of those devboards you can see on the market come with an application processor without modem. The ones that really see huge volumes are the ones with integrated modems, e.g. MTK processors with modem or the Snapdragon series, interestingly you are not going to get any of them for $50.
OTOH, the dev boards are the reason why people have been able to run desktop OSs on their ARM-based boards.
Do you know why no desktop software is optimized for Xeon Phi, SPARC, POWER9? Because almost nobody plays with them.
Also, many compilers are already heavily optimized for x86_64 but not so much for ARM64
If you look at the C language then there are two compilers: gcc and clang.
Clang uses LLVM as its back end, and is basically an Apple project. It is heavily optimized for arm64 (guess what: Apple uses arm64). LLVM is also the back end for Rust, Julia and many other languages. LuaJIT for example is pretty good on arm64 too.
Not true. Linaro is a thing, ARM also provides a first party compiler that is pretty good.
ARM has a team working on the arm64 GCC implementation so I expect it to be pretty well worked out like their own compiler.
When you look at hand optimised assembly in languages / library ARM is not even close to x64
and what exactly are you comparing to reach this conclusion: "and optimised assembly in languages / library ARM is not even close to x64"?
With programs like ExaGears (https://eltechs.com/product/exagear-desktop/exagear-desktop-... and https://eltechs.com/product/exagear-server/) and given a Linux environment, ARMv8 could be very powerful to replace x86 (not yet x86_64 since it's patented). You can even run Wine in the VM in near real-time. I wonder if ExaGears could be open sourced one day so we could see what magic it does to make it so fast.
Next generation or the one after will likely use RISC-V.
I think in a few decades, we'll all have in our computers a CPU, a GPU and a QPU (quantum processing unit).
We're going to reach the limit for CPU miniaturization soon.
I don't think the QPU is going to happen – quantum processing will be available, but in cloud.
And even then, I don't see quantum computing going the same way we went with our machines - QC looks like very batch-oriented with timeshared front-end processors, a bit like some ancient supercomputers.
I think this would enable new usages we can't even imagine yet. This is not a question of wether it is useful now.
That's like saying in the 90s, "why do we need 1Gbps internet, 56k is completely enough to check your emails or browse newsgroups."
And the ancient supercomputers are now in our pockets
> That's like saying in the 90s, "why do we need 1Gbps internet, 56k is completely enough to check your emails or browse newsgroups."
Not really, since QC is not computing like today except a lot faster. It's a very different way for solving some problems. They are not general purpose computing devices.
> And the ancient supercomputers are now in our pockets
Yeah, about that pocket-sized dilution refrigerator...
Would you say GPU compute was envisioned in the 80's?
I don't think we understand the quantum world and quantum computers well enough today to define 100% of the things we'll be using quantum computers for 50 years from now. We'll start with cryptography, molecule interaction simulations, and some optimization problems, but I think we'll think of new ways to use them in the future.
> Would you say GPU compute was envisioned in the 80's?
Sure, why not? What is a modern GPU if not a bunch (well, a very large bunch!) of vector processors? Vector supercomputers were a thing in the 1970'ies, parallel vector supercomputers in the early 1980'ies. The 1980'ies also saw the beginnings of massively parallel systems (e.g. Connection Machine, Transputer). So take the state of supercomputing in the 1980'ies, extrapolate using Moore's law up to today, and you might arrive at something quite close to a GPU chip.
Now, that this chip full of vector processors would be a side-effect of the $$$ available for graphics rendering, largely for entertainment purposes, is probably a historical fluke few would have predicted.
But my point was that Quantum Computing really is different. It's not a general purpose computing method (in the sense of Turing completeness etc.), and AFAIK so far all attempts at it require very low temperatures in order to drive down thermal fluctuations. Sub-K refrigerators have advanced considerably in the past few decades, but still it's far away from something portable.
> I don't think we understand the quantum world and quantum computers well enough today to define 100% of the things we'll be using quantum computers for 50 years from now. We'll start with cryptography, molecule interaction simulations, and some optimization problems, but I think we'll think of new ways to use them in the future.
Oh, absolutely. I'm just not convinced it'll be something every Tom, Dick, and Harry will use to go about their daily lives, even 50 years from now.
"The Cray-1 was the first supercomputer to successfully implement the vector processor design. These systems improve the performance of math operations by arranging memory and registers to quickly perform a single operation on a large set of data."
I would say the ILLIAC IV would be a better approximation, but the point is the same: massively parallel computers have been with us for a very long time before the first GPU.
What nobody anticipated is that we'd be using stuff that was originally designed for games.
Mostly yes: https://en.wikipedia.org/wiki/Connection_Machine
From the 60's: https://en.wikipedia.org/wiki/ILLIAC_IV
The problem with timesharing and QC is that you need to preserve and restore the quantum state of the computer when switching tasks. I am not sure how you would go about with that or if that's even possible.
GPGPU is basically "single instruction multiple data" SIMD at thousand cores. Supercomputers have been doing this for decades.
Quantum computing architectures are a whole different beast.
Can you please stop with the thousand cores bullshit? Even in vega 64 there are only 64 cores with 64 ALU each core.
Each core can still only process one instruction stream but if there are multiple threads sharing the same instruction stream they all get executed at once.
> Not really, since QC is not computing like today except a lot faster. It's a very different way for solving some problems. They are not general purpose computing devices.
> Would you say GPU compute was envisioned in the 80's?
Well, GPUs aren't used for general computation either. There are specific sets of problems that they're good at, but we don't run the OS on them.
> There are specific sets of problems that they're good at, but we don't run the OS on them.
Xeon Phi's predecessor, the Larrabee, was designed as a GPU. Intel now has a Phi that can be the only CPU in your computer.
It all depends on the code that the GPU cores run.
Some time ago, while musing about what a modern-day Amiga would be, I'd imagined it'd have a GPU and run all (or, at least most of) its software on it.
>And the ancient supercomputers are now in our pockets
Actually, my cheap smartphone is orders of magnitude faster than the ancient supercomputers.
And yet the UI is still laggy :-\
They have to do a lot more work to render that UI. The framebuffer on an iPhone X is larger than the entire main memory of the Cray 1!
Mostly a software issue.
To me, this is reminiscent of a time when people never thought we'd own personal computers because "what's the use?" I'm putting my bets on the technology making its way into costumer products as it matures.
Not if the device has to be at absolute zero, it's not.
I don't see that requirement relaxing in the next ten years.
"There is no cloud it's just someone else's computer "
I like to think of the cloud as a ski-rental.