nkurz 5 years ago

Adding a little bit of background here might help. Daniel defines what he means by "lane" in an earlier post: https://lemire.me/blog/2018/11/05/measuring-the-memory-level....

Restating, what he's doing is "pointer chasing", which is one of the slowest way of accessing memory because each load is delayed by the full latency of the lookup. Set up an array of random numbers, look up the value at a location, use the value you find as the next array location, then repeat. Because each load is data dependent on the previous, no prefetching is possible. So far, all data reported are for a single core.

Each self-contained "pointer chase" is a "lane". To increase the number of lanes, do multiple pointer chases in parallel: lookup up N array values in quick succession on the same core, use the values you find as the next array location for each lane, repeat. You find that up to a point, the total number of "chases" per second increases as you increase the number of lanes.

Different processors "plateau" at different levels of parallelism. Intel processors max out at about 10 lanes, which implies that they can keep about 10 requests to RAM "in flight" at a time. This matches the published number of "fill buffers" for the Intel chips. Apple's processors max out at over 20 lanes, saying that they can keep more than 20 requests in flight. Apple hasn't published how many "fill buffers" their chips have, so this is one of the first direct measurements.

One of the big questions being asked is: "If increasing the number of requests-in-flight is important to performance, why doesn't Intel just increase the number of fill buffers?" If you have an answer, please let us know, either here, or as a comment on the blog, or in response to Travis's question at RWT: https://www.realworldtech.com/forum/?threadid=181499.

  • davrosthedalek 5 years ago

    It might be interesting to know if these lanes are per logical core, physical core or CPU. If it's per core, it might be hard to find real-life code using more than 10 lanes outside of micro-benchmarks.

    • BeeOnRope 5 years ago

      The simple answer is that these lanes (buffers, I prefer to call them) are per logical core.

      The more complicated answer is that there are a variety of buffers involved in the "path to memory", starting with the core-private buffers between the core and L1 and L2, and moving to socket-shared buffers at the L3, memory controllers and DRAM.

      In this particular single-threaded test, it is the 10 core-private "line fill buffers" between the L1 and L2 which are the bottleneck, which is where you get the simple answer from. If you ran this test on enough cores at the same time, I would entirely expect the bottlneck to change to one of the shared buffers.

    • mooman219 5 years ago

      I would bet that Intel has decent statistics on the average number of lanes used, and opted for 10 based on that. Since die space is a premium, they either used the extra space where it's needed more, or shrunk the die by some small amount.

    • nicoburns 5 years ago

      I'm not an expert (so I could be completely wrong), but I could imagine in-memory data stores being bound by this kind of limit...

      • dragontamer 5 years ago

        AMD processors have 44-stores in flight, but only 22-loads in flight.

        I presume Intel processors have a similar difference between their "store buffers" and their "load buffers".

        • nkurz 5 years ago

          I'm not sure if we are using the same terminology, but I think Intel has the opposite arrangement, with more "load buffers" than "store buffers". In Table 2-7. L1 Data Cache Components from https://www.intel.com/content/dam/doc/manual/64-ia-32-archit..., it says:

                                    Sandy Bridge
            Load buffers            64 entries 
            Store buffers           36 entries
            Line fill buffers (LFB) 10 entries
          
          I think Haswell (and later?) are 72/42/10 per physical(?) core, also with more load than store. How does this match up with what you know about AMD? And do you have a clear number of LFB's (or whatever they call the buffers which are allocated only for L1D misses) for AMD?

          Edit: I now see the link you gave in your other comment. Hmm. I think numbers you gave do correspond to the inverted ratio numbers for Intel, and I couldn't find any claimed number corresponding to LFB's. We do have a couple AMD machine we test on, and I think we did, but I don't recall the exact numbers, other than a general recollection that they were about about the same as the Intel. In his RWT post, Travis claims Ryzen as having 8-10.

          • dragontamer 5 years ago

            > In his RWT post, Travis claims Ryzen as having 8-10.

            I wonder if they're statically partitioned: so that you only get ~11 per thread on Ryzen when SMT is activated.

            The AMD Optimization manual is cler that the LS unit handles cache misses. Intel's manual is clear that LFB handles cache misses.

            Load / store buffers seem to be a more general set of micro-op reordering in general.

            --------

            It appears I've been mistaken btw. Let me post the full text of the section:

            > The LS unit includes a 44-entry load queue (LDQ). The LDQ receives load operations at dispatch. Loads leave the LDQ when the load has completed and delivered data to the integer unit or the floating-point unit.

            > The LS unit utilizes a 44-entry store queue which holds stores from dispatch until the store data can be written to the data cache

            ...

            > The LS unit can track up to 22 outstanding in-flight cache misses.

            So that's 44 load queue, 44 store queue, and 22-cache misses for AMD Ryzen. Presumably, this is split somehow in hyperthreads (which would lead to the ~10 number you suggested)

            • BeeOnRope 5 years ago

              Load and store queues are different than the "miss buffers" we are talking about here. Every load and store will need one of these entries from allocation until retirement, so there are usually a lot of them, something like 20% to 40% the size of the ROB (re-order buffer) each, since many code sequences have many loads and stores.

              The miss buffers are different in that they only get involved when there is an L1 miss. They are only used from the moment the miss is detected until the miss completes. Also, load and store buffers are used 1:1 with load and store instructions (including instructions that have a load, store or both combined with an ALU operation, such as load-op or RMW on x86). Miss buffers are 1:1 with outstanding misses on the granularity of cache lines. So if you load 8 consecutive 4-byte values, which all fall in the same cache line, which misses in L1, you'll have 8 load buffer entries used, but only 1 fill buffer used since the buffer coalesces all misses to the same line.

  • toast0 5 years ago

    At the end of the day, these requests are going to ram, which fetches one line of memory at a time, per channel. More buffers here, becomes more queuing elsewhere, and means more cache pollution if you mispredict and have to flush the pipeline.

    Probably real world programs are doing some work with the data they're reading, or their data structures might be more likely to be a little bit contiguous.

    • tlb 5 years ago

      That's not true, as shown by the benchmarks where 10 lanes makes it go nearly 10x faster. Modern memory systems have a lot of banking and interleaving that allows several requests to be making progress simultaneously.

    • BeeOnRope 5 years ago

      The time to actually access DRAM itself is actually a relatively small part of the total time. Most of the time is spent getting from the core to the DRAM in the first place. As little as 10ns out of the total of 50 to 100 ns is spent accessing the DRAM itself. So even if the DRAM access was totally serial there is a lot of opportunity in parallelizing the rest of the path.

      Of course, as a sibling comment pointed out, DRAM access is generally far from serial: you have multiple channels which are inherently parallel, and even within a single channel you can use things like command overlapping to have the DIMM do more than one thing at once.

      The numbers bear this out: the single serial access pattern in Daniel's test gives me a latency of 80 ns per read on my Skylake client machine, but with a parallelism of about 10 that drops to about 8 ns, so the DRAM subsystem is able serving up a cache line at least every 8 nanoseconds.

dragontamer 5 years ago

AMD's optimization guide for Zen (https://developer.amd.com/wordpress/media/2013/12/55723_SOG_...) suggests:

> The LS unit can track up to 22 outstanding in-flight cache misses.

So not as much as Apple, but more than Intel in this regards. I probably could test this, but I'm tired today, lol.

----------

10 outstanding loads per core seems like a small number, especially compared to contemporary processors (Apple or AMD's). Maybe Intel knows something we don't.

  • awalton 5 years ago

    > Maybe Intel knows something we don't.

    Or maybe Intel's chips miss less often due to more sophisticated branch prediction and prefetching? The Pentium 4s were victims of excruciatingly painful pipeline flushes and stalls. It'd make sense Intel would have invested heavily in understanding and mitigating these kinds of caching behaviors and a lot of that surely was lifted into the Nehalem/Core microarchitecture designs that most of the modern Intel chips are designed around.

    • Taniwha 5 years ago

      I think the point of this benchmark is that you can't predict load addresses because they are effectively random

dragontamer 5 years ago

Okay, I've given it some thought. Here's my theory:

ARM's load/load, load/store, store/load, and store/store reordering is more flexible than Intel's store/load (only) reordering. As such, ARM is more likely to benefit from deeper load-buffering than Intel.

Remember that all of these systems are multi-core, which means we have to consider what these load-buffers do during a multi-core situation. Here's some example code for "Thread 1"

Thread 1:

    register RAX=0; // Non-memory. No load/store associated
    int a=0, b=0, c=0; // "True Memory", accessing these have a load/store
    while(1){
        RAX++;
        memory barrier();
        a = RAX;
        RAX++;
        memory barrier();
        b = RAX;
        RAX++;
        memory barrier();
        c = RAX;
    }
In effect, this thread would create the memory commands:

    memory barrier
    store [A], 0
    memory barrier
    store [B], 1
    memory barrier
    store [C], 2
    memory barrier
    store [A], 3
    memory barrier
    store [B], 4
    memory barrier
    store [C], 5
    (etc. etc)
For simplicity, I've added memory barriers to ensure that these writes happen in EXACTLY this order. No need to worry about ARM's store/store reordering in this example. (x86 does NOT need memory barriers, because store/store will always happen in order)

Now lets imagine a 2nd core reading from these variables. Lets say Core#2 has the following load buffer:

    load rax, [A]
    load rbx, [B]
    load rcx, [C]
    # Note: no memory barriers!
With this ordering, the Intel processor is guaranteed to have "RBX" a later value than RAX, and RCX to have the latest value.

This is NOT true in ARM. In the ARM processor, the following assignments are possible:

    RAX = 3 # load RAX MAY execute last on ARM.
    RBX = 1 # This example will NEVER happen on x86
    RCX = 2 
Why? Because in ARM, the load/load boundary can be crossed by the core. That is, the "load RBX, [B]" instruction MAY be executed first on ARM's architecture, despite coming in afterwards.

This means that load buffers are far, far cheaper on ARM. But they also come at a cost to the programmer's sanity. If you actually have important "ordering", you need to be sure to protect your code with memory barriers.

x86 has an implicit memory barrier on all stores and loads. This is slower and more costly. In effect, the load-buffer on x86 MUST be a proper queue: the first loads must execute first, while the last loads must execute last.

In ARM, this isn't the case. A load-buffer of size 40 can all execute simultaneously and out of order. Much to the confusion to multi-threaded programmers, I'm sure. This example also demonstrates why on weakly ordered systems, you need a memory-barrier on BOTH threads. (Fixing the store/store reordering with memory barriers only solves half the problem)

With that being said: modern programmers know that the only variables that need a defined ordering are synchronization structures, like spinlocks. Every spinlock needs a "release" barrier afterwards (Explicit barrier states: do NOT reorder anything past this store). Every spinlock needs an "acquire" barrier before reading (Command: do NOT reorder anything before this load).

Semaphores / Atomics / other synchronization structures also need a defined ordering. Accessing "normal" variables is well known to be "unordered" by modern programmers. So ARM has the advantage with regards to modern languages. x86's architecture is more strict than necessary.

  • BeeOnRope 5 years ago

    Yes, this is a reasonable explanation of the ARM and Intel memory models, but I'm not following how this relates to miss buffers specifically.

    The load buffers which maintain the source/program ordering on a per-instruction basis as you describe above are called ... load buffers. These are different than the "miss buffers" which track outstanding misses. Intel chips have 70+ load buffers, so no there is no shortage there. Yes, x86 chips have to jump through a bunch of to keep ordering, which means their load buffers probably need to more sophisticated (Intel sometimes refers to a MOB - a "memory order buffer" which tracks this stuff, but as far as I can tell this is not a separate buffer but just another name for the load and store buffers and associated machinery, with a focus on the memory ordering aspect).

    The way it works in Intel is that everything will speculatively execute out of order, loads passing loads and all the things the memory model doesn't officially allow, but the load buffer will track the original order. As long as there is not indication of an actual problem with the ordering (i.e., a store on another core which causes a snoop that hits the load buffer on this core), the out-of-order nature of the execution isn't visible, so everything can retire fine. If something happens that violates the order, a "MOB machine clear" happens (with a cost similar to a branch misprediction) occurs and the chip starts over.

    None of this directly implicates the miss buffers though. For one thing, they are not 1:1 with instructions and they are not instruction ordered, since multiple misses will coalesce to the same buffer.

    That said, I can certainly see the stronger x86 memory model playing a role here, but it isn't clear to me how.