arijun 10 days ago

In a few fields of rust we are starting to see a convergence of lower level libraries that then can be shared amongst the higher level crates. For example, wgpu is seeing broad use across a bunch of libraries from game engines to UI libraries. This then allows the shared libraries to be made more robust, with more shared resources going into them.

Does anyone know how much of this is happening in the matrix/array space in rust? There are several libraries that have overlapping goals: ndarray, nalgebra, etc. How much do they share in terms of underlying code? Do they share data structures, or is anything like that on the horizon?

  • sarah-ek 10 days ago

    as far as i know, very little is shared. ndarray-linalg is mostly a lapack wrapper. nalgebra and faer both implement the algorithms from scratch, with nalgebra focusing more on smaller matrices

  • antimora 10 days ago

    WGPU is also used by Burn deep learning framework and WONNX for computation.

heinrichhartman 10 days ago

This does not seem to depend on BLAS/LAPACK.

Good to see LU decomposition with full pivoting being implemented here (which is missing from BLAS/LAPACK). This gives a fast, numerically stable way to compute the rank of a matrix (with a basis of the kernel and image spaces). Details: https://www.heinrichhartmann.com/posts/2021-03-08-rank-decom....

  • sarah-ek 10 days ago

    lapack does expose a full pivoting lu as far as i can tell? https://netlib.org/lapack/explore-html/d8/d4d/group__getc2_g...

    • bee_rider 10 days ago

      If you are going to include vs MKL benchmarks in your repo, full pivoting LU might be one to consider. I think most people are happy with partial pivoting, so I sorta suspect Intel hasn’t heavily tuned their implementation, might be room to beat up on the king of the hill, haha.

      • sarah-ek 10 days ago

        funny you mention that, the full pivoting. it's one of the few benchmarks where faer wins by a huge margin

        n faer mkl openblas

        1024 27.06 ms 186.33 ms 793.26 ms

        1536 73.57 ms 605.71 ms 2.65 s

        2048 280.74 ms 1.53 s 8.99 s

        2560 867.15 ms 3.31 s 17.06 s

        3072 1.87 s 6.13 s 55.21 s

        3584 3.42 s 10.18 s 71.56 s

        4096 6.11 s 15.70 s 168.88 s

        • adgjlsfhk1 10 days ago

          it might be interesting to add butterfly lu https://arxiv.org/abs/1901.11371. it's a way of doing a numerically stable lu like factorization without any pivoting, which allows it to parallelize better.

          • bee_rider 10 days ago

            It looks like they are describing a preconditioner there.

            • adgjlsfhk1 10 days ago

              the key point is that the preconditioner allows you to skip pivoting which is really nice because the pivoting introduces a lot of data dependence.

          • sarah-ek 10 days ago

            looks interesting! thanks for sharing

    • heinrichhartman 10 days ago

      Thanks for pointing this out. Looks like only the python bindings are not included in nunpy.

  • bmitc 10 days ago

    Is there a collection of algorithms published somewhere, in a coherent manner? It seems like to me, there should be a collection of matrix algebra, symbolic algebra, etc. algorithms somewhere that makes it easy to implement these in any language or system. As it is, it seems you need to already be an expert or researcher in the space to make any movement.

    • sevagh 10 days ago

      BLIS is an interesting new direction in that regard: https://github.com/flame/blis

      >The BLAS-like Library Instantiation Software (BLIS) framework is a new infrastructure for rapidly instantiating Basic Linear Algebra Subprograms (BLAS) functionality. Its fundamental innovation is that virtually all computation within level-2 (matrix-vector) and level-3 (matrix-matrix) BLAS operations can be expressed and optimized in terms of very simple kernels.

      • bmitc 9 days ago

        Interesting. Thanks for the heads up!

    • rvdca 10 days ago

      One could argue that libraries like BLAS/LAPACK are those collections...

      • bmitc 10 days ago

        That can be argued, but they're pretty esoteric. For example, what language are they written in? Fortran, assembly, etc.? Probably nothing the majority of developers or mathematicians can't understand. And creating bindings for them is non-trivial. I don't even know where to start from the website. And on the website, all the links to the routines are broken. https://www.netlib.org/blas/

        • 1980phipsi 9 days ago

          The naming convention is a bit esoteric, but once you figure it out it's not too bad. I don't think writing bindings is too hard once you know what's going on. You most likely would want to write some kind of wrapper on top of that in order to make things a little more user-friend.

          I will agree with you on the routines being broken. That happened within the past year. Not sure why it happened, but it's annoying. If you know the name of the routine you can usually find documentation on another site.

bayindirh 10 days ago

Why Eigen is not run in parallel mode w/ Open-MP?

Eigen handle most (if not all, I just skimmed the tables) tasks in parallel [0]. Plus, it has hand-tuned SIMD code inside, so it needs "-march=native -mtune=native -O3" to make it "full send".

Some solvers' speed change more than 3x with "-O3", to begin with.

This is the Eigen benchmark file [1].

[0]: https://eigen.tuxfamily.org/dox/TopicMultiThreading.html

[1]: https://github.com/sarah-ek/faer-rs/blob/main/faer-bench/eig...

  • sarah-ek 10 days ago

    author here, eigen is compiled with -fopenmp, which enables parallelism by default

    • bayindirh 10 days ago

      Hi! Thanks for chiming in!

      Did you check with resource utilization? If you don't provide "OMP_NUM_THREADS=n", Eigen doesn't auto-parallelize by default.

  • dannyz 10 days ago

    Related on the Eigen benchmarking, I see a lot of use of auto in the benchmarks. Eigen does not recommend using it like this (https://eigen.tuxfamily.org/dox/TopicPitfalls.html) because the template expressions can be quite complicated. I'm not sure if it matters here or not, but it would probably better not to use it in the benchmarks.

    • sarah-ek 10 days ago

      i've contributed to eigen in the past and know enough about the internals of the codebase to know my way around safe `auto` usage

      • dannyz 10 days ago

        I wasn't worried about safe usage, more that some of the initialization may be moved inside the benchmarking function instead of outside of it like intended. I'm sure you know more about it than me though.

    • bayindirh 10 days ago

      While auto is a compile time burden, it creates a lot of load during compilation of this benchmark.

      My complete Ph.D., using ton of Eigen components plus other libraries was compiling in 10 seconds flat on a way older computer. This requires gigs of RAM plus a minute.

      • sarah-ek 10 days ago

        the long compile times are mostly because im instantiating every dense decomposition in the library in one translation unit, for several data types (f32, f64, f128, c32, c64, c128)

owlbite 10 days ago

Something looks dubious with the benchmarking here to me.

Top-tier numerical linear algebra libraries hold all hit the same number (give or take a few percent) for matrix multiply, because they're all achieving the same hardware peak performance.

  • sarah-ek 10 days ago

    one issue that may be affecting the result is that openmp's threadpool doesn't play well with rayon's. i've seen some perf degradation in the past (on both sides) when both are used in the same program

    i plan to address that after refactoring the benches by executing each library individually

    • jimbobraggins 10 days ago

      Very cool project! I'd suggest before running the new benchmarks to reach out to to the developers of the packages you are testing against to see if they think the benchmark you wrote is doing efficient calling conventions. I work on a large open source software project and we've had people claim they are 10x faster than us while they were really using our code in some very malformed ways.

      Also stops them from grumbling after you post good results!

      • sarah-ek 10 days ago

        fair enough. i try to stay in touch with the eigen and nalgebra developers so i have a good idea on how to write code efficiently with them. for openblas and mkl i've been trying recently to call into the lapack api (benches doing that are still unpublished at the moment), that way im using a universal interface for that kinda stuff.

        and of course i do check the cpu utilization to make sure that all threads are spinning for multithreaded benchmarks, and occasionally check the assembly of the hot loops to make sure that the libraries were built properly and are dispatching to the right code. (avx2, avx512, etc) so overall i try to take it seriously and i'll give credit where credit is due when it turns out another library is faster

meisel 10 days ago

Looking at thin matrix SVD, it appears much faster than everyone else. I’m curious what it’s doing differently at a high level and if there’s any tradeoff in accuracy. I also wonder how it compares to MKL, which is typically the winner in all these benchmarks on Intel.

  • sarah-ek 10 days ago

    im in the process of refactoring the benchmark code at the moment, and plan to include mkl in the benches soon.

    overall, the results show that faer is usually faster, or even with openblas, and slower than mkl on my desktop

    • mjhay 10 days ago

      Wow, that's impressive! I wouldn't expect anything to be able to beat MKL, given the optimizations made based on proprietary information.

andyferris 10 days ago

Does anyone understand the difference (in the benchmark tables) between faer and faer(par)? Which number should be considered important?

  • llimllib 10 days ago

    the former is run with no parallelism, and the latter with `parallelism::Rayon(0)`, which presumably means that it has some parallelism, though I don't write rust or know what Rayon is.

    [1]: https://github.com/sarah-ek/faer-rs/blob/172f651fafe625a2534...

    • nindalf 10 days ago

      Rayon is a crate that makes it easy to write code that executes in parallel. Take a function that looks like this. It'll execute on a single thread.

      fn sum_of_squares(input: &[i32]) -> i32 { input.iter() .map(|&i| i * i) .sum() }

      If you changed it to input.par_iter() then it executes in parallel on multiple threads.

snowAbstraction 9 days ago

For the larger performance diffs, has anyone looked into why? Are there a couple of common reasons? I'd really like to know. Thanks

  • sarah-ek 9 days ago

    i have, yes. i can't speak for openblas or mkl, but im familiar with eigen and nalgebra's implementations to some extent

    nalgebra doesn't use blocking, so decompositions are handled one column (or row) at a time. this is great for small matrices, but scales poorly for larger ones

    eigen uses blocking for most decompositions, other than the eigendecomposition, but they don't have a proper threading framework. the only operation that is properly multithreaded is matrix multiplication using openmp (and the unstable tensor module using a custom thread pool)

rayrrr 10 days ago

How exactly does this dovetail with https://github.com/rust-or/good_lp ? Will it be a replacement, an enhancement, or something else?

  • sestep 10 days ago

    Seems like a different focus, no? Aren't linear programming solvers a much narrower domain than general linear algebra libraries?

    • fastneutron 10 days ago

      Correct, linear algebra and linear programming are two very distinct things. The latter is a widely-used optimization technique, and computationally depends on the former, which is a general mathematical framework for essentially all numerical computing.

  • electrozav 10 days ago

    That's code for linear programming (optimization) not for linear algebra

  • rayrrr 10 days ago

    I asked because Faer describes itself as a low-level linear algebra library with a high-level wrapper, whereas good_lp describes itself as a high-level wrapper around many relevant low-level libraries.

    • constantcrying 10 days ago

      Linear programming and linear algebra aren't the same thing.

Ericson2314 10 days ago

Nixpkgs has a some pluggable BLAS/Lapack implementation infra. If this does offer a shim layer doing exactly that interace, it would be nice to see this packaged as a new alternative!

netwrt 10 days ago

Does this include an einsum function? I find that it makes complex operations easier to work with

  • sarah-ek 10 days ago

    not yet. a tensor api is on the long term todo list, but it's a big undertaking and i'd like to focus on matrix operations for the time being

jdeaton 10 days ago

Why not just call Eigen from rust

  • sarah-ek 10 days ago

    because we can do better

    • 1980phipsi 9 days ago

      What features of Rust let you write faster code than C/C++/Fortran?

      • adgjlsfhk1 9 days ago

        most of it's just general programming niceness. If you have to spend a few hours to wrestle with make/bazel/etc every time you need to reach for a dependency, you don't depend on things and have to rewrite them yourself. If your programming language doesn't have good ways of writing generic code, you either have to write the code once per precision (and do all your bugfixes/perf improvements in triplicate) or do very hacky metaprogramming where you use Python to generate your low level code (yes the Fortran people actually do this https://github.com/aradi/fypp), or use the C preprocesser which is even worse.