Faer-rs: Linear algebra foundation for Rust

229 points by nateb2022 10 days ago

arijun 10 days ago

In a few fields of rust we are starting to see a convergence of lower level libraries that then can be shared amongst the higher level crates. For example, wgpu is seeing broad use across a bunch of libraries from game engines to UI libraries. This then allows the shared libraries to be made more robust, with more shared resources going into them.

Does anyone know how much of this is happening in the matrix/array space in rust? There are several libraries that have overlapping goals: ndarray, nalgebra, etc. How much do they share in terms of underlying code? Do they share data structures, or is anything like that on the horizon?

sarah-ek 10 days ago

as far as i know, very little is shared. ndarray-linalg is mostly a lapack wrapper. nalgebra and faer both implement the algorithms from scratch, with nalgebra focusing more on smaller matrices
antimora 10 days ago

WGPU is also used by Burn deep learning framework and WONNX for computation.

heinrichhartman 10 days ago

This does not seem to depend on BLAS/LAPACK.

Good to see LU decomposition with full pivoting being implemented here (which is missing from BLAS/LAPACK). This gives a fast, numerically stable way to compute the rank of a matrix (with a basis of the kernel and image spaces). Details: https://www.heinrichhartmann.com/posts/2021-03-08-rank-decom....

sarah-ek 10 days ago

lapack does expose a full pivoting lu as far as i can tell? https://netlib.org/lapack/explore-html/d8/d4d/group__getc2_g...
- bee_rider 10 days ago
  
  If you are going to include vs MKL benchmarks in your repo, full pivoting LU might be one to consider. I think most people are happy with partial pivoting, so I sorta suspect Intel hasn’t heavily tuned their implementation, might be room to beat up on the king of the hill, haha.
  
  sarah-ek 10 days ago
  
  funny you mention that, the full pivoting. it's one of the few benchmarks where faer wins by a huge margin
  n faer mkl openblas
  1024 27.06 ms 186.33 ms 793.26 ms
  1536 73.57 ms 605.71 ms 2.65 s
  2048 280.74 ms 1.53 s 8.99 s
  2560 867.15 ms 3.31 s 17.06 s
  3072 1.87 s 6.13 s 55.21 s
  3584 3.42 s 10.18 s 71.56 s
  4096 6.11 s 15.70 s 168.88 s
  
  adgjlsfhk1 10 days ago
  
  it might be interesting to add butterfly lu https://arxiv.org/abs/1901.11371. it's a way of doing a numerically stable lu like factorization without any pivoting, which allows it to parallelize better.
  
  bee_rider 10 days ago
  
  It looks like they are describing a preconditioner there.
  
  adgjlsfhk1 10 days ago
  
  the key point is that the preconditioner allows you to skip pivoting which is really nice because the pivoting introduces a lot of data dependence.
  
  sarah-ek 10 days ago
  
  looks interesting! thanks for sharing
- heinrichhartman 10 days ago
  
  Thanks for pointing this out. Looks like only the python bindings are not included in nunpy.
sevagh 10 days ago

On the contrary, it seemingly can be used to make a BLAS implementation (example in a PR: https://github.com/sarah-ek/faer-rs/pull/37)
bmitc 10 days ago

Is there a collection of algorithms published somewhere, in a coherent manner? It seems like to me, there should be a collection of matrix algebra, symbolic algebra, etc. algorithms somewhere that makes it easy to implement these in any language or system. As it is, it seems you need to already be an expert or researcher in the space to make any movement.
- sevagh 10 days ago
  
  BLIS is an interesting new direction in that regard: https://github.com/flame/blis
  >The BLAS-like Library Instantiation Software (BLIS) framework is a new infrastructure for rapidly instantiating Basic Linear Algebra Subprograms (BLAS) functionality. Its fundamental innovation is that virtually all computation within level-2 (matrix-vector) and level-3 (matrix-matrix) BLAS operations can be expressed and optimized in terms of very simple kernels.
  
  bmitc 9 days ago
  
  Interesting. Thanks for the heads up!
- rvdca 10 days ago
  
  One could argue that libraries like BLAS/LAPACK are those collections...
  
  bmitc 10 days ago
  
  That can be argued, but they're pretty esoteric. For example, what language are they written in? Fortran, assembly, etc.? Probably nothing the majority of developers or mathematicians can't understand. And creating bindings for them is non-trivial. I don't even know where to start from the website. And on the website, all the links to the routines are broken. https://www.netlib.org/blas/
  
  1980phipsi 9 days ago
  
  The naming convention is a bit esoteric, but once you figure it out it's not too bad. I don't think writing bindings is too hard once you know what's going on. You most likely would want to write some kind of wrapper on top of that in order to make things a little more user-friend.
  I will agree with you on the routines being broken. That happened within the past year. Not sure why it happened, but it's annoying. If you know the name of the routine you can usually find documentation on another site.

bayindirh 10 days ago

Why Eigen is not run in parallel mode w/ Open-MP?

Eigen handle most (if not all, I just skimmed the tables) tasks in parallel [0]. Plus, it has hand-tuned SIMD code inside, so it needs "-march=native -mtune=native -O3" to make it "full send".

Some solvers' speed change more than 3x with "-O3", to begin with.

This is the Eigen benchmark file [1].

[0]: https://eigen.tuxfamily.org/dox/TopicMultiThreading.html

[1]: https://github.com/sarah-ek/faer-rs/blob/main/faer-bench/eig...

sarah-ek 10 days ago

author here, eigen is compiled with -fopenmp, which enables parallelism by default
- bayindirh 10 days ago
  
  Hi! Thanks for chiming in!
  Did you check with resource utilization? If you don't provide "OMP_NUM_THREADS=n", Eigen doesn't auto-parallelize by default.
  
  sarah-ek 10 days ago
  
  i did check, yes
dannyz 10 days ago

Related on the Eigen benchmarking, I see a lot of use of auto in the benchmarks. Eigen does not recommend using it like this (https://eigen.tuxfamily.org/dox/TopicPitfalls.html) because the template expressions can be quite complicated. I'm not sure if it matters here or not, but it would probably better not to use it in the benchmarks.
- sarah-ek 10 days ago
  
  i've contributed to eigen in the past and know enough about the internals of the codebase to know my way around safe `auto` usage
  
  dannyz 10 days ago
  
  I wasn't worried about safe usage, more that some of the initialization may be moved inside the benchmarking function instead of outside of it like intended. I'm sure you know more about it than me though.
- bayindirh 10 days ago
  
  While auto is a compile time burden, it creates a lot of load during compilation of this benchmark.
  My complete Ph.D., using ton of Eigen components plus other libraries was compiling in 10 seconds flat on a way older computer. This requires gigs of RAM plus a minute.
  
  sarah-ek 10 days ago
  
  the long compile times are mostly because im instantiating every dense decomposition in the library in one translation unit, for several data types (f32, f64, f128, c32, c64, c128)

owlbite 10 days ago

Something looks dubious with the benchmarking here to me.

Top-tier numerical linear algebra libraries hold all hit the same number (give or take a few percent) for matrix multiply, because they're all achieving the same hardware peak performance.

sarah-ek 10 days ago

one issue that may be affecting the result is that openmp's threadpool doesn't play well with rayon's. i've seen some perf degradation in the past (on both sides) when both are used in the same program
i plan to address that after refactoring the benches by executing each library individually
- jimbobraggins 10 days ago
  
  Very cool project! I'd suggest before running the new benchmarks to reach out to to the developers of the packages you are testing against to see if they think the benchmark you wrote is doing efficient calling conventions. I work on a large open source software project and we've had people claim they are 10x faster than us while they were really using our code in some very malformed ways.
  Also stops them from grumbling after you post good results!
  
  sarah-ek 10 days ago
  
  fair enough. i try to stay in touch with the eigen and nalgebra developers so i have a good idea on how to write code efficiently with them. for openblas and mkl i've been trying recently to call into the lapack api (benches doing that are still unpublished at the moment), that way im using a universal interface for that kinda stuff.
  and of course i do check the cpu utilization to make sure that all threads are spinning for multithreaded benchmarks, and occasionally check the assembly of the hot loops to make sure that the libraries were built properly and are dispatching to the right code. (avx2, avx512, etc) so overall i try to take it seriously and i'll give credit where credit is due when it turns out another library is faster

meisel 10 days ago

Looking at thin matrix SVD, it appears much faster than everyone else. I’m curious what it’s doing differently at a high level and if there’s any tradeoff in accuracy. I also wonder how it compares to MKL, which is typically the winner in all these benchmarks on Intel.

sarah-ek 10 days ago

im in the process of refactoring the benchmark code at the moment, and plan to include mkl in the benches soon.
overall, the results show that faer is usually faster, or even with openblas, and slower than mkl on my desktop
- mjhay 10 days ago
  
  Wow, that's impressive! I wouldn't expect anything to be able to beat MKL, given the optimizations made based on proprietary information.

andyferris 10 days ago

Does anyone understand the difference (in the benchmark tables) between faer and faer(par)? Which number should be considered important?

royjacobs 10 days ago

If you look at the benchmark code [0] you can see it indicates whether Rayon is used for parallelism.
[0] https://github.com/sarah-ek/faer-rs/blob/main/faer-bench/src...
llimllib 10 days ago

the former is run with no parallelism, and the latter with `parallelism::Rayon(0)`, which presumably means that it has some parallelism, though I don't write rust or know what Rayon is.
[1]: https://github.com/sarah-ek/faer-rs/blob/172f651fafe625a2534...
- nindalf 10 days ago
  
  Rayon is a crate that makes it easy to write code that executes in parallel. Take a function that looks like this. It'll execute on a single thread.
  fn sum_of_squares(input: &[i32]) -> i32 { input.iter() .map(|&i| i * i) .sum() }
  If you changed it to input.par_iter() then it executes in parallel on multiple threads.
- nateb2022 10 days ago
  
  Rayon is a library that helps to parallelize sequential computations while guaranteeing data-race freedom. Cf. https://docs.rs/rayon/latest/rayon/

loeg 10 days ago

A bit of a tangent, but the same author has something like a libdivide (C++) for Rust: https://github.com/sarah-ek/fastdiv . Cool.

snowAbstraction 9 days ago

For the larger performance diffs, has anyone looked into why? Are there a couple of common reasons? I'd really like to know. Thanks

sarah-ek 9 days ago

i have, yes. i can't speak for openblas or mkl, but im familiar with eigen and nalgebra's implementations to some extent
nalgebra doesn't use blocking, so decompositions are handled one column (or row) at a time. this is great for small matrices, but scales poorly for larger ones
eigen uses blocking for most decompositions, other than the eigendecomposition, but they don't have a proper threading framework. the only operation that is properly multithreaded is matrix multiplication using openmp (and the unstable tensor module using a custom thread pool)

rayrrr 10 days ago

How exactly does this dovetail with https://github.com/rust-or/good_lp ? Will it be a replacement, an enhancement, or something else?

sestep 10 days ago

Seems like a different focus, no? Aren't linear programming solvers a much narrower domain than general linear algebra libraries?
- fastneutron 10 days ago
  
  Correct, linear algebra and linear programming are two very distinct things. The latter is a widely-used optimization technique, and computationally depends on the former, which is a general mathematical framework for essentially all numerical computing.
electrozav 10 days ago

That's code for linear programming (optimization) not for linear algebra
rayrrr 10 days ago

I asked because Faer describes itself as a low-level linear algebra library with a high-level wrapper, whereas good_lp describes itself as a high-level wrapper around many relevant low-level libraries.
- constantcrying 10 days ago
  
  Linear programming and linear algebra aren't the same thing.

Ericson2314 10 days ago

Nixpkgs has a some pluggable BLAS/Lapack implementation infra. If this does offer a shim layer doing exactly that interace, it would be nice to see this packaged as a new alternative!

netwrt 10 days ago

Does this include an einsum function? I find that it makes complex operations easier to work with

sarah-ek 10 days ago

not yet. a tensor api is on the long term todo list, but it's a big undertaking and i'd like to focus on matrix operations for the time being

citizen_friend 10 days ago

Is there a rust API for LAPACK?

sanxiyn 10 days ago

Yes, see https://github.com/blas-lapack-rs.

jdeaton 10 days ago

Why not just call Eigen from rust

sarah-ek 10 days ago

because we can do better
- 1980phipsi 9 days ago
  
  What features of Rust let you write faster code than C/C++/Fortran?
  
  adgjlsfhk1 9 days ago
  
  most of it's just general programming niceness. If you have to spend a few hours to wrestle with make/bazel/etc every time you need to reach for a dependency, you don't depend on things and have to rewrite them yourself. If your programming language doesn't have good ways of writing generic code, you either have to write the code once per precision (and do all your bugfixes/perf improvements in triplicate) or do very hacky metaprogramming where you use Python to generate your low level code (yes the Fortran people actually do this https://github.com/aradi/fypp), or use the C preprocesser which is even worse.

alexhackhack 10 days ago

[dead]