Swan – A Lightweight Language Model Execution Environment Using FPGA

zachbee 10 days ago

I'm curious what the motivation is here -- unfortunately, the dev blog is all in Chinese and I can't read it. If it's mostly to show a proof-of-concept of LLMs on a FPGA, that's awesome!

But if this is targeting real-world applications, I'd have concerns about price-to-performance. High-level synthesis tools often result in fairly poor performance compared to writing Verilog or SystemVerilog. Also, AI-focused SoCs like the Nvidia Jetson usually offer better price-to-performance and performance-per-watt than FPGA systems like the KV260.

Potentially focusing on specialized transformer architectures with high sparsity or significant quantization could give FPGAs an advantage over AI chips, though.

Not to toot my own horn, but I wrote up a piece on open-source FPGA development recently going a bit deeper into some of these insights, and why AI might not be the best use-case for open-source FPGA applications: https://www.zach.be/p/how-to-build-a-commercial-open-source

imtringued 10 days ago

AMD hasn't shipped their "high compute" SOMs, so there is little point in building inference around it. Using programmable logic for machine learning is a complete waste, since Xilinx never shied away from sprinkling lots of "AI Engines" on their bigger FPGAs, to the point where buying the FPGA just for the AI Engines might be worth it, because 100s of VLIW cores packs a serious punch for running numerical simulations.
- khoss 9 days ago
  
  AMD actually also does inference on PL, with reasonable commercial success, actually. Have a look at FINN.
UncleOxidant 10 days ago

There has been a some recent investigations into bitnets (1 or 2-bit weights for NNs including LLMs) where they show that a 1.58 bit weight (with values: -1,0,1) can achieve very good results. Effectively that's 2 bits. The problem is that doing 2-bit math on a CPU or GPU isn't going to be very efficient (lots of shifting & masking). But doing 2-bit math on an FPGA is really easy and space-efficient. Another bonus is that many of the matrix multiplications are replaced by additions. Right now if you want to investigate these smaller weight sizes FPGAs are probably the best option.
> High-level synthesis tools often result in fairly poor performance compared to writing Verilog or SystemVerilog.
Agreed.
- rjeli 10 days ago
  
  I'm curious, do you have any intuition for what percent of the time is spent shifting & masking vs. adding & subtracting (int32s I think)? Probably about the same?
accurrent 10 days ago

The blogs in Japanese not Chinese.

Alifatisk 9 days ago

Is this what's known as FPGA acceleration?