Any comments, contributions, or feedback? Ping me!

Running LLMs and ML in Wasm

Searching new runtimes for AI

I came up with an (obvious) idea the other day that led to me to the following question: “would it be possible to run LLM inference from Wasm?”. Being able to compile ML models into Wasm would allow us to run them in a heterogeneous set of runtimes and devices, including mobile or the browser. However, would running these models in Wasm offer access to the low-level computational resources of the device required for an efficient execution?

And this is how my journey into ML in Wasm started.

A cartoon of a crab that is building a artifical intelligence robot and enjoying the work - SDXL

`wasi-nn`

A simple DuckDuckGo search for “ML in Wasm” immediately throws wasi-nn as the key result. wasi-nn is a WASI API spec proposal for performing ML inference in Wasm. The proposal is not exactly oriented for the execution of ML models inside Wasm, but to expose ML capabilities to Wasm modules. WebAssembly programs that want to use a host’s ML capabilities (like triggering the execution of computational graphs in GPU) can access these through wasi-nn’s core abstractions: backends, graphs, and tensors.

“the nature of ML inference makes it amenable to hardware acceleration of various kinds; without this hardware acceleration, inference can suffer slowdowns of several hundred times. Hardware acceleration for ML is very diverse — SIMD (e.g., AVX512), GPUs, TPUs, FPGAs — and it is unlikely (impossible?) that all of these would be supported natively in WebAssembly”.

The best way to understand what wasi-nn proposes is to read the proposed API. It basically exposes to Wasm ways to create tensors, load a computational graph from a specific ML framework like Tensorflow or pyTorch, and triggering inferences against a specific backend, as shown in the API walkthrough from the proposal:

// Load the model.
let encoding = wasi_nn::GRAPH_ENCODING_...;
let target = wasi_nn::EXECUTION_TARGET_CPU;
let graph = wasi_nn::load(&[bytes, more_bytes], encoding, target);

// Configure the execution context.
let context = wasi_nn::init_execution_context(graph);
let tensor = wasi_nn::Tensor { ... };
wasi_nn::set_input(context, 0, tensor);

// Compute the inference.
wasi_nn::compute(context);
wasi_nn::get_output(context, 0, &mut output_buffer, output_buffer.len());

This post is a good walk-through the implementation rationale. And you may wondering, “is someone already leveraging wasi_nn for their application? They are. Second State has already been tinkering with wasi-nn to run LLM inference as shown in this post, this talk, and run-llm.sh

Re-write in Rust/Wasm

After going through wasi-nn I thought “what if one could rewrite the inference graph of a model in line with what llama.cpp did to run inference over Llama using C++ so that it can be compiled to Wasm?”. “Rewriting in Rust” has been a trend in the recent years, but is sometimes not a good idea, and this may be one of those cases. The closest that I could find that resembles running ML inference in rust is tch-rs, which provides a wrapper around the C++ pyTorch API. However, the backend would still be C++, and this is far from allowing a Wasm compiled inference implementation. Someone has even tried to re-write a model from scratch from Python into Rust with lower improvements than the ones original expected.

But if we think this through for a moment, it actually makes sense. One of the key requirements to do ML inference (or training), is to perform vast quantities of matrix multiplications. For this, to accelerate these operations we use GPUs, and the most amenable language to interact with GPUs is C++. We can write wrappers and bindings for the ML frameworks in many languages but for now, the backend will still be C++-based and compiled to specific hardware. So we may be far from “compiling once and running anywhere” for AI.

Rewrite in rust

But don’t get me wrong, this doesn’t mean that Rust may not end up having an important place in moving AI into production, a good example of this is how Exa.ai managed to improve 4x the throughput of their real-time embedding by migrating from Python to Rust. This doesn’t exactly involve model inference (which was my original target), but it is something.

Candle - ML Rust Framework

And when I thought that everything was lost, and running inference for AI models in Wasm was just not there yet, I came across Candle, a minimalist ML framework for Rust with a focus on peformance. The framework includes GPU support, and supports the compilation of some models in Wasm!. Skimming through their README you’ll already see how they provide the implementation of several models like yolo, whisper, phi, or LLaMa2 that can be compiled to Wasm and entirely run in the browser. Just what I was looking for!

With this, I will be able to explore the idea that I mentioned at the beginning of the post. In the next few weeks I’ll tinker a bit and report back with the results. In the meantime, here’s an example of Phi1.5/Phi2 fully running in the browser.

I still need to dig a bit deeper into the codebase to see if these examples leverage WebGL (or even the newer WebGL) APIs to access the underlying GPU resources of a device, but in any case this gives me a great foundation to build upon.

ONNX Runtime

Throughout my research, I also came across ONXX Runtime, which seems to be the de-factor runtime to run models implemented with different frameworks in an heterogeneous set of backends. “ONNX Runtime is compatible with different hardware, drivers, and operating systems, and provides optimal performance by leveraging hardware accelerators where applicable alongside graph optimizations and transforms”. ONNX offers ways to run inference over models in mobile devices and web browsers.

How does this runtime works? You train your model in a ML framework like Tensorflow, pyTorch or even the traditional Scikit-learn, export the model into an ONNX format, and then provide the output to the ONNX runtime to run inferences for it in any of the provided backends supported by the runtime. So basically, as long as your model can be exported into an ONNX format, you can leverage the runtime to run in the cloud, the edge, the browser, etc.

Similar to ONNX runtime, there’s a project called MLC LLM focused on the deployment of LLMs over different runtimes.

Conclusion

A few years ago the bulk of AI was being run in Jupyter notebooks and Python, but as the technology matures, and more and more of these models are move from research into production, we see their deployment and execution in more and more runtimes and hardware. With my distributed systems background, I am honestly really excited to see (and hopefully contribute) to this jump of AI into every corner of the computational space.

Any comments, contributions, or feedback? Ping me!

Follow @adlrocha Tweet to @adlrocha