TNS
VOXPOP
Do You Resent AI?
If you’re a developer, do you resent generative AI’s ability to write code?
Yes, because I spent a lot of time learning how to code.
0%
Yes, because I fear that employers will replace me and/or my peers with it.
0%
Yes, because too much investment is going to AI at the expense of other needs.
0%
No, because it makes too many programming mistakes.
0%
No, because it can’t replace what I do.
0%
No, because it is a tool that will help me be more productive.
0%
No, I am a highly evolved being and resent nothing.
0%
I don’t think much about AI.
0%
Kubernetes / Large Language Models / WebAssembly

WebAssembly, Large Language Models, and Kubernetes Matter

WebAssembly makes it quick and easy to download and run a complete LLM on a machine without any major setup.
Apr 30th, 2024 7:51am by
Featued image for: WebAssembly, Large Language Models, and Kubernetes Matter
Feature image by Jose Castillo via Unsplash.

WebAssembly (WASM) makes it dead simple to develop, build, run, and operate the exact same code on any piece of hardware you can find under your desk, in your data center, in your AWS account, or on the control unit of a 30-ton harvester in a corn field.

While I had talked with Fermyon’s CEO Matt Butcher about this vision back at KubeCon 2022 in Detroit, today there are actual production-ready use cases that bring tangible value.

LlamaEdge: One Line of Code to Run an LLM Anywhere

LlamaEdge, an open source project, promises that all it takes is to paste a single line of code into a terminal on basically any machine and after a few seconds a browser will pop up showing a UI very similar to what we are used to from ChatGPT. Of course, we neither have the hardware to run ChatGPT on our laptops nor does OpenAI offer that option from a licensing perspective, however, what we can run are dozens of open source variants. By default, LlamaEdge installs a small version of Google’s Gemma LLM on the local machine for instant gratification, and it works great.

But how is it so quick and easy to download and run a complete LLM on my machine without any major setup? This is where wasmEdge comes in to save the day. Llama Edge runs as pre-compiled code (bytecode) on top of the WasmEdge Runtime. All it takes is 30MB (not GB!) of disk space plus the space needed to download your LLM of choice. Once downloaded Llama Edge takes advantage of wasmEdge’s ability to consistently provide CPU, GPU, RAM, and disk resources on top of basically any operating system (Windows, Linux and derivates) and any silicon (Intel, AMD, Nvidia, etc.) without any advanced configuration needed. Crack open a terminal on your machine right now and check it out: This single command…

  … results in a UI without any further config needed.

Components Are the New Containers

“Components are the new containers,” said Liam Randall, CEO of Cosmonic. Considering that I was able to set up a complete LLM, including its ChatGPT-like UI, in under a minute on the same MacBook I’m writing this article on, Randall’s statement makes perfect sense. Had I installed the same LLM without WASM, I would have had to follow a number of MacOS-specific steps: 1) install homebrew, 2) install the required packages, 3) find and clone the desired Llama LLM, 4) install Python dependencies, 5) convert and quantize the model file, and 6) test my install. But as I’m running WasmEdge, I do not have to worry about any of these steps, nor does a Python runtime even have to be present. LlamaEdge simply needs wasmEdge to run and that’s it.

But Do I Need to Learn Rust?

As a Python developer, I would strongly prefer not having to learn Rust to be able to use an LLM. All I need is a line of command line code to get the LLM set up and then another line if I want to select a specific LLM instead of the default one:


The above command brings the user to a selection of out-of-the-box LLMs.

I still have not written a single line of actual Rust code, instead, I copied and pasted the needed commands from the LlamaEdge GitHub site and now I can talk to my brand new LLM. Going back to Randall’s statement regarding components being the new containers, I can now simply import this model as a component for any of my future Python apps. At the same time, I can share this component with my team or customers so that they can also incorporate my LLM into their own apps.

This brings me back to a conversation I had with Fermyon’s Tim Enwall back at AWS Re:Invent, where we talked about the possibility of offering WASM components in the form of a subscription service. If I, as an industry analyst, create my own LLM that is fine-tuned with all of my own past publications I could compile it for WASM and start selling subscriptions to this digital twin of mine.

Note that I would not be able to do the same thing simply by placing my LLM into an application container, as containers rely on the underlying operating system and hardware infrastructure. Therefore, thorough testing is necessary every time I make changes or upgrades, as my LLM app may now only work with certain specific versions of a particular operating system.

One More Use Case: Data Pipeline Management for Logging and Beyond

Calyptia’s, recently acquired by Chronosphere, FluentBit observability data pipeline management platform allows developers to write plugins in the form of WASM programs. Developers can simply use Rust, TinyGo, and, to a degree, Python to write complex functions for culling, enriching, converting or otherwise processing pipeline data.

We could now connect this back to our LlamaEdge example, enabling our WASM pipeline program to “talk to” LlamaEdge to analyze logs in real time, extract meaningful insights, or even automate responses based on the content of the logs. Imagine a scenario where your WASM pipeline program detects an anomaly in the log data, such as an unusual spike in traffic or a potential security breach. It could then query the LlamaEdge LLM to understand the context better and suggest immediate actions or escalate the issue to the appropriate team members.

By integrating LLMs into the data pipeline, the process of monitoring and responding to events can become significantly more intelligent and proactive. This could revolutionize how we handle log data, turning a reactive process into a dynamic, automated one that not only alerts but also provides potential solutions. This becomes even more interesting when considering that processing telemetry data in a decentralized manner within the data pipeline reduces the amount of data that needs to be ingested into one or more corporate observability platforms. This can lead to significant cost reduction as many of the observability platforms charge enterprise customers based on incoming data volume.

Fermyon Platform for Kubernetes: Higher Density, Lower Cost

Fermyon launched its SpinKube framework for Kubernetes, enabling WASM applications to run on Kubernetes with higher density and therefore at lower cost compared to containers. SpinKube leverages the lightweight nature of WebAssembly modules to pack more applications onto each server node, significantly reducing the required compute resources compared to traditional containerized applications.

The SpinKube framework is designed to be developer-friendly, offering seamless integration with existing Kubernetes environments. Developers can deploy their WASM applications as if they were traditional containerized apps, without needing to learn new tools or workflows. This ease of use accelerates the development cycle and simplifies the deployment process.

Moreover, SpinKube ensures security and isolation at the application level, a critical aspect for multitenant environments. Each WASM application runs in its isolated sandbox, providing a secure execution environment that minimizes the risk of vulnerabilities affecting the host system or other applications.

Fermyon’s commitment to open standards and community-driven development is evident in SpinKube’s architecture. The platform supports a wide range of programming languages and tools, making it accessible to a broad developer community. This inclusivity fosters innovation and encourages the adoption of WASM technology across various industries.

In summary, the Fermyon for Kubernetes represents a significant advancement in cloud native computing. By enabling higher density and lower cost, while maintaining ease of use, security, and open standards, SpinKube positions itself as a key player in the future of Kubernetes application deployment. Here it is important to mention that Fermyon donated SpinKube to the CNCF sandbox.

Final Words: LLMs, Developer Productivity, and Operations Cost Pressure as the Driving Force Behind WASM Success

WASM’s inherent ability to consistently run anywhere where there is a WebAssembly runtime available makes this technology predestined for “moving LLMs to where the data is.”

This is great for compliance reasons, as enterprises can simply “dock” the desired LLM to their relevant data sources without the need to ask for permission to move potentially sensitive data. This portability combined with the small size of the WASM runtime and the ability to run WASM apps on Kubernetes, right next to traditional containers, could make it cheaper and therefore easier to run some LLM inference or model training on server infrastructure that is sitting idle over the weekend anyway. Once Monday rolls around, we can terminate our WASM-LLM apps or move them somewhere else. Of course, this principle does not only apply to LLMs but can work for many other use cases.

If the Bytecode Alliance and the W3C WebAssembly Community Group can accelerate the pace of implementing the WebAssembly Component Model so that WASM can become universally usable, this technology will become a true game changer. WASI 0.2 was a nice step forward, but to make this platform ready for the mass market there is still quite a bit of homework to be done.

Group Created with Sketch.
TNS owner Insight Partners is an investor in: fermyon, Kubernetes.
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.