SGLang vs llama.cpp – A Quick Speed Test

Recently, I stumbled upon a post about SGLang, an open-source LLM inference engine that boasts 2-5x higher throughput compared to other solutions and a 1.76x speedup for DeepSeek R1 models!

“I’d be super happy even with a modest 1.5x speed-up over m…


This content originally appeared on DEV Community and was authored by Maxim Saplin

Recently, I stumbled upon a post about SGLang, an open-source LLM inference engine that boasts 2-5x higher throughput compared to other solutions and a 1.76x speedup for DeepSeek R1 models!

Image description

"I'd be super happy even with a modest 1.5x speed-up over my LM Studio/llama.cpp setup!" was my first reaction...

A Closer Look

Just like llama.cpp, SGLang turned out to be a pretty low-level thing... I typically use LM Studio (and spent some with Ollama) for running models locally. They are very convenient and require minimal setup - in just minutes and a few clicks you can discover, download, and run models. Both provide an easy way to chat and can run local OpenAI Chat Completions endpoints, which is handy for integrating with various tools (e.g. using local Web UI or experimenting with AI agents).

SGLang is different, it was not created for LLM enthusiasts to run models on their home rigs. I started my research by looking for Ollama/Jan-like solutions, ideally with GUI (e.g. LM Studio) that could integrate SGLang as a runtime, but I didn't find any.

Hence I spent a couple of hours configuring my WSL2 and installing SGLang before I received my first generated tokens:
I didn't find explicit mention of supported platforms, seems like it's Linux only, I used WSL (Ubuntu 24) on Windows.
No chat UI (even through CLI), only OpenAI inference server
Support downloading .safetensors models from Hugginface (though you need to configure huggingface-cli first to log in to get models like LLama or Gemma)
Besides the HF model format, it has some limited support for GGUF, i.e. the models you might have downloaded can be tried. For me Llama 3.1 8B was loaded, Gemma 2 9B failed (@q8 and Q4)
Supports on-line quantization upon loading a model
Tested it via a custom Web UI which I had to run separately. It has a tokens per second counter.

If you want to try SGLang by yourself, I've compiled my notes while setting it up and benchmarking here.

Results

I have tested inference speed with Gemma 2 9B. SGLang used a model from Google's HF hub in Safetensors format, LM Studio - GGUF model from LMStudio HF hub. Both models were tested in 16-bit and 8-bit variants. Both used CUDA backend, RTX 4090 100% GPU off-load.

Runtime Quantization VRAM Load Time Speed
SGLang fp8 21.1 GB 4-5 min ~70 tok/s
LM Studio Q8 12.6 GB ~10 sec ~65 tok/s
SGLang bf16 20.7 GB 4-5 min ~47 tok/s
LM Studio f16 20.7 GB ~20 sec ~44 tok/s

With SGLang there's a roughly 7% faster generation speed in tokens per second. Yet SGLang is super slow loading the models, taking minutes compared to seconds with llama.cpp. Besides there's some odd behavior in terms of VRAM consumption, loading the model in fp8 quantized format (doing online quantization) SGLang's memory consumption went up - loading larger models might be a challenge.

Sticking to Llama.cpp

IMO for the local LLM tinkering the marginal difference in generation speed is not worth the hassle - painful installation, troubled model discovery and downloading, longer load times, and odd VRAM consumption. Although SGLang might be a good option for multi-user production environments serving multiple requests at a time.


This content originally appeared on DEV Community and was authored by Maxim Saplin


Print Share Comment Cite Upload Translate Updates
APA

Maxim Saplin | Sciencx (2025-02-17T12:03:05+00:00) SGLang vs llama.cpp – A Quick Speed Test. Retrieved from https://www.scien.cx/2025/02/17/sglang-vs-llama-cpp-a-quick-speed-test/

MLA
" » SGLang vs llama.cpp – A Quick Speed Test." Maxim Saplin | Sciencx - Monday February 17, 2025, https://www.scien.cx/2025/02/17/sglang-vs-llama-cpp-a-quick-speed-test/
HARVARD
Maxim Saplin | Sciencx Monday February 17, 2025 » SGLang vs llama.cpp – A Quick Speed Test., viewed ,<https://www.scien.cx/2025/02/17/sglang-vs-llama-cpp-a-quick-speed-test/>
VANCOUVER
Maxim Saplin | Sciencx - » SGLang vs llama.cpp – A Quick Speed Test. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2025/02/17/sglang-vs-llama-cpp-a-quick-speed-test/
CHICAGO
" » SGLang vs llama.cpp – A Quick Speed Test." Maxim Saplin | Sciencx - Accessed . https://www.scien.cx/2025/02/17/sglang-vs-llama-cpp-a-quick-speed-test/
IEEE
" » SGLang vs llama.cpp – A Quick Speed Test." Maxim Saplin | Sciencx [Online]. Available: https://www.scien.cx/2025/02/17/sglang-vs-llama-cpp-a-quick-speed-test/. [Accessed: ]
rf:citation
» SGLang vs llama.cpp – A Quick Speed Test | Maxim Saplin | Sciencx | https://www.scien.cx/2025/02/17/sglang-vs-llama-cpp-a-quick-speed-test/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.