Making VLLM work on WSL2

Running a small llama3 model for demonstration purposes on WSL2 using VLLM

1. Requirements

WSL version 2
Python: 3.8–3.12
GPU ABOVE GTX1080 (Did not manage to make it work on 1080 as it told me the hardware was too old. 🙁 Ollama is less …


This content originally appeared on DEV Community and was authored by Emilien Lancelot

Running a small llama3 model for demonstration purposes on WSL2 using VLLM

1. Requirements

  • WSL version 2
  • Python: 3.8–3.12
  • GPU ABOVE GTX1080 (Did not manage to make it work on 1080 as it told me the hardware was too old. :-( Ollama is less picky).

2. Preflight checks

Checking NVCC

In WSL, do:

nvcc --version

▶️ Command not found?

Fixing NVCC: Nvidia Drivers Installation

Visit Nvidia's official website to download and install the Nvidia drivers for WSL. Choose Linux > x86_64 > WSL-Ubuntu > 2.0 > deb (network)

Follow the instructions provided on the page.

Add the following lines to your .bashrc:

export PATH="/usr/local/cuda-12.6/bin:$PATH"
export LD_LIBRARY_PATH="/usr/local/cuda-12.6/lib64:$LD_LIBRARY_PATH"

⚠️ ⚠️ ⚠️ Check the content of "/usr/local" to be sure that you do have the "cuda-12.6" folder. Yours might have a different version.

Reload your configuration and check that all is working as expected

source ~/.bashrc
nvcc --version
nvidia-smi.exe

ℹ️ "nvidia-smi" isn't available on WSL so just verify that the .exe one detects your hardware. Both commands should displayed gibberish but no apparent errors.

3. Creating the environment

python3 --version # copy the version
conda create -n myenv python=3.10 -y # Update the python version with your own

▶️ Don't have conda?

mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm ~/miniconda3/miniconda.sh
source ~/miniconda3/bin/activate

conda create -n myenv python=3.10 -y # Update the python version with your own

4. Installing VLLM

Installing VLLM:

pip install vllm

Trying to start the inference server with a tiny LLM:

vllm serve facebook/opt-125m

▶️ Runtime crash of VLLM?

    from torch._C import *  # noqa: F403
    ^^^^^^^^^^^^^^^^^^^^^^
ImportError: /home/xxxx/vllm_serve/lib/python3.11/site-packages/torch/lib/../../nvidia/cusparse/lib/libcusparse.so.12: undefined symbol: __nvJitLinkComplete_12_4, version libnvJitLink.so.12

Well, this is where sh*t hits the fan. I recommend that you try the fix below. However, if it doesn't work, I can only wish you good luck. It's another one of those technologies that have error messages written by a depressive data scientist. So just scroll to the top of the stacktrace and hope for the best. Google is your friend. Have faith.

Potential VLLM serve fix:

python -m pip uninstall torch torchvision torchaudio
python -m pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121
vllm serve facebook/opt-125m # Should be working now...

https://github.com/pytorch/pytorch/issues/111469

5. Running VLLM

Let's try with a tiny Facebook LLM.

  1. Create an account on Hugging Face and then create a api-key.
    Then go to the model page you want to try out.

  2. For us: https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct

  3. At the top of the page there will be some kind of form you must apply to.

Accepting the Facebook agreement. For the one time they do something good for humanity.

When done, you'll get an email 10 minutes later telling you you've got access. 🎉

  1. To connect to the VLLM server try this piece of code:
from vllm import LLM
from huggingface_hub import login

login("<REPLACE-ME-WITH-YOUR-HUGGING-FACE_TOKEN") # Replace this !

# Load the LLama 3 8B model
model = LLM("meta-llama/Llama-3.2-1B-Instruct")

from transformers import AutoTokenizer

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")

# Prepare the input message
messages = [{"role": "user", "content": "What is the capital of France?"}]
formatted_prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# Generate the output
output = model.generate(formatted_prompt)
print(output)

Run it and in all the mess you'll see:

text='The capital of France is Paris.'

Serving using OpenAI style

VLLM is OpenAI compatible to some extent. We can use this ability to use the OpenAI client instead of the Hugging Face one.

from openai import OpenAI
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="token-abc123",
)

completion = client.chat.completions.create(
  model="meta-llama/Llama-3.2-1B-Instruct",
  messages=[
    {"role": "user", "content": "Hello!"}
  ]
)

print(completion.choices[0].message)

Should output something like:

ChatCompletionMessage(content='Hello! How can I assist you today?', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=[])

6. Manage installed models

List models

ls ~/.cache/huggingface/hub
models--facebook--opt-125m  version.txt

Delete models

rm -rf ~/.cache/huggingface/hub/models--<model-name>

In need of a simple yet effective multi-agent framework with working tool calling for any LLMs ?
I've got you covered! Go try Yacana!

Yacana logo

VLLM support is in progress. For now, you'll need to use Ollama.

HF.


This content originally appeared on DEV Community and was authored by Emilien Lancelot


Print Share Comment Cite Upload Translate Updates
APA

Emilien Lancelot | Sciencx (2025-01-17T13:24:54+00:00) Making VLLM work on WSL2. Retrieved from https://www.scien.cx/2025/01/17/making-vllm-work-on-wsl2-3/

MLA
" » Making VLLM work on WSL2." Emilien Lancelot | Sciencx - Friday January 17, 2025, https://www.scien.cx/2025/01/17/making-vllm-work-on-wsl2-3/
HARVARD
Emilien Lancelot | Sciencx Friday January 17, 2025 » Making VLLM work on WSL2., viewed ,<https://www.scien.cx/2025/01/17/making-vllm-work-on-wsl2-3/>
VANCOUVER
Emilien Lancelot | Sciencx - » Making VLLM work on WSL2. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2025/01/17/making-vllm-work-on-wsl2-3/
CHICAGO
" » Making VLLM work on WSL2." Emilien Lancelot | Sciencx - Accessed . https://www.scien.cx/2025/01/17/making-vllm-work-on-wsl2-3/
IEEE
" » Making VLLM work on WSL2." Emilien Lancelot | Sciencx [Online]. Available: https://www.scien.cx/2025/01/17/making-vllm-work-on-wsl2-3/. [Accessed: ]
rf:citation
» Making VLLM work on WSL2 | Emilien Lancelot | Sciencx | https://www.scien.cx/2025/01/17/making-vllm-work-on-wsl2-3/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.