This content originally appeared on DEV Community and was authored by Shayan
Yesterday, DeepSeek released a series of very powerful language models including DeepSeek R1 and a number of distilled (smaller) models that are based on the Qwen and Llama architectures. These models have made a lot of noise in the AI community for their performance, reasoning capabilities and most importantly for being open source with an MIT license.
I have been testing these models both from their own API and locally on my own MacBook Pro and I have to say that the performance has been amazing, even for the smaller models such as the 8B and 14B models. Here's a benchmark comparing DeepSeek R1 to other state-of-the-art models from OpenAI and Anthropic.
In this guide, I'm going to walk you through how to setup Ollama and run the latest DeepSeek R1 models locally on your own computer. But before we get started, let's take a look at the models themselves.
DeepSeek R1
DeepSeek R1 is a large language model that focuses on reasoning. It can handle tasks that need multi-step problem-solving and logical thinking. The model uses a special training method that puts more emphasis on Reinforcement Learning (RL) instead of Supervised Fine-Tuning (SFT). This approach helps the model to be better at figuring things out on its own.
The model is open source, which means its weights are available under the MIT license. This allows people to use it for commercial purposes, make changes to it, and create new versions based on it. This is different from many other big language models that are not open source.
Distilled Models: Smaller but Still Powerful
DeepSeek AI also released smaller versions of the model. These distilled models come in different sizes like 1.5B, 7B, 8B, 14B, 32B, and 70B parameters. They are based on Qwen and Llama architectures. These smaller models keep a lot of the reasoning power of the bigger model but are easier to use on personal computers.
The smaller models, especially the 8B and smaller ones, can run on regular computers with CPUs, GPUs, or Apple Silicon. This makes them easy for people to experiment with at home.
What is Ollama?
Ollama is a tool that lets you run and manage large language models (LLMs) on your own computer. It makes it easier to download, run, and use these models without needing a powerful server. Ollama supports various operating systems, including macOS, Linux, and Windows. It is designed to be simple to use, with basic commands to pull, run, and manage models.
Ollama also provides a way to use the models through an API, which allows you to integrate them into other applications. Importantly, Ollama offers an experimental compatibility layer with the OpenAI API. This means you can often use existing applications and tools designed for OpenAI with your local Ollama server. It can be configured to use GPUs for faster processing, and it offers features like custom model creation and model sharing. Ollama is a great way to explore and use LLMs without relying on cloud-based services.
Installing Ollama
Before you can use DeepSeek models, you need to install Ollama. Here's how to do it on different operating systems:
macOS
- Go to the Ollama website and download the macOS installer.
- Open the downloaded file and drag the Ollama application to your Applications folder.
- Start the Ollama application. It will run in the background and show up in your system tray.
- Open a terminal and type
ollama -v
to check if the installation was successful.
Linux
-
Open a terminal and run the following command to install Ollama:
curl -fsSL https://ollama.com/install.sh | sh
-
If you prefer a manual install, download the correct
.tgz
package from the Ollama website. Then, extract the package to/usr
using these commands:
curl -L https://ollama.com/download/ollama-linux-amd64.tgz -o ollama-linux-amd64.tgz sudo tar -C /usr -xzf ollama-linux-amd64.tgz
To start Ollama, run
ollama serve
. You can check if it's working by typingollama -v
in another terminal.-
For a more reliable setup, create a systemd service. First, create a user and group for Ollama:
sudo useradd -r -s /bin/false -U -m -d /usr/share/ollama ollama sudo usermod -a -G ollama $(whoami)
-
Then, create a service file in
/etc/systemd/system/ollama.service
with the following content:
[Unit] Description=Ollama Service After=network-online.target [Service] ExecStart=/usr/bin/ollama serve User=ollama Group=ollama Restart=always RestartSec=3 Environment="PATH=$PATH" [Install] WantedBy=default.target
-
Finally, start and enable the service:
sudo systemctl daemon-reload sudo systemctl enable ollama sudo systemctl start ollama sudo systemctl status ollama
Windows
- Go to the Ollama website and download the Windows installer (
OllamaSetup.exe
). - Run the installer. Ollama will be installed in your user profile.
- Ollama will run in the background and show up in your system tray.
- Open a command prompt or PowerShell and type
ollama -v
to check if the installation was successful.
Understanding Ollama Commands
Ollama uses simple commands to manage models. Here are some key commands you'll need:
-
ollama -v
: Checks the installed version of Ollama. -
ollama pull <model_name>:<tag>
: Downloads a model from the Ollama library. -
ollama run <model_name>:<tag>
: Runs a model and starts an interactive chat session. -
ollama create <model_name> -f <Modelfile>
: Creates a custom model using a Modelfile. -
ollama show <model_name>
: Shows information about a model. -
ollama ps
: Lists the models that are currently running. -
ollama stop <model_name>
: Unloads a model from memory. -
ollama cp <source_model> <destination_model>
: Copies a model. -
ollama delete <model_name>
: Deletes a model. -
ollama push <model_name>:<tag>
: Uploads a model to a model library.
DeepSeek Models on Ollama
DeepSeek models are available on the Ollama library in different sizes and formats. Here's a breakdown:
- Model Sizes: The models come in various sizes, such as 1.5b, 7b, 8b, 14b, 32b, 70b, and 671b. The 'b' stands for billion parameters. Larger models usually perform better but need more resources.
- Quantized Versions: Some models are available in quantized versions (e.g.,
q4_K_M
,q8_0
). These versions use less memory and can run faster, but may have a slight drop in quality. - Distilled Versions: DeepSeek also offers distilled versions (e.g.,
qwen-distill
,llama-distill
). These are smaller models that have been trained to act like the larger ones, balancing performance and resource use. - Tags: Each model has a
latest
tag and specific tags that show the size, quantization, and distillation method.
Using DeepSeek Models
Here's how to use DeepSeek models with Ollama:
Pulling a Model
To download a DeepSeek model, use the command:
ollama pull deepseek-r1:<model_tag>
Replace <model_tag>
with the specific tag of the model you want to use. For example:
-
To download the latest 7B model:
ollama pull deepseek-r1:7b
-
To download the 14B Qwen-distilled model with
q4_K_M
quantization:
ollama pull deepseek-r1:14b-qwen-distill-q4_K_M
-
To download the 70B Llama-distilled model with
fp16
precision:
ollama pull deepseek-r1:70b-llama-distill-fp16
Here are some of the available tags:
-
latest
-
1.5b
-
7b
-
8b
-
14b
-
32b
-
70b
-
671b
-
1.5b-qwen-distill-fp16
-
1.5b-qwen-distill-q4_K_M
-
1.5b-qwen-distill-q8_0
-
14b-qwen-distill-fp16
-
14b-qwen-distill-q4_K_M
-
14b-qwen-distill-q8_0
-
32b-qwen-distill-fp16
-
32b-qwen-distill-q4_K_M
-
32b-qwen-distill-q8_0
-
70b-llama-distill-fp16
-
70b-llama-distill-q4_K_M
-
70b-llama-distill-q8_0
-
7b-qwen-distill-fp16
-
7b-qwen-distill-q4_K_M
-
7b-qwen-distill-q8_0
-
8b-llama-distill-fp16
-
8b-llama-distill-q4_K_M
-
8b-llama-distill-q8_0
Running a Model
After downloading a model, you can run it using the command:
ollama run deepseek-r1:<model_tag>
For example:
-
To run the latest 7B model:
ollama run deepseek-r1:7b
-
To run the 14B Qwen-distilled model with
q4_K_M
quantization:
ollama run deepseek-r1:14b-qwen-distill-q4_K_M
-
To run the 70B Llama-distilled model with
fp16
precision:
ollama run deepseek-r1:70b-llama-distill-fp16
This will start an interactive chat session where you can ask the model questions.
Using the API
You can also use the Ollama API with DeepSeek models. Here's an example using curl
:
curl http://localhost:11434/api/generate -d '{
"model": "deepseek-r1:7b",
"prompt": "Write a short poem about the stars."
}'
For chat completions:
curl http://localhost:11434/api/chat -d '{
"model": "deepseek-r1:7b",
"messages": [
{
"role": "user",
"content": "Write a short poem about the stars."
}
]
}'
Using the OpenAI-Compatible API
Ollama provides an experimental compatibility layer with parts of the OpenAI API. This allows you to use existing applications and tools designed for OpenAI with your local Ollama server.
Key Concepts
- API Endpoint: Ollama's OpenAI-compatible API is served at
http://localhost:11434/v1
. - Authentication: Ollama's API doesn't require an API key for local use. You can often use a placeholder like
"ollama"
for theapi_key
parameter in your client. - Partial Compatibility: Ollama's compatibility is experimental and partial. Not all features of the OpenAI API are supported, and there may be some differences in behavior.
- Focus on Core Functionality: Ollama primarily aims to support the core functionality of the OpenAI API, such as chat completions, text completions, model listing, and embeddings.
Supported Endpoints and Features
Here's a breakdown of the supported endpoints and their features:
-
/v1/chat/completions
- Purpose: Generate chat-style responses.
- Supported Features:
- Chat completions (multi-turn conversations).
- Streaming responses (real-time output).
- JSON mode (structured JSON output).
- Reproducible outputs (using a
seed
). - Vision (multimodal models like
llava
that can process images). - Tools (function calling).
- Supported Request Fields:
-
model
: The name of the Ollama model to use. -
messages
: An array of message objects, each with arole
(system
,user
,assistant
, ortool
) andcontent
(text or image). -
frequency_penalty
,presence_penalty
: Controls repetition. -
response_format
: Specifies the output format (e.g.json
). -
seed
: For reproducible outputs. -
stop
: Sequences to stop generation. -
stream
: Enables/disables streaming. -
stream_options
: Additional options for streaming.-
include_usage
: Includes usage information in the stream.
-
-
temperature
: Controls randomness. -
top_p
: Controls diversity. -
max_tokens
: Maximum tokens to generate. -
tools
: List of tools the model can access.
-
-
/v1/completions
- Purpose: Generate text completions.
- Supported Features:
- Text completions (single-turn generation).
- Streaming responses.
- JSON mode
- Reproducible outputs.
- Supported Request Fields:
-
model
: The name of the Ollama model. -
prompt
: The input text. -
frequency_penalty
,presence_penalty
: Controls repetition. -
seed
: For reproducible outputs. -
stop
: Stop sequences. -
stream
: Enables/disables streaming. -
stream_options
: Additional options for streaming.-
include_usage
: Includes usage information in the stream.
-
-
temperature
: Controls randomness. -
top_p
: Controls diversity. -
max_tokens
: Maximum tokens to generate. -
suffix
: Text to append after the model's response
-
/v1/models
/v1/models/{model}
/v1/embeddings
How to Use Ollama with OpenAI Clients
Here's how to configure popular OpenAI clients to work with Ollama:
-
OpenAI Python Library:
from openai import OpenAI client = OpenAI( base_url='http://localhost:11434/v1/', api_key='ollama', # Required but ignored ) # Example chat completion chat_completion = client.chat.completions.create( messages=[ {'role': 'user', 'content': 'Say this is a test'}, ], model='deepseek-r1:7b', ) # Example text completion completion = client.completions.create( model="deepseek-r1:7b", prompt="Say this is a test", ) # Example list models list_completion = client.models.list() # Example get model info model = client.models.retrieve("deepseek-r1:7b")
-
OpenAI JavaScript Library:
import OpenAI from 'openai'; const openai = new OpenAI({ baseURL: 'http://localhost:11434/v1/', apiKey: 'ollama', // Required but ignored }); // Example chat completion const chatCompletion = await openai.chat.completions.create({ messages: [{ role: 'user', content: 'Say this is a test' }], model: 'deepseek-r1:7b', }); // Example text completion const completion = await openai.completions.create({ model: "deepseek-r1:7b", prompt: "Say this is a test.", }); // Example list models const listCompletion = await openai.models.list() // Example get model info const model = await openai.models.retrieve("deepseek-r1:7b")
-
curl
(Direct API Calls):
# Chat completion curl http://localhost:11434/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "deepseek-r1:7b", "messages": [ { "role": "user", "content": "Hello!" } ] }' # Text completion curl http://localhost:11434/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "deepseek-r1:7b", "prompt": "Say this is a test" }' # List models curl http://localhost:11434/v1/models # Get model info curl http://localhost:11434/v1/models/deepseek-r1:7b # Embeddings curl http://localhost:11434/v1/embeddings \ -H "Content-Type: application/json" \ -d '{ "model": "all-minilm", "input": ["why is the sky blue?", "why is the grass green?"] }'
Choosing the Right Model
When choosing a DeepSeek model, consider these factors:
- Size: Larger models generally perform better but need more resources. Start with smaller models if you have limited resources.
- Quantization: Quantized models use less memory but may have slightly lower quality.
- Distillation: Distilled models offer a good balance between performance and resource usage.
It's best to experiment with different models to see which one works best for you.
Additional Tips
- Always check the Ollama library for the latest models and tags.
- Use
ollama ps
to monitor the resources used by your models. - You can adjust parameters like
temperature
,top_p
, andnum_ctx
to fine-tune the model's output.
Troubleshooting
If you have any issues, check the Ollama logs:
- macOS:
~/.ollama/logs/server.log
- Linux:
journalctl -u ollama --no-pager
- Windows:
%LOCALAPPDATA%\Ollama\server.log
You can also use the OLLAMA_DEBUG=1
environment variable for more detailed logs.
Going Further with LLMs
Of course, running these models locally is just the beginning. You can integrate these models into your own applications using the API, build custom applications such as chatbots, research tools with Retriever-augmented generation (RAG), and more.
I have written a number of guides on exploring these models further such as:
Setting up Postgres and pgvector with Docker for building RAG applications - Learn how to set up Postgres and pgvector with Docker for RAG (Retrieval-Augmented Generation) in this step-by-step guide.
Deep Dive into Vector Similarity Search within Postgres and pgvector - Learn how to use pgvector to make vector similarity search easier in Postgres. Discover functions for creating indexes, querying vectors, and more.
Creating AI Agents in Node Using the AI SDK - Learn how to create AI agents in Node using the AI SDK to automate workflows and tasks.
How to Enrich Customer Data with LLMs and Web Crawling - Learn how to use LLMs and Puppeteer to crawl customer websites and enrich their data within your SaaS product.
Conclusion
I hope this guide has been helpful to show you how easy it is to get started with Ollama and run state-of-the-art language models on your own computer. Remember that you are not just limited to DeepSeek models, you can use any model that is available on Ollama or even directly from the models available on other platforms like Hugging Face.
If you have any questions or feedback, please let me know in the comments below.
This content originally appeared on DEV Community and was authored by Shayan
Shayan | Sciencx (2025-01-21T21:30:45+00:00) Run DeepSeek-R1 on Your Laptop with Ollama. Retrieved from https://www.scien.cx/2025/01/21/run-deepseek-r1-on-your-laptop-with-ollama/
Please log in to upload a file.
There are no updates yet.
Click the Upload button above to add an update.