Build Your Own Language Model: A Simple Guide with Python and NumPy

This content originally appeared on DEV Community and was authored by Santhosh Vijayabaskar

Artificial Intelligence is everywhere these days, and language models are a big part of that. Have you ever wondered how AI can predict the next word in a sentence or even write entire paragraphs? In this tutorial, we’ll build a super simple language model without relying on fancy frameworks like TensorFlow or PyTorch—just plain Python and NumPy.

Sounds cool? Let’s get started!🚀

What We’re Building:

We'll be creating a bigram model. It predicts the next word in a sentence based on the current word. We’ll keep it straightforward and easy to follow so you’ll learn how things work without getting buried in too much detail.🧠💡

Step 1: Set Up

Before we begin, let's make sure you’ve got Python and NumPy ready to go. If you don’t have NumPy installed, quickly install it with:

pip install numpy

Step 2: Understanding the Basics

A language model predicts the next word in a sentence. We’ll keep things simple and build a bigram model. This just means that our model will predict the next word using only the current word.

We’ll start with a short text to train the model. Here’s a small sample we’ll use:

import numpy as np

# Sample dataset: A small text corpus
corpus = """Artificial Intelligence is the new electricity.
Machine learning is the future of AI.
AI is transforming industries and shaping the future."""

Step 3: Preparing the Text

First things first, we need to break this text into individual words and create a vocabulary (basically a list of all unique words). This gives us something to work with.

# Tokenize the corpus into words
words = corpus.lower().split()

# Create a vocabulary of unique words
vocab = list(set(words))
vocab_size = len(vocab)

print(f"Vocabulary: {vocab}")
print(f"Vocabulary size: {vocab_size}")

Here, we’re converting the text to lowercase and splitting it into words. After that, we create a list of unique words to serve as our vocabulary.

Step 4: Map Words to Numbers

Computers work with numbers, not words. So, we’ll map each word to an index and create a reverse mapping too (this will help when we convert them back to words later).

word_to_idx = {word: idx for idx, word in enumerate(vocab)}
idx_to_word = {idx: word for word, idx in word_to_idx.items()}

# Convert the words in the corpus to indices
corpus_indices = [word_to_idx[word] for word in words]

Basically, we’re just turning words into numbers that our model can understand. Each word gets its own number, like “AI” might become 0, and “learning” might become 1, depending on the order.

Step 5: Building the Model

Now, let’s get to the heart of it: building the bigram model. We want to figure out the probability of one word following another. To do that, we’ll count how often each word pair (bigram) shows up in our dataset.

# Initialize bigram counts matrix
bigram_counts = np.zeros((vocab_size, vocab_size))

# Count occurrences of each bigram in the corpus
for i in range(len(corpus_indices) - 1):
    current_word = corpus_indices[i]
    next_word = corpus_indices[i + 1]
    bigram_counts[current_word, next_word] += 1

# Normalize the counts to get probabilities
bigram_probabilities = bigram_counts / bigram_counts.sum(axis=1, keepdims=True)

print("Bigram probabilities matrix: ", bigram_probabilities)

Here’s what’s happening:

We’re counting how often each word follows another (that's the bigram).
Then, we turn those counts into probabilities by normalizing them.
In simple terms, this means that if "AI" is often followed by "is," the probability for that pair will be higher.

Step 6: Predicting the Next Word

Let’s now test our model by making it predict the next word based on any given word. We do this by sampling from the probability distribution of the next word.

def predict_next_word(current_word, bigram_probabilities):
    word_idx = word_to_idx[current_word]
    next_word_probs = bigram_probabilities[word_idx]
    next_word_idx = np.random.choice(range(vocab_size), p=next_word_probs)
    return idx_to_word[next_word_idx]

# Test the model with a word
current_word = "ai"
next_word = predict_next_word(current_word, bigram_probabilities)
print(f"Given '{current_word}', the model predicts '{next_word}'.")

This function takes a word, looks up its probabilities, and randomly selects the next word based on those probabilities. If you pass in "AI," the model might predict something like "is" as the next word.

Step 7: Generate a Sentence

Finally, let's generate a whole sentence! We’ll start with a word and keep predicting the next word a few times.

def generate_sentence(start_word, bigram_probabilities, length=5):
    sentence = [start_word]
    current_word = start_word

    for _ in range(length):
        next_word = predict_next_word(current_word, bigram_probabilities)
        sentence.append(next_word)
        current_word = next_word

    return ' '.join(sentence)

# Generate a sentence starting with "artificial"
generated_sentence = generate_sentence("artificial", bigram_probabilities, length=10)
print(f"Generated sentence: {generated_sentence}")

This function takes an initial word and predicts the next one, then uses that word to predict the following one, and so on. Before you know it, you’ve got a full sentence!

Wrapping Up

There you have it—a simple bigram language model built from scratch using just Python and NumPy. We didn’t use any fancy libraries, and you now have a basic understanding of how AI can predict text. You can play around with this code, feed it different text, or even expand it by using more advanced models.

Give it a try, and let me know how it goes. Happy coding!

This content originally appeared on DEV Community and was authored by Santhosh Vijayabaskar

Print Share Comment Cite Upload Translate Updates

APA

Santhosh Vijayabaskar | Sciencx (2024-10-18T23:35:15+00:00) Build Your Own Language Model: A Simple Guide with Python and NumPy. Retrieved from https://www.scien.cx/2024/10/18/build-your-own-language-model-a-simple-guide-with-python-and-numpy/

MLA

" » Build Your Own Language Model: A Simple Guide with Python and NumPy." Santhosh Vijayabaskar | Sciencx - Friday October 18, 2024, https://www.scien.cx/2024/10/18/build-your-own-language-model-a-simple-guide-with-python-and-numpy/

HARVARD

Santhosh Vijayabaskar | Sciencx Friday October 18, 2024 » Build Your Own Language Model: A Simple Guide with Python and NumPy., viewed ,<https://www.scien.cx/2024/10/18/build-your-own-language-model-a-simple-guide-with-python-and-numpy/>

VANCOUVER

Santhosh Vijayabaskar | Sciencx - » Build Your Own Language Model: A Simple Guide with Python and NumPy. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2024/10/18/build-your-own-language-model-a-simple-guide-with-python-and-numpy/

CHICAGO

" » Build Your Own Language Model: A Simple Guide with Python and NumPy." Santhosh Vijayabaskar | Sciencx - Accessed . https://www.scien.cx/2024/10/18/build-your-own-language-model-a-simple-guide-with-python-and-numpy/

IEEE

" » Build Your Own Language Model: A Simple Guide with Python and NumPy." Santhosh Vijayabaskar | Sciencx [Online]. Available: https://www.scien.cx/2024/10/18/build-your-own-language-model-a-simple-guide-with-python-and-numpy/. [Accessed: ]

rf:citation

» Build Your Own Language Model: A Simple Guide with Python and NumPy | Santhosh Vijayabaskar | Sciencx | https://www.scien.cx/2024/10/18/build-your-own-language-model-a-simple-guide-with-python-and-numpy/ |

Please log in to upload a file.

There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.