This content originally appeared on DEV Community and was authored by Santhosh Vijayabaskar
Artificial Intelligence is everywhere these days, and language models are a big part of that. Have you ever wondered how AI can predict the next word in a sentence or even write entire paragraphs? In this tutorial, we’ll build a super simple language model without relying on fancy frameworks like TensorFlow or PyTorch—just plain Python and NumPy.
Sounds cool? Let’s get started!🚀
What We’re Building:
We'll be creating a bigram model. It predicts the next word in a sentence based on the current word. We’ll keep it straightforward and easy to follow so you’ll learn how things work without getting buried in too much detail.🧠💡
Step 1: Set Up
Before we begin, let's make sure you’ve got Python and NumPy ready to go. If you don’t have NumPy installed, quickly install it with:
pip install numpy
Step 2: Understanding the Basics
A language model predicts the next word in a sentence. We’ll keep things simple and build a bigram model. This just means that our model will predict the next word using only the current word.
We’ll start with a short text to train the model. Here’s a small sample we’ll use:
import numpy as np
# Sample dataset: A small text corpus
corpus = """Artificial Intelligence is the new electricity.
Machine learning is the future of AI.
AI is transforming industries and shaping the future."""
Step 3: Preparing the Text
First things first, we need to break this text into individual words and create a vocabulary (basically a list of all unique words). This gives us something to work with.
# Tokenize the corpus into words
words = corpus.lower().split()
# Create a vocabulary of unique words
vocab = list(set(words))
vocab_size = len(vocab)
print(f"Vocabulary: {vocab}")
print(f"Vocabulary size: {vocab_size}")
Here, we’re converting the text to lowercase and splitting it into words. After that, we create a list of unique words to serve as our vocabulary.
Step 4: Map Words to Numbers
Computers work with numbers, not words. So, we’ll map each word to an index and create a reverse mapping too (this will help when we convert them back to words later).
word_to_idx = {word: idx for idx, word in enumerate(vocab)}
idx_to_word = {idx: word for word, idx in word_to_idx.items()}
# Convert the words in the corpus to indices
corpus_indices = [word_to_idx[word] for word in words]
Basically, we’re just turning words into numbers that our model can understand. Each word gets its own number, like “AI” might become 0, and “learning” might become 1, depending on the order.
Step 5: Building the Model
Now, let’s get to the heart of it: building the bigram model. We want to figure out the probability of one word following another. To do that, we’ll count how often each word pair (bigram) shows up in our dataset.
# Initialize bigram counts matrix
bigram_counts = np.zeros((vocab_size, vocab_size))
# Count occurrences of each bigram in the corpus
for i in range(len(corpus_indices) - 1):
current_word = corpus_indices[i]
next_word = corpus_indices[i + 1]
bigram_counts[current_word, next_word] += 1
# Normalize the counts to get probabilities
bigram_probabilities = bigram_counts / bigram_counts.sum(axis=1, keepdims=True)
print("Bigram probabilities matrix: ", bigram_probabilities)
Here’s what’s happening:
We’re counting how often each word follows another (that's the bigram).
Then, we turn those counts into probabilities by normalizing them.
In simple terms, this means that if "AI" is often followed by "is," the probability for that pair will be higher.
Step 6: Predicting the Next Word
Let’s now test our model by making it predict the next word based on any given word. We do this by sampling from the probability distribution of the next word.
def predict_next_word(current_word, bigram_probabilities):
word_idx = word_to_idx[current_word]
next_word_probs = bigram_probabilities[word_idx]
next_word_idx = np.random.choice(range(vocab_size), p=next_word_probs)
return idx_to_word[next_word_idx]
# Test the model with a word
current_word = "ai"
next_word = predict_next_word(current_word, bigram_probabilities)
print(f"Given '{current_word}', the model predicts '{next_word}'.")
This function takes a word, looks up its probabilities, and randomly selects the next word based on those probabilities. If you pass in "AI," the model might predict something like "is" as the next word.
Step 7: Generate a Sentence
Finally, let's generate a whole sentence! We’ll start with a word and keep predicting the next word a few times.
def generate_sentence(start_word, bigram_probabilities, length=5):
sentence = [start_word]
current_word = start_word
for _ in range(length):
next_word = predict_next_word(current_word, bigram_probabilities)
sentence.append(next_word)
current_word = next_word
return ' '.join(sentence)
# Generate a sentence starting with "artificial"
generated_sentence = generate_sentence("artificial", bigram_probabilities, length=10)
print(f"Generated sentence: {generated_sentence}")
This function takes an initial word and predicts the next one, then uses that word to predict the following one, and so on. Before you know it, you’ve got a full sentence!
Wrapping Up
There you have it—a simple bigram language model built from scratch using just Python and NumPy. We didn’t use any fancy libraries, and you now have a basic understanding of how AI can predict text. You can play around with this code, feed it different text, or even expand it by using more advanced models.
Give it a try, and let me know how it goes. Happy coding!
This content originally appeared on DEV Community and was authored by Santhosh Vijayabaskar
Santhosh Vijayabaskar | Sciencx (2024-10-18T23:35:15+00:00) Build Your Own Language Model: A Simple Guide with Python and NumPy. Retrieved from https://www.scien.cx/2024/10/18/build-your-own-language-model-a-simple-guide-with-python-and-numpy/
Please log in to upload a file.
There are no updates yet.
Click the Upload button above to add an update.