This content originally appeared on DEV Community and was authored by Brian Douglas
Some of the best implementations of AI use existing knowledge basis and make them searchable through prompts. While scrolling through Twitter, I a tweet on paul-graham-gpt.
Mckay Wrigley@mckaywrigleyI embedded all of @paulg’s essays.
All 605,870 tokens worth.
Use OpenAI’s new model to search & chat with them at paul-graham-gpt.vercel.app.
Code & dataset is 100% open-source for anyone to use.
GitHub: github.com/mckaywrigley/p…16:19 PM - 02 Mar 2023
In this 30 days of OpenAI series, I am looking at AI projects and their code to help demystify how any dev can build AI-generated projects.
Paul Graham is best known for his work on the programming language Lisp and for cofounding the influential startup accelerator and seed capital firm Y Combinator. He also writes many essays that provide a lot of knowledge for current and future startup founders.
Mckay Wrigley built a tool called paul-graham-gpt, you can use it live here, to navigate all of PaulG's essays, and I am excited to jump in and take a look on how he did this.
mckaywrigley / paul-graham-gpt
AI search & chat for all of Paul Graham’s essays.
Paul Graham GPT
AI-powered search and chat for Paul Graham's essays.
All code & data used is 100% open-source.
Dataset
The dataset is a CSV file containing all text & embeddings used.
Download it here.
I recommend getting familiar with fetching, cleaning, and storing data as outlined in the scraping and embedding scripts below, but feel free to skip those steps and just use the dataset.
How It Works
Paul Graham GPT provides 2 things:
- A search interface.
- A chat interface.
Search
Search was created with OpenAI Embeddings (text-embedding-ada-002
).
First, we loop over the essays and generate embeddings for each chunk of text.
Then in the app we take the user's search query, generate an embedding, and use the result to find the most similar passages from the book.
The comparison is done using cosine similarity across our database of vectors.
Our database is a Postgres…
How was mckaywrigley/paul-graham-gpt made?
paul-graham-gpt is described as AI search & chat for all of Paul Graham’s essays. When looking closer at the code, it uses the embeddings API from OpenAI.
_OpenAI’s text embeddings measure the relatedness of text strings. Embeddings are commonly used for:
_
This is my first look at embeddings, and if you read in my previous post on aicommits, that used completions. Based on my reading in the docs, embeddings are useful when traversing existing data and looking for relevancy. The code samples use Amazon food reviews as the example in the docs. You might be looking for reviews on condiments, and the relevance you are looking for is negative reviews. The embeddings check tone along with ratings.
That is my best explanation after a first look, but check out the embeddings use cases for more context
How does it work?
The project's README does a great job explaining the techniques. The author is looping over all essays and generating embeddings for each text chunk. This is done in the generateEmbeddings function.
All essay content is stored in scripts/pg.json
// scripts/embed.ts
// this response is loop over using the essay content in the generateEmbeddings fnction
const embeddingResponse = await openai.createEmbedding({
model: "text-embedding-ada-002",
input: content
});
...
// This is parsing the essays from the JSON
(async () => {
const book: PGJSON = JSON.parse(fs.readFileSync("scripts/pg.json", "utf8"));
await generateEmbeddings(book.essays);
})();
Then they take the user's search query to generate an embedding and use the result to find the most relevant passages from the essays.
// pages/api/search.ts
const res = await fetch("https://api.openai.com/v1/embeddings", {
headers: {
"Content-Type": "application/JSON",
Authorization: `Bearer ${apiKey}`
},
method: "POST",
body: JSON.stringify({
model: "text-embedding-ada-002",
input
})
});
...
The comparison is done using cosine similarity across our database of vectors.
// pages/api/search.ts
const {
data: chunks,
error
} = await supabaseAdmin.RPC("pg_search", {
query_embedding: embedding,
similarity_threshold: 0.01, // cosine similarity
match_count: matches
});
The Postgres database has the pgvector extension hosted on Supabase. This was just announced recently by Supabase last month.
Results are ranked by similarity score and returned to the user.
I enjoyed walking through the code and learning how this works. If I need to correct something, or if you have some insight into the code, please comment. Thanks to McKay for sharing this with us, and be sure to give them a follow and check out their other work in AI, Codewand AI-powered tools to help your team build software faster.
Also, if you have a project leveraging OpenAI, leave a link in the comments. I'd love to take a look and include it in my 30 days of OpenAI series.
Stay saucy.
image was generated using midjourney
This content originally appeared on DEV Community and was authored by Brian Douglas
Brian Douglas | Sciencx (2023-03-04T14:57:11+00:00) AI powered search with OpenAI embeddings. Retrieved from https://www.scien.cx/2023/03/04/ai-powered-search-with-openai-embeddings/
Please log in to upload a file.
There are no updates yet.
Click the Upload button above to add an update.