AI powered search with OpenAI embeddings

Some of the best implementations of AI use existing knowledge basis and make them searchable through prompts. While scrolling through Twitter, I a tweet on paul-graham-gpt.


This content originally appeared on DEV Community and was authored by Brian Douglas

Some of the best implementations of AI use existing knowledge basis and make them searchable through prompts. While scrolling through Twitter, I a tweet on paul-graham-gpt.

In this 30 days of OpenAI series, I am looking at AI projects and their code to help demystify how any dev can build AI-generated projects.

Paul Graham is best known for his work on the programming language Lisp and for cofounding the influential startup accelerator and seed capital firm Y Combinator. He also writes many essays that provide a lot of knowledge for current and future startup founders.

Mckay Wrigley built a tool called paul-graham-gpt, you can use it live here, to navigate all of PaulG's essays, and I am excited to jump in and take a look on how he did this.

GitHub logo mckaywrigley / paul-graham-gpt

AI search & chat for all of Paul Graham’s essays.

Paul Graham GPT

AI-powered search and chat for Paul Graham's essays.

All code & data used is 100% open-source.

Dataset

The dataset is a CSV file containing all text & embeddings used.

Download it here.

I recommend getting familiar with fetching, cleaning, and storing data as outlined in the scraping and embedding scripts below, but feel free to skip those steps and just use the dataset.

How It Works

Paul Graham GPT provides 2 things:

  1. A search interface.
  2. A chat interface.

Search

Search was created with OpenAI Embeddings (text-embedding-ada-002).

First, we loop over the essays and generate embeddings for each chunk of text.

Then in the app we take the user's search query, generate an embedding, and use the result to find the most similar passages from the book.

The comparison is done using cosine similarity across our database of vectors.

Our database is a Postgres…

How was mckaywrigley/paul-graham-gpt made?

paul-graham-gpt is described as AI search & chat for all of Paul Graham’s essays. When looking closer at the code, it uses the embeddings API from OpenAI.

_OpenAI’s text embeddings measure the relatedness of text strings. Embeddings are commonly used for:
_

This is my first look at embeddings, and if you read in my previous post on aicommits, that used completions. Based on my reading in the docs, embeddings are useful when traversing existing data and looking for relevancy. The code samples use Amazon food reviews as the example in the docs. You might be looking for reviews on condiments, and the relevance you are looking for is negative reviews. The embeddings check tone along with ratings.

That is my best explanation after a first look, but check out the embeddings use cases for more context

How does it work?

The project's README does a great job explaining the techniques. The author is looping over all essays and generating embeddings for each text chunk. This is done in the generateEmbeddings function.

All essay content is stored in scripts/pg.json

// scripts/embed.ts

// this response is loop over using the essay content in the generateEmbeddings fnction
const embeddingResponse = await openai.createEmbedding({
  model: "text-embedding-ada-002",
  input: content
});

...

// This is parsing the essays from the JSON
(async () => {
  const book: PGJSON = JSON.parse(fs.readFileSync("scripts/pg.json", "utf8"));

  await generateEmbeddings(book.essays);
})();

Then they take the user's search query to generate an embedding and use the result to find the most relevant passages from the essays.

// pages/api/search.ts

const res = await fetch("https://api.openai.com/v1/embeddings", {
  headers: {
    "Content-Type": "application/JSON",
    Authorization: `Bearer ${apiKey}`
  },
  method: "POST",
  body: JSON.stringify({
    model: "text-embedding-ada-002",
    input
  })
});

...

The comparison is done using cosine similarity across our database of vectors.

// pages/api/search.ts

const {
  data: chunks,
  error
} = await supabaseAdmin.RPC("pg_search", {
  query_embedding: embedding,
  similarity_threshold: 0.01, // cosine similarity
  match_count: matches
});

The Postgres database has the pgvector extension hosted on Supabase. This was just announced recently by Supabase last month.

Results are ranked by similarity score and returned to the user.

I enjoyed walking through the code and learning how this works. If I need to correct something, or if you have some insight into the code, please comment. Thanks to McKay for sharing this with us, and be sure to give them a follow and check out their other work in AI, Codewand AI-powered tools to help your team build software faster.

Also, if you have a project leveraging OpenAI, leave a link in the comments. I'd love to take a look and include it in my 30 days of OpenAI series.

Stay saucy.

image was generated using midjourney


This content originally appeared on DEV Community and was authored by Brian Douglas


Print Share Comment Cite Upload Translate Updates
APA

Brian Douglas | Sciencx (2023-03-04T14:57:11+00:00) AI powered search with OpenAI embeddings. Retrieved from https://www.scien.cx/2023/03/04/ai-powered-search-with-openai-embeddings/

MLA
" » AI powered search with OpenAI embeddings." Brian Douglas | Sciencx - Saturday March 4, 2023, https://www.scien.cx/2023/03/04/ai-powered-search-with-openai-embeddings/
HARVARD
Brian Douglas | Sciencx Saturday March 4, 2023 » AI powered search with OpenAI embeddings., viewed ,<https://www.scien.cx/2023/03/04/ai-powered-search-with-openai-embeddings/>
VANCOUVER
Brian Douglas | Sciencx - » AI powered search with OpenAI embeddings. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2023/03/04/ai-powered-search-with-openai-embeddings/
CHICAGO
" » AI powered search with OpenAI embeddings." Brian Douglas | Sciencx - Accessed . https://www.scien.cx/2023/03/04/ai-powered-search-with-openai-embeddings/
IEEE
" » AI powered search with OpenAI embeddings." Brian Douglas | Sciencx [Online]. Available: https://www.scien.cx/2023/03/04/ai-powered-search-with-openai-embeddings/. [Accessed: ]
rf:citation
» AI powered search with OpenAI embeddings | Brian Douglas | Sciencx | https://www.scien.cx/2023/03/04/ai-powered-search-with-openai-embeddings/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.