Distributed embeddings cluster

This article is part of a tutorial series on txtai, an AI-powered search engine.

The txtai API is a web-based service backed by FastAPI. All txtai functionality is available via the API. The API can also cluster multiple embeddings indices into a si…


This content originally appeared on DEV Community and was authored by David Mezzetti

This article is part of a tutorial series on txtai, an AI-powered search engine.

The txtai API is a web-based service backed by FastAPI. All txtai functionality is available via the API. The API can also cluster multiple embeddings indices into a single logical index to horizontally scale over multiple nodes.

This notebook installs the txtai API and shows an example of building an embeddings cluster.

Install dependencies

Install txtai and all dependencies.

pip install txtai

Start distributed embeddings cluster

First we'll start multiple API instances that will serve as embeddings index shards. Each shard stores a subset of the indexed data and these shards work in tandem to form a single logical index.

Then we'll start the main API instance that clusters the shards together into a logical instance.

The API instances are all started in the background.

import os
os.chdir("/content")
writable: true

# Embeddings settings
embeddings:
    method: transformers
    path: sentence-transformers/bert-base-nli-mean-tokens
# Embeddings cluster
cluster:
    shards:
        - http://127.0.0.1:8001
        - http://127.0.0.1:8002
# Start embeddings shards
CONFIG=index.yml nohup uvicorn --port 8001 "txtai.api:app" &> shard-1.log &
CONFIG=index.yml nohup uvicorn --port 8002 "txtai.api:app" &> shard-2.log &

# Start main instance
CONFIG=cluster.yml nohup uvicorn --port 8000 "txtai.api:app" &> main.log &

# Wait for startup
sleep 90

Python

Let's first try the cluster out directly in Python. The code below aggregates the two shards into a single cluster and executes actions against the cluster.

from txtai.api import Cluster

cluster = Cluster({"shards": ["http://127.0.0.1:8001", "http://127.0.0.1:8002"]})

data = [
    "US tops 5 million confirmed virus cases",
    "Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg",
    "Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
    "The National Park Service warns against sacrificing slower friends in a bear attack",
    "Maine man wins $1M from $25 lottery ticket",
    "Make huge profits without work, earn up to $100,000 a day",
]

# Index data
cluster.add([{"id": x, "text": row} for x, row in enumerate(data)])
cluster.index()

# Test search
uid = cluster.search("feel good story", 1)[0]["id"]
print("Query: feel good story\nResult:", data[uid])
Query: feel good story
Result: Maine man wins $1M from $25 lottery ticket

JavaScript

Next let's try to run the same code above via the API using JavaScript.

npm install txtai

For this example, we'll clone the txtai.js project to import the example build configuration.

git clone https://github.com/neuml/txtai.js

Run cluster.js

The following script is a JavaScript version of the logic above

import {Embeddings} from "txtai";
import {sprintf} from "sprintf-js";

const run = async () => {
    try {
        let embeddings = new Embeddings(process.argv[2]);

        let data  = ["US tops 5 million confirmed virus cases",
                     "Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg",
                     "Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
                     "The National Park Service warns against sacrificing slower friends in a bear attack",
                     "Maine man wins $1M from $25 lottery ticket",
                     "Make huge profits without work, earn up to $100,000 a day"];

        console.log();
        console.log("Querying an Embeddings cluster");
        console.log(sprintf("%-20s %s", "Query", "Best Match"));
        console.log("-".repeat(50));

        for (let query of ["feel good story", "climate change", "health", "war", "wildlife", "asia", "north america", "dishonest junk"]) {
            let results = await embeddings.search(query, 1);
            let uid = results[0].id;
            console.log(sprintf("%-20s %s", query, data[uid]))
        }
    }
    catch (e) {
        console.trace(e);
    }
};

run();

Build and run cluster.js

cd txtai.js/examples/node
npm install
npm run build

Next lets run the code against the main cluster URL

node dist/cluster.js http://127.0.0.1:8000
Querying an Embeddings cluster
Query                Best Match
--------------------------------------------------
feel good story      Maine man wins $1M from $25 lottery ticket
climate change       Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg
health               US tops 5 million confirmed virus cases
war                  Beijing mobilises invasion craft along coast as Taiwan tensions escalate
wildlife             The National Park Service warns against sacrificing slower friends in a bear attack
asia                 Beijing mobilises invasion craft along coast as Taiwan tensions escalate
north america        US tops 5 million confirmed virus cases
dishonest junk       Make huge profits without work, earn up to $100,000 a day

The JavaScript program is showing the same results as the Python code above. This is running a clustered query against both nodes in the cluster and aggregating the results together.

Queries can be run against each individual shard to see what the queries independently return.

node dist/cluster.js http://127.0.0.1:8001
Querying an Embeddings cluster
Query                Best Match
--------------------------------------------------
feel good story      Maine man wins $1M from $25 lottery ticket
climate change       Beijing mobilises invasion craft along coast as Taiwan tensions escalate
health               US tops 5 million confirmed virus cases
war                  Beijing mobilises invasion craft along coast as Taiwan tensions escalate
wildlife             Beijing mobilises invasion craft along coast as Taiwan tensions escalate
asia                 Beijing mobilises invasion craft along coast as Taiwan tensions escalate
north america        US tops 5 million confirmed virus cases
dishonest junk       Beijing mobilises invasion craft along coast as Taiwan tensions escalate
node dist/cluster.js http://127.0.0.1:8002
Querying an Embeddings cluster
Query                Best Match
-------------------------------------------------------
feel good story      Make huge profits without work, earn up to $100,000 a day
climate change       Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg
health               Make huge profits without work, earn up to $100,000 a day
war                  Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg
wildlife             The National Park Service warns against sacrificing slower friends in a bear attack
asia                 Make huge profits without work, earn up to $100,000 a day
north america        Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg
dishonest junk       Make huge profits without work, earn up to $100,000 a day

Note the differences. The section below runs a count against the full cluster and each shard to show the count of records in each.

curl http://127.0.0.1:8000/count
printf "\n"
curl http://127.0.0.1:8001/count
printf "\n"
curl http://127.0.0.1:8002/count
6
3
3

This notebook showed how a distributed embeddings cluster can be created with txtai. This example can be further scaled out on Kubernetes with StatefulSets, which will be covered in a future tutorial.


This content originally appeared on DEV Community and was authored by David Mezzetti


Print Share Comment Cite Upload Translate Updates
APA

David Mezzetti | Sciencx (2021-05-19T15:58:00+00:00) Distributed embeddings cluster. Retrieved from https://www.scien.cx/2021/05/19/distributed-embeddings-cluster/

MLA
" » Distributed embeddings cluster." David Mezzetti | Sciencx - Wednesday May 19, 2021, https://www.scien.cx/2021/05/19/distributed-embeddings-cluster/
HARVARD
David Mezzetti | Sciencx Wednesday May 19, 2021 » Distributed embeddings cluster., viewed ,<https://www.scien.cx/2021/05/19/distributed-embeddings-cluster/>
VANCOUVER
David Mezzetti | Sciencx - » Distributed embeddings cluster. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2021/05/19/distributed-embeddings-cluster/
CHICAGO
" » Distributed embeddings cluster." David Mezzetti | Sciencx - Accessed . https://www.scien.cx/2021/05/19/distributed-embeddings-cluster/
IEEE
" » Distributed embeddings cluster." David Mezzetti | Sciencx [Online]. Available: https://www.scien.cx/2021/05/19/distributed-embeddings-cluster/. [Accessed: ]
rf:citation
» Distributed embeddings cluster | David Mezzetti | Sciencx | https://www.scien.cx/2021/05/19/distributed-embeddings-cluster/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.