Web3 AI/Machine Learning with GlacierDB
In this tutorial, We introduce the Glacier VectorDB to DApp Developers on how to leverage GlacierDB for AI/Machine learning.
Concepts
Before we follow the quickstart, you'd have a look at the basic concept. Here we reference some core concepts for you
- VectorDB: https://en.wikipedia.org/wiki/Vector_database
- Embedding: https://en.wikipedia.org/wiki/Word_embedding
- LLM: https://en.wikipedia.org/wiki/Large_language_model
- Huggingface: https://huggingface.co/docs/api-inference/quicktour
Prerequisites
- Glacier VectorDB Endpoint: https://greenfield.onebitdev.com/glacier-gateway/
- Huggingface API Token: https://huggingface.co/docs/api-inference/quicktour
- Glacier SDK NPM: @glacier-network/client
- Demo wallet: create a new one that only for test purpose!
QuickStart
In this quickstart, we store a dataset related to programing-lang and retrieve documents using vector queries.
Create a GlacierClient
const privateKey = `<your-wallet-privateKey>`;
const endpoint = 'https://greenfield.onebitdev.com/glacier-gateway/'
const client = new GlacierClient(endpoint, {
privateKey,
});
Create VectorDB collection
All vector features start from here. We just simply define the schema of a collection to enable the vector feature.
const schema = {
title: "programming-lang",
type: "object",
properties: {
name: {
type: "string",
},
nameEmbedding: {
type: "string",
vectorIndexOption: {
"type": "knnVector",
"dimensions": 384,
"similarity": "euclidean",
},
},
link: {
type: "string",
},
type: {
type: "string",
}
}
}
nameEmbedding
is a vector field that support vector query forname
. We use theVectorIndexOption
describe the vector feature, and theEmbedding
suffix is not necessary but it's good to useEmbedding
to name the field.- And we don't support updating the schema after the collection has been created. This feature is under development!
VectorIndexOption: define vector options
Type: Value must be knnVector.
Dimensions: Number of vector dimensions, This value can't be greater than 2048. Here we use 384 dimensions that transform by the embedding model all-MiniLM-L6-v2,
The
Dimensions
must match the embedding model you are using. If you are using OpenAI's text-embedding-ada-002 or text-embedding-3-small, theDimensions
is 1536.
Similarity: Vector similarity function to use to search for top K-nearest neighbors. Value can be one of the following: euclidean, cosine, dotProduct
Load Embedding Documents
We use the Huggingface API to embedding documents, any other embedding tool can be used well.
Model: all-MiniLM-L6-v2 API: https://api-inference.huggingface.co/pipeline/feature-extraction/sentence-transformers/all-MiniLM-L6-v2
let doc = {
'link': 'http://en.wikipedia.org/wiki/A-0_System',
'type': 'ComputerLanguage',
'name': 'A-0 System',
'nameEmbedding': await getEmbeddingFromHF('A-0 System'),
}
let coll = client.namespace(namespace).dataset(dataset).collection(collection)
let resp = await coll.insertOne(doc)
The value of
nameEmbedding
must be an embedding array. If the value is a string, the query will behave incorrectly.
async function getEmbeddingFromHF(input) {
const embedding_url = "https://api-inference.huggingface.co/pipeline/feature-extraction/sentence-transformers/all-MiniLM-L6-v2"
let response = await axios.post(embedding_url, {
inputs: input,
}, {
headers: {
'Authorization': `Bearer ${hf_token}`,
'Content-Type': 'application/json'
}
});
if (response.status === 200) {
return response.data;
} else {
throw new Error(`Failed to get embedding. Status code: ${response.status}`);
}
}
VectorQuery Documents
Now, we can query the documents using vectors.
const embedding = await getEmbeddingFromHF(text)
let result = await coll.find({
'numCandidates': 10,
'vectorPath': 'nameEmbedding',
'queryVector': embedding,
}).toArray()
Only the following options can be used for vectorQuery::
- numCandidates: Number of nearest neighbors to use during the search. Value must be less than or equal to (<=) 10000.
- vectorPath: The vector field define by schema
- queryVector: Array of numbers that represent the query vector. The array size must match the number of vector dimensions specified in the schema