Skip to main content

Web3 AI/Machine Learning with GlacierDB

In this tutorial, We introduce the Glacier VectorDB to DApp Developers on how to leverage GlacierDB for AI/Machine learning.

Concepts

Before we follow the quickstart, you'd have a look at the basic concept. Here we reference some core concepts for you

Prerequisites

QuickStart

In this quickstart, we store a dataset related to programing-lang and retrieve documents using vector queries.

Create a GlacierClient

const privateKey = `<your-wallet-privateKey>`;
const endpoint = 'https://greenfield.onebitdev.com/glacier-gateway-vector/'
const client = new GlacierClient(endpoint, {
privateKey,
});

Create VectorDB collection

All vector features start from here. We just simply define the schema of a collection to enable the vector feature.

const schema = {
title: "programming-lang",
type: "object",
properties: {
name: {
type: "string",
},
nameEmbedding: {
type: "string",
vectorIndexOption: {
"type": "knnVector",
"dimensions": 384,
"similarity": "euclidean",
},
},
link: {
type: "string",
},
type: {
type: "string",
}
}
}
  • nameEmbedding is a vector field that support vector query for name. We use the VectorIndexOption describe the vector feature, and the Embedding suffix is not necessary but it's good to use Embedding to name the field.
  • And we don't support updating the schema after the collection has been created. This feature is under development!

VectorIndexOption: define vector options

  • Type: Value must be knnVector.

  • Dimensions: Number of vector dimensions, This value can't be greater than 2048. Here we use 384 dimensions that transform by the embedding model all-MiniLM-L6-v2,

    • The Dimensions must match the embedding model you are using. If you are using OpenAI's text-embedding-ada-002 or text-embedding-3-small, the Dimensions is 1536.

  • Similarity: Vector similarity function to use to search for top K-nearest neighbors. Value can be one of the following: euclidean, cosine, dotProduct

Load Embedding Documents

We use the Huggingface API to embedding documents, any other embedding tool can be used well.

Model: all-MiniLM-L6-v2 API: https://api-inference.huggingface.co/pipeline/feature-extraction/sentence-transformers/all-MiniLM-L6-v2

let doc = {
'link': 'http://en.wikipedia.org/wiki/A-0_System',
'type': 'ComputerLanguage',
'name': 'A-0 System',
'nameEmbedding': await getEmbeddingFromHF('A-0 System'),
}

let coll = client.namespace(namespace).dataset(dataset).collection(collection)
let resp = await coll.insertOne(doc)
  • The value of nameEmbedding must be an embedding array. If the value is a string, the query will behave incorrectly.

async function getEmbeddingFromHF(input) {
const embedding_url = "https://api-inference.huggingface.co/pipeline/feature-extraction/sentence-transformers/all-MiniLM-L6-v2"
let response = await axios.post(embedding_url, {
inputs: input,
}, {
headers: {
'Authorization': `Bearer ${hf_token}`,
'Content-Type': 'application/json'
}
});

if (response.status === 200) {
return response.data;
} else {
throw new Error(`Failed to get embedding. Status code: ${response.status}`);
}
}

VectorQuery Documents

Now, we can query the documents using vectors.

const embedding = await getEmbeddingFromHF(text) 

let result = await coll.find({
'numCandidates': 10,
'vectorPath': 'nameEmbedding',
'queryVector': embedding,
}).toArray()

Only the following options can be used for vectorQuery::

  • numCandidates: Number of nearest neighbors to use during the search. Value must be less than or equal to (<=) 10000.
  • vectorPath: The vector field define by schema
  • queryVector: Array of numbers that represent the query vector. The array size must match the number of vector dimensions specified in the schema

Reference