Introducing sqlite-lembed: A SQLite extension for generating text embeddings locally
2024-07-24 by Alex Garcia
tl;dr —
sqlite-lembed
is a SQLite extension for generating text embeddings, meant to work alongsidesqlite-vec
. With a single embeddings model file provided in the.gguf
format, you can generate embeddings using regular SQL functions, and store them directly inside your SQLite database. No extra server, process, or configuration needed!
I've been working on sqlite-vec
for quite some time now - 3 months since I first announced it, More than 7 months since my first prototype, and more than 2 years since my first SQLite vector search attempt. And the initial stable version coming soon, I promise! v0.1.0
is scheduled for next week.
But one weakness of sqlite-vec
compared to other vector storage tools is that you must generate embeddings yourself. Some vector databases have helper functions and wrappers that automatically generate embeddings for you when inserting text.
But this feature never made sense for sqlite-vec
. It's a single C file with no external dependencies. Adding embedding model inference would drastically add scope and make things too complicated.
At the same time, I don't want to pip install openai
or pip install sentence-transformers
every time I want to generate embeddings on some text. I want something that is lightweight, a single binary, and works with SQLite.
So, with the help of llama.cpp
's embeddings support, sqlite-lembed
is born!
¶ Usage
There are a few ways to install sqlite-lembed
- npm install sqlite-lembed
, pip install sqlite-lembed
, gem install sqlite-lembed
, or grabbing pre-compiled extension from the Releases page. Or if you want to directly install and give your IT admins a scare, install with:
curl -L https://github.com/asg017/sqlite-lembed/releases/download/v0.0.1-alpha.4/install.sh | sh
You now have a lembed0.dylib
(MacOS) or lembed0.so
(Linux) file in your current directory!
Now you'll need an embeddings models in .gguf
format. A few open source options include nomic-embed-text-v1.5
and mxbai-embed-large-v1
, but here we will download the smaller and older all-MiniLM-L6-v2
model like so:
curl -L -o all-MiniLM-L6-v2.e4ce9877.q8_0.gguf https://huggingface.co/asg017/sqlite-lembed-model-examples/resolve/main/all-MiniLM-L6-v2/all-MiniLM-L6-v2.e4ce9877.q8_0.gguf
Now we can generate some embeddings! Fire up the sqlite3
CLI and run these setup commands.
.load ./lembed0
INSERT INTO temp.lembed_models(name, model)
select 'all-MiniLM-L6-v2', lembed_model_from_file('all-MiniLM-L6-v2.e4ce9877.q8_0.gguf');
The temp.lembed_model
virtual table lets you "register" models with pure INSERT INTO
statements. The name
field is a unique identifier for a given model, and model
is provided as a path to the .gguf
model, on disk, with the lembed_model_from_file()
function.
Let's try out this new 'all-MiniLM-L6-v2'
model with the lembed()
function.
select lembed(
'all-MiniLM-L6-v2',
'The United States Postal Service is an independent agency...'
); -- X'A402...09C3' (1536 bytes)
That's out first embedding! A 384 dimensional floating point vector (defined as part of the all-MiniLM-L6-v2
model), taking up 1,536 bytes of space with 4 bytes per element.
Now a single embedding of a single sentence isn't that exciting — let's try a larger sample. Since we will be comparing multiple vectors together, let's bring in sqlite-vec
into our project. Again you can npm install
or gem install
or pip install
sqlite-vec
, but if you live dangerously you can install with:
curl -L https://github.com/asg017/sqlite-vec/releases/download/0.0.1-alpha.37/install.sh | sh
Let's create a corpus of some random news headlines and store that in a "regular" SQLite table.
create table articles(
headline text
);
-- Random NPR headlines from 2024-06-04
insert into articles VALUES
('Shohei Ohtani''s ex-interpreter pleads guilty to charges related to gambling and theft'),
('The jury has been selected in Hunter Biden''s gun trial'),
('Larry Allen, a Super Bowl champion and famed Dallas Cowboy, has died at age 52'),
('After saying Charlotte, a lone stingray, was pregnant, aquarium now says she''s sick'),
('An Epoch Times executive is facing money laundering charge');
Ok now let's generate some embeddings! We will store the embedding directly into a new vec0
virtual table. We can always join this new table back with the articles
table for metadata.
.load ./vec0
-- Build a vector table with embeddings of article headlines
create virtual table vec_articles using vec0(
headline_embeddings float[384]
);
insert into vec_articles(rowid, headline_embeddings)
select rowid, lembed('all-MiniLM-L6-v2', headline)
from articles;
Now every headline
in articles
has been embed and stored in vec_articles
. To perform a KNN-style search, we can do:
param set :query 'firearm courtroom'
with matches as (
select
rowid,
distance
from vec_articles
where headline_embeddings match lembed('all-MiniLM-L6-v2', :query)
order by distance
limit 3
)
select
headline,
distance
from matches
left join articles on articles.rowid = matches.rowid;
/*
+--------------------------------------------------------------+------------------+
| headline | distance |
+--------------------------------------------------------------+------------------+
| Shohei Ohtani's ex-interpreter pleads guilty to charges rela | 1.14812409877777 |
| ted to gambling and theft | |
+--------------------------------------------------------------+------------------+
| The jury has been selected in Hunter Biden's gun trial | 1.18380105495453 |
+--------------------------------------------------------------+------------------+
| An Epoch Times executive is facing money laundering charge | 1.27715671062469 |
+--------------------------------------------------------------+------------------+
*/
And there we go! Notice how "firearm courtroom" doesn't appear in any of these headlines, but it can still figure out that "Hunter Biden's gun trial" is related, and the other two justice-related articles appear on top.
So there you have it - text embeddings and vector search, all with the sqlite3
CLI, two extensions, and a single .gguf
file.
¶ Last notes
It is not required to use sqlite-lembed
with sqlite-vec
, or vice-versa. You can use any embeddings provider with sqlite-vec
— the OpenAI API, other JSON endpoints, PyTorch models, etc. As long as your embeddings can be provided as JSON or a compact BLOG format, you're good to go.
Similarly, it is not required to use sqlite-vec
with sqlite-lembed
. You can dump embeddings generated by sqlite-lembed
into any other vector store you like, or in regular SQLite tables with sqlite-vec
.
Also, Windows isn't supported yet. Sorry! Hopefully soon, llama.cpp
does support Windows, but Github Actions can be quite a nightmare. WASM is also not supported yet, but hoping to figure that out in the near future.
And lastly — sqlite-lembed
is still in beta! While sqlite-vec
stabilized on v0.1.0 next week, sqlite-lembed
will be actively developed for the near future. Mostly because the llama.cpp
dependency is also under active deveopment, but I hope that the main SQL API won't change much.