Alex Garcia's Blog

sqlite-vec now supports metadata columns and filtering

2024-11-20 by Alex Garcia

tl;dr — sqlite-vec, a SQLite extension for vector search, now supports metadata columns, auxiliary columns, and partitioning in vec0 virtual tables! You can use these to store metadata like user_id or created_at fields, add additional WHERE clauses in KNN queries, and make certain selective queries much faster. Try it out!


As of the latest v0.1.6 release of sqlite-vec, you can now store non-vector data in vec0 virtual tables! For example:

create virtual table vec_articles using vec0(

  article_id integer primary key,

  -- Vector text embedding of the `headline` column, with 384 dimensions
  headline_embedding float[384],

  -- Partition key, internally shard vector index on article published year
  year integer partition key,

  -- Metadata columns, can appear in `WHERE` clause of KNN queries
  news_desk text,
  word_count integer,
  pub_date text,

  -- Auxiliary columns, unindexed but fast lookups
  +headline text,
  +url text
);

Here we are storing a New York Time article headlines dataset from the past 30 years, where we embed the headlines with mixedbread-ai/mxbai-embed-xsmall-v1.

If we wanted to see the closest related headlines to 'pandemic' on article published in 2020 by the 'Sports' or 'Business' new desk with more than 500 but less than 1000 words, we can perform a KNN query like so:

select
  article_id,
  headline,
  news_desk,
  word_count,
  url,
  pub_date,
  distance
from vec_articles
where headline_embedding match lembed('pandemic')
  and k = 8
  and year = 2020
  and news_desk in ('Sports', 'Business')
  and word_count between 500 and 1000;
┌────────────┬──────────────────────────────────────────────────────────────────────┬───────────┬────────────┬─────────────────────────────┬──────────────────────────┬───────────┐
│ article_id │ headline                                                             │ news_desk │ word_count │ url                         │ pub_date                 │ distance  │
├────────────┼──────────────────────────────────────────────────────────────────────┼───────────┼────────────┼─────────────────────────────┼──────────────────────────┼───────────┤
│    2911716 │ The Pandemic’s Economic Damage Is Growing                            │ Business  │        910 │ https://www.nytimes.com/... │ 2020-07-07T18:12:40+0000 │ 0.8928120 │
│    2892929 │ As Coronavirus Spreads, Olympics Face Ticking Clock and a Tough Call │ Sports    │        987 │ https://www.nytimes.com/... │ 2020-03-06T01:34:36+0000 │ 0.9608180 │
│    2932041 │ The Pandemic Is Already Affecting Next Year’s Sports Schedule        │ Sports    │        620 │ https://www.nytimes.com/... │ 2020-11-11T13:56:25+0000 │ 0.9802038 │
│    2915381 │ The Week in Business: Getting Rich Off the Pandemic                  │ Business  │        814 │ https://www.nytimes.com/... │ 2020-08-02T11:00:03+0000 │ 1.0064692 │
│    2896043 │ The Coronavirus and the Postponement of the Olympics, Explained      │ Sports    │        798 │ https://www.nytimes.com/... │ 2020-03-25T17:45:58+0000 │ 1.0115833 │
│    2898566 │ Robots Welcome to Take Over, as Pandemic Accelerates Automation      │ Business  │        871 │ https://www.nytimes.com/... │ 2020-04-10T09:00:27+0000 │  1.019637 │
│    2898239 │ The Pandemic Feeds Tech Companies’ Power                             │ Business  │        784 │ https://www.nytimes.com/... │ 2020-04-08T16:43:13+0000 │ 1.0200014 │
│    2929224 │ In M.L.S., the Pandemic Changes the Playoff Math                     │ Sports    │        859 │ https://www.nytimes.com/... │ 2020-10-29T17:09:10+0000 │ 1.0238885 │
└────────────┴──────────────────────────────────────────────────────────────────────┴───────────┴────────────┴─────────────────────────────┴──────────────────────────┴───────────┘

Here we used sqlite-lembed to embed our query, but any other embeddings provider could be used!

We can reference those metadata columns and parition key columns in the WHERE clause of the KNN query, and get the exact results we want!

Now, what's the difference between metadata, partition key, and auxiliary columns?

Metadata columns for WHERE clause filtering

Metadata columns are declared with normal column declartions in the vec0 constructor. Metadata columns are stored and indexed alongside vectors, and can appear in the WHERE clause of KNN queries.

create virtual table vec_articles using vec0(
  article_id integer primary key,
  headline_embedding float[384],
  news_desk text,
  word_count integer,
  pub_date text
);

select
  article_id,
  headline,
  news_desk,
  word_count,
  pub_date,
  distance
from vec_articles
where headline_embedding match lembed('new york city housing')
  and k = 20
  and news_desk = 'Metro'
  and word_count < 1000
  and pub_date between '2004-01-20' and '2009-01-20';
┌────────────┬──────────────────────────────────────────────────────────────────────┬───────────┬────────────┬──────────────────────────┬────────────────────┐
│ article_id │ headline                                                             │ news_desk │ word_count │ pub_date                 │ distance           │
├────────────┼──────────────────────────────────────────────────────────────────────┼───────────┼────────────┼──────────────────────────┼────────────────────┤
│    1717598 │ Manhattan: City to Expand Housing Program                            │ Metro     │         83 │ 2007-02-28T05:00:00+0000 │ 0.7736235857009888 │
│    1607183 │ Manhattan: More Money for Housing                                    │ Metro     │         96 │ 2006-06-16T04:00:00+0000 │ 0.7818768620491028 │
│                                                                                  ...                                                                       │
│    1772158 │ Ask About New York Architecture, On Screen and Off                   │ Metro     │        241 │ 2007-09-17T18:25:57+0000 │  0.930429220199585 │
│    1673007 │ Manhattan: City Balances Budget for 26th Year                        │ Metro     │         87 │ 2006-11-01T05:00:00+0000 │ 0.9327330589294434 │
│    1616702 │ Little Shift in Prices of Manhattan Apartments                       │ Metro     │        615 │ 2006-07-06T04:00:00+0000 │ 0.9354249238967896 │
└────────────┴──────────────────────────────────────────────────────────────────────┴───────────┴────────────┴──────────────────────────┴────────────────────┘

There we retrieved the 20 most related article headlines to 'new york city housing', published by the 'Metro' news desk, with less than 1000 words, published during the George W Bush administration.

Metadata columns can be boolean, integer, floats, or text values. More types like BLOBs, dates, and UUID/ULIDs are coming soon!

Only a subset of operators are supported during metadata filtering, including:

Notably absent: REGEXP, LIKE, GLOB, and other custom scalar functions. Also NULL values are not supported yet,

Partition keys for faster WHERE clause filtering

Now the above query was actually a bit slow! There are 3 million rows in the table, and metadata filters need to visit every single row to do a comparison. Metadata comparison are quite fast and built for fast filtering, but they have their limits.

But notice how we only wanted a small subset of values – between '2004-01-20' and '2009-01-20' is only 5 years out of 30 years of data. We can tell the vec0 virtual table to internally shard the vector index on a given key, using partition keys!

create virtual table vec_articles using vec0(
  article_id integer primary key,
  headline_embedding float[384],

  -- shard the vector index based on published year
  year integer partition key,

  news_desk text,
  word_count integer,
  pub_date text
);

select
  article_id,
  headline,
  news_desk,
  word_count,
  pub_date,
  distance
from vec_articles
where headline_embedding match lembed('new york city housing')
  and k = 20
  -- narrow search to these years only
  and year between 2004 and 2009
  and news_desk = 'Metro'
  and word_count < 1000
  -- finer filtering for exact dates we care about
  and pub_date between '2004-01-20' and '2009-01-20';

This KNN query returns the same exact results as the one above - but is 3x faster! This is because internally, vectors are stored based on the year value of its row. In that KNN query, sqlite-vec will recognize constraints on partition keys, and quickly pre-filter rows before any vectors are compared.

But beware! It's easy to accidentally over-shard a vector index on the wrong values and cause performance issues. Partition keys are great for date-based items like year or month, particulary when each unique partition key value has 100's or 1000's of vectors. They are also great for user IDs or document IDs, for "per-user" or "per-document" vector indexes.

Partition key columns can only be TEXT or INTEGER values, file an issue if you want to see some other type support. Currently column in (...) constraints are not supported for partition key columns, but will be soon!

Auxiliary columns

Some columns never need to be indexed! You can always store addtionally SELECT-only metadata in separate tables and do a JOIN yourself, or you can use auxiliary columns:

create virtual table vec_articles using vec0(
  article_id integer primary key,
  headline_embedding float[384],
  +headline text,
  +url text
);

select
  article_id,
  headline,
  url,
  distance
from vec_articles
where headline_embedding match lembed('dodgers game')
  and k = 20;
┌────────────┬─────────────────────────────────────────────────────────────────────────────────────┬───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬────────────────────┐
│ article_id │ headline                                                                            │ url                                                                                                                               │ distance           │
├────────────┼─────────────────────────────────────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼────────────────────┤
│    1896278 │ Attention Dodgers Fans: There’s a Game Tonight                                      │ https://bats.blogs.nytimes.com/2008/10/15/attention-dodgers-fans-theres-a-game-tonight/                                           │ 0.6733786463737488 │
│    2556896 │ Dodgers, in Flurry of Activity, Move to Revamp Their Infield                        │ https://www.nytimes.com/2014/12/11/sports/baseball/mlb-jimmy-rollins.html                                                         │ 0.7796685099601746 │
│    2382487 │ Keeping Up With the Dodgers                                                         │ https://www.nytimes.com/2012/12/15/sports/angels-keeping-up-with-the-dodgers-leading-off.html                                     │ 0.7849781513214111 │
│    2585169 │ New Life for the Dodgers’ Old Digs                                                  │ https://www.nytimes.com/slideshow/2015/04/19/sports/baseball/20150419DODGERTOWN.html                                              │ 0.7894293665885925 │
│    1032111 │ Not Dodgers II, but It's Baseball; The Game Is Back in Brooklyn, on a Smaller Scale │ https://www.nytimes.com/2001/06/23/nyregion/not-dodgers-ii-but-it-s-baseball-the-game-is-back-in-brooklyn-on-a-smaller-scale.html │ 0.7978747487068176 │
└────────────┴─────────────────────────────────────────────────────────────────────────────────────┴───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴────────────────────┘

Auxiliary columns are denoted by a + prefix in the column definition, modeled after the same feature in the SQLite R*Tree extension. These columns are unindex, stored in a separate internal table and JOIN'ed at SELECT time. They cannot appear in a KNN WHERE query, as performance would worsen dramatically.

But it saves you from dealing with additional JOINs yourself! They are especially great for longer TEXT or BLOB values.

Roadmap and the future of sqlite-vec

Metadata column support is the biggest update to sqlite-vec since the initial v0.1.0 launch 3 months ago, but I have a lot planned for the project!

First off: ANN indexes. The vec0 virtual table is brute-force only, which really slows down KNN queries on larger datasets. There are strategies like binary quantization or Matryoshka embeddings that can help, but sqlite-vec won't be fast until ANN indexes are supported.

I delayed working on ANN indexes until metadata columns were supported, because its much easier to build an ANN index with metaddata filtering on day 1 than it is to retroactively try to support them. I think this was the right call — metadata columns are hard! Follow issue #25 for future update on this!

Next: Quantizers. Currently sqlite-vec only supported simple binary quantization and scalar quantization with int8 vectors. But I want to support float16, float8, "smarter" binary quantization (ie custom thresholds instead of just > 0), and other techniques that have come about the last few months. This will also help support ANN indexes, as many of them rely on vector compression for fast queries.

There's also a ton of performance work that sqlite-vec needs, especially with these new metadata column features. This initial release was more of a "make it work" and not "make it fast", so expect much faster metadata filtering in upcoming releases!

Sister projects sqlite-lembed and sqlite-rembed also need a ton of love, they both have some older PRs that need merging. Expect releases of both of these projects very soon!

And finally, a ton of smaller integrations! For example, Rody Davis submitted Dart and Flutter bindings that I have not yet merged, Oscar Franco contributed Android and iOS bindings that needs love, and Pyodide support is on the horizon.