At TipTap, we are building a deeply personalized site; think “OKCupid for product recommendations” or “Pinterest for people with your tastes”. We use psychology research to measure and predict your personality and traits along a number of scales (dimensions), and then we connect you with people, products and content we think you’ll like. There’s no machine learning involved; we use 50 years of human learning instead.
Problem: That’s Not Relational
From a database perspective, what this means is we’d like to do things roughly equivalent to
SELECT * FROM products JOIN users ON products.author_id = users.id JOIN similarity(your_user_id, products.id) AS s ON similarity.user_id = users.id ORDER BY s LIMIT 10;
where similarity() is a set-returning function—essentially a virtual table of all users, ordered by their similarity to you, in the context of what “similar” means for each product.
Of course, you can’t do this in PostgreSQL, not least of all because there’s no LATERAL support yet. You have to move the similarity() function into the SELECT list, and that means it needs to be super-fast, because it’s getting called once per product and returning only the one row. Realistically, we’re doing something very much against the relational grain, and pure-SQL functions don’t get us where we need to be. A straight procedural implementation in PL/pgSQL is far too slow. There are papers on extending the relational algebra to personalized or contextualized queries; maybe someday when we’re rich, we can fund that.
Meanwhile, PostgreSQL isn’t just a relational database; it’s a robust, scalable framework for indexing and querying data. I think it can do this.
Essentially, what we’re doing is finding the similarity between you and another user, represented as points or vectors in an N-dimensional feature space, where:
- Similarity is 1 – distance; your similarity to yourself is 1, to anyone else is never less than 0, and in general similarity is a salient, consistent (but not mathematically pure) percentage measure.
- Features can be arbitrary data types in arbitrary domains. On some scales, they’re pre-normalized floats; on others, they’re dates, or latitude/longitude points.
- The distance function depends on the dimension, and is normalized from 0–1. On our pre-normalized scales, that distance function is “subtraction”. When comparing your age, the distance function is “map the age difference along a curve such that a 70- and 75-year old are similar but a 15- and 20-year old are not”. And so on.
- If a feature is unknown in either user, treat the missing data appropriately. In the simplest case, perhaps the similarity is always 0 along that feature.
- Aggregate the feature-specific similarities via some aggregation function. Perhaps you want a weighted average, or Euclidean distance, or cosine similarity.
Once we’ve calculated similarity between you and other users, we layer relevance on top. Relevance is calculated much like similarity — the aggregate of some set of feature-specific relevance factors — but relevance measures the connection between you and the piece of content. One relevance factor is your similarity to the the user who posted that piece of content, and here’s where it gets interesting:
We calculate that similarity in the context of that piece of content. Some factors matter more than others. If you are considering buying a car, you may value advice from others in your age group, or who share your tastes in performance vs. efficiency, or style vs. practicality. But you don’t much care where they live. If you’re looking for a restaurant, you want somebody who’s a healthy eater if you are, and you don’t care if they’re stylish or disorganized, but most importantly you probably want to talk to people in your neighborhood! Unless you’re asking about a restaurant in a popular tourist location, in which case you may want to see other frequent travelers… and so on.
This means we can’t even use a pre-calculated index of similarity; we must calculate it on the fly. Which is why, at least without LATERAL support (and possibly with), similarity() goes back up to the SELECT list. Which means, again: it’s got to be fast.
In my next post, I’ll talk about how we might accomplish that.