Building a news recommendation engine
We explore the MIND dataset to build a quick news recommendation engine from scratch using Curiosity
Recommendation systems have been a key part of modern website interactions, as it can be very hard to find information otherwise. Websites have sections like “Suggested Read”, “You might also be interested at”, “Relevant for you”, in which these recommendations are embedded in the flow of the website, and are key to help users understand, identify, and provide new opportunities to engage with the content, ultimately improve their experience and increasing retention and interest.
In my last post, Exploring CodeProject using Curiosity, I showed how one can build a graph embedding model and use it to recommend similar CodeProject posts in real-time using the built-in HNSW index. In that case, we built our recommendation model based entirely on the extracted entities from the articles’ text content and the relationships that were captured in the knowledge graph.
But another interesting source of signals for providing recommendations is user-behavior: the premise is that similar users will have similar “interests”, and we can capture these interests based on their interaction on the website.
There is plenty of research on the topic, and while we’ll not present any novel algorithm here, I’ll show how one can model this data as a set of relationships in the graph, and a few different approaches to provide recommendations based on this.
As we don’t have the user-behavior for CodeProject, for this post we’ll switch to another interesting public data-set that has all the information we need to demonstrate how to build a recommendation algorithm: Enter the MIND — or Microsoft News Recommendation Dataset. More details about this dataset can be found in the related paper, but in short, it consists of an extract of news articles and user interactions with them over time.
Data Modeling for Recommendations
Similar to my previous post, the first step in building our demo is to get the data inside the system — so let’s start with our graph schema. From a quick inspection, the key data types in the MIND dataset are:
- News Articles (from msn.com)
- Anonymized Users
- Entities (from WikiData)
- Categories & Subcategories
The resulting graph schema is quite similar to the one from the CodeProject article, with the only change being that the Tags type have been replaced with WikiData Entities that were pre-captured by Microsoft when they generated the dataset.
We start building our MIND-connector (no, not this one) as usual. You can check the entire source-code for the connector on GitHub. We add the Curiosity Library package to be able to talk to the graph and also add a CSV and an HTML parser to help us read and transform the data.
For reading the data from each of the files, we can use the neat record objects to create the supporting classes to read the CSV data into, and then asynchronously iterate through the files to read each object:
We now define our supporting Node schema classes for the Curiosity graph:
And proceed to write the logic to ingest all articles and impressions:
Now we just need to generate an API token for our Curiosity instance, sit back and wait while the ingestion downloads all the data and populate our graph:
This might take a while, as it will also read the HTML pages of each article to try to extract the full text and date. If you run the code with the large dataset, might be a good time to go refill your coffee mug.
As we already explored the idea of using graph embeddings to recommend similar articles to any given article, I’ll cover now two other approaches that we can use to suggest articles:
- Articles related to the user’s interests
- Articles viewed by Similar Users
Articles related to a user’s interests
This algorithm works for existing users, as it depends on previous interests captured by the user interactions with news on the site. We want to explore the fact that if a user reads an article in which a given entity appears, there is a good chance that the user will be interested in new articles that also talk about the given entity. This can be represented as a set of graph operations, in which we start at the user, and navigate the graph towards articles, then entities, and then back to articles.
This operation can be modeled with a graph query like this:
Articles viewed by Similar Users
This algorithm uses a graph embedding model to try to model similar users based on their interests on the graph. First, we train a new graph similarity model, but this time for users instead of articles. As the number of users is quite high, we configure the model to only consider users that have read at least 50 articles.
We will also use an interesting capability of the Curiosity graph database, that allows us to expand the knowledge graph by defining inferencing rules to create “virtual edges” at query time. For this model, we’ll define the following rule:
We now create the similarity index and configure it as follows:
Note that in this case, we disable the similarity for tokens options, as we’ll not be handling any “Concepts” related to the user. We also add the two edge and node types that we want to use for modeling similarity, including the inferred edge defined earlier. Once the model is trained, we can use a similar graph query to fetch recommended articles based on user similarity:
Then, we configure our User view to show the response of these three endpoints in separate tabs, using the Content From Endpoint component.
Finally, we can check that it all works by opening any user and clicking on each of the “recommend” tabs we created above (Similar for similar articles, Interests for related articles to what the user is interested and Similar Users for articles read by similar users):
In this post, we showed here how to quickly put together a set of algorithms to recommend related news articles. One key aspect that is not covered here but can be left as an improvement to these algorithms is a way to mix the different techniques depending on the profile of the user and to use other signals such as freshness or length to extend the predictions and re-rank the results.
If you’re interested in reproducing this locally, you can find the connector code on GitHub.