Building a news recommendation engine

We explore the MIND dataset to build a quick news recommendation engine from scratch using Curiosity

Recommendation systems have been a key part of modern website interactions, as it can be very hard to find information otherwise. Websites have sections like “Suggested Read”, “You might also be interested at”, “Relevant for you”, in which these recommendations are embedded in the flow of the website, and are key to help users understand, identify, and provide new opportunities to engage with the content, ultimately improve their experience and increasing retention and interest.

Recommending similar CodeProject articles using graph-embeddings

In my last post, Exploring CodeProject using Curiosity, I showed how one can build a graph embedding model and use it to recommend similar CodeProject posts in real-time using the built-in HNSW index. In that case, we built our recommendation model based entirely on the extracted entities from the articles’ text content and the relationships that were captured in the knowledge graph.

But another interesting source of signals for providing recommendations is user-behavior: the premise is that similar users will have similar “interests”, and we can capture these interests based on their interaction on the website.

There is plenty of research on the topic, and while we’ll not present any novel algorithm here, I’ll show how one can model this data as a set of relationships in the graph, and a few different approaches to provide recommendations based on this.

As we don’t have the user-behavior for CodeProject, for this post we’ll switch to another interesting public data-set that has all the information we need to demonstrate how to build a recommendation algorithm: Enter the MIND — or Microsoft News Recommendation Dataset. More details about this dataset can be found in the related paper, but in short, it consists of an extract of news articles and user interactions with them over time.

Data Modeling for Recommendations

Similar to my previous post, the first step in building our demo is to get the data inside the system — so let’s start with our graph schema. From a quick inspection, the key data types in the MIND dataset are:

  • News Articles (from msn.com)
  • Anonymized Users
  • Entities (from WikiData)
  • Categories & Subcategories

The resulting graph schema is quite similar to the one from the CodeProject article, with the only change being that the Tags type have been replaced with WikiData Entities that were pre-captured by Microsoft when they generated the dataset.

Graph Schema for the MIND dataset

We start building our MIND-connector (no, not this one) as usual. You can check the entire source-code for the connector on GitHub. We add the Curiosity Library package to be able to talk to the graph and also add a CSV and an HTML parser to help us read and transform the data.

Creating a new data-connector and adding a few useful NuGet packages

For reading the data from each of the files, we can use the neat record objects to create the supporting classes to read the CSV data into, and then asynchronously iterate through the files to read each object:

News on the record

We now define our supporting Node schema classes for the Curiosity graph:

And proceed to write the logic to ingest all articles and impressions:

Logic to bring the MIND dataset into the graph

Now we just need to generate an API token for our Curiosity instance, sit back and wait while the ingestion downloads all the data and populate our graph:

This might take a while…

This might take a while, as it will also read the HTML pages of each article to try to extract the full text and date. If you run the code with the large dataset, might be a good time to go refill your coffee mug.

Obligatory XKCD

Recommendation Algorithms

As we already explored the idea of using graph embeddings to recommend similar articles to any given article, I’ll cover now two other approaches that we can use to suggest articles:

  • Articles related to the user’s interests
  • Articles viewed by Similar Users

Articles related to a user’s interests

This algorithm works for existing users, as it depends on previous interests captured by the user interactions with news on the site. We want to explore the fact that if a user reads an article in which a given entity appears, there is a good chance that the user will be interested in new articles that also talk about the given entity. This can be represented as a set of graph operations, in which we start at the user, and navigate the graph towards articles, then entities, and then back to articles.

This operation can be modeled with a graph query like this:

Graph query for articles related to entities of interest

Articles viewed by Similar Users

This algorithm uses a graph embedding model to try to model similar users based on their interests on the graph. First, we train a new graph similarity model, but this time for users instead of articles. As the number of users is quite high, we configure the model to only consider users that have read at least 50 articles.

We will also use an interesting capability of the Curiosity graph database, that allows us to expand the knowledge graph by defining inferencing rules to create “virtual edges” at query time. For this model, we’ll define the following rule:

Inferred edges can be used to extend our data-model without having to increase our graph size

We now create the similarity index and configure it as follows:

Settings for the user graph embeddings model

Note that in this case, we disable the similarity for tokens options, as we’ll not be handling any “Concepts” related to the user. We also add the two edge and node types that we want to use for modeling similarity, including the inferred edge defined earlier. Once the model is trained, we can use a similar graph query to fetch recommended articles based on user similarity:

Graph query for articles related to similar users

Visualizing Recommendations

A quite handy feature from Curiosity is the option to customize the front-end directly from the application, without having to write a single line of JavaScript. We can use this here to build a quick prototype UI to visualize the results from the different algorithms. In our case, we’ll add the three tabs to be able to and link each tab to a Custom Endpoint. First, we create the endpoints for each algorithm:

Articles from users that have similar “reading taste” to the user
Articles that talk about entities the user is interested at
Articles that are similar to the articles read by the user

Then, we configure our User view to show the response of these three endpoints in separate tabs, using the Content From Endpoint component.

Finally, we can check that it all works by opening any user and clicking on each of the “recommend” tabs we created above (Similar for similar articles, Interests for related articles to what the user is interested and Similar Users for articles read by similar users):

Viewing the predictions from the three recommendation algorithms

Final Notes

In this post, we showed here how to quickly put together a set of algorithms to recommend related news articles. One key aspect that is not covered here but can be left as an improvement to these algorithms is a way to mix the different techniques depending on the profile of the user and to use other signals such as freshness or length to extend the predictions and re-rank the results.

If you’re interested in reproducing this locally, you can find the connector code on GitHub.

CTO @curiosity_ai

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store