Exploring CodeProject With Curiosity

Rafael Oliveira
The Startup
Published in
11 min readNov 25, 2020

--

I was reading my email this morning, when I stumbled upon an interesting question in the daily newsletter from CodeProject, to which, like the author, I had no idea about the answer: “How many articles there are on CP?”

Did you know…

Turns out CodeProject is quite a large library after all, with more than 14 million users and 63 thousand articles! We’ve been looking for interesting data-sets to demonstrate how to use the Curiosity Search, and this seemed like an interesting one!

You can reproduce this on your own machine — all the code is available in this repository, and you can download a copy of Curiosity to run locally, or deploy to your favorite containers platform using the Docker image.

Time to check those 63000+ articles

Adding data to a Curiosity application is easy using our C# data connector — it provides an easy to use interface that helps us write data ingestion jobs and create the graph relationships that will define your search.

As we’ll be extracting the data from the Site Map, let’s take a look and decide on a data schema for our own CodeProject knowledge graph. First, from the Site Map, we can see that CodeProject is structured in Categories such as Desktop Development, and Subcategories such as Clipboard.

Site map: Categories and Subcategories

If we open one of the subcategories pages, we can see the entire list of articles that is tagged in each subcategory. These pages might take a while to load, as they contains a link to every single article in that sub-category. The next data type we can derive from is the Article. Let’s open one to see what’s inside!

Articles all the way

Finally, if we open an article, we can see a few more things that can be interesting to capture:

Finally some content to read!

If we take a quick look on the article structure, we can notice two more data types: Authors and Tags. We can also decide on what metadata to import from the article, like the Title, the Short Description, the article content and the stats of the article.

What we will be extracting from an article

This gives us this final schema for our CodeProject graph:

Structuring CodeProject in a graph

First let’s take a look on how to define a schema for this data using the Article type. We start by creating a class and tag it with the [Node] attribute. Each node schema in a Curiosity application has to have a unique string key that identifies it — we mark that field with the attribute [Key]. For the Article type, we use the URL of the article as that already provides us a nice unique identifier for each article. The other fields will be marked with the [Property] attribute, and will contain the rest of the data we will extract from each article webpage. For the node timestamp (which we use in Curiosity for time filtering), we use a DateTimeOffset field (marked with the attribute [Timestamp]). For articles we fail to parse the date from the HTML, we use the default value DateTimeOffset.UnixEpoch — this is a special value that will be ignored when searching.

The final code for the Article type looks like this:

[Node]
public sealed class Article
{
[Key] public string Url { get; set; }
[Timestamp] public DateTimeOffset Timestamp { get; set; }
[Property] public string Title { get; set; }
[Property] public string Description { get; set; }
[Property] public string Text { get; set; }
[Property] public string Html { get; set; }
[Property] public int Views { get; set; }
[Property] public int Bookmarks { get; set; }
[Property] public int Downloads { get; set; }
}

You can see the remaining data schemas in the attached GitHub repository. For the edges, we’ use a simple convention (HasAuthor / AuthorOf), and I create a simple static class so that we don’t have to repeat the names as strings in the code later:

public static class Edges
{
public const string AuthorOf = nameof(AuthorOf);
public const string HasAuthor = nameof(HasAuthor);
public const string TagOf = nameof(TagOf);
public const string HasTag = nameof(HasTag);
public const string CategoryOf = nameof(CategoryOf);
public const string HasCategory = nameof(HasCategory);
public const string SubcategoryOf = nameof(SubcategoryOf);
public const string HasSubcategory = nameof(HasSubcategory);
}

Finally we can start putting our data connector together:

using (var graph = Graph.Connect(server, token, "CodeProject"))
{
await graph.CreateNodeSchemaAsync<Article>();
await graph.CreateNodeSchemaAsync<Author>();
await graph.CreateNodeSchemaAsync<Tag>();
await graph.CreateNodeSchemaAsync<Category>();
await graph.CreateNodeSchemaAsync<Subcategory>();
await graph.CreateEdgeSchemaAsync(Edges.AuthorOf,
Edges.HasAuthor,
Edges.CategoryOf,
Edges.HasCategory,
Edges.SubcategoryOf,
Edges.HasSubcategory,
Edges.TagOf,
Edges.HasTag);
await IngestCodeProject(graph); await graph.CommitPendingAsync();
}

The token variable contains the API token you can generate on the application. The server variable should point to the address Curiosity is hosted — in the case of testing it locally, it’s “http://localhost:8080/”. Remember in a real deployment you shouldn’t store the API token directly in the code

All that is left to do is to implement the IngestCodeProject method to crawl the CodeProject site map and get all articles from it.

Crawling CodeProject

I was happy to see there is an official API for CodeProject, but after a second look, it seems like they started building the API, but it has been long abandoned, with the latest update from 2015:

Not a good sign when the API last changed 5 years ago…

So we fall back to a web crawler instead. To implement the crawler in C#, we can use the good-ol’ HtmlAgilityPack parsing library. We start with a few helpful methods to download a page and extract links:

static async Task<HtmlDocument> GetPage(string url)
{
using (var c = new HttpClient())
{
var html = await c.GetStringAsync(url);
var doc = new HtmlDocument();
doc.LoadHtml(html);
return doc;
}
}
static IEnumerable<string> GetLinks(string baseUrl,
HtmlDocument doc)
{
return doc.DocumentNode
.SelectNodes("//a[@href]")
.Select(n => ToAbsolute(baseUrl, n.Attributes["href"].Value))
.Where(u => u is object)
.Distinct();
}
static string ToAbsolute(string baseUrl, string url)
{
if (string.IsNullOrWhiteSpace(url)) return null;
var uri = new Uri(url, UriKind.RelativeOrAbsolute); if (!uri.IsAbsoluteUri)
{
uri = new Uri(new Uri(baseUrl), uri);
}
return uri.ToString();
}

From navigating around the website HTML, we can see that the hierarchy of categories can be easily read from the side menu that appears in any section page:

Structure, structure, structure…

The full code for the crawler won’t fit here — but you can check it in GitHub. Once we have the crawler code working ,we just need write a few lines of code to create the right nodes in the knowledge graph. For example, this is how we connect articles, authors and tags:

var articleNode = graph.AddOrUpdate(
new Article()
{
Url = articleLink,
Bookmarks = stats.bookmarked,
Views = stats.views,
Downloads = stats.downloads,
Description = content.description,
Title = content.title,
Text = content.text,
Html = content.html,
Timestamp = date
});
foreach (var author in authors)
{
var authorNode = graph.AddOrUpdate(
new Author()
{
Name = author.name,
Url = author.url
});
graph.Link(articleNode, authorNode,
Edges.HasAuthor, Edges.AuthorOf);
}
foreach (var tag in tags)
{
var tagNode = graph.AddOrUpdate(
new Tag()
{
Name = tag.tag,
Url = tag.url
});
graph.Link(articleNode, tagNode,
Edges.HasTag, Edges.TagOf);
}
graph.Link(articleNode, subcategoryNode,
Edges.HasSubcategory, Edges.SubcategoryOf);
graph.Link(articleNode, categoryNode,
Edges.HasCategory, Edges.CategoryOf);

As you can see, we are creating each article in the graph, and linking them via the respective edges to authors and tags.

Armed with an API token, we can now run now the crawler in the command line with

dotnet run http://localhost:8080/api/ {API_TOKEN}

You’ll see it starts downloading all articles and uploading it to your Curiosity application:

The 🕷 is alive!

You can check on your browser the Data Hub page to see the data starting to appear as the articles are downloaded:

Now that we have the crawler doing the hard work for us, it’s time to check how we can configure the search experience on Curiosity.

Making CodeProject Searchable

We’ve two last things to do here:

  • Configure search, autocomplete and filtering
  • Create data views for each type with the relevant information

Let’s start with configuring search!

First we head to the Data Hub again, open the Article data type, click on Text Search and add the fields Title, Description and Text as searchable, for the other data types, like Author, we’ll configure Name as searchable.

Making things searchable takes a few clicks

This way, we can already search for our favorite authors and look for information in the content of the articles. You’ll notice that I’ve also changed the boost for each field, so that text matches in the Article’s Title and Description fields contribute more for the ranking of an article.

For Autocomplete and Filtering, we would like the search box to offer Authors, Tags and Categories/Subcategories to search for.

So let’s configure it:

Finally, we also want our search box to automatically capture Authors, Tags and Categories/Subcategories if the user types them as text directly in the search box — we can configure the required NLP models like this:

This means if we type something like “windows” in the search box, it will automatically be recognized as Tag in the knowledge graph, and we get better search results based on it:

You might have noticed that we’re still showing our Key fields (i.e. the URLs) on the filters and card titles — this is because we have not yet configured how to render our data. Now that we configured search, it’s time to make our search experience more useful and pretty!

Improving our Search Experience

Curiosity has an embedded interface editor that allows you to define your own custom data views based not only on each data point, but also on its connections on the knowledge graph.

For example, for the articles, we might want to show not only the content, but also the related Author as a link the user can easily navigate to and find other Articles. We might also want to configure similarity for articles, so we can suggest users other articles that they might be interested at — we’ll get to that later in this article!

Starting with the Article type, we head to the Data Hub again, and configure the following:

  • Style: Change the Label to the field Title, and change the icon and color
  • Renderer: Add related Author, Category and Tags to the footer, add the article URL as a link to the card, use the HTML when we open the card preview, and add a tab to show related Authors.

We will do something similar for the other data types, so that we can see and search in related articles for Categories, Subcategories and Authors:

Searching our data

Now that we’re done configuring the system, we can play around and see how the data looks like. We can for example start with the legend Sacha Barber, you can see how the autocomplete we configured earlier already suggests his name when we start typing his name on the search box. From there, it’s easy to navigate categories he has published at, and find interesting articles he published in some niche topics like Machine Learning.

The legend, Sacha Barber

One can also explore the graph we created while ingesting the data, and take a look “behind the scenes” in how the data is connected:

Graphs, graphs everywhere

Recommendation for Similar Articles

Now that we have our articles in the system, we can use the relationships each article has (such as Category, Subcategory, Tags plus the internal types captured by the built-in unsupervised models for Abbreviations & Concepts) to train a graph-based embeddings model and use it for suggesting similar articles. Curiosity uses internally a C# port of the Facebook’ s open-source FastText code, adapted to handle graph relationships instead of words as in the traditional word2vec model. For fast vector search, it uses the approximate near neighbours algorithm Hierarchical Navigable Small World graphs (you can see the implementation we use here: HNSW)— this allows consuming the embeddings for providing recommendations in real-time, without having to scan all vectors for the closest neighbours.

We train this model by adding an embeddings index in the Settings page:

Adding a similarity index for articles

And configure it as follows:

Note that this model can also use a pre-trained Token Embeddings model to improve the training data by expanding the captured “Concepts” in the text. To do this, we manually train once the Token Embeddings model:

First step: Training the token embeddings for word similarity

Once this model is trained, we can then train our Article Similarity model:

Second step: Training the article similarity model

Once the model is trained, we can modify our Article view in the Data Hub to show similar articles. For this, we add one new tab to the Full View, and add a Similar Search component inside of it:

Publishing a recommendation endpoint

Now that we have our models trained, we can also make it available for use outside our system. Curiosity offers a built-in way to write your own custom endpoints: We just need to go to Menu > Endpoints, and create a new endpoint called recommend-articles:

The code for this endpoint is quite simple, the endpoint receives the URL of the article in the body, and uses the Curiosiy query language to retrieve 10 similar articles:

var url = Body.Trim('"');if(Graph.TryGet(N.Article.Type, url, out var article))
{
return Q().StartAt(article.UID)
.Similar(count:10)
.EmitWithScores("Similar",
fields: new []{"Url", "UID", "Timestamp"});
}
return "{}";

In order to call the endpoint from outside, you’ll need to generate a token for it (Settings > Tokens > New Token > Endpoint). You can then test it using curl, for example, with the following command:

curl -X POST -H “Accept: application/json” 
-H “Content-Type: application/json”
-H “Authorization: Bearer ${TOKEN}” — data “ARTICLE URL”
http://localhost:8080/api/cce/token/run/recommend-articles

Conclusion

Funny how sometimes a simple question at the right time can lead us down the rabbit hole! I hope this post helped showing how simple it is to create your own intelligent search — the hard part was really writing the crawling logic based on the website, something that I want to try again in the future using Selenium, using the soon to be released Python connector.

--

--