NLP in C# made easy with spaCy & Catalyst

I often see the question on how to do natural language processing from C#. Almost all NLP engineers end up using Java or Python , as the most used open source packages for NLP are developed in those languages, such as Stanford’s CoreNLP, NLTK or spaCy. If you are a .NET developer, the options are unfortunately a bit more restricted.

We at Curiosity developed our own high-performance Catalyst NLP toolkit, which we use on our day-to-day work. But while Catalyst is quite powerful (and fully written in C#) — it doesn’t provide the exact same models and algorithms of a larger library such as spaCy — and often we would like to be able to compare and explore results and use-cases from the Python world to evaluate how to bring them to our .NET library.

As an example, I was recently asked how to get a dependency parse tree using Catalyst, and while I started some time ago to add support for a model supporting dependency parsing on Catalyst, there is still some work to be done to be able to use it. But that reminded me that I was recently been playing with the Python.Included package, which enables you to embed the Python runtime on any modern .NET project, and using it is as simple as adding their NuGet package and bootstrapping the Python engine:

The nice thing about this library is that it will automatically install the Python environment on the first run, so you don’t need to have or manage an existing Python installation.

With this in mind, I decided to give it a try to integrate spaCy and Catalyst, providing access to the latest and greatest NLP models developed by the amazing Explosion team, while keeping things accessible to C# developers, and take advantage of the strongly-typed development experience of the .NET world.

Installing the spaCy & models

SpaCy has a really helpful website to get you started, with a really helpful widget that generates the right code to install spaCy and any language-specific models you might need:

The obvious first step was to install spaCy (i.e. the pip install spacy above). This ended up being easier than expected with the Python C# integration:

The next step was getting it to install the necessary models. After a bit of fiddling with how spaCy downloads and install models (and how they handle model compatibility across versions), I ended up reverse engineering the download logic and reimplementing it in C# to invoke directly the Installer.PipInstallModule method with the correct URL created for the installed spaCy version, language and model sizes requested by the user (similar to how the spaCy CLI invokes pip it in this line).

And it was as easy as that ✨

A little bit of refactoring and clean-up, and we’re ready to use spaCy from C#:

The nice thing about this integration is that you can any existing spaCy model to process your text data, and write the rest of your logic using the strongly-typed Catalyst C# API. The magic for that is done in this method, that copies back the data from the doc container from the python world into a Catalyst Document object.

You can even batch-process documents using spaCy, which is the recommended approach to process lots of text using spaCy. To do that, just use the Process method and pass any IEnumerable<Document> to it:

There are lots of possibilities now to expose more of spaCy to the the C# world — like being able to load your own spaCy models — which I’ll try to add soon. If you have any ideas or suggestions, feel free to drop us a word in the Catalyst issues page or our Gitter chat!

And if you want to give it a try, just install the new Catalyst.spaCy NuGet package to your C# project to get started!

CTO @curiosity_ai