Relating Disparate Things with NLP

Whenever you read an article on, say, the BBC’s website, the website will present you with a list of other articles which are related to the one you’re reading.

Whenever you read an article on, say, the BBC’s website, the website will present you with a list of other articles which are related to the one you’re reading.



I never used to pay it any mind -- I just used that list for my further reading -- but a new project has come in recently and made me reconsider how these relations are actually drawn up. This new project requires a mechanism for drawing up lists of items related to a given news article, based on the metadata attached to each item -- not just articles related to the given article. As it turns out, there are lots of ways one might draw up a set of many-to-one correlations like a ‘related articles’ list, and which ones you use depends on the requirements of the task at hand.

Comparing text with natural language processing

Very broadly speaking, natural language processing is the discipline of creating computer programs which aim to work with human languages in a human-like manner. It is a field of intense research today, and help drives the practice of machine learning: the creation and refinement of computer programs which can improve their own mechanisms. With NLP, it is possible to create programs which can ‘classify’ a body of text, placing it in a category based on how it compares to other texts of known classification. For example, classification is occasionally used in spam filtering: spam filters can classify a given email as either ‘spam’ or ‘not spam’ based on how the text compares to a corpus of known spam -- generally generated and updated by email users marking emails as spam.

Text classification is not limited to just two categories, though: it can be used with many categories, in order to produce a more broadly capable classification program. I ended up focussing my research for this project on text classification, as it sounded like something I could apply to many projects in the future.

Text classification algorithms

Text classification, as a problem, is fairly straightforward; implementing a technique to solve the classification problem is where things get complicated. There are many techniques for classifying text, but arguably, the most straightforward is naive Bayes classification. In essence, a naive Bayes classifier counts and compares the frequency of words, and suggests probabilities that a given text is of any particular category based on how word frequencies in the text compare to word frequencies in other texts of known classification. Because it is quite a simple algorithm, there are many implementations of naive Bayes classification out on the Web, for many different languages. Natural is an NLP library written in JavaScript for use in Node.js apps.

Using Natural

Simply passing a bunch of files into a naive Bayes classifier and expecting magic to happen is, as I discovered, really naive. The classifier is called ‘naive’ for a reason: on its own, it can’t distinguish between words, give words different weightings, or make any sort of ‘intelligent’ guesswork. Given sufficiently small data sets -- on the order of three or four categories, and a corpus of a few hundred words -- a naive Bayes classifier can work quite well. Once the data set grows beyond these thresholds, however, probabilities begin to level off rapidly, and the classifier becomes unable to return useful results. As the size of the corpus increases, common words become more common, and enter into multiple categories, meaning that the classifier struggles to determine anything useful other than the fact that a given document relates, to some degree, to just about every category the classifier knows about.

Natural and friends

Pre-processing, I learnt, is vital to making a naive Bayes classifier effective. Doing things like breaking words down into common lemmas, combining categories into broader categories based on common metadata, and picking out known keywords to give them an exceptional weighting, helped make the classifier much more useful. Wordpos, Stopwords, and a little regex helped me break texts down into useful lists of nouns in their citation form, the form of a word before inflection.

However, there is still much to do, and many other classification techniques to investigate. Additionally, there is much research to do into building a multi-threaded text classifier, which is another story altogether.