Topic Modelling for short-text documents (like Tweets) comes with its own set of challenges…which required a dedicated set of tools to work with.
In this post on Towards Data Science, I compare the widely-popular LDA topic modelling approach to its less-famous younger brother: GSDMM. I explain the main differences in the algorithms, provide intuitions about how they operate under the hood, explain the pre-processing requirements for each, and evaluate their comparative performance on clustering varying amounts of short-text documents.
Spoiler alert: they each have their comparative advantages and a lot depends on your use case.
The great news? You don’t actually have to choose! In the Medium post I also showcase a method for combining LDA and GSDMM into a single model to get you the best of both worlds!