Pro Tips for Short-Text Topic Modelling in Python

Topic Modelling for short-text documents (like Tweets) comes with its own set of challenges…which required a dedicated set of tools to work with.

In this post on Towards Data Science, I compare the widely-popular LDA topic modelling approach to its less-famous younger brother: GSDMM. I explain the main differences in the algorithms, provide intuitions about how they operate under the hood, explain the pre-processing requirements for each, and evaluate their comparative performance on clustering varying amounts of short-text documents.

Spoiler alert: they each have their comparative advantages and a lot depends on your use case.

The great news? You don’t actually have to choose! In the Medium post I also showcase a method for combining LDA and GSDMM into a single model to get you the best of both worlds!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s