Intro

In 2016 I was studying Psychology at the University of Innsbruck, and I had no idea where to go afterwards. I had just finished my Bachelor’s thesis knew one thing: I wasn’t made for clinical psychology. I didn’t really want to work with patients. But I also had discovered the research process, and I felt there might be a path forwards for me within science.

What interested me most was how modern big data technologies could be applied to questions in Psychology. How we could use the massive amounts of unstructured text and image data that are generated daily to gain insights into the human mind.

I jumped right in and two years later analysed over 20 million comments on Reddit to examine how the moral values we express in text change through group dynamics over long periods of time.

Reddit is an under-utilized source of data. It’s relatively anonymous, messy, and huge. It seemed like a natural choice to go back to in order to learn more about transformers and their applications. So here it is: my first attempt at using BerTopic to extract what users of r/HillaryForPrison and r/ImpeachTrump (for some balance) talked about.

The great thing about BerTopic is that it doesn’t just provide out of the box topic modelling, it also allows us to examine how the topics differ between these two subreddits and how they developed over time.

Data Cleaning

From the BerTopic documentation:

“Should I preprocess the data?

No. By using document embeddings there is typically no need to preprocess the data as all parts of a document are important in understanding the general topic of the document. Although this holds true in 99% of cases, if you have data that contains a lot of noise, for example, HTML-tags, then it would be best to remove them. HTML-tags typically do not contribute to the meaning of a document and should therefore be removed.”

Seems like I don’t have to do much data cleaning. Because Reddit comments are not necessarily ‘clean’, I quickly looped over the data to remove any non-alphanumeric characters (except for punctuation). Empty comments & known bots are already removed. In addition, after some trial and error, I also removed all links and hashtags, because these resulted in several rather meaningless topics.

Fitting the Model

In the first step the sentence transformer model is imported and document embeddings are created.

# Read in comments

comments_path = os.path.join(inpath, 'comments.txt')

comments = []
with open(comments_path, 'rb') as fp:
    comments = pickle.load(fp)


# Load sentence transformer model
from sentence_transformers import SentenceTransformer
# Good, relatively fast all-purpose model
sentence_model = SentenceTransformer("all-MiniLM-L6-v2") 


# Create documents embeddings (with progress bar because it takes a while)
embeddings = sentence_model.encode(comments, show_progress_bar=True) 

# save because this takes a while 
embeddings_path = os.path.join(outpath, 'embeddings.pickle')

with open(embeddings_path, 'wb') as fp:
    pickle.dump(embeddings, fp)

Then we can define custom UMAP & HDBSCAN models to reduce the embedding dimension and perform document clustering. In order to speed up this process we preprocess and initialize with PCA embeddings. Originally I was planning to do this for r/TheDonald and r/SandersForPresident (and I’m still planning to at some point), but even with 128 GB of RAM the model wouldn’t fit into memory, so I had to settle for two smaller subreddits.

There’s also the option to move computation to GPU, but with this data set the topic model takes only about 5-10 minutes to execute, so the added work is not necessary.

import numpy as np
from hdbscan import HDBSCAN  
from umap import UMAP 
from sklearn.decomposition import PCA

# Speed up the process by initializing with PCA values
def rescale(x, inplace=False):
    """ Rescale an embedding so optimization will not have convergence issues.
    """
    if not inplace:
        x = np.array(x, copy=True)

    x /= np.std(x[:, 0]) * 10000

    return x


pca_embeddings = rescale(PCA(n_components = 5).fit_transform(embeddings))

While the UMAP model uses mostly the default values, I chose to set the min_cluster_size for the HDBSCAN to 1000 - otherwise we’d get lots of very rare topics.

# Define UMAP model to reduce embeddings dimension
umap_model = UMAP(n_neighbors = 15, # simply the default value.
                  n_components = 5,
                  min_dist = 0.1,
                  metric = 'cosine',
                  init = pca_embeddings,
                  low_memory = False,
                  random_state = 101)

# Define HDBSCAN model to perform document clustering
hdbscan_model = HDBSCAN(min_cluster_size = 1500,
                        metric = 'euclidean',
                        cluster_selection_method = 'eom',
                        prediction_data = True)


# Removing stopwords
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words="english", min_df=0.001, max_df=0.99)

All of this is put together to define the final model.

# Create BERTopic model
from bertopic import BERTopic
topic_model = BERTopic(top_n_words = 8,
                       min_topic_size = 1500,
                       calculate_probabilities = False,
                       # custom umap to reduce embeddings dimension
                       umap_model = umap_model, 
                       # custom clustering model
                       hdbscan_model = hdbscan_model, 
                       # diversify topic representations, easier interpretation
                       diversity = 0.6,  
                       # remove stopwords after fitting
                       vectorizer_model = vectorizer_model, 
                       verbose = False)

Once the model is defined we can train it. With the ~450,000 comments

# read comments
comments_path = os.path.join(inpath, 'comments.txt')
comments = []

with open(comments_path, 'rb') as fp:
    comments = pickle.load(fp)

# Train model, extract topics and probabilities
topics, probabilities = topic_model.fit_transform(comments, embeddings)

Visualization

There are few hard rules what constitutes a good topic model. In the end it all boils down to whether the topics that are extracted are meaningful and interpretable. BerTopic provides a host of out of the box visualizations that can help to better understand the extracted topics and to ultimately make sense of the underlying documents.

The first of these is the distance matrix. Topics are projected into two-dimensional space to enable us to interpret how they relate to each other (or rather how the algorithm projects those relations).

I chose a relatively low number of topics (or rather a large min_document_size) for easier interpretability, but of course this can obscure the hierarchical clustering and relations between topics to some extent.

topic_model.visualize_topics()

The hierarchy of topics can be visualized through a dendrogram, just like any hierarchical clustering algorithm. If two topics are connected closer to the origin of the x-axis, they are also closer in the text - and there’s a good chance they might be overlapping. Topics that are further away on the y-axis are less closely related to each other, meaning they co-occur less frequently.

topic_model.visualize_hierarchy()

Another way to look at how the topics are related is by using a heatmap - though at this point I feel I’m almost repeating myself. Because we set diversity to a relatively high value (0.6), the topics are overall relatively dissimilar.

topic_model.visualize_heatmap()

Visualize documents

Just like the topics we can project the comments themselves into two-dimensional space to visualize their relations in terms of the topics they contain (as opposed to, for example, their syntactic structures).

# Reduce dimensionality of embeddings:
hierarchical_topics = topic_model.hierarchical_topics(comments, topics)

reduced_embeddings = UMAP(
  n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine'
  ).fit_transform(embeddings)

By hovering over the plot you can see the content of a small proportion of example comments to get a feeling for what the topics represent.

topic_model.visualize_hierarchical_documents(
  comments, hierarchical_topics, reduced_embeddings = reduced_embeddings,
  sample = 0.003, hide_document_hover = False,
  width = 800, height = 550
  )

Topics by subreddit

What sets BerTopic apart is that by leveraging c-TF-IDF we can examine how the topics differ between specific predefined classes, such as the subreddits they were posted in. Up until this point we treated r/HillaryForPrison and r/ImpeachTrump as if they were one entity, which they are clearly not. Even from the names alone it’s obvious that they represent completely different political ideologies. They share the goal or the hope of legal action against their political opponent, however, which might mean that it is likely we’ll find both shared as well as separate topics.

# get topic representations per class
topics_per_class = topic_model.topics_per_class(comments, topics, classes = subreddits)

Clicking through the visualization below it can be seen that r/HillaryForPrison talked about topics 0 (putting Hillary Clinton & Bernie Sanders in prison) and 7 (the email scandal) vastly more than r/ImpeachTrump. r/ImpeachTrump talked more about the electoral system (topic 1) and about taxes, jobs and the economy (topic 4), as well as impeaching Trump (duh!) and fascism.

On most other topics, there seems to be more activity in r/HillaryForPrison - this could be because the subreddit contains more comments (though not by much!) or because conservatives rally behind a smaller selection of key topics, while the range of topics is bigger for progressives.

It might be interesting to examine how sentiment varies between these topics and/or to set lower limits for min_topic_size so subtopics can be explored in more detail. An interesting example of this is topic 5, where both subreddits talk about the media. The topic contains the keywords fox news and breitbart, two right-wing news outlets/tabloids and CNN, which is more democratically aligned. Both subreddits engage in this topic frequently, but likely with much different talking points and sentiment.

# visualize
topic_model.visualize_topics_per_class(
    topics_per_class, 
    top_n_topics = 14,
    width = 800, height = 400
    )

Topics over time

Similar to how we can examine how the topics differ and align between the two subreddits, we can also check and visualize how they develop over time. Especially in political subreddits like r/HillaryForPrison and r/ImpeachTrump it stands to reason that some topics would be greatly influenced by current events. Perhaps these could also be the topics that trend in both subreddits at the same time. Other topics might be ‘evergreens’, like Trump fans bitching about the bad, bad system media.

Timestamps for the comments are saved in POSIX epoch time (seconds since Jan 01, 1970) and need to be converted to date time format in order to model the topics over time. Re-calculating the c-TF-IDF for each time point is computationally very expensive, so it’s necessary to reduce the dimensionality by retaining only year and month. Even with this greatly reduced input space and the relatively small corpus size training takes over 1 hour.

import datetime

time_path = os.path.join(inpath, 'timestamps.txt')
timestamps = []

with open(time_path, 'rb') as fp:
    timestamps = pickle.load(fp)

dates = [datetime.date.fromtimestamp(timestamp) for timestamp in timestamps]

# Retain only year and month, otherwise calculation will take too long
yearmon = [datetime.date(dates[i].year, dates[i].month, 1) for i in range(0, len(dates))]

Training the model is again very easy and can be done in a single line. It would perhaps be even more interesting to combine the topics_per_class and topics_over_time, but this is not yet implemented in BerTopic, so it would require a lot more custom code - that’s for another day.

topics_over_time = topic_model.topics_over_time(comments, topics, yearmon)

It’s not surprising to see that in general most activity is centered around the 2016 election. After all, r/HillaryForPrison is the bigger subreddit and Hillary Clinton became less politically relevant than Donald Trump in the aftermath.

Some topics, however, appear to be influenced by political events of the day (other than the election) more than others. These are topics 6 (russian involvement in US politics), 8 (racism & white privilege) and 10 (actually impeaching Trump). All of these are more aligned to r/ImpeachTrump and therefore could retain importance even after the election.

topic_model.visualize_topics_over_time(
    topics_over_time, 
    top_n_topics = 14,
    width = 800, height = 300)

Citation

For attribution, please cite this work as

Jonas Schropp (Jan 1, 0001) Hillary vs. the Donald. Retrieved from /blog/2022-08-09-hillary-vs-td/

BibTeX citation

@misc{ 0001-hillary-vs.-the-donald,
 author = { Jonas Schropp },
 title = { Hillary vs. the Donald },
 url = { /blog/2022-08-09-hillary-vs-td/ },
 year = { 0001 }
 updated = { Jan 1, 0001 }
}