Introducing the New Topic Analyzer in Communalytic

November 19, 2024

Discovering Latent Topics

We are thrilled to announce the release of the new Topic Analyzer in Communalytic, a computational social science research tool designed to study online communities and public discourse on social media. Communalytic is entirely web-based and requires no special programming or coding experience.

The Topic Analyzer can automatically group social media posts that are semantically similar (i.e., similar in their meaning). The tool is designed to expedite the analysis of a large corpus of text data without the need to read and review every post.

Researchers can use the Analyzer to discover latent topics in a dataset, i.e., abstract topics that may not be directly observable from reading the posts alone. It can also be used to discover communities of users who share a similar interest in a topic but do not necessarily communicate with each other.

Creating Embeddings with Social Media Posts

The Topic Analyzer utilizes sentence-transformer models to convert social media posts into computer-readable vector embeddings, enabling the capture of the semantic meaning of these posts. For more information on embedding, see here and here.

Communalytic generates embeddings from social media posts using a multilingual text embedding model from VoyageAI: Voyage-3-lite (512 dimensions) in Communalytic EDU and Voyage-3 (1024 dimensions) in Communalytic PRO. These general-purpose models are optimized for multilingual retrieval, making them ideal for identifying semantic similarities between sentences in a dataset. These models were selected because they outperform similar models for creating embeddings from texts in 27 different languages: Arabic, Bengali, Czech, Danish, Dutch, English, French, Georgian, German, Greek, Hungarian, Italian, Japanese, Korean, Kurdish, Norwegian, Persian, Polish, Portuguese, Russian, Slovak, Spanish, Swedish, Thai, Turkish, Urdu, Vietnamese.

Auto-clustering of Semantically Similar Posts

After transforming posts into embeddings, Communalytic uses a dimension reduction technique called UMAP to compress the embeddings from 512 or 1024 dimensions down to three dimensions. These reduced embeddings are then visualized with Communalytic’s interactive 3D Semantic Similarity Map, where each dot represents a post. Semantically similar posts are automatically grouped into clusters using one of three clustering algorithms: HDBScan, KMeans, or Gaussian Mixture. Each cluster represents a distinct latent topic. Dots within the same cluster are displayed in the same colour, making it easy to distinguish between topics.

For instance, consider a dataset containing three posts:

“I like apples.”
“I hate oranges.”
“I need a car.”

The Topic Analyzer would cluster the first two posts into a single group because they share semantic similarity—they both discuss preferences related to fruits (despite expressing opposite sentiments). The third post, however, would form its own cluster, as it focuses on an unrelated topic (a need for a car). This demonstrates how the Topic Analyzer attempts to group content based on meaning, rather than just surface-level keywords.

Auto-Labelling of Semantically Similar Posts

After the posts are clustered and visualized with the 3D Semantic Similarity Map, researchers can review each cluster manually and assign a descriptive label. Alternatively, they can use one of the available LLMs (e.g., llama-3.1 or mistral-7b) to suggest a topic label for each cluster automatically.

A preview of the 3D Semantic Similarity Map

Below is a video demo of the Topic Analyzer and the companion 3D Semantic Similarity Map.

About Communalytic

Communalytic is a no-code computational social science research tool for studying online communities and public discourse on social media. It is designed to provide researchers, journalists, and students with essential resources and infrastructure for conducting independent, public-interest research. It has a full suite of easy-to-use social media data collectors and analyzers – no coding is required.

Users can bring their own data or use one of Communalytic’s various social media data collectors to collect data from platforms such as Bluesky, Mastodon, Reddit, Telegram, X (formerly Twitter), and YouTube.

There are two versions of Communalytic. Each is designed for different purposes and different sets of users:

Communalytic EDU is designed to help students learn about social media data analytics.

Communalytic PRO is designed for academic researchers and journalists and is ideal for large-scale research projects.

About Communalytic’s Data Analyzer Modules

Communalytic also comes with a set of built-in data analytics modules, including a:

Civility Analyzer that can identify toxic and prosocial interactions in a dataset using the latest machine-learning models (Perspective API and Detoxify),
Sentiment Analyzer that can calculate sentiment polarity scores to determine whether the text in a dataset expresses a positive, negative or neutral sentiment,
Topic Analyzer that can automatically group social media posts that are semantically similar to identify latent topics in a dataset (i.e., abstract topics that may not be directly observable from just reading the posts),
Network Analyzer that can generate and visualize various types of networks in a dataset, including signed and unsigned communication networks, as well as link-sharing networks. A signed network is one where the nodes and edges carry additional information such as weights (i.e., toxicity and prosocial scores or sentiment scores)

These Analyzers can automatically:

Detect antisocial (Toxicity, Insults, Threats … ) and prosocial interactions (Compassion, Curiosity and Respect …) in any text-based dataset,
Assess sentiments in online discourse (i.e., opinion mining),
Group together social media posts that are semantically similar and identify latent topics, uncovering hidden communities of users who share an interest in a topic but may not know each other or have ever communicated with one another.
Find out who talks to whom, who shares whose contents, who shares the same links or resources, etc…

When used together, these analytical modules can be used to study online communities and influencers, map shared interests among community members, study the spread of misinformation and disinformation, and detect signs of possible coordination among seemingly disparate actors.