As part of the Social Media Lab’s ongoing efforts to study the changing landscape of social media websites and their communities, in this post we share some of our preliminary data analysis of a popular social news aggregation and discussion site called reddit. On reddit, users can submit content such as text or links, and can also comment or vote on the content posted by others. The site is organized by areas of interest called subreddits (think of them as separate online groups).

Inspired by some of the previous work on mapping linkages between reddit‘s various groups, we want to better understand the rapidly growing community of over 200M reddit users and their posting practices. As an initial step, we are developing a typology of users based on their engagement with the site and others. In particular, we ask whether there are different types of users on reddit and across different subreddits.
For our exploratory analysis we used Tableau, a visualization engine that helps to summarize and discover interesting patterns in structured data. The subreddits that we chose for this analysis were “ask” subreddits, where users ask and answer questions about various topics from history (r/AskHistorians/) to astronomy (r/AskAstronomy/). In total, using reddit‘s API, we have collected ~250,000 publicly available posts and comments submitted to the 13 “ask” subreddits during the period of one year in 2015.
Below is an interactive visualization showing the relationship between posting and commenting behaviour. Each data point/shape represents a reddit user who either asked a question (submitted a post) or answered/commented on other user’s question, or both. The X axis represents the number of posts (~questions) and the Y axis represents the number of comments (~answers). Different shapes relate to different subreddits (see the legend to the right) and the size of the shape represents the number of likes the user received. We wanted to know if there are groups of users who tend to either ask or answer questions and why, and whether it is different for different subreddits.
Note: Since we used a logarithmic function to reduce the effect of outliers, users who submitted only posts or only comments are not visible in this chart.
The three different colors in the visualization represent different types of users, detected automatically by Tableau using a k-means clustering algorithm that takes into account: # of posts, # of comments, and # of likes. The clustering algorithm allows to group users with similar posting behaviour. Each line shown in the visualization represents a linear relationship (regression) between # of posts and comments for each cluster.
Overall, we found that active users contribute both posts/questions and comments/answers with a slight preference towards commenting/answering. This suggests the presence of a generally attentive community of users who are willing to help and contribute to the group by answering and commenting on other people’s posts and questions, and are not just there to get their own questions answered. This trend is especially visible among users grouped in the largest (orange) cluster, labelled as
Based on the clustering analysis, we also found two extremes:
users who contributed 10 or more posts/questions (log>=1); these are the users who presumably found group’s answers helpful in the past and came back to ask more questions;
users who contributed 100 or more comments/answers (log>=2); these users are especially active communicators who engage others in discussion by answering and contributing to other people’s questions.
One of the most interesting features of this interactive visualization is its ability to view the prevalence of the two extremes across different subreddits by using the highlight subreddit feature (in the bottom left side of the visualization). Using this feature, for example, we can see that AskPhysics has some extreme posters/questioners (red cluster) but none of the extreme commentators/answerers (blue cluster); while AskLiteraryStudies has no extreme posters/questioners (red cluster) and only three extreme commentators/answerers (blue cluster). This suggests that there may be slight variations in posting behaviour among members of different subreddits.
Our future work will examine why some users like to publish more posts than comments and vise versa. And why do some subreddits encourage different posting behaviour than others? We also plan to use Social Network Analysis to discover and compare posting practices at the group level across different subreddits.
Note: the analysis is done by Bradly Dahdaly, a data science intern at the Social Media Lab with contributions by Anatoliy Gruzd, Philip Mai and other members of the Lab.