Classifying the Ideological Orientation of User-Submitted Texts in Social Media

With the long-term goal of understanding how language is used and evolves within online communities, this work explores the application of natural language processing techniques to classify text articles according to their ideological orientation (i.e., conservative or liberal). We first collect a balanced corpus of text articles posted to the online communities r/Liberal and r/Conservative from the social media website Reddit. Using the corpus, we develop and apply three classifiers. The baseline classifier is a Bayes model that accounts for each text article’s web domain, as such, classification is independent of content. Next, we develop a support vector machine (SVM) model with term frequency-inverse document frequency (TF-IDF) features; this approach highlight differences in language using a count-based feature-space to differentiate text articles. Last, we evaluate the context-based transformer (RoBERTa) model and discuss its under-performance relative to the baseline and SVM models.

PDF

Datasets


Introduced in the Paper:

Reddit Ideology Database

Used in the Paper:

Reddit
Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Classification Reddit Ideology Database SVM F1-score (Weighted) 86.19 # 1
Classification Reddit Ideology Database RoBERTa F1-score (Weighted) 78.13 # 2

Methods