What’s really the difference? Developing machine learning classifiers for identifying Russian state-funded news in Serbia

by Ognjan Denkovski
Thesis supervisor: dr. Damian Trilling

Topic overview
Democratic nations globally are facing increasing levels of false and misleading information circulating through social media and political websites, often propagating alternative socio-political realities. One of the main actors in this process has been the Russian state, whose organized disinformation campaigns have influenced elections and narratives surrounding major social events throughout the Western world (Helmus et al., 2018). An essential element of these campaigns is the content produced by state-funded outlets like RT and Sputnik – content thereafter spread by underfunded or sympathetic local media and organized social network groups (Bradshaw & Howard, 2018). In response to a lack of comprehensive research examining the characteristics of this content, this study examines: a) whether, and if so how, news articles from Russian state-funded outlets are distinct from those from Western media (represented by U.S. state-funded outlets) and b) whether these differences can be used to automatically determine the source of an article. The investigation uses the case study of Serbia, a country characterized by severe restrictions on critical media and deep divisions between pro-Western and pro-Eastern segments of the population. The findings are promising for future text classification research.

Data and methods
A total of 10,132 articles were analyzed from three U.S. and two Russian state-funded outlets. Fourteen features, grouped in three feature sets, were obtained for each article: structural frame presence, thematic frame presence and linguistic properties of text. The structural frames were used to capture the differences in routines and standards of news production across U.S. and Russian outlets, such as the use of conflict frames in news presentation, while the thematic frames refer to narratives ascribed to Russian news in the Balkans, such as anti-Western attitudes. Finally, the linguistic properties of text capture variations in linguistic habits of authors from both countries, such as the complexity of language used (Schoonvelde, Brosius, Schumacher & Baker, 2019). In the analysis of frame presence, the study combined manual content analysis with supervised machine learning, while all linguistic properties of text were obtained automatically with text analysis packages in Python.

Feature distribution
The variance and distribution of features across articles from both countries was examined to identify features which do not vary significantly and are thus unlikely to inform the country source classification task. Statistical analyses demonstrated that five of the seven frames are useful for the country source classification, as well as all linguistic properties of text. The distribution of the relevant features (N = 12) is presented in the following figures.

Figure 1: Frame presence as a proportion of total articles per country source
Figure 2: Linguistic feature distribution by country source, standardized
Figure 3a: Word Cloud NE – U.S.
Figure 3b: Word Cloud NE – Russia

Feature distribution shows that U.S. news is characterized by the human interest coverage, while Russian news has a high prevalence of anti-Western articles (19%). Russian news is more complex and succinct, while U.S. news is on average longer and less substantive, with simpler language. The named entity use shows that Russian news primarily discusses Russia and the U.S., while U.S. news is largely focused on Serbian affairs.

Country source classification
Nine feature combinations are tested for their potential for automatically recognizing whether an article was written by U.S. or Russian outlets, including four feature combinations suggested by feature selection analyses. The classifying potential of each feature combination is presented in Figure 4, with the weighted f1 score (a “harmonic mean” between precision and recall) practically indicating classifier accuracy (Forman, 2003, p.1294). The best performance was achieved with all remaining features (N=12) with an f1 score of 75%. The best performing individual feature set was the linguistic feature set with an f1 score of 71%.

Figure 4: Country source classification scores

Significance and implications
The findings from this study are significant from a theoretical, methodological and practical perspective. Theoretically the study demonstrates that news production values and linguistic habits vary significantly across U.S. and Russian state-funded outlets, suggesting that quantitatively profiling outlet news production values and journalists’ linguistic habits has extensive merit for political communication research and news source classification.

Methodologically, the study shows that the distinction between U.S. and Russian news in Serbia is best captured through linguistic properties of text, suggesting the possibility for a simple, yet informative, cross-country analysis of the differences between Russian news and Western or local media without researcher language constraint.

Practically, classifiers such as the one developed in this study can be applied to the identification of extremist social media communities, such as the Sputnik-linked group recently shutdown by Facebook for spreading anti-NATO propaganda, as well as help monitor for the development of new narratives promoted by Russian outlets among vulnerable political groups – detection mechanisms which Western governments are increasingly in need off (Bradshaw & Howard, 2016; Wooley & Howard, 2016; Helmus, 2018; Hanlon et al., 2018; Cerulus, 2019).

The Github repository of the project can be accessed via this link.

References

Bradshaw, S., & Howard, P. N. (2018). Challenging truth and trust: A global inventory of organized social media manipulation. The Computational Propaganda Project.

Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. Journal of machine learning research, 3, 1289-1305.

Hanlon, B. (2018). A Long Way To Go: Analyzing Facebook, Twitter, and Google’s Efforts to Combat Foreign Interference. Alliance for Securing Democracy Executive Summary. Retrieved from: https://securingdemocracy.gmfus.org/a-long-way-to-go-analyzing-facebook-twitter-and-googles-efforts-to-combat-foreign-interference/

Helmus, T.C., Bodine-Baron, E., Radin, A., Magnuson, M., Mendelson, J., Marcellino, W., Bega, A. & Winkelman, Z. (2018). Russian social media influence. Understanding Russian propaganda in Eastern Europe. Rand Corporation Report. Retrieved from: https://www.rand.org/content/dam/rand/pubs/research_reports/RR2200/RR2237/RAND_RR2237.pdf.doi: 10.7249/RR2237.

Schoonvelde, M., Brosius, A., Schumacher, G., & Bakker, B. N. (2019). Liberals lecture, conservatives communicate: Analyzing complexity and ideology in 381,609 political speeches. PloS one, 14(2), e0208450.doi: 0.1371/journal.pone.0208450

Woolley, S. C., & Howard, P. N. (2016). Automation, algorithms, and politics| Political communication, computational propaganda, and autonomous agents—Introduction. International Journal of Communication, 10, 9.

Cerulus, L. (2019, Jan 17). Facebook takes down two Russian disinformation networks in Eastern Europe. Politico. Retrieved from: https://www.politico.eu/article/facebook-takes-down-two-russian-disinformation-networks-in-eastern-europe/