Automated Content Analysis with R

Welcome to this beginner’s guide into automated content analysis with R. This guide is a fork to the seminal yet German version of inhaltsanalyse-mit-r.de by Cornelius Puschmann and is maintained by both Cornelius and Mario Haim.

Currently, this guide is divided into nine chapters, in which essential approaches to automated content analysis with R are presented on the basis of numerous examples. So-called R notebooks are used, containing a combination of explanations and R code, which can be executed and adapted together with the provided corpora and other resources. The latest (development) version of the R Notebooks can be found at GitHub.

Download

You may download any R notebooks, corpora, dictionaries, and other resources used in this guide as one large ZIP file.

R packages

This guide’s most important foundation is the R package quanteda, which has been developed by Ken Benoit and colleagues. It includes a sophisticated infrastructure for the analyses of texts in R. Using quanteda, you can easily import text data, create corpora, count words, and even use dictionaries, making quanteda considerably more extensive than comparable packages. Comparatively broad packages include tm and (to some extent) tidytext; however, in contrast to tm, quanteda is both younger and faster, provides a huge variety of functions, and has an excellent documentation. Importantly, quite a few examples discussed here are taken directly from the quanteda documentation.

Other packages used in this guide are dependent on the specific chapter. FOr example, in the chapter on supervised machine learning, we will be using RTextTools. For topic modelling, we will build on the packages topicmodels and stm. Ultimately, for tagging, parsing, and entity recognition, we need tools for linguistic annotation, which are provided, among others, by udpipe and spacyr.

Finally, we will be making great use of the great tidyverse packages. The tidyverse (as in “universe”) is a project by New Zealand statistician Hadley Wickham to turn R into the leading data-science programming language (despite numerous syntactic and performance challenges). If you are new to the tidyverse, it may be a bit harsh to understand in the first place but trust us, it’s worth it. And once you got the gist of, for example, tidyr, dplyr, or ggplot, you do not want to miss them anymore. The must-read introduction to the tidyverse, by the way, is the open-source book R for Data Science by Garrett Grolemund and Hadley Wickham himself.

Corpora

A corpus (or, plural, corpora) is a body of text to be analyzed. We will be using a wide variety of corpora throughout this guide. While you might find the consistent use of one corpus more plausible for an introductionary guide, we decided thoughtfully on using a multitude of corpora, which are distinct in their language, beat, news outlets, structure, and also volume. By building on social-media data, emails, press releases, political talks, petitions, and other sorts of texts, you will get a broad overview of these distinct types and their capabilities. Some corpora, such as the Sherlock Holmes corpus, are included because of their ease of use, others, such as the EUspeech corpus, are employed because of their relevance for social-scientific research. Importantly, the corpora are free to use; this is because copyright has either expired or does not protect the content (for example, for Tweets or comments).

Corpus	Description	Texts	Words	Genre	Language	Source	Chapter
Sherlock Holmes	Detective novels by Arthur Conan Doyle	12	126.804	Literature	en	archive.org	1, 2, 3, 6
Twitter	Tweets by Donald Trump and Hillary Clinton during the US POTUS campaign 2016	18.826	458.764	Social Media	en	trumptwitterarchive.com own collection	3
Finanzkrise	Articles from five Swiss daily newspapers, based on a keyword search for ‘Finanzkrise’	21.280	3.989.262	Press	de	COSMAS	3
EU	Speeches of EU politicians 2007-2015	17.505	14.279.385	Politics	en	Schumacher et al, 2016	4
UN	Transcripts from the annual United Nations General Assembly debate 1970-2017	7.897	24.420.083	Politics	en	Mikhaylov et al, 2017	4, 6
Facebook	Random sample of comments from six public pages, posted 2015-2016	20.000	1.054.477	Social Media	de	own collection	4
New York Times	Articles from the New York Times as used in the ‘Making the News’ project (1996-2006)	30.862	215.275	Press	en	Boydstun, 2013	5
Enron	Enron emails	341.071	178.908.873	Economy	en	Klimt & Yang, 2004	-

Automated Content Analysis with R

Cornelius Puschmann & Mario Haim

Table of Contents

Download

R packages

Corpora