Welcome to this beginner’s guide into automated content analysis with R. This guide is a fork to the seminal yet German version of inhaltsanalyse-mit-r.de by Cornelius Puschmann and is maintained by both Cornelius and Mario Haim.

Currently, this guide is divided into nine chapters, in which essential approaches to automated content analysis with R are presented on the basis of numerous examples. So-called R notebooks are used, containing a combination of explanations and R code, which can be executed and adapted together with the provided corpora and other resources. The latest (development) version of the R Notebooks can be found at GitHub.

Table of Contents

  1. Introduction
  2. quanteda basics
  3. Word and text metrics
  4. Sentiment analysis
  5. Topic-specific dictionaries
  6. Supervised machine learning
  7. Topic modeling
  8. Multiple languages

Download

You may download any R notebooks, corpora, dictionaries, and other resources used in this guide as one large ZIP file.

R packages

This guide’s most important foundation is the R package quanteda, which has been developed by Ken Benoit and colleagues. It includes a sophisticated infrastructure for the analyses of texts in R. Using quanteda, you can easily import text data, create corpora, count words, and even use dictionaries, making quanteda considerably more extensive than comparable packages. Comparatively broad packages include tm and (to some extent) tidytext; however, in contrast to tm, quanteda is both younger and faster, provides a huge variety of functions, and has an excellent documentation. Importantly, quite a few examples discussed here are taken directly from the quanteda documentation.

Other packages used in this guide are dependent on the specific chapter. FOr example, in the chapter on supervised machine learning, we will be using RTextTools. For topic modelling, we will build on the packages topicmodels and stm. Ultimately, for tagging, parsing, and entity recognition, we need tools for linguistic annotation, which are provided, among others, by udpipe and spacyr.

Finally, we will be making great use of the great tidyverse packages. The tidyverse (as in “universe”) is a project by New Zealand statistician Hadley Wickham to turn R into the leading data-science programming language (despite numerous syntactic and performance challenges). If you are new to the tidyverse, it may be a bit harsh to understand in the first place but trust us, it’s worth it. And once you got the gist of, for example, tidyr, dplyr, or ggplot, you do not want to miss them anymore. The must-read introduction to the tidyverse, by the way, is the open-source book R for Data Science by Garrett Grolemund and Hadley Wickham himself.

Corpora

A corpus (or, plural, corpora) is a body of text to be analyzed. We will be using a wide variety of corpora throughout this guide. While you might find the consistent use of one corpus more plausible for an introductionary guide, we decided thoughtfully on using a multitude of corpora, which are distinct in their language, beat, news outlets, structure, and also volume. By building on social-media data, emails, press releases, political talks, petitions, and other sorts of texts, you will get a broad overview of these distinct types and their capabilities. Some corpora, such as the Sherlock Holmes corpus, are included because of their ease of use, others, such as the EUspeech corpus, are employed because of their relevance for social-scientific research. Importantly, the corpora are free to use; this is because copyright has either expired or does not protect the content (for example, for Tweets or comments).

Corpus Description Texts Words Genre Language Source Chapter
Sherlock Holmes Detective novels by Arthur Conan Doyle 12 126.804 Literature en archive.org 1, 2, 3, 6
Twitter Tweets by Donald Trump and Hillary Clinton during the US POTUS campaign 2016 18.826 458.764 Social Media en trumptwitterarchive.com own collection 3
Finanzkrise Articles from five Swiss daily newspapers, based on a keyword search for ‘Finanzkrise’ 21.280 3.989.262 Press de COSMAS 3
EU Speeches of EU politicians 2007-2015 17.505 14.279.385 Politics en Schumacher et al, 2016 4
UN Transcripts from the annual United Nations General Assembly debate 1970-2017 7.897 24.420.083 Politics en Mikhaylov et al, 2017 4, 6
Facebook Random sample of comments from six public pages, posted 2015-2016 20.000 1.054.477 Social Media de own collection 4
New York Times Articles from the New York Times as used in the ‘Making the News’ project (1996-2006) 30.862 215.275 Press en Boydstun, 2013 5
Enron Enron emails 341.071 178.908.873 Economy en Klimt & Yang, 2004 -