Welcome to this beginner’s guide into automated content analysis with R. This guide is a fork to the seminal yet German version of inhaltsanalyse-mit-r.de by Cornelius Puschmann and is maintained by both Cornelius and Mario Haim.
Currently, this guide is divided into nine chapters, in which essential approaches to automated content analysis with R are presented on the basis of numerous examples. So-called R notebooks are used, containing a combination of explanations and R code, which can be executed and adapted together with the provided corpora and other resources. The latest (development) version of the R Notebooks can be found at GitHub.
You may download any R notebooks, corpora, dictionaries, and other resources used in this guide as one large ZIP file.
This guide’s most important foundation is the R package quanteda, which has been developed by Ken Benoit and colleagues. It includes a sophisticated infrastructure for the analyses of texts in R. Using quanteda, you can easily import text data, create corpora, count words, and even use dictionaries, making quanteda considerably more extensive than comparable packages. Comparatively broad packages include tm and (to some extent) tidytext; however, in contrast to tm, quanteda is both younger and faster, provides a huge variety of functions, and has an excellent documentation. Importantly, quite a few examples discussed here are taken directly from the quanteda documentation.
Other packages used in this guide are dependent on the specific chapter. FOr example, in the chapter on supervised machine learning, we will be using RTextTools. For topic modelling, we will build on the packages topicmodels and stm. Ultimately, for tagging, parsing, and entity recognition, we need tools for linguistic annotation, which are provided, among others, by udpipe and spacyr.
Finally, we will be making great use of the great tidyverse packages. The tidyverse (as in “universe”) is a project by New Zealand statistician Hadley Wickham to turn R into the leading data-science programming language (despite numerous syntactic and performance challenges). If you are new to the tidyverse, it may be a bit harsh to understand in the first place but trust us, it’s worth it. And once you got the gist of, for example, tidyr, dplyr, or ggplot, you do not want to miss them anymore. The must-read introduction to the tidyverse, by the way, is the open-source book R for Data Science by Garrett Grolemund and Hadley Wickham himself.
A corpus (or, plural, corpora) is a body of text to be analyzed. We will be using a wide variety of corpora throughout this guide. While you might find the consistent use of one corpus more plausible for an introductionary guide, we decided thoughtfully on using a multitude of corpora, which are distinct in their language, beat, news outlets, structure, and also volume. By building on social-media data, emails, press releases, political talks, petitions, and other sorts of texts, you will get a broad overview of these distinct types and their capabilities. Some corpora, such as the Sherlock Holmes corpus, are included because of their ease of use, others, such as the EUspeech corpus, are employed because of their relevance for social-scientific research. Importantly, the corpora are free to use; this is because copyright has either expired or does not protect the content (for example, for Tweets or comments).
Corpus | Description | Texts | Words | Genre | Language | Source | Chapter |
---|---|---|---|---|---|---|---|
Sherlock Holmes | Detective novels by Arthur Conan Doyle | 12 | 126.804 | Literature | en | archive.org | 1, 2, 3, 6 |
Tweets by Donald Trump and Hillary Clinton during the US POTUS campaign 2016 | 18.826 | 458.764 | Social Media | en | trumptwitterarchive.com own collection | 3 | |
Finanzkrise | Articles from five Swiss daily newspapers, based on a keyword search for ‘Finanzkrise’ | 21.280 | 3.989.262 | Press | de | COSMAS | 3 |
EU | Speeches of EU politicians 2007-2015 | 17.505 | 14.279.385 | Politics | en | Schumacher et al, 2016 | 4 |
UN | Transcripts from the annual United Nations General Assembly debate 1970-2017 | 7.897 | 24.420.083 | Politics | en | Mikhaylov et al, 2017 | 4, 6 |
Random sample of comments from six public pages, posted 2015-2016 | 20.000 | 1.054.477 | Social Media | de | own collection | 4 | |
New York Times | Articles from the New York Times as used in the ‘Making the News’ project (1996-2006) | 30.862 | 215.275 | Press | en | Boydstun, 2013 | 5 |
Enron | Enron emails | 341.071 | 178.908.873 | Economy | en | Klimt & Yang, 2004 | - |