Why an independent introduction to automated content analysis with R? The classic (i.e. manual) standardized content analysis is one of the most important methods of empirical social sciences, and there are numerous publications with high circulation, to which this brief overview certainly does not come close to, especially not in terms of its richness of detail, the depth of its methodical classification, and the degree of practical experience. However, the range of textbooks on offer is already much more limited if one turns to (partially) automated content analysis and looks for a sufficiently application-oriented description that does not shy away from code, and which is also freely available. In this admittedly narrow category there is considerably less choice, and the focus is mostly on individual proprietary programs with a graphical user interface, which are usually not available free of charge and not very powerful, and which also become obsolete relatively quickly.
Programming languages for data science, especially R and Python, offer many new possibilities for the application within social-science research, not only in the field of statistics. Such languages are often more flexible, versatile, and powerful than standard commercial tools, such as SPSS and MaxQDA, which does not mean that these tools and languages cannot be used side by side. But especially in recent years, R has grown tremendously in its potential for areas such as content analysis, where a combination of manual approaches and inflexible standard programs previously prevailed. The development of powerful R packages, specifically for social-science research, such as quanteda, stm, and RTextTools, makes working with content data so easy that R no longer has to lag behind its main competitor, Python, when it comes to efficient analysis of text data.
What do we mean by content analysis here, anyway? To prevent disappointments: In this introduction (currently) only text is used, even if images and video content are also part of classic content analysis. Although a lot has happened in these areas in recent years, procedures for the analysis of non-textual content would clearly go beyond the scope of this introduction. However, we have deliberately preferred the term content analysis to related terms, such as text mining, in order to make it clear that our focus is on social-science knowledge rather than the subtleties of individual technological processes. At the same time, we do not use the term in the relatively narrow reading that often prevails in communication and media studies, which usually means the classification of texts by human coders on the basis of a corresponding codebook. The reason being, that, on the one hand, this introduction is ideally useful for communication and media scientists as well as for sociologists and political scholar (and, of course, beyond these disciplines), and, on the other hand, in the following chapters there are repeatedly references to approaches which clearly come from computer linguistics and computer science, which enrich the repertoire of social-science methods without falling under this classic definition of content analysis. Moreover, there are important techniques in these disciplines that are comparatively irrelevant for social scientists, such as tagging or parsing, which we only discuss here peripherally, even if they offer interesting potential to support other methods. Instead, computer-aided analysis within computational communication science is about the competent application of such techniques, with the clear goal of gaining knowledge about social phenomena from texts. Therefore, we deliberately do not speak of text or data mining, but of content analysis, even if in the course of the following chapters we deal with a lot with concepts like corpus, word frequency, and text statistics, which are probably missing in most classic introductions to content analysis. The fact that the path via R is chosen anyway, and not a tool with a graphical user interface, is no contradiction. The prejudice, programming = computer science, unfortunately still persists in the social sciences, even if the Internet has radically changed how and what can and should be coded, and thus also how relevant coding is for the social sciences.
A freely available introduction to automated content analysis with R, which has precisely this application-related and social-scientific focus, and which at the same time provides very concrete code examples instead of primarily explaining content analysis in abstract terms, has been missing so far, despite advanced overviews on topics such as the linking of machine learning and content analysis. Whether this experiment is successful or not, is, as always, up to the reader. The concrete application in the following chapters is clearly preferred over the theoretical reflection, which is thoroughly tackled by the well-known publications on content analysis. If you are on the lookout for an almost direct translation of the classic content analysis with manual coding, we particularly recommend Chapter 5, which deals with the derivation of content analysis categories from structural features (usually words) using methods of supervised machine learning. This is where the relationship between traditional and automated content analysis is most evident.
Last but not least: This introduction requires not only knowledge about content analysis and some social-science education, but also basic R knowledge. There are many introductions to R. Beyond “normal R,” almost all packages from the so-called tidyverse are used, first of all ggplot2 and dplyr. The book R for Data Science by Garrett Grolemund and Hadley Wickham is especially useful to understand the code examples. And of course, this introduction is work in progress – feedback and criticism are very welcome!
CoRnelius Puschmann, puschmann@gmail.com / cbpuschmann, Hamburg, May 2019
MaRio Haim, mario@haim.it / drfollowmario, Berlin, September 2019