Automated Content Analysis with R

So far, we have worked with neatly prepared datasets. However, academic daily business is to work with somewhat more messy data as well. Importantly, a common challenge is to work with texts in a variety of languages. For example, think of news reports from different countries. Luckily, the commonly known providers of online dictionaries also provide API’s to translate texts.

DeepL is considered one of the best translation engines these days, yet the Germany-based company provides only a handful of latin-inspired languages, pricing contains a monthly fee (some 5€) plus on-demand charges, roughly equal to 20€ per 1mio. characters; registration for DeepL API required
Google Translate, probably best known, provides a huge amount of languages at fixed pricing (i.e., free up to 500k characters per month, then starting at $20 per month for 1mio. characters); enable through the Google API’s Developer Console
Microsoft Translator is roughly comparable to Google in providing a huge amount of languages, but applying a much more complex pricing model with free usage for up to 2mio. characters per month, then charging some $10 per 1mio. characters on demand but decreasing this charge for even higher (!) usage; for developers it’s referred to as Microsoft Azure Translator Text API
Yandex Translate is a Russian alternative to the two US giants, roughly providing the same language variety at prices starting with $15 per 1mio. characters per month and discounts for higher usage; register at Yandex Translate Developers
as a more exceptional case, IBM Watson’s Language Translator is a business-to-business provider without a maintained online tool, translating a number of supposedly often requested languages (i.e., more than DeepL but significantly less than the others) at no costs for up to 1mio. characters per month, afterwards charging some $20 per 1mio. characters
finally, if you are working in public administration within the EU, you can also use the EU Machine Translation for Public Administrations which appearantly also offers an API at no costs while providing translations between any official EU languages

For the purpose of this demonstration, we will be using Google Translate within its free limits. We will do so also because there is an R package available providing access to Google’s API’s, namely googleLanguageR. Hence, we’ll start by installing that along with the usual suspects, tidyverse and quanteda.

if(!require("quanteda")) {install.packages("quanteda"); library("quanteda")}
if(!require("tidyverse")) {install.packages("tidyverse"); library("tidyverse")}
if(!require("googleLanguageR")) {install.packages("googleLanguageR"); library("googleLanguageR")}
theme_set(theme_bw())

For all API’s generlly but for the Google API specifically, one needs to register and retrieve some API authentication. This can be a “token” (a secret text, basically) or a pair of two secret parts of an “authenticator.” Typically, during this process of registration, one also needs to provide credit-card information, even if working with the free volume only. At the Google Developer Console, enable the Translation API, and create service-account credentials (whereby you need to enter a project name, grant access to the Translation API), which provides you with a downloadable key file in JSON format. Download it and put it into your R working diretory (sorry, we cannot provide you with one here).

Google claims, by the way, not warn you and actually ask for you permission before really charging you. For those control maniacs out there (like me), though, you can also (regularly) check the request monitor about your current quota usage.

After having the credential file available at a specific file location (for ease of use, we’ll just assume the file location to be stored in the “credential.json” variable), we now need to tell googleLanguageR to use it for authentication against Google’s API.

gl_auth(credential.json)

If this passes without errors, we are good to go. That is, we can use the API to detect the language of a text. We’ll demonstrate that with the first paragraph of the first Sherlock-Holmes novel.

load("data/sherlock/sherlock.absaetze.RData")
text <- sherlock.absaetze[[1, 'content']]

gl_translate_detect(text)

## 2019-09-11 20:08:21 -- Detecting language: 4294 characters

The result is by no means surprising–it’s English. What’s more is that we are presented with a confidence indicator and an estimate whether the result actually “isReliable,” both of which are deprecated though and will be removed in future versions of the API. Just focus on the “language” column.

Of course, the main function is to actually translate a text it into one of the many provided languages. Aside from the text to be translated, you need to specify the target language. It is also recommendable to specify whether your text contains HTML or is simple plain text, although some preprocessing to remove unnecessary markup is advisable to reduce the number of characters to be translated (and thus charged for). If you know the text’s current language, you can also specify that; if omited, Google will automatically try to detect it (see above) which might be added to your quoate usage.

gl_translate(text, format = 'text', source = 'en', target = 'de')

## 2019-09-11 20:08:21 -- Translating text: 4294 characters -

As you can or probably cannot see, the translation is decent but certainly not perfect. This varies from language to language and also the aforementioned alternative API’s provide differing results, but hey, it’s automated large-scale text translation into a whole multitude of languages. In fact, for Google Translate the package also provides us with an up-to-date list of what’s possible.

gl_translate_languages()

A final word on mass translation: You can easily use the API to translate a lot of texts at once (we will do so in a second), but the API has its limits also per call. While you now know that and can keep it in mind for other API’s, our used package googleLanguageR automatically splits large portions of texts into separate API calls. While this should neither affect translation nor quota usage, it may be the cause for translations on a large scale to consume considerably more time. But you are already used to long times for computational tasks, anyhow …

Integrating translation into sentiment analysis

Let’s assume we want to apply a dictionary, be it for sentiment or any topic-specific dictionary, to a corpus, but our dictionary’s language and the corpus’ language do not match. Specifically and for this example, we will build on the already well-known english Bing Liu dictionary and a German news corpus of roughly 21,000 articles published between 2007 and 2012 in various Swiss daily newspapers regarding the financial crisis (applying the search query “Finanzkrise”).

We will first load both elements and also draw a sample from the corpus in order to to overstrain the API limits.

positive.words.bl <- scan("dictionaries/bingliu/positive-words.txt", what = "char", sep = "\n", skip = 35, quiet = T)
negative.words.bl <- scan("dictionaries/bingliu/negative-words.txt", what = "char", sep = "\n", skip = 35, quiet = T)
sentiment.dict.en <- dictionary(list(positive = positive.words.bl, negative = negative.words.bl))

load("data/finanzkrise.korpus.RData")
crisis.sample <- corpus_sample(korpus.finanzkrise, 20)
crisis.sample

## Corpus consisting of 20 documents and 6 docvars.

Essentially, now, we have three options in order to sentiment-analyze our data:

Translate our German corpus and apply the original Bing Liu dictionary
Translate the English words in the dictionary and apply to the original corpus
Search for a German dictionary (yes, this sounds like a bad joke at this point, but given the wide availability of dictionaries, this is to be seriously considered)

1. Translate our German corpus and apply the original Bing Liu dictionary

For this we need to push our current corpus to Google’s API and store the result. Since some of the texts include a tiny little bit of HTML, we specify the format as such.

translations <- gl_translate(crisis.sample$documents$texts, format = 'html', source = 'de', target = 'en')

## 2019-09-11 20:08:22 -- Translating html: 26096 characters -

translations

crisis.sample.translated <- crisis.sample
crisis.sample.translated$documents$texts <- translations$translatedText

The fourth line writes the translated results back into the (copied) corpus, to which we can now simply apply the Bing Liu dictionary.

dfm.sentiment.translatedcorpus <- 
  crisis.sample.translated %>% 
  dfm(dictionary = sentiment.dict.en) %>% 
  dfm_weight(scheme = "prop")
dfm.sentiment.translatedcorpus

## Document-feature matrix of: 20 documents, 2 features (0.0% sparse).
## 20 x 2 sparse Matrix of class "dfm"
##                    features
## docs                  positive  negative
##   NZZ11/AUG.04114   0.87500000 0.1250000
##   BUN12/SEP.00304   0.27272727 0.7272727
##   BUN08/DEZ.00225   0.55555556 0.4444444
##   NLZ08/DEZ.01218   0.61904762 0.3809524
##   BAZ08/MAI.01310   0.33333333 0.6666667
##   BEZ08/OKT.05282   0.16666667 0.8333333
##   NZZ09/SEP.01035   0.50000000 0.5000000
##   BAZ08/DEZ.00837.1 0.11111111 0.8888889
##   NZZ08/OKT.04847   0.50000000 0.5000000
##   NZZ12/MAR.00484   0.31578947 0.6842105
##   NZZ08/JUN.02490.1 0.37500000 0.6250000
##   NZZ12/JAN.03253   0.57142857 0.4285714
##   NLZ10/FEB.03632   0.60000000 0.4000000
##   NZZ09/AUG.03062   0.14285714 0.8571429
##   NZZ11/AUG.03923   0.18181818 0.8181818
##   NZZ09/SEP.03423   0.50000000 0.5000000
##   BAZ09/NOV.04495   0.25000000 0.7500000
##   BEZ08/NOV.04372.2 0.64285714 0.3571429
##   NZZ08/JUL.04514   0.08333333 0.9166667
##   NZZ09/NOV.03203   0.35714286 0.6428571

We can see a bit of sentiment detection in all texts so our procedure worked in essence. But how good is this procedure? Well, we need comparative research for that …

2. Translate the English words in the dictionary and apply to the original corpus

Instead of translating the texts, we could also translate the Bing Liu dictionary. For that, we need to translate both lists of words (i.e., positive and negative ones) from their current language, English, into German. Now, this takes a while–mainly because googleLanguageR splits our almost 7000 words into a lot of different API calls. Hence, we do not run these commands here but instead load the already translated lists.

#positive.translations <- gl_translate(positive.words.bl, format = 'text', source = 'en', target = 'de')
#negative.translations <- gl_translate(negative.words.bl, format = 'text', source = 'en', target = 'de')
load('data/bl_translated.RData')
sentiment.dict.en.de <- dictionary(list(positive = positive.translations$translatedText, negative = negative.translations$translatedText))

Once this is done, we can apply this newly translated dictionary to our original texts.

dfm.sentiment.translateddict <- 
  crisis.sample %>% 
  dfm(dictionary = sentiment.dict.en.de) %>% 
  dfm_weight(scheme = "prop")
dfm.sentiment.translateddict

## Document-feature matrix of: 20 documents, 2 features (10.0% sparse).
## 20 x 2 sparse Matrix of class "dfm"
##                    features
## docs                 positive  negative
##   NZZ11/AUG.04114   1.0000000 0        
##   BUN12/SEP.00304   0.4545455 0.5454545
##   BUN08/DEZ.00225   1.0000000 0        
##   NLZ08/DEZ.01218   0.5714286 0.4285714
##   BAZ08/MAI.01310   0         1.0000000
##   BEZ08/OKT.05282   0         1.0000000
##   NZZ09/SEP.01035   0.6666667 0.3333333
##   BAZ08/DEZ.00837.1 0.3333333 0.6666667
##   NZZ08/OKT.04847   0.5000000 0.5000000
##   NZZ12/MAR.00484   0.4166667 0.5833333
##   NZZ08/JUN.02490.1 0.3000000 0.7000000
##   NZZ12/JAN.03253   0.6666667 0.3333333
##   NLZ10/FEB.03632   0.7000000 0.3000000
##   NZZ09/AUG.03062   0.3333333 0.6666667
##   NZZ11/AUG.03923   0.4000000 0.6000000
##   NZZ09/SEP.03423   0.7272727 0.2727273
##   BAZ09/NOV.04495   0.2500000 0.7500000
##   BEZ08/NOV.04372.2 0.8000000 0.2000000
##   NZZ08/JUL.04514   0.3333333 0.6666667
##   NZZ09/NOV.03203   0.5000000 0.5000000

Is it different from our result before? Let’s compare visually.

dfm.sentiment.translatedcorpus %>% 
  convert('data.frame') %>% 
  gather(positive, negative, key = "Polarity", value = "Sentiment") %>% 
  mutate(Type = 'translated corpus and Bing Liu') %>% 
  bind_rows(
    dfm.sentiment.translateddict %>% 
      convert('data.frame') %>% 
      gather(positive, negative, key = 'Polarity', value = 'Sentiment') %>% 
      mutate(Type = 'translated Bing Liu')
  ) %>% 
  ggplot(aes(document, Sentiment, fill = Polarity)) + 
    geom_bar(stat="identity") + 
    scale_colour_brewer(palette = "Set1") + 
    ggtitle("Sentiment scores in news reports on the financial crisis in Swiss newspapers") + 
    xlab("") + 
    ylab("Sentiment share (%)") + 
    facet_grid(rows = vars(Type)) +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))

Based on our small 20-case sample, results are pretty similar. One notable difference, though, is that in this second approach we only translate the dictionary once, no matter if we have 20 or 20 million news reports. Translation costs may thus be cheaper.

3. Search for a German dictionary

This is the most straightforward way. We just need a dictionary and repeat what we are well used to already. For the dictionary, we will use the SentiWS dictionary, which here is available as RData file.

load("dictionaries/sentiWS.RData")
sentiment.dict.de <- dictionary(list(positive = positive.woerter.senti, negative = negative.woerter.senti))

And then it’s business as usual:

dfm.sentiment.germandict <- 
  crisis.sample %>% 
  dfm(dictionary = sentiment.dict.de) %>% 
  dfm_weight(scheme = "prop")
dfm.sentiment.germandict

## Document-feature matrix of: 20 documents, 2 features (2.5% sparse).
## 20 x 2 sparse Matrix of class "dfm"
##                    features
## docs                 positive  negative
##   NZZ11/AUG.04114   0.8000000 0.2000000
##   BUN12/SEP.00304   0.2500000 0.7500000
##   BUN08/DEZ.00225   0.3333333 0.6666667
##   NLZ08/DEZ.01218   0.2727273 0.7272727
##   BAZ08/MAI.01310   0.3333333 0.6666667
##   BEZ08/OKT.05282   0         1.0000000
##   NZZ09/SEP.01035   0.3333333 0.6666667
##   BAZ08/DEZ.00837.1 0.2500000 0.7500000
##   NZZ08/OKT.04847   0.5714286 0.4285714
##   NZZ12/MAR.00484   0.1000000 0.9000000
##   NZZ08/JUN.02490.1 0.3333333 0.6666667
##   NZZ12/JAN.03253   0.5555556 0.4444444
##   NLZ10/FEB.03632   0.6250000 0.3750000
##   NZZ09/AUG.03062   0.1666667 0.8333333
##   NZZ11/AUG.03923   0.6000000 0.4000000
##   NZZ09/SEP.03423   0.5714286 0.4285714
##   BAZ09/NOV.04495   0.2500000 0.7500000
##   BEZ08/NOV.04372.2 0.4444444 0.5555556
##   NZZ08/JUL.04514   0.3333333 0.6666667
##   NZZ09/NOV.03203   0.3333333 0.6666667

This way seems the most natural one as speakers of the German language have coded words as positive or negative. Double meanings available in one but not another language (for example, the word “home” is neither on Bing Liu’s positive nor negative list; translated to German, though, it might either result in “zuhause,” an equally neutral term, or the rather emotion-laden term “heimat”) are captured only in this approach for both languages.

But let’s compare all three translation enquiries with each other.

dfm.sentiment.translatedcorpus %>% 
  convert('data.frame') %>% 
  gather(positive, negative, key = "Polarity", value = "Sentiment") %>% 
  mutate(Type = 'translated corpus, original Bing Liu') %>% 
  bind_rows(
    dfm.sentiment.translateddict %>% 
      convert('data.frame') %>% 
      gather(positive, negative, key = 'Polarity', value = 'Sentiment') %>% 
      mutate(Type = 'original corpus, translated Bing Liu')
  ) %>% 
  bind_rows(
    dfm.sentiment.germandict %>% 
      convert('data.frame') %>% 
      gather(positive, negative, key = 'Polarity', value = 'Sentiment') %>% 
      mutate(Type = 'original corpus, original SentiWS')
  ) %>% 
  ggplot(aes(document, Sentiment, fill = Polarity)) + 
    geom_bar(stat="identity") + 
    scale_colour_brewer(palette = "Set1") + 
    ggtitle("Sentiment scores in news reports on the financial crisis in Swiss newspapers") + 
    xlab("") + 
    ylab("Sentiment share (%)") + 
    facet_grid(rows = vars(Type)) +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))

We can conclude that in essence they all are pretty similar. The general trend of the articles is not far off. However, specific pieces are indeed coded differently, although we need to remind ourselves that we are looking at only 20 texts. Based on this small sample, though, we can conclude that all three approaches seem viable.

Things to remember when working with different languages

The following characteristics of multiple-language corpora are worth bearing in mind:

automated API-based translation is widely available, easy to use, and relatively cheap
while original-language data seems intuitively better, translated versions work pretty well
if you do not have access to dictionaries in all languages, it might be “cheaper” to translate the dictionary rather than your large-scale corpus
oh, and please-please-please validate, validate, validate