So far, we have worked with neatly prepared datasets. However, academic daily business is to work with somewhat more messy data as well. Importantly, a common challenge is to work with texts in a variety of languages. For example, think of news reports from different countries. Luckily, the commonly known providers of online dictionaries also provide API’s to translate texts.
For the purpose of this demonstration, we will be using Google Translate within its free limits. We will do so also because there is an R package available providing access to Google’s API’s, namely googleLanguageR. Hence, we’ll start by installing that along with the usual suspects, tidyverse and quanteda.
if(!require("quanteda")) {install.packages("quanteda"); library("quanteda")}
if(!require("tidyverse")) {install.packages("tidyverse"); library("tidyverse")}
if(!require("googleLanguageR")) {install.packages("googleLanguageR"); library("googleLanguageR")}
theme_set(theme_bw())
For all API’s generlly but for the Google API specifically, one needs to register and retrieve some API authentication. This can be a “token” (a secret text, basically) or a pair of two secret parts of an “authenticator.” Typically, during this process of registration, one also needs to provide credit-card information, even if working with the free volume only. At the Google Developer Console, enable the Translation API, and create service-account credentials (whereby you need to enter a project name, grant access to the Translation API), which provides you with a downloadable key file in JSON format. Download it and put it into your R working diretory (sorry, we cannot provide you with one here).
Google claims, by the way, not warn you and actually ask for you permission before really charging you. For those control maniacs out there (like me), though, you can also (regularly) check the request monitor about your current quota usage.
After having the credential file available at a specific file location (for ease of use, we’ll just assume the file location to be stored in the “credential.json” variable), we now need to tell googleLanguageR to use it for authentication against Google’s API.
gl_auth(credential.json)
If this passes without errors, we are good to go. That is, we can use the API to detect the language of a text. We’ll demonstrate that with the first paragraph of the first Sherlock-Holmes novel.
load("data/sherlock/sherlock.absaetze.RData")
text <- sherlock.absaetze[[1, 'content']]
gl_translate_detect(text)
## 2019-09-11 20:08:21 -- Detecting language: 4294 characters
The result is by no means surprising–it’s English. What’s more is that we are presented with a confidence indicator and an estimate whether the result actually “isReliable,” both of which are deprecated though and will be removed in future versions of the API. Just focus on the “language” column.
Of course, the main function is to actually translate a text it into one of the many provided languages. Aside from the text to be translated, you need to specify the target language. It is also recommendable to specify whether your text contains HTML or is simple plain text, although some preprocessing to remove unnecessary markup is advisable to reduce the number of characters to be translated (and thus charged for). If you know the text’s current language, you can also specify that; if omited, Google will automatically try to detect it (see above) which might be added to your quoate usage.
gl_translate(text, format = 'text', source = 'en', target = 'de')
## 2019-09-11 20:08:21 -- Translating text: 4294 characters -
As you can or probably cannot see, the translation is decent but certainly not perfect. This varies from language to language and also the aforementioned alternative API’s provide differing results, but hey, it’s automated large-scale text translation into a whole multitude of languages. In fact, for Google Translate the package also provides us with an up-to-date list of what’s possible.
gl_translate_languages()
A final word on mass translation: You can easily use the API to translate a lot of texts at once (we will do so in a second), but the API has its limits also per call. While you now know that and can keep it in mind for other API’s, our used package googleLanguageR automatically splits large portions of texts into separate API calls. While this should neither affect translation nor quota usage, it may be the cause for translations on a large scale to consume considerably more time. But you are already used to long times for computational tasks, anyhow …
Let’s assume we want to apply a dictionary, be it for sentiment or any topic-specific dictionary, to a corpus, but our dictionary’s language and the corpus’ language do not match. Specifically and for this example, we will build on the already well-known english Bing Liu dictionary and a German news corpus of roughly 21,000 articles published between 2007 and 2012 in various Swiss daily newspapers regarding the financial crisis (applying the search query “Finanzkrise”).
We will first load both elements and also draw a sample from the corpus in order to to overstrain the API limits.
positive.words.bl <- scan("dictionaries/bingliu/positive-words.txt", what = "char", sep = "\n", skip = 35, quiet = T)
negative.words.bl <- scan("dictionaries/bingliu/negative-words.txt", what = "char", sep = "\n", skip = 35, quiet = T)
sentiment.dict.en <- dictionary(list(positive = positive.words.bl, negative = negative.words.bl))
load("data/finanzkrise.korpus.RData")
crisis.sample <- corpus_sample(korpus.finanzkrise, 20)
crisis.sample
## Corpus consisting of 20 documents and 6 docvars.
Essentially, now, we have three options in order to sentiment-analyze our data:
For this we need to push our current corpus to Google’s API and store the result. Since some of the texts include a tiny little bit of HTML, we specify the format as such.
translations <- gl_translate(crisis.sample$documents$texts, format = 'html', source = 'de', target = 'en')
## 2019-09-11 20:08:22 -- Translating html: 26096 characters -
translations
crisis.sample.translated <- crisis.sample
crisis.sample.translated$documents$texts <- translations$translatedText
The fourth line writes the translated results back into the (copied) corpus, to which we can now simply apply the Bing Liu dictionary.
dfm.sentiment.translatedcorpus <-
crisis.sample.translated %>%
dfm(dictionary = sentiment.dict.en) %>%
dfm_weight(scheme = "prop")
dfm.sentiment.translatedcorpus
## Document-feature matrix of: 20 documents, 2 features (0.0% sparse).
## 20 x 2 sparse Matrix of class "dfm"
## features
## docs positive negative
## NZZ11/AUG.04114 0.87500000 0.1250000
## BUN12/SEP.00304 0.27272727 0.7272727
## BUN08/DEZ.00225 0.55555556 0.4444444
## NLZ08/DEZ.01218 0.61904762 0.3809524
## BAZ08/MAI.01310 0.33333333 0.6666667
## BEZ08/OKT.05282 0.16666667 0.8333333
## NZZ09/SEP.01035 0.50000000 0.5000000
## BAZ08/DEZ.00837.1 0.11111111 0.8888889
## NZZ08/OKT.04847 0.50000000 0.5000000
## NZZ12/MAR.00484 0.31578947 0.6842105
## NZZ08/JUN.02490.1 0.37500000 0.6250000
## NZZ12/JAN.03253 0.57142857 0.4285714
## NLZ10/FEB.03632 0.60000000 0.4000000
## NZZ09/AUG.03062 0.14285714 0.8571429
## NZZ11/AUG.03923 0.18181818 0.8181818
## NZZ09/SEP.03423 0.50000000 0.5000000
## BAZ09/NOV.04495 0.25000000 0.7500000
## BEZ08/NOV.04372.2 0.64285714 0.3571429
## NZZ08/JUL.04514 0.08333333 0.9166667
## NZZ09/NOV.03203 0.35714286 0.6428571
We can see a bit of sentiment detection in all texts so our procedure worked in essence. But how good is this procedure? Well, we need comparative research for that …
Instead of translating the texts, we could also translate the Bing Liu dictionary. For that, we need to translate both lists of words (i.e., positive and negative ones) from their current language, English, into German. Now, this takes a while–mainly because googleLanguageR splits our almost 7000 words into a lot of different API calls. Hence, we do not run these commands here but instead load the already translated lists.
#positive.translations <- gl_translate(positive.words.bl, format = 'text', source = 'en', target = 'de')
#negative.translations <- gl_translate(negative.words.bl, format = 'text', source = 'en', target = 'de')
load('data/bl_translated.RData')
sentiment.dict.en.de <- dictionary(list(positive = positive.translations$translatedText, negative = negative.translations$translatedText))
Once this is done, we can apply this newly translated dictionary to our original texts.
dfm.sentiment.translateddict <-
crisis.sample %>%
dfm(dictionary = sentiment.dict.en.de) %>%
dfm_weight(scheme = "prop")
dfm.sentiment.translateddict
## Document-feature matrix of: 20 documents, 2 features (10.0% sparse).
## 20 x 2 sparse Matrix of class "dfm"
## features
## docs positive negative
## NZZ11/AUG.04114 1.0000000 0
## BUN12/SEP.00304 0.4545455 0.5454545
## BUN08/DEZ.00225 1.0000000 0
## NLZ08/DEZ.01218 0.5714286 0.4285714
## BAZ08/MAI.01310 0 1.0000000
## BEZ08/OKT.05282 0 1.0000000
## NZZ09/SEP.01035 0.6666667 0.3333333
## BAZ08/DEZ.00837.1 0.3333333 0.6666667
## NZZ08/OKT.04847 0.5000000 0.5000000
## NZZ12/MAR.00484 0.4166667 0.5833333
## NZZ08/JUN.02490.1 0.3000000 0.7000000
## NZZ12/JAN.03253 0.6666667 0.3333333
## NLZ10/FEB.03632 0.7000000 0.3000000
## NZZ09/AUG.03062 0.3333333 0.6666667
## NZZ11/AUG.03923 0.4000000 0.6000000
## NZZ09/SEP.03423 0.7272727 0.2727273
## BAZ09/NOV.04495 0.2500000 0.7500000
## BEZ08/NOV.04372.2 0.8000000 0.2000000
## NZZ08/JUL.04514 0.3333333 0.6666667
## NZZ09/NOV.03203 0.5000000 0.5000000
Is it different from our result before? Let’s compare visually.
dfm.sentiment.translatedcorpus %>%
convert('data.frame') %>%
gather(positive, negative, key = "Polarity", value = "Sentiment") %>%
mutate(Type = 'translated corpus and Bing Liu') %>%
bind_rows(
dfm.sentiment.translateddict %>%
convert('data.frame') %>%
gather(positive, negative, key = 'Polarity', value = 'Sentiment') %>%
mutate(Type = 'translated Bing Liu')
) %>%
ggplot(aes(document, Sentiment, fill = Polarity)) +
geom_bar(stat="identity") +
scale_colour_brewer(palette = "Set1") +
ggtitle("Sentiment scores in news reports on the financial crisis in Swiss newspapers") +
xlab("") +
ylab("Sentiment share (%)") +
facet_grid(rows = vars(Type)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Based on our small 20-case sample, results are pretty similar. One notable difference, though, is that in this second approach we only translate the dictionary once, no matter if we have 20 or 20 million news reports. Translation costs may thus be cheaper.
This is the most straightforward way. We just need a dictionary and repeat what we are well used to already. For the dictionary, we will use the SentiWS dictionary, which here is available as RData file.
load("dictionaries/sentiWS.RData")
sentiment.dict.de <- dictionary(list(positive = positive.woerter.senti, negative = negative.woerter.senti))
And then it’s business as usual:
dfm.sentiment.germandict <-
crisis.sample %>%
dfm(dictionary = sentiment.dict.de) %>%
dfm_weight(scheme = "prop")
dfm.sentiment.germandict
## Document-feature matrix of: 20 documents, 2 features (2.5% sparse).
## 20 x 2 sparse Matrix of class "dfm"
## features
## docs positive negative
## NZZ11/AUG.04114 0.8000000 0.2000000
## BUN12/SEP.00304 0.2500000 0.7500000
## BUN08/DEZ.00225 0.3333333 0.6666667
## NLZ08/DEZ.01218 0.2727273 0.7272727
## BAZ08/MAI.01310 0.3333333 0.6666667
## BEZ08/OKT.05282 0 1.0000000
## NZZ09/SEP.01035 0.3333333 0.6666667
## BAZ08/DEZ.00837.1 0.2500000 0.7500000
## NZZ08/OKT.04847 0.5714286 0.4285714
## NZZ12/MAR.00484 0.1000000 0.9000000
## NZZ08/JUN.02490.1 0.3333333 0.6666667
## NZZ12/JAN.03253 0.5555556 0.4444444
## NLZ10/FEB.03632 0.6250000 0.3750000
## NZZ09/AUG.03062 0.1666667 0.8333333
## NZZ11/AUG.03923 0.6000000 0.4000000
## NZZ09/SEP.03423 0.5714286 0.4285714
## BAZ09/NOV.04495 0.2500000 0.7500000
## BEZ08/NOV.04372.2 0.4444444 0.5555556
## NZZ08/JUL.04514 0.3333333 0.6666667
## NZZ09/NOV.03203 0.3333333 0.6666667
This way seems the most natural one as speakers of the German language have coded words as positive or negative. Double meanings available in one but not another language (for example, the word “home” is neither on Bing Liu’s positive nor negative list; translated to German, though, it might either result in “zuhause,” an equally neutral term, or the rather emotion-laden term “heimat”) are captured only in this approach for both languages.
But let’s compare all three translation enquiries with each other.
dfm.sentiment.translatedcorpus %>%
convert('data.frame') %>%
gather(positive, negative, key = "Polarity", value = "Sentiment") %>%
mutate(Type = 'translated corpus, original Bing Liu') %>%
bind_rows(
dfm.sentiment.translateddict %>%
convert('data.frame') %>%
gather(positive, negative, key = 'Polarity', value = 'Sentiment') %>%
mutate(Type = 'original corpus, translated Bing Liu')
) %>%
bind_rows(
dfm.sentiment.germandict %>%
convert('data.frame') %>%
gather(positive, negative, key = 'Polarity', value = 'Sentiment') %>%
mutate(Type = 'original corpus, original SentiWS')
) %>%
ggplot(aes(document, Sentiment, fill = Polarity)) +
geom_bar(stat="identity") +
scale_colour_brewer(palette = "Set1") +
ggtitle("Sentiment scores in news reports on the financial crisis in Swiss newspapers") +
xlab("") +
ylab("Sentiment share (%)") +
facet_grid(rows = vars(Type)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
We can conclude that in essence they all are pretty similar. The general trend of the articles is not far off. However, specific pieces are indeed coded differently, although we need to remind ourselves that we are looking at only 20 texts. Based on this small sample, though, we can conclude that all three approaches seem viable.
The following characteristics of multiple-language corpora are worth bearing in mind: