This project was going to be way shorter and simpler, but it got out of hand. I challenged myself by creating an app able to handle text in files of millions of rows and several gigabytes. What a cocktail. As a bonus, I created a way to generate tweets under different styles. A real test, but also a fun one.
Introduction
Everything started with the final capstone for the Coursera specialization I finished a few weeks ago “Data Science Specialization”. They set the objectives and published the data. We needed to do the rest.
The main goal was to create a function to predict the following word in a sentence. The source was a series of texts in English in a 550 MB file. Although it doesn’t look like a big file, it was plain text and the result after processing was a file of about 3.5 GB and 200 million of rows. That’s something my laptop wasn’t happy about.
To handle this volume the proper method is to calculate the reduction in accuracy using a percentage of the file. It’s the way I started, but since I wanted to find a challenge, I opted to use the whole file. It was hard since 8 GB of RAM aren’t enough and the system was crashing way too often.
My take was to save partial files in the hard drive instead keeping them into memory. The process was slow and painful, but at least I will be able to handle the entire corpus of text.
For this post, I will explain the main steps I took to finish the assignment.
Note You can go straight to the fun part and try the app here.
Main idea
I started by thinking about how I would predict the following word for a sentence myself. I’d need the context of the sentence. Having the context I’d look for a similar context in a table and choose the most probable following word.
The main idea was to 1) process the text removing numbers, symbols, profanities, and URLs; 2) tokenize the text into words; 3) grouping consecutive words (n-grams); and 4) suggest the last word of the n-gram as the following word for the sentence using the rest as the context.
Hands on.
First, I download a list of profanities from in English from a public repository (NSFW).
if (!file.exists("profanity.txt")){
download.file ("https://raw.githubusercontent.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en",
"profanity.txt")
}
profanityWords <- read_lines ("profanity.txt")
Processing
The function below cleans the text and tokenizes it into n-grams. I want the grams to make sense, so the longer the better, but this has some trade-offs. The average frequency will decrease because some long combinations of words are hard to find more than once in the text. Also, I wanted a suggested word every time, I don’t want the user to introduce more words than needed. I reached an executive decision and I made grams 2 to 6 words long.
ngMin <- 2
ngMax <- 6
processTextToToken <- function (text, pred = F) {
# Because the user's input must be identically processed, I added a flag so
# the function won't tokenize the input in case of a prediction request.
if (pred) {
ng <- 1
} else {
ng <- ngMin:ngMax
}
# Using the [tm] package I cleaned the text before tokenize it.
corpus <- text %>%
as.character () %>%
removeNumbers () %>%
removePunctuation (preserve_intra_word_contractions = T)
corpusToken <- tokens (corpus, what = "fastestword")
# Removing word present in the profanity file and lowercase every letter.
corpusToken <- corpusToken %>%
tokens_remove (profanityWords) %>%
tokens_tolower ()
# Creating the n-grams
corpusToken <- tokens_ngrams (corpusToken, n = ng, concatenator = " ")
if (pred) corpusToken <- paste (corpusToken, collapse = " ")
return (corpusToken)
}
The result of the previous function is a token object that will feed the function below. This one will calculate the frequency of each n-gram along the text.
Because object was too big, it couldn’t be fit into the RAM memory. I divided the token table from the previous function into 50 chunks and “rbinded” into one big object.
tokenFreq <- function (processedTokens, pred = F, pos) {
if (pred) {
x <- processTextToToken (text = processedTokens, pred = T)
return (x)
# I included this if-else section because the function can be used for shorter texts.
} else {
if (length (processedTokens) > 1e5) {
len <- 50
segment <- seq.int (from = 1, to = length (processedTokens), length.out = len)
} else {
len <- 10
segment <- seq.int (from = 1, to = length (processedTokens), length.out = len)
}
count <- 0
preX <- data.table ()
for (i in 1:(len - 1)) {
count <- count + 1
if (count %% 10 == 0) print (paste0 (count, "/", len))
corpus_short <- processedTokens [segment[i]:segment[i+1]]
x <- dfm (corpus_short)
x <- as.data.table (textstat_frequency (x))
x <- x [, c ("rank", "docfreq", "group") := NULL]
preX <- rbindlist (list (preX, x))
}
preX <- preX [, .(frequency = sum (frequency)), by = feature]
# If the text is really big, I had to sacrified n-grams present in low frequency.
if (length (processedTokens) > 1e5) {
preX <- preX [frequency > 1]
}
return (preX)
}
}
Because the limitations of my laptop, I decide to remove those n-grams showing up only once. However, there is a risk to remove combinations of words that could be useful. If I run the script in a more powerful laptop, I’ll get rid of it.
Chop chop chop
Although the previous two functions are ready to digest some big texts, my laptop is not ready to stand that load. I had to figure out how to recycle objects to save memory. I came up with this ugly loop:
parts <- 50
seq4Corpora <- seq.int (from = 1, to = nrow (Corpora), length.out = parts +1)
for (i in 1:parts) {
print (paste0 ("*****", i, "/", parts, "*****"))
if (!file.exists (paste0 ("./tok", i, ".csv"))) {
Corporai <- Corpora [seq4Corpora[i]:seq4Corpora[i+1]]
print ("Creating and processing tokens")
Corporai <- processTextToToken (Corporai$text)
print ("Calculating token frequency")
toki <- tokenFreq (Corporai, pred = F)
fwrite (toki, paste0 ("./tok", i, ".csv"))
}
}
I split again the text into 50 chunks processing them independently and the result saved into individual files in the hard drive.
The upside of this method is that I was able to analyze the 100% of the dataset. The downside is that some n-grams were present in more than one chunk of the dataset, adding up weight to the final file.
I couldn’t load one by one and merge them because my laptop would explode, but Linux to the rescue. From R I can run commands in the terminal and “rbind” the files. And now I can load the file into R.
Once loaded, I sum the frequencies of each n-gram (the feature
variable) that were scattered before along the file. Also, I split the n-gram in two parts, the preWord
part and the lastWord
one (using the preAndLastWord
function). The first being the “context” I mentioned before and the latest the suggestion that the function will give.
The same function saves the smaller file into a compressed csv file that I can retrieve anytime.
preAndLastWord <- function (term) {
wor <- words (term)
n <- length (wor)
list (paste (wor [1:n-1], collapse = " "), wor[n])
}
fuseNload <- function () {
if (!file.exists ("./ngrams.csv")){
system ("head -1 ./tok1.csv > ./ngramFullTable.csv;
for filename in $(ls ./tok*.csv);
do sed 1d $filename >> ./ngramFullTable.csv;
rm $filename;
done")
CorporaLoaded <- fread ("./ngramFullTable.csv")
CorporaLoaded <- CorporaLoaded [, .(frequency = sum (frequency)), by = feature]
CorporaLoaded [, c ("preWord", "lastWord") := preAndLastWord (feature),
by = feature][, feature := NULL]
fwrite (CorporaLoaded, "./ngrams.csv")
zip (zipfile = "./ngrams.zip", files = "./ngrams.csv")
}
return (fread ("./ngrams.csv"))
}
Prediction
The following function started as a very simple one: show the value of the second column (lastWord
) with the input in the first one (preWord
) and the highest value of the third (frequency
). Easy peasy. Not quite so.
Because is the function that will be dealing with the user’s input, it needs to be ready for 1) any input and 2) give a direct result every time.
predictionModel <- function (corpora, query,
ngMinPred = ngMin, ngMaxPred = ngMax, full = F) {
# Process of the input (`query`)
prepQuery <- tokenFreq (query, pred = T)
# Loop going from the greater n-gram to the lowest.
fullResult <- data.table ()
for (i in ngMaxPred:ngMinPred) {
lastQuery <- tail (strsplit (prepQuery, split = " ")[[1]], i - 1)
toFind <- paste (lastQuery, collapse = " ")
result <- corpora [preWord == toFind][order (-frequency)]
fullResult <- rbind (fullResult, result, fill = T)
}
# In the unlike case the last word is not found, remove it and repeat the process
if (nrow (fullResult) == 0) {
prepQuery <- tokenFreq (preAndLastWord (prepQuery)[[1]], pred = T)
fullResult <- data.table ()
for (i in ngMaxPred:ngMinPred) {
lastQuery <- tail (strsplit (prepQuery, split = " ")[[1]], i - 1)
toFind <- paste (lastQuery, collapse = " ")
result <- corpora [preWord == toFind][order (-frequency)]
fullResult <- rbind (fullResult, result, fill = T)
}
}
# For plotting purposes, what the value of the n-gram is
fullResult [, n := (length (words (preWord)) +1), by = .(preWord, lastWord)]
# Calculate the frequency of the `lastWord` in the table and order it.
fullResult <- fullResult [, .(frequency = sum (frequency)),
by = .(preWord, lastWord, n)][order (-frequency)][order (-n)]
fullResult [, prob := (frequency * 100) / sum (fullResult$frequency)]
# Again, if the result is used for plotting, I want the 10 most common words.
# For tweeting, I want to add some randomness and choose not only the first word, but
# any word can be choosen according to its frequency.
if (full) {
return (fullResult [!duplicated (lastWord)])
} else {
return (fullResult [!duplicated (lastWord)][1:10])
}
}
Tweeting
After solving the main goal, I thought “What if I feed the result back?”. I have to say it was a fun afternoon creating sentences out of one word. Some of them made sense, some of them made none. Regardless, I noticed some bugs that helped me to improve the whole script and make it more resilient.
tweeting <- function (ngram, beginning, random = F) {
sen <- c ("")
while (sum (nchar (sen), na.rm=T) < 140) {
if (paste (sen, collapse = " ") == c ("")) {
nex <- predictionModelLoaded (corpora = ngram,
query = beginning,
ngMinPred = ngMin, ngMaxPred = ngMax,
full = T)
if (random) {
nex <- sample (nex$lastWord, size = 1, prob = nex$prob)
} else {
topPerc <- floor (nrow (nex) * 0.05)
nex <- sample (x = nex$lastWord [1:topPerc], size = 1)
}
if (is.na (nex)) nex <- sample (size = 1, x = c("of", "the", "in", "to", "for"))
sen <- c (beginning, nex)
} else {
nex <- predictionModelLoaded (corpora = ngram,
query = paste (sen, collapse = " "),
ngMinPred = ngMin, ngMaxPred = ngMax,
full = T)
if (random) {
nex <- sample (nex$lastWord, size = 1, prob = nex$prob)
} else {
topPerc <- floor (nrow (nex) * 0.05)
nex <- sample (x = nex$lastWord [1:topPerc], size = 1)
}
if (is.na (nex)) nex <- sample (size = 1, x = c("of", "the", "in", "to", "for"))
sen <- c (sen, paste (nex))
}
}
return (paste (sen, collapse = " "))
}
The regular behavior of the function is internally to create a table with the processed input present in the preWord
column. From there, it will rearrange the rows in descending order by longitude of the n-gram and probability; it will get the top 5% of the rows and randomly select one word. Strange results are given if I let the function to chose any word according to its probability.
App
So all this effort would be fruitless if it is enclosed in a folder, forgotten.
I create a Shiny app where any user can access and play around, testing the functions and creating some tweets.
As a side project, I used the infamous Donald Trump’s tweets accessible in a json file from “Trump Twitter Archive” as the data source. He has written more than 30k tweets and, although prolific, it doesn’t constitute a big corpus to work with. However, with some modifications here and there, I could make the functions work with it. I can say that some of “his” tweets are hilarious.
You can access the final app here.