August 2016 – Shivdeep Nancherla

WordCloud in R – Mythological twist

A WordCloud in R

Let Noble thoughts come to us from every side

– Rigveda, I-89-i

Have you ever wondered what it would be to do a textual analysis of some ancient texts? Would it not be nice to ‘mine’ insights into Valmiki’s Ramayana? Or Veda Vyasa’s Mahabharata? The Ramayana arguably happened about 9300 years ago. In the Thretha yuga. The wiki for Ramayana.

The original Ramayana consists of seven sections called kandas, these have varying numbers of chapters as follows: Bala-kanda—77 chapters, Ayodhya-kanda—119 chapters, Aranya-kanda—75 chapters, Kishkindha-kanda—67 chapters, Sundara-kanda—68 chapters, Yuddha-kanda—128 chapters, and Uttara-kanda—111 chapters.

So, there are a total of 24,000 verses in total. Well, I don’t really have the pdf of the ‘Original’ version, I thought I could use C. Rajagopalachari’s English retelling of the epic. This particular book is quiet popular and has sold over a million copies. It is a page-turner and has around 300 pages.

How about analyzing the text in this book?

Wouldn’t it be EPIC?!

That is exactly what I want to embark on this blog, text mining helps to derive valuable insights into the mind of the writer. It can also be leveraged to gain in-tangible insights like sentiment, relevance, mood, relations, emotion, summarization etc.

The first part of this series would be to run a descriptive analysis on the text and generate a word cloud. Tag clouds or word clouds add simplicity and clarity, the most used words are displayed as weighted averages, the more the count of the word, bigger would be the size of the word. After all, isn’t it visually engaging than looking at a table?

Firstly, we would need to install the relevant packages in R and load them –

The second step would be to read the pdf (which is currently in my working directory)

I first validate if the pdf is there in my working directory

The ‘tm’ package just provides a readPDF function, but the pdf engine needs to be downloaded. Let us use a pdf engine called xpdf. The link for setting up the pdf engine (and updating the system path) is here.

Great, now we can get rolling.

Let us create a pdf reader called ‘Rpdf’ using the code below, this instructs the pdftotext.exe to maintain the original physical layout of the text.

>  Rpdf <- readPDF(control = list(text = "-layout"))

Now, we might need to convert the pdf to text and store it in a corpus. Basically we need to instruct the function on which resource we need to read. The second parameter is the reader that we created in the previous line.

>  ramayana <- Corpus(URISource(files), readerControl = list(reader = Rpdf))

Now, let us check what the variable ‘ramayana’ contains

If I look at the summary of the variable, it will prompt me with the following details.

The next step would be to do some transformation on the text, let us use the tm_map() function is to replace special characters from the text. We could use this to replace single quotes (‘), full stops (.) and replace them with spaces.

Also, don’t you think we need to remove all the stop words? Words like ‘will’, ‘shall’, ‘the’, ‘we’ etc. do not make much sense in a word cloud. These are called stopwords, the tm_map provides for a function to do such an operation.

> ramayana <- tm_map(ramayana, removeWords, stopwords("english"))

Let us also convert all the text to lower

> ramayana <- tm_map(ramayana, content_transformer(tolower))

I could also specify some stop-words that I would want to remove using the code:

> ramayana <- tm_map(ramayana, removeWords, c("the", "will", "like", "can", "and", "shall"))

Let us also remove white spaces and remove the punctuation.

> ramayana <- tm_map(ramayana, removePunctuation)
> ramayana <- tm_map(ramayana, stripWhitespace)

Any other pre-processing that you can think of? How about removing suffixes, removing tense in words? Is ‘kill’ different from ‘killed’? Do they not originate from the same stem ‘kill’? Or ‘big’, ‘bigger’, ‘biggest’? Can’t we just have ‘big’ with a weight of 3 instead of these three separate words? We use the stemDocument parameter for this.

> ramayana <- tm_map(ramayana, stemDocument)

The next step would be to create a term-document matrix. It is a table containing the frequency of words. We use ‘termdocumentmatrix’ provided by the text mining package to do this.

> dtm <- TermDocumentMatrix(ramayana)
> m <- as.matrix(dtm)
> v <- sort(rowSums(m),decreasing=TRUE)
> d <- data.frame(word = names(v),freq=v)

Now, let us look at a sample of the words and their frequency we got. We pick the first 20.

Not surprising, is it? ‘Rama’ is indeed the centre of the story.

Now, let us generate the word cloud

> wordcloud(words = d$word, freq = d$freq, min.freq = 3, max.words=100, random.order=FALSE, rot.per=0.60,  colors=brewer.pal(8, "Dark2"))

Voila! The word cloud of all the words of Ramayana.

A view of plot downloaded from R.

If you like this, you could comment below. If you would like to connect with me, then be sure to find me on Twitter, Facebook, LinkedIn. The links are on the side navigation. Or you could drop an email to shivdeep.envy@gmail.com

Data Manipulation in R with dplyr

The R language is widely used among data scientists, statisticians, researchers and students.

It is simply the leading tool for statistics, data analysis and machine learning. It is platform-independent, open-source, and has a large, vibrant community of users.

The Comprehensive R Archive Network is the one-stop-shop for all R packages.

This really brings us to the package to be discussed on this blog – dplyr. The CRAN documentation for dplyr can be found here.

For this blog, I would be demonstrating the 5 operations of the package. The first thing we would need is to install the package and load the library.

> install.packages(“dplR”)

> library(dplR)

We then need to find a dataset on which we could run these operations. CRAN makes the download logs of their packages publicly available here – CRAN package download logs. Let us download the file for July 8, 2014 (we could really pick a log from any date) onto RStudio’s working directory.

Once the file has been copied onto the working directory of R, execute the below line (where the variable path2csv stores the location of the csv)

> mydf <- read.csv(path2csv, stringsAsFactors = FALSE)

we then save the data frame onto a variable called cran by converting it to a tbl_df to improve readability. Calling the variable cran prints out the contents.

> cran <- tbl_df(mydf)

> cran

The dplyr philosophy is to have small functions that do one thing well. There are basically 5 commands that cover most of the fundamental data manipulation tasks.

select()

Usually in the entire data set that we use for analyis, we would really be interested in a few columns. This function is used to select / fetch the columns which are required. If I only need the columns ip_id, package and country. I execute the following statement –

> select(cran, ip_id, package, country)

It is important to note that the columns are returned in the order in which we specified, irrespective of how it was in the original dataframe.

We could also use the ‘-‘ sign to ommit the columns we do not need.

> select(cran, -time)

filter()

Now that we know how to select columns, the next logical thing would be to be able to select rows. That is where the filter() function comes in.

This is like the ‘where’ clause in SQL. Let us understand this by an example –

> filter(cran, package == "swirl")

If you look at the column ‘package’, we now see that the resulting dataframe has only rows which have the package as ‘swirl’.

Multiple conditions can be passed to filter() one after the other. For example, if I want to fetch all swirl packages downloaded on the OS – linux in India:

> filter(cran, package == "swirl", r_os == "linux-gnu", country == "IN")

arrange()

This is used to order the rows of a dataset according to the values of a particular variable. Suppose we want to order all rows of a dataset in ascending / descending order of a column. Notice the ip_id column listed in descending order.

> arrange(cran2, desc(ip_id))

mutate()

This function is used to edit or add additional columns to the dataframe. Suppose I want to convert the size column which is in bytes to megabytes and store the values in a column called size_mb.

> mutate(cran3, size_mb = size / 2^20)

sumarize()

This function is used to collapse the dataset into a single row, the go-to function to calculate the mean in a sanitized dataframe.

For example – I want to know the average download size from the size column.

> summarize(cran, avg_bytes = mean(size))

sumarize() can also be used to fetch records in groups using the FOR EACH construct.

Disclosure: The above example is from the dplyR lesson on the swirl package.

Phew, finally

My First Blog Post

It has been years since I bought this domain. I finally managed to get this hosted (on a shared hosting space from my friend). This has been on my bucket list for this year, happy to have ticked it before the year ends (BTW, I have two different bucket lists – one for the financial year ending (for my financial goals) and the other for the Julian Calendar). I just realized writing this that I put two sets of brackets in my previous statement. Well, I shall let that pass for now.

Coming to the purpose of this site (Indian English?). I intend to use this as a personal blog/ space. My digital presence. My online avatar. My springboard into the web. My Hangar. It would also be the one-stop-shop (no, no, don’t start thinking of e-comm, I do not intend to sell anything here) for all the information that I would really want to share with the world out there. To establish a digital presence, it looks like I need the following –

Content – Shall build it in a few days
Strategy – Interesting, Strategy for a personal blog? this needs some thought
Design – I totally intend to use come customized templates, I also would use this to play around with some UI/ UX
Technology – well, you might see some of my escapades in web technology (read Dashboards, Shiny Apps, JavaScript frameworks, spa.js) here.

Well, that’s pretty much the time I had for this right now.

Cheers,

Shivdeep