The R language is widely used among data scientists, statisticians, researchers and students.
It is simply the leading tool for statistics, data analysis and machine learning. It is platform-independent, open-source, and has a large, vibrant community of users.
The Comprehensive R Archive Network is the one-stop-shop for all R packages.
This really brings us to the package to be discussed on this blog – dplyr. The CRAN documentation for dplyr can be found here.
For this blog, I would be demonstrating the 5 operations of the package. The first thing we would need is to install the package and load the library.
> install.packages(“dplR”)
> library(dplR)
We then need to find a dataset on which we could run these operations. CRAN makes the download logs of their packages publicly available here – CRAN package download logs. Let us download the file for July 8, 2014 (we could really pick a log from any date) onto RStudio’s working directory.
Once the file has been copied onto the working directory of R, execute the below line (where the variable path2csv stores the location of the csv)
> mydf <- read.csv(path2csv, stringsAsFactors = FALSE)
we then save the data frame onto a variable called cran by converting it to a tbl_df to improve readability. Calling the variable cran prints out the contents.
> cran <- tbl_df(mydf)
> cran
The dplyr philosophy is to have small functions that do one thing well. There are basically 5 commands that cover most of the fundamental data manipulation tasks.
Now that we know how to select columns, the next logical thing would be to be able to select rows. That is where the filter() function comes in.
This is like the ‘where’ clause in SQL. Let us understand this by an example –
> filter(cran, package == "swirl")
If you look at the column ‘package’, we now see that the resulting dataframe has only rows which have the package as ‘swirl’.
Multiple conditions can be passed to filter() one after the other. For example, if I want to fetch all swirl packages downloaded on the OS – linux in India:
> filter(cran, package == "swirl", r_os == "linux-gnu", country == "IN")