We’re done with the basics of handling data in R. Now we want to know
how to make sense of it. We know what kind of data it is, we know how to
look at column names, dimensions and the like. If you’re trying to add
value to this data however, that very often isn’t enough, so here’s a
look at using the tools available to you to start figuring out how to do
what you want.
R packages
An R package is a bundle of functions and/or datasets. It extends the
capabilities that the “base” and “recommended” R packages have. By using
packages we can do data manipulation in a variety of ways, produce all
sorts of awesome charts, generate books like this, use other languages
like Python and JavaScript, and of course, do all sorts of data
analysis.
Installing packages
Once you’ve identified a package that contains functions or data you’re
interested in using, we need to get the package onto our machine.
To get the package, you can use an R function or you can use the Install
button on the Packages tab.
install.packages("datasauRus")
If you need to install a number of packages, install.packages()
takes
a vector of package names.
install.packages(c("datasauRus","tidyverse"))
Updating packages involves re-running install.packages()
and it’s
usually easier to trigger this by using the Update button on the
Packages tab and selecting all the packages you want to update.
Installing from GitHub and other sources
The install.packages()
function works with CRAN, CRAN mirrors, and
CRAN-like repositories
If you want to install BioConductor packages, there are some helper
scripts available from the BioConductor website,
bioconductor.org.
Other package sources, such as GitHub, will involve building packages
before they can be installed. If you’re on Windows, this means you need
an additional piece of software called
Rtools. The other handy
thing you’ll need is the package devtools
(available from CRAN).
devtools
provides a number of functions designed to make it easier to
install from GitHub, BitBucket, and other sources.
library(devtools)
install_github("lockedata/pRojects")
Recommended packages
Here are my recommended packages – look out for books and blogposts on
these in the future!
tidyverse
The tidyverse
is a suite of packages designed to make your life
easier. It’s well worth installing and many of the packages in this
recommendations section are part of the tidyverse
.
install.packages("tidyverse")
Getting data in and out of R
The following packages can be used to get data into, and out of R:
- Working with databases, you can use the
DBI
package and it’s
companionodbc
to connect to most databases - To get data from web pages, you can use
rvest
- To work with APIs, you use
httr
- To work with CSVs, you can use
readr
ordata.table
.[6] - To work with SPSS, SAS, and Stata files, use
readr
andhaven
Data manipulation
The tidyverse
contains great packages for data manipulation including
dplyr
and purrr
.
Additionally, a favourite data manipulation package of mine is
data.table
. data.table
tends to have a bit of a steeper learning
curve than the tidyverse
but it’s phenomenal for brevity and
performance.
Data visualisation
- For static graphics
ggplot2
is fantastic – it adds a sensible
vocabulary to help you construct charts with ease plotly
helps you build interactive charts from scratch or make
ggplot2
charts interactiveleaflet
is a great maps packageggraph
helps you build effective network diagrams
Data science
caret
is an interface package to many model algorithms and has a
raft of insanely useful features itselfbroom
takes outputs from model functions and makes them into nice
data.framesmodelr
helps build samples and supplement result setsreticulate
is a package for talking to Python and, therefore,
enables you to work with any deep learning framework that is based
in Python.tensorflow
is a package based onreticulate
and
allows you to work with tensorflow in Rsparklyr
allows you to run and work with Spark processes on your R
datah2o
is a package for working with H2O, a super nifty machine
learning platform
Presenting results
rmarkdown
is the core package for combining text and code and
being able to produce outputs like HTML pages, PDFs, and Word
documentsbookdown
facilitates books like thisrevealjs
allows you to make slide decks usingrmarkdown
flexdashboard
andshiny
allow you to make interactive, reactive
dashboards and other analytical apps
Finding packages
As well as using online search facilities like
CRAN and
rdrr.io for packages, there are some handy packages
that help you find other packages!
ctv
allows you to get all the packages in a given CRAN task
view, which are maintained
lists of package for various taskssos
allows you to search for packages and functions that match a
keyword
Loading packages
To make functions and data from a package available to use, we need to
run the library()
function.
library("utils")
The library()
function accepts a vector of length 1, so you need to
perform multiple calls to the function to load up multiple packages.
library("utils")
library("stats")
Once a package is loaded, you can then use any of it’s functions.
You can find what functions are available in a package by looking at
it’s help page.
Alternatively, you can type the package’s name and hit Tab. This
auto-completes the package’s name, adds two colons (::
) and then shows
the list of available functions for that package. The double colon trick
is very helpful for when you want to browse package functionality, e.g.
utils::find()
.
Any function in R can be prefixed with it’s package name and the double
colon (::
) – this is great for telling people where
functions are coming from and for tracking dependencies in long scripts.
It is also really useful when you have two packages loaded that might
have a function of the same name. This is because the order the packages
are loaded in dictates which one gets overridden.
Learning how to use a package
R documentation is some of the best out there.
Yes, I will complain about the impenetrable statistical jargon some
package authors use, but the CRAN gatekeepers require that packages
generally have a really high standard of documentation.
Every function you use will have a help page associated with it. This
page usually contains a description, shows what parameters the function
has, what those parameters are, and most importantly, there’s usually
examples.
To navigate to the help page of an individual function in an R package
you:
- Hit F1 on a function name in a script
- Type
??fnName
and send to the console
??mean
- Search in the Help tab
- Use the
help()
function to open up the packages index page and
navigate to the relevant function
help(package="utils")
- Find the relevant package in the Packages tab and click on it.
Scroll through the index that opens up on the Help page to find the
right function
As well as the function level documentation, good packages also provide
a higher level of documentation that covers workflows using the
packages, how to extend package functionality, or outlines any
methodologies or research that led to the package.
These pieces of documentation are called vignettes. They are
accessible on the package’s index page or you can use the function
vignette()
to read them.
vignette("multi")
Using functions in packages
In previous sections we’ve seen R functions that are used on objects
to perform some activity. Functions seen so far include:
class()
andis.*()
functions for checking datatypesas.*
for converting to datatypeslength()
andnames()
for metadatahead()
andtail()
for getting a small amount of elements from an
objectncol()
,nrow()
,colnames()
, andrownames()
for getting
data.frame metadataSys.Date()
andSys.time()
for getting current date-time values
There are a huge range of functions out there, whether available in R
straight away, or from adding extra functionality.
Understanding how functions work and being able to use them correctly
will help you learn, and use R effectively.
Using a function
A function does some computation on an object. The use of a function
consists of:
- A function’s name
- Parentheses
- 0 or more inputs
Each input is provided to an argument or parameter within a
function.
These arguments have names, although you don’t often need to provide the
names.
You can find out what arguments a function takes by using the code
completion and it’s help snippet, or by searching for the function in
the Rstudio Help tab.
When you’re inside the brackets of a function you can get the list of
available arguments and auto-complete them.
Examining functions
One of the niftiest things about R is being able to see the code for a
function. You can examine how many functions work by just typing their
name without any parentheses.
You can find out what arguments a function takes by using the code
completion and it’s help snippet, or by searching for the function in
the Rstudio Help tab.
When you’re inside the brackets of a function you can get the list of
available arguments and auto-complete them.
Examining functions
One of the niftiest things about R is being able to see the code for a
function. You can examine how many functions work by just typing their
name without any parentheses.
Sys.Date
## function ()
## as.Date(as.POSIXlt(Sys.time()))
## <bytecode: 0x10df91748>
## <environment: namespace:base>
The first line(s) show how the arguments are specified. Subsequent lines
show the code and the final lines starting with <
can be mostly
ignored.
Function input patterns
Functions tend to conform to certain patterns of inputs.
No inputs
Some functions don’t require the user to provide info and so they don’t
have any arguments. Sys.Date()
and similar functions do not need user
input because the functions provide information about the system.
Looking at the function definition above, we can see that there are no
arguments specified in the first line.
Single inputs
Other functions only have a single allowed input. length()
returns the
length of an object so it only allows you to provide it with an object.
length
## function (x) .Primitive("length")
We can see in this definition that the function takes the argument x
.
Many inputs
Some functions have multiple inputs, although not all of them are
necessarily mandatory. head()
and tail()
have been used so far
with only a single input but they take an optional argument as to how
many elements should be returned.
head(letters)
## [1] "a" "b" "c" "d" "e" "f"
head(letters, 2)
## [1] "a" "b"
The rnorm()
function allows us to generate a vector of values from a
normal distribution. We can tell it how many values we need (n
), and
we can optionally provide the mean (mean
) and standard deviation
(sd
) to describe the Normal curve that values should be selected from.
rnorm
## function (n, mean = 0, sd = 1)
## .Call(C_rnorm, n, mean, sd)
## <bytecode: 0x10ded12a8>
## <environment: namespace:stats>
Looking at how rnorm
is specified we can see that we’re expected to
provide n
, but mean
and sd
are given values of 0 and 1
respectively by default.
rnorm(n=5)
## [1] -1.4818734 -1.0309718 1.4056332 0.4328255 -0.6992250
rnorm(n = 5, mean = 10, sd = 2)
## [1] 6.759442 8.453144 8.035506 12.549855 11.781241
Unlimited inputs
Other functions can take an unlimited amount of input values. Functions
like sum()
will sum the values from a number of objects.
sum
## function (..., na.rm = FALSE) .Primitive("sum")
The ellipsis (...
) is used to denote when the user can provide any
number of values.
sum(1:3, 1:9, pi)
## [1] 54.14159
Naming arguments
Every input provided to a function is associated with an argument.
Each argument must have a name. Even functions that allow unlimited
inputs assign these inputs to a name. Behind the scenes, they get put
into a list object and the list gets called ...
(or ellipsis).
There are some typical names for arguments that take your data object.
These include:
x
data
.data
df
You don’t usually have to provide the argument names, just put things in
the relevant places in the function. Sometimes though, you will need
to use argument names.
Here are my rules of thumb for knowing when you need to name names:
- You’re using the arguments in an order that is different from the
function author’s intended order (you might be skipping some
arguments as the default values are fine or you might just prefer a
different order) - The arguments you want to specify show up after the
...
in a
function’s argument list - You want to give a specific name to a value in a
...
argument
We can provide names for clarity or so we can use arguments out of order
if we prefer to.
rnorm(n = 5, mean = 10, sd = 2)
## [1] 10.775754 10.104590 7.055771 14.544010 9.135413
rnorm(mean = 10, sd = 2, n = 5)
## [1] 9.625624 8.345576 8.794756 8.914196 10.417960
A common behaviour change that you’ll need to work with is how missing
(NA
) values get handled. Functions that allow you change this
behaviour, usually have an argument called things like na.rm
,
na.omit
, and na.action
.
sum(1:5, NA)
## [1] NA
sum(1:5, NA, na.rm = TRUE)
## [1] 15
In the sum()
example, I used the na.rm
argument’s name. This is
because otherwise the TRUE
would be considered part of the values
being passed for summing. Without the name, the value gets considered as
part of the ...
.
sum(1:5, NA, TRUE)
## [1] NA
A function will sometimes have ...
at the end of it’s list of
arguments when it utilises other functions and those have optional /
default values.
For instance the predict()
function allows us to take a model we’ve
built and apply it to some new data.
It works for many different types of model and these different models
expect different types of inputs. Some models expect data.frames, others
expect time series data, etc.
There’s lots of potential variations, the only thing that is mandatory
is the model object.
predict
## function (object, ...)
## UseMethod("predict")
## <bytecode: 0x103e21638>
## <environment: namespace:stats>
The predict()
function then determines what type of model object
you’ve provided it and passes the model, and any other values you
provided, to the relevant function, returning back the results.
linearMod<-lm(Sepal.Length~., data=iris)
predict(linearMod, iris[1,])
## 1
## 5.004788
And so very quickly before I summarise all that for you, just a note to
say that that’s the very basics of R covered, but look out soon for a
couple of posts on making R work for you – R projects (A very good
habit) and a Github 101 coming up!
Summary
R uses functions as the means of performing operations.
Functions can take 0 or more arguments. All arguments may be mandatory,
but some can be optional or even undefined.
You can use argument names to provide arguments in different orders to
that defined by the function author or to provide them in the case where
an ellipsis (...
) is used in a function.
R packages bundle functionality and/or data.
You can install packages from the central public repository (CRAN) via
install.packages()
or install them from GitHub with the package
devtools
. R packages contain documentation that helps you understand
how functions work and how the package overall works.
When you want to make use of functionality from a package you can either
load all of a package’s functionality by using the library()
function
or refer to a specific function by prefixing the function with the
package name and two colons (::
) e.g. utils::help("mean")
.
There are many packages out there for different activities and
domain-specific types of analysis. Use online search facilities like
rdrr.io or CRAN task
views to find ones specific to
your requirements.
As usual, all the code from this installments video is included below in
one fell swoop.
Video code
install.packages("tidyverse")
library("tidyverse")
help(package="dplyr")
and then
?bind_rows
one <- mtcars[1:4, ]
two <- mtcars[11:14, ]
# You can supply data frames as arguments:
bind_rows(one, two) -> THREE
Sys.Date
length
length(THREE)
head
head(THREE, 5)
rnorm
rnorm(n=10)
rnorm(n=5, mean = 2, sd = 7)
#Have a read of the blog post here to find out why we sometimes use the argument names inside the brackets!
sum
sum(1:9, 7, pi)
predict
linearMod<-lm(Sepal.Length~., data=iris)
# logisticMod<-glm(Species~., data=iris, family=binomial)
predict(linearMod, iris[1,])
Happy coding 🙂 Ellen!