Introduction
One of the great things about the R world has been a collection of R
packages called tidyverse that are easy for beginners to learn and
provide a consistent data manipulation and visualisation space. The
value of these tools has been so great that many of them have been
ported to Python. That’s why we thought we should provide an
introduction to tidyverse for Python blog post.
What is tidyverse?
Tidyverse is an opinionated collection of
R packages designed for data science. All packages share an underlying
design philosophy, grammar, and data structures. The core R tidyverse
packages are: ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr and
forcats.
Python implementation of dplyr
The tidyverse package dplyr is a grammar
of data manipulation, providing a consistent set of verbs that help you
solve the most common data manipulation challenges. Here are some of the
functions dplyr provides that are commonly used:
- mutate() – adds new variables that are functions of existing
variables - select() – picks variables based on their names.
- filter() – picks cases based on their values.
- summarise() – reduces multiple values down to a single summary.
- arrange() – changes the ordering of the rows.
Dplython is a Python
implementation of dplyr which can be installed using pip and the
following command:
pip install dplython
Instructions on how to use pip to install python packages can be found
here.
The Dplython README provides
some clear examples of how the package can be used. Below is an summary
of the common functions:
- select() – used to get specific columns of the data-frame.
- sift() – used to filter out rows based on the value of a variable in
that row. - sample_n() and sample_frac() – used to provide a random sample of
rows from the data-frame. - arrange() – used to sort results.
- mutate() – used to create new columns based on existing columns.
For more functions and example code visit the Dplython
README page.
At the bottom of the README a comparison is provided to
pandas-ply which is another
python implementation of dplyr.
Dplython comes with a sample data-set called ‘diamonds’. Here are some
basic examples of how to use Dplython.
Import Python packages and the ‘diamonds’ data-frame:
import pandas
from dplython import (DplyFrame, X, diamonds, select, sift, sample_n,
sample_frac, head, arrange, mutate, group_by, summarize, DelayFunction)
Create a new data-frame by selecting columns of the ‘diamonds’
data-frame:
diamondsSmall = diamonds >> select(X.carat, X.cut, X.price, X.color, X.clarity , X.depth , X.table)
Display the top 4 rows of the ‘diamondsSmall’ data-frame:
print(diamondsSmall >> head(4))
## carat cut price color clarity depth table
## 0 0.23 Ideal 326 E SI2 61.5 55.0
## 1 0.21 Premium 326 E SI1 59.8 61.0
## 2 0.23 Good 327 E VS1 56.9 65.0
## 3 0.29 Premium 334 I VS2 62.4 58.0
Filter the data-frame for rows where the price is higher than 18,000 and
the carat less than 1.2 and sort them by depth:
print((diamondsSmall >> sift(X.price > 18000, X.carat < 1.2) >> arrange(X.depth)))
## carat cut price color clarity depth table
## 27455 1.14 Very Good 18112 D IF 59.1 58.0
## 27457 1.07 Very Good 18114 D IF 60.9 58.0
## 27530 1.07 Premium 18279 D IF 60.9 58.0
## 27635 1.04 Very Good 18542 D IF 61.3 56.0
## 27507 1.09 Very Good 18231 D IF 61.7 58.0
Provide a random sample of 5 rows from the data-frame
print(diamondsSmall >> sample_n(5))
## carat cut price color clarity depth table
## 320 0.71 Good 2801 F VS2 57.8 60.0
## 9813 0.91 Premium 4670 H VS1 61.8 54.0
## 11795 1.18 Very Good 5088 E SI2 62.5 60.0
## 11845 0.95 Very Good 5101 D SI1 63.7 55.0
## 11552 1.17 Ideal 5032 F SI1 63.0 54.0
Add a column to the data-frame containing the rounded value of ‘carat’
print((diamondsSmall >> mutate(carat_bin=X.carat.round()) >> sample_n(5)))
## carat cut price color clarity depth table carat_bin
## 11883 0.99 Very Good 5112 F SI1 62.5 58.0 1.0
## 45123 0.77 Fair 1651 D SI2 65.1 63.0 1.0
## 51630 0.31 Premium 544 E SI1 59.2 60.0 0.0
## 49382 0.51 Very Good 2102 G IF 62.6 56.0 1.0
## 18296 1.54 Very Good 7437 I SI2 63.3 60.0 2.0
Python implementation of ggplot2
The tidyverse package ggplot2 is a
system for declaratively creating graphics, based on The Grammar of
Graphics. You provide the data, tell ggplot2 how to map variables to
aesthetics, what graphical primitives to use, and it takes care of the
details.
A Python port of ggplot2 has long been requested and there are now a few
Python implementations of it; Plotnine
is the one we will explore here. Plotting with a grammar is powerful, it
makes custom (and otherwise complex) plots easy to think about and
create, while the plots remain simple.
Plotnine can be installed using pip:
pip install plotnine
Plotnine splits plotting into three distinct parts which are data,
aesthetics and layers. The data step adds the data to the graph, the
aesthetics (aes) step adds visual attributes and the layers step creates
the objects on a plot. Multiple aesthetics and layers functions can be
added to a Plotnine graph.
If you are a python user used to Matplotlib it can take some getting
used to a Grammar of Graphics plotting tool which is partly due to the
difference in philosophy. Plotnine provides
some
tutorials to
help with getting to grips with the package and there is also the
Plotnine README. However if you
are new to Grammar of Graphics plotting then this highly recommended
kaggle notebook for Plotnine is probably the
best place to start.
Here are some examples of how to use plotnine to visualize data from the
‘diamonds’ data-frame that comes with Dplython.
Import Python packages, the ‘diamonds’ data-frame and create a sample
data-frame:
import warnings; warnings.filterwarnings("ignore") # hide Python warnings
import pandas
import dplython as dplython
from plotnine import *
diamondsSample = dplython.diamonds >> dplython.sample_n(5000)
Create a scatter plot of ‘carat’ vs ‘price’:
print(ggplot(diamondsSample) # diamondsSample is the data
+ aes('carat', 'price') # plot 'carat' vs 'price'
+ geom_point() # display the results as a scatter plot
)
## <ggplot: (41012744)>
Add additional layers e.g. a line of best fit:
print(ggplot(diamondsSample)
+ aes('carat', 'price')
+ stat_smooth() # add a line of best fit
+ geom_point())
## <ggplot: (-9223372036813567705)>
Add another aesthetic, here the data is coloured by the ‘cut’ variable:
print(ggplot(diamondsSample)
+ aes('carat', 'price')
+ aes(color='cut') # colour the data by the variable cut and create a ledgend
+ geom_point())
## <ggplot: (-9223372036816020904)>
Add a layer which separates the data into graphs based on ‘colour’
print(ggplot(diamondsSample)
+ aes('carat', 'price')
+ aes(color='cut')
+ facet_wrap('color') # seperate the data by 'colour' and graph seperately
+ geom_point())
## <ggplot: (64014519)>
This article compares a variety of alternative
plotting packages for Python.
Next steps
- Read the documents that are linked in this blog post.
- Learn the basics of Pandas.
- Use Dplython and Plotnine to practice data manipulation &
visualization. For example complete some of the exercises at
kaggle.
Do you know of other good Python implementations of tidyverse? If so let
us know about them!