Python and Tidyverse

Introduction

One of the great things about the R world has been a collection of R
packages called tidyverse that are easy for beginners to learn and
provide a consistent data manipulation and visualisation space. The
value of these tools has been so great that many of them have been
ported to Python. That’s why we thought we should provide an
introduction to tidyverse for Python blog post.

What is tidyverse?

Tidyverse is an opinionated collection of
R packages designed for data science. All packages share an underlying
design philosophy, grammar, and data structures. The core R tidyverse
packages are: ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr and
forcats.

Python implementation of dplyr

The tidyverse package dplyr is a grammar
of data manipulation, providing a consistent set of verbs that help you
solve the most common data manipulation challenges. Here are some of the
functions dplyr provides that are commonly used:

  • mutate() – adds new variables that are functions of existing
    variables
  • select() – picks variables based on their names.
  • filter() – picks cases based on their values.
  • summarise() – reduces multiple values down to a single summary.
  • arrange() – changes the ordering of the rows.

Dplython is a Python
implementation of dplyr which can be installed using pip and the
following command:

pip install dplython

Instructions on how to use pip to install python packages can be found
here.

The Dplython README provides
some clear examples of how the package can be used. Below is an summary
of the common functions:

  • select() – used to get specific columns of the data-frame.
  • sift() – used to filter out rows based on the value of a variable in
    that row.
  • sample_n() and sample_frac() – used to provide a random sample of
    rows from the data-frame.
  • arrange() – used to sort results.
  • mutate() – used to create new columns based on existing columns.

For more functions and example code visit the Dplython
README page.

At the bottom of the README a comparison is provided to
pandas-ply which is another
python implementation of dplyr.

Dplython comes with a sample data-set called ‘diamonds’. Here are some
basic examples of how to use Dplython.

Import Python packages and the ‘diamonds’ data-frame:

import pandas
from dplython import (DplyFrame, X, diamonds, select, sift, sample_n,
    sample_frac, head, arrange, mutate, group_by, summarize, DelayFunction) 

Create a new data-frame by selecting columns of the ‘diamonds’
data-frame:

diamondsSmall = diamonds >> select(X.carat, X.cut, X.price, X.color, X.clarity  , X.depth  , X.table)

Display the top 4 rows of the ‘diamondsSmall’ data-frame:

print(diamondsSmall >> head(4)) 

##    carat      cut  price color clarity  depth  table
## 0   0.23    Ideal    326     E     SI2   61.5   55.0
## 1   0.21  Premium    326     E     SI1   59.8   61.0
## 2   0.23     Good    327     E     VS1   56.9   65.0
## 3   0.29  Premium    334     I     VS2   62.4   58.0

Filter the data-frame for rows where the price is higher than 18,000 and
the carat less than 1.2 and sort them by depth:

print((diamondsSmall >> sift(X.price > 18000, X.carat < 1.2) >> arrange(X.depth)))

##        carat        cut  price color clarity  depth  table
## 27455   1.14  Very Good  18112     D      IF   59.1   58.0
## 27457   1.07  Very Good  18114     D      IF   60.9   58.0
## 27530   1.07    Premium  18279     D      IF   60.9   58.0
## 27635   1.04  Very Good  18542     D      IF   61.3   56.0
## 27507   1.09  Very Good  18231     D      IF   61.7   58.0

Provide a random sample of 5 rows from the data-frame

print(diamondsSmall >> sample_n(5))

##        carat        cut  price color clarity  depth  table
## 320     0.71       Good   2801     F     VS2   57.8   60.0
## 9813    0.91    Premium   4670     H     VS1   61.8   54.0
## 11795   1.18  Very Good   5088     E     SI2   62.5   60.0
## 11845   0.95  Very Good   5101     D     SI1   63.7   55.0
## 11552   1.17      Ideal   5032     F     SI1   63.0   54.0

Add a column to the data-frame containing the rounded value of ‘carat’

print((diamondsSmall >> mutate(carat_bin=X.carat.round()) >>  sample_n(5)))

##        carat        cut  price color clarity  depth  table  carat_bin
## 11883   0.99  Very Good   5112     F     SI1   62.5   58.0        1.0
## 45123   0.77       Fair   1651     D     SI2   65.1   63.0        1.0
## 51630   0.31    Premium    544     E     SI1   59.2   60.0        0.0
## 49382   0.51  Very Good   2102     G      IF   62.6   56.0        1.0
## 18296   1.54  Very Good   7437     I     SI2   63.3   60.0        2.0

Python implementation of ggplot2

The tidyverse package ggplot2 is a
system for declaratively creating graphics, based on The Grammar of
Graphics. You provide the data, tell ggplot2 how to map variables to
aesthetics, what graphical primitives to use, and it takes care of the
details.

A Python port of ggplot2 has long been requested and there are now a few
Python implementations of it; Plotnine
is the one we will explore here. Plotting with a grammar is powerful, it
makes custom (and otherwise complex) plots easy to think about and
create, while the plots remain simple.

Plotnine can be installed using pip:

pip install plotnine

Plotnine splits plotting into three distinct parts which are data,
aesthetics and layers. The data step adds the data to the graph, the
aesthetics (aes) step adds visual attributes and the layers step creates
the objects on a plot. Multiple aesthetics and layers functions can be
added to a Plotnine graph.

If you are a python user used to Matplotlib it can take some getting
used to a Grammar of Graphics plotting tool which is partly due to the
difference in philosophy. Plotnine provides
some
tutorials to
help with getting to grips with the package and there is also the
Plotnine README. However if you
are new to Grammar of Graphics plotting then this highly recommended
kaggle notebook for Plotnine is probably the
best place to start.

Here are some examples of how to use plotnine to visualize data from the
‘diamonds’ data-frame that comes with Dplython.

Import Python packages, the ‘diamonds’ data-frame and create a sample
data-frame:

import warnings; warnings.filterwarnings("ignore") # hide Python warnings 
import pandas
import dplython as dplython
from plotnine import *
diamondsSample = dplython.diamonds >> dplython.sample_n(5000)

Create a scatter plot of ‘carat’ vs ‘price’:

print(ggplot(diamondsSample) # diamondsSample is the data  
 + aes('carat', 'price') # plot 'carat' vs 'price'
 + geom_point() # display the results as a scatter plot
 )

## <ggplot: (41012744)>

Add additional layers e.g. a line of best fit:

print(ggplot(diamondsSample)  
 + aes('carat', 'price') 
 + stat_smooth() # add a line of best fit
 + geom_point()) 

## <ggplot: (-9223372036813567705)>

Add another aesthetic, here the data is coloured by the ‘cut’ variable:

print(ggplot(diamondsSample)
 + aes('carat', 'price')
 + aes(color='cut') # colour the data by the variable cut and create a ledgend 
 + geom_point())

## <ggplot: (-9223372036816020904)>

Add a layer which separates the data into graphs based on ‘colour’

print(ggplot(diamondsSample)
 + aes('carat', 'price')
 + aes(color='cut')
 + facet_wrap('color') # seperate the data by 'colour' and graph seperately  
 + geom_point())

## <ggplot: (64014519)>

This article compares a variety of alternative
plotting packages for Python.

Next steps

  • Read the documents that are linked in this blog post.
  • Learn the basics of Pandas.
  • Use Dplython and Plotnine to practice data manipulation &
    visualization. For example complete some of the exercises at
    kaggle.

Do you know of other good Python implementations of tidyverse? If so let
us know about them!