The above article, which is about a single variable, display some graphs about the distribution.

In one word, if you have a single variable, you want to explore it, you could look at the dot plot, histogram or kernel density estimate curve.

So, recommend you to use violin plot, which combines a box plot with the kernel density estimation that gives a more detailed view of the data.

Two Variables : Establishing Relationships

Let’s look at some examples.

Scatter Plots and Smooth lines

Smoothing lines use linear regression line and loess (locally weighted polynomial)

### the relationship between two variables  
# smoothing lines use linear regression line and loess (locally weighted polynomial)

ggplot(heightweight, aes(x = ageYear, y = heightIn)) +
  geom_point(colour = 'grey60') +
  stat_smooth(method = lm, se = FALSE, colour = "blue") +
  stat_smooth(method = loess, se = FALSE, colour = 'red') +
  ggtitle('Relationship between two variables in R by ggplot2') + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), face = "bold"))

### the relationship between two variables  
# smoothing lines use linear regression line
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os

tips = pd.read_csv('E:\\git\\blog_rmarkdown\\data\\tips.csv')

ax = sns.regplot(x = "total_bill", y = "tip", data = tips, fit_reg = True, ci = None, lowess = False)
sns.regplot(x = "total_bill", y = "tip", data = tips, fit_reg = True, ci = None, lowess = True, ax = ax)
plt.title('Relationship between two variables in Python by seaborn')

# smoothing lines use logistic regression line

b <- biopsy
b$classn[b$class == "benign"] <- 0
b$classn[b$class == "malignant"] <- 1
ggplot(b, aes(x = V1, y = classn)) +
  geom_point(position = position_jitter(width = 0.3, height = 0.06),
             alpha = 0.4, shape = 21, size = 1.5) +
  stat_smooth(method = glm, method.args = list(family = "binomial")) +
  ggtitle('Relationship between two variables in R by ggplot2') +
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), face = "bold"))

# smoothing lines use logistic regression line
tips["big_tip"] = (tips.tip / tips.total_bill) > .15

sns.lmplot(x = "total_bill", y = "big_tip", data = tips, logistic = True, y_jitter = .03)

Logarithmic Plots

They turn multiplicative variations into additive ones.

They reveal exponential and power law behavior.

In a logarithmic plot, we graph the logarithm of the data instead of the raw data.

# logarithmic plots

ggplot(Animals, aes(x = body, y = brain, label = rownames(Animals))) +
  geom_text(size = 3) +
  ggtitle('No logarithmic plot in R by ggplot2') +
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), face = "bold"))

ggplot(Animals, aes(x = body, y = brain, label = rownames(Animals))) +
  geom_text(size = 3) +
  scale_x_log10() + scale_y_log10() +
  ggtitle('Logarithmic plot in R by ggplot2') +
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), face = "bold"))

It’s possible to use a log axis for just one axis.

It is often useful to represent financial data this way, because it better represents proportional changes.

ggplot(aapl, aes(x = date, y = adj_price)) +
  geom_line() +
  ggtitle('No logarithmic plot in R by ggplot2') +
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), face = "bold"))

ggplot(aapl, aes(x = date, y = adj_price)) +
  geom_line() +
  scale_y_log10(breaks = c(2, 10, 50, 250)) +
  ggtitle('Logarithmic plot in R by ggplot2') +
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), face = "bold"))

Smoothing methods and logarithmic plots are both tools that help us recognize structure in a data set.

Smoothing methods reduce noise, and logarithmic plots help with data sets spanning many orders of magnitude.


It does not work on the data but on the plot as a whole by changing its aspect ratio.

ggplot(marathon, aes(x = Half, y = Full)) +
  geom_point() +
  coord_fixed() +
  scale_y_continuous(breaks = seq(0, 420, 30)) +
  scale_x_continuous(breaks = seq(0, 420, 30)) +
  ggtitle('Different banking in R by ggplot2') +
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), face = "bold"))

ggplot(marathon, aes(x = Half, y = Full)) +
  geom_point() +
  coord_fixed(ratio = 1/2) +
  scale_y_continuous(breaks = seq(0, 420, 30)) +
  scale_x_continuous(breaks = seq(0, 420, 15)) +
  ggtitle('Different banking in R by ggplot2') +
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), face = "bold"))

In the book of data analysis with open source tools, the writer gives us some suggestions about how to make a graph.

Let’s see a three-step (maybe four-step) process what the writer said.

Time as a variable: Time-Series Analysis

Every time series have several components, which are trend, seasonality, noise and other.

Given above components, we can summarize what it means to “analysis” a time series.

We have three basic tasks: description, prediction and control.

Smoothing the time series - removing noise

There are some method to do it, such as moving average, weighted moving average, exponential smoothing (Holt-Winters method) and so on.

# exponential smoothing
dd_past <- window(AirPassengers, end = c(1957, 12))
m <- HoltWinters(dd_past, seasonal = "mult")
dd_pred <- predict(m, n.ahead = 36)
plot(m, dd_pred)

More than two variables: graphical multivariate analysis

The scatter-plot matrix

# scatter plot matrix

c2009 <- subset(countries, Year == 2009,
                select = c(Name, GDP, laborrate, healthexp, infmortality))

panel.cor <- function(x, y, digits=2, prefix="", cex.cor, ...) {
  usr <- par("usr")
  par(usr = c(0, 1, 0, 1))
  r <- abs(cor(x, y, use="complete.obs"))
  txt <- format(c(r, 0.123456789), digits=digits)[1]
  txt <- paste(prefix, txt, sep="")
  if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
  text(0.5, 0.5, txt, cex =  cex.cor * (1 + r) / 2)

panel.hist <- function(x, ...)
  usr <- par("usr"); on.exit(par(usr))
  par(usr = c(usr[1:2], 0, 1.5) )
  h <- hist(x, plot = FALSE)
  breaks <- h$breaks; nB <- length(breaks)
  y <- h$counts; y <- y/max(y)
  rect(breaks[-nB], 0, breaks[-1], y, col = "cyan", ...)

pairs(c2009[,2:5], upper.panel = panel.cor,
      diag.panel  = panel.hist,
      lower.panel = panel.smooth)

iris = pd.read_csv('E:\\git\\blog_rmarkdown\\data\\iris.csv')


The conditional plots

Conditional plots are especially useful if some of the variables in a data set are clearly “control” variables.

Because it provide a systematic way to study the dependence of the remaining variables on the controls.

# conditional plots
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  facet_grid(drv ~ cyl) +
  ggtitle('The conditional plots in R by ggplot2') +
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), face = "bold"))

sns.lmplot(x="total_bill", y="tip", hue="smoker", col="time", data=tips)

There are also many methods to make a graph, you could explore them right away.


