.

Lessons

- Overview of EDA
- Examining numerical data
- Considering categorical data
- Centrality and variability
- Amounts and proportions
- Comparisons
- Trends

Lesson 1: What is EDA?

EDA: an introduction

EDA is an iterative cycle that helps you understand what your data says. It involves:
- Generate questions about your data
- Search for answers by visualizing, transforming, and modeling your data
- Use what you learn to refine your questions and/or generate new questions

EDA: an introduction

Your goal during EDA is to develop an understanding of your data.

“Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.” — John Tukey

EDA: two useful questions

There is no rule about which questions you should ask to guide your research. However, two questions are particularly useful:

What type of variation occurs within my variables?
What type of covariation occurs between my variables?

Is EDA a tool for discovery or confirmation?

Discovery
Confirmation

When you begin to explore data, is it better to formulate one or two high-quality questions to ask, or many, many questions to explore?

One or two high-quality questions
Many, many questions

Lesson 2: Examining numerical data

scatter plots

- many borrowers with an income below $100,000
- a handful of borrowers with income above $250,000

scatter plots: R codes

loan50 |> 
  mutate(total_income = total_income / 1000) |> 
  mutate(loan_amount = loan_amount / 1000) |>
  ggplot(aes(total_income, loan_amount)) +
  geom_point(
    size = 3, 
    color = "#377eb8"
  ) +
  scale_x_continuous(
    breaks = seq(0, 300, 50), 
    labels = scales::label_currency(suffix = "K"), 
    limits = c(0, 350), 
    expand = c(0, 0)
  ) +
  scale_y_continuous(
    breaks = seq(0, 40, 10), 
    labels = scales::label_currency(suffix = "K"), 
    limits = c(0, 42), 
    expand = c(0, 0)) +
  coord_cartesian(clip = "off") +
  labs(x = "Total Income", 
       y = "Loan Amount",
       title = "Scatter Plot of Total Income vs. Loan Amount"
      )

scatter plots

- The relationship is nonlinear, as highlighted by the dashed line.
- What implications can you draw from this pattern?

scatter plots: R codes

county |> 
  mutate(median_hh_income = median_hh_income / 1000) |>
  ggplot(aes(poverty, median_hh_income)) +
  geom_point(
    size = 3, 
    color = "#377eb8", 
    alpha = 0.8
  ) +
  geom_point(
    size = 0.5, 
    color = "gray40"
  ) +
  geom_smooth(
    color = "grey30", 
    lty = "dashed"
  ) +
  scale_y_continuous(
    breaks = seq(0, 130, 20), 
    labels = scales::label_currency(suffix = "K"), 
    limits = c(0, 130), 
    expand = c(0, 0)
  ) +
  scale_x_continuous(
    breaks = seq(0, 50, 10), 
    labels = scales::label_number(suffix = "%"), 
    limits = c(0, 50), 
    expand = c(0, 0)
  ) +
  coord_cartesian(clip = "off") +
  labs(x = "Poverty Rate (%)", 
       y = "Median Household Income",
       title = "Scatter Plot of Poverty Rate vs. Median Household Income"
      )

scatter plots

- 1. Scatter plot of population change against the population before the change.
- 1. A scatter plot of the same data but where the population size has beenlog-transformed.
- What can we infer from these plots?
- Why log transformation is important?

scatter plots: R codes

p_231 <- 
  county |> 
  mutate(pop2017 = pop2017 / 1e6) |> 
  ggplot(aes(pop2017, pop_change)) +
  geom_point(
    color = "#377eb8", 
    size = 3, 
    alpha = 0.7
  ) +
  scale_x_continuous(
    breaks = seq(0, 10, 2), 
    limits = c(0, 10), 
    expand = c(0, 0), 
    labels = scales::label_number(suffix = "m")
  ) +
  scale_y_continuous(
    labels = scales::label_number(suffix = "%")
  ) +
  coord_cartesian(clip = "off") +
  labs(x = "(a) Population before change (m = millions)", 
       y = "Population Change (%)") +
  plot_theme

p_232 <- 
  county |> 
  mutate(pop2017 = pop2017 / 1e6) |> 
  ggplot(aes(pop2017, pop_change)) +
  geom_point(
    color = "#377eb8", 
    size = 3, 
    alpha = 0.7
  ) +
  scale_y_continuous(labels = scales::label_number(suffix = "%")) +
  scale_x_log10() +
  coord_cartesian(clip = "off") +
  labs(x = TeX("(b) $\\log_{10}$ Population before change (m = millions)"), 
       y = "Population Change (%)") +
  plot_theme

p_c_233 <- p_231 + p_232 + plot_layout(ncol = 2)

dot plots

- A dot plot is a one-variable scatterplot; an example using the interest rate of 50 loans above.
- Sometimes two variables are one too many: only one variable may be of interest.

dot plots: R codes

loan50 |>
  ggplot(aes(x = interest_rate)) +
  geom_dotplot(
    binwidth = 1, 
    method = "histodot",
    fill = "#5A9BD5", 
    color = "#5A9BD5",
    dotsize = 0.8,
    stackratio = 1.2
  ) +
  geom_point(
    aes(y = -0.05, x = mean(loan50$interest_rate)),
    color = "red",
    size = 4,
    shape = 17,
    fill = "red",
    stroke = 5
  ) +
  geom_hline(yintercept = 0, color = "gray60") +
  scale_y_continuous(NULL, breaks = NULL) +
  scale_x_continuous(
    breaks = seq(5, 30, 5),
    labels = scales::percent_format(scale = 1),
    limits = c(5, 30)
  ) +
  coord_cartesian(ylim = c(-0.05, 1), clip = "off") +
  labs(x = "Interest Rate, Rounded to Nearest Percent")

histogram and shapes

Lesson 2: Centrality and variability

Centrality (aka the “Average” value)

A single number representing the middle of a set of numbers

Mean: $\frac{\text{Sum of values}}{\text{# of values}}$
Median: “Middle” value (50% of data above & below)
Mode: Most frequent value (usually for categorical data)

Centrality (aka the “Average” value)

Mean is not the always “best” choice

wildlife_impacts %>%
    filter(! is.na(height)) %>%
    summarise(
      mean = mean(height),
      median = median(height))
# A tibble: 1 × 2
   mean median
  <dbl>  <dbl>
1  984.     50

Percent of data below mean:

percentiles <- ecdf(wildlife_impacts$height)
meanP <- percentiles(mean(wildlife_impacts$height, na.rm = TRUE))
paste0(round(100*meanP, 1), '%')
[1] "73.9%"

Variability (“spread”)

Standard deviation: distribution of values relative to the mean
- $s = \sqrt{\frac{\sum_{i=1}^{N}(x_i - \bar{x})^2}{N - 1}}$
Interquartile range (IQR): $Q_3 - Q_1$ (middle 50% of data)
Range: max - min

Variability (“spread”)

Complaints are coming in about orders shipped from warehouse B, so you collect some data.
… here averages are misleading

days_to_ship
# A tibble: 12 × 3
   order warehouseA warehouseB
   <int>      <dbl>      <dbl>
 1     1          3          1
 2     2          3          1
 3     3          3          1
 4     4          4          3
 5     5          4          3
 6     6          4          4
 7     7          5          5
 8     8          5          5
 9     9          5          5
10    10          5          6
11    11          5          7
12    12          5         10

days_to_ship |> 
  pivot_longer(-order, names_to = "warehouse", values_to = "days") |> 
  group_by(warehouse) |>
  summarise(
    mean = mean(days),
    median = median(days))

# A tibble: 2 × 3
  warehouse   mean median
  <chr>      <dbl>  <dbl>
1 warehouseA  4.25    4.5
2 warehouseB  4.25    4.5

Variability (“spread”)

Complaints are coming in about orders shipped from warehouse B, so you collect some data:
variability reveals difference in days to ship

days_to_ship
# A tibble: 12 × 3
   order warehouseA warehouseB
   <int>      <dbl>      <dbl>
 1     1          3          1
 2     2          3          1
 3     3          3          1
 4     4          4          3
 5     5          4          3
 6     6          4          4
 7     7          5          5
 8     8          5          5
 9     9          5          5
10    10          5          6
11    11          5          7
12    12          5         10

days_to_ship |> 
  pivot_longer(-order, names_to = "warehouse", values_to = "days") |> 
  group_by(warehouse) |>
  summarise(
    mean = mean(days),
    sd = sd(days),
    iqr = IQR(days),
    range = max(days) - min(days))

# A tibble: 2 × 5
  warehouse   mean    sd   iqr range
  <chr>      <dbl> <dbl> <dbl> <dbl>
1 warehouseA  4.25 0.866  1.25     2
2 warehouseB  4.25 2.70   2.75     9

Variability (“spread”)

Outliers

Mean and standard deviation are sensitive to outliers

Outliers: $Q_1 - 1.5 IQR$ * $Q_3 + 1.5 IQR$
Extreme values: $Q_1 - 3 IQR$ * $Q_3 + 3 IQR$

data1 <- c(3,3,4,5,5,6,6,7,8,9)

Mean: 5.6
Standard deviation: 2.01
Median: 5.5
IQR: 2.5

data2 <- c(3,3,4,5,5,6,6,7,8,20)

Mean: 6.7
Standard deviation: 4.95
Median: 5.5
IQR: 2.5

Outliers

Source: Data Science Discovery

Outliers

Robust statistics for continuous data (less sensitive to outliers)

Centrality: use median rather than mean
Variability: use IQR rather than standard deviation

“Visualizing data helps us think”

anscombe |> tibble()
# A tibble: 11 × 8
      x1    x2    x3    x4    y1    y2    y3    y4
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1    10    10    10     8  8.04  9.14  7.46  6.58
 2     8     8     8     8  6.95  8.14  6.77  5.76
 3    13    13    13     8  7.58  8.74 12.7   7.71
 4     9     9     9     8  8.81  8.77  7.11  8.84
 5    11    11    11     8  8.33  9.26  7.81  8.47
 6    14    14    14     8  9.96  8.1   8.84  7.04
 7     6     6     6     8  7.24  6.13  6.08  5.25
 8     4     4     4    19  4.26  3.1   5.39 12.5 
 9    12    12    12     8 10.8   9.13  8.15  5.56
10     7     7     7     8  4.82  7.26  6.42  7.91
11     5     5     5     8  5.68  4.74  5.73  6.89

anscombe_summary_stats
# A tibble: 2 × 9
  statistic    x1    x2    x3    x4    y1    y2    y3    y4
  <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 mean       9     9     9     9     7.50  7.50  7.5   7.50
2 sd         3.32  3.32  3.32  3.32  2.03  2.03  2.03  2.03

Anscombe’s Quartet

Stephen Few (2009, p6)

Data types determines how to summarize it

Nominal (categorical)	Ordinal (categorical	Numerical (continuous)
Measures Frequency counts proportions	Measures Frequency counts proportions Median, mode IQR	Measures Mean, median Range, standard deviation, IQR
Charts Bars	Charts Bars	Charts Histogram Boxplot

Measures

Frequency counts
proportions

Measures

Frequency counts
proportions
Median, mode
IQR

Measures

Mean, median
Range, standard deviation, IQR

Charts

Bars

Charts

Bars

Charts

Histogram
Boxplot

Summarizing Nominal data

Summarize with counts/ percentages

wildlife_impacts |> 
  count(operator, sort = TRUE) |> 
  mutate(percent = n / sum(n))
# A tibble: 4 × 3
  operator               n percent
  <chr>              <int>   <dbl>
1 SOUTHWEST AIRLINES 17970   0.315
2 UNITED AIRLINES    15116   0.265
3 AMERICAN AIRLINES  14887   0.261
4 DELTA AIR LINES     9005   0.158

Visualize with bars

wildlife_impacts |> 
  count(operator, sort = TRUE) |> 
  ggplot(aes(x = fct_reorder(operator, n), y = n)) +
  geom_col() +
  coord_flip() +
  labs(x = "Operator", y = "Count") +
  theme_minimal()

Summarizing Ordinal data

Summarize: counts/ percentages

wildlife_impacts |> 
  count(incident_month, sort = TRUE) |> 
  mutate(percent = n / sum(n))
# A tibble: 12 × 3
   incident_month     n percent
            <dbl> <int>   <dbl>
 1              9  7980  0.140 
 2             10  7754  0.136 
 3              8  7104  0.125 
 4              5  6161  0.108 
 5              7  6133  0.108 
 6              6  4541  0.0797
 7              4  4490  0.0788
 8             11  4191  0.0736
 9              3  2678  0.0470
10             12  2303  0.0404
11              1  1951  0.0342
12              2  1692  0.0297

Visualize: bars

wildlife_impacts |> 
  count(incident_month, sort = TRUE) |> 
  ggplot(aes(x = as.factor(incident_month), y = n)) +
  geom_col() +
  labs(x = "Incident Month", y = "Count")

Summarizing continuous data

Histograms:

Skewness
Number of modes

Boxplots:

Outliers
Comparing variablesn

Histogram: Identify Skewness & # of Modes

Summarise:

Mean, median, sd, range, & IQR:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
    0.0     0.0    50.0   983.8  1000.0 25000.0   18038

Visualize:

Histogram (identify skewness & modes)

Histogram: Identify Skewness & # of Modes

Height

Speed

Boxplot: Identify outliers

Height

Speed

Histogram and Boxplot

Histogram

Skewness
Modes

Boxplot

Outliers