EDA is an iterative cycle that helps you understand what your data says. It involves:
Your goal during EDA is to develop an understanding of your data.
“Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.” — John Tukey
There is no rule about which questions you should ask to guide your research. However, two questions are particularly useful:
What type of variation occurs within my variables?
What type of covariation occurs between my variables?
Discovery
Confirmation
One or two high-quality questions
Many, many questions
loan50 |>
mutate(total_income = total_income / 1000) |>
mutate(loan_amount = loan_amount / 1000) |>
ggplot(aes(total_income, loan_amount)) +
geom_point(
size = 3,
color = "#377eb8"
) +
scale_x_continuous(
breaks = seq(0, 300, 50),
labels = scales::label_currency(suffix = "K"),
limits = c(0, 350),
expand = c(0, 0)
) +
scale_y_continuous(
breaks = seq(0, 40, 10),
labels = scales::label_currency(suffix = "K"),
limits = c(0, 42),
expand = c(0, 0)) +
coord_cartesian(clip = "off") +
labs(x = "Total Income",
y = "Loan Amount",
title = "Scatter Plot of Total Income vs. Loan Amount"
)county |>
mutate(median_hh_income = median_hh_income / 1000) |>
ggplot(aes(poverty, median_hh_income)) +
geom_point(
size = 3,
color = "#377eb8",
alpha = 0.8
) +
geom_point(
size = 0.5,
color = "gray40"
) +
geom_smooth(
color = "grey30",
lty = "dashed"
) +
scale_y_continuous(
breaks = seq(0, 130, 20),
labels = scales::label_currency(suffix = "K"),
limits = c(0, 130),
expand = c(0, 0)
) +
scale_x_continuous(
breaks = seq(0, 50, 10),
labels = scales::label_number(suffix = "%"),
limits = c(0, 50),
expand = c(0, 0)
) +
coord_cartesian(clip = "off") +
labs(x = "Poverty Rate (%)",
y = "Median Household Income",
title = "Scatter Plot of Poverty Rate vs. Median Household Income"
)What can we infer from these plots?
Why log transformation is important?
p_231 <-
county |>
mutate(pop2017 = pop2017 / 1e6) |>
ggplot(aes(pop2017, pop_change)) +
geom_point(
color = "#377eb8",
size = 3,
alpha = 0.7
) +
scale_x_continuous(
breaks = seq(0, 10, 2),
limits = c(0, 10),
expand = c(0, 0),
labels = scales::label_number(suffix = "m")
) +
scale_y_continuous(
labels = scales::label_number(suffix = "%")
) +
coord_cartesian(clip = "off") +
labs(x = "(a) Population before change (m = millions)",
y = "Population Change (%)") +
plot_theme
p_232 <-
county |>
mutate(pop2017 = pop2017 / 1e6) |>
ggplot(aes(pop2017, pop_change)) +
geom_point(
color = "#377eb8",
size = 3,
alpha = 0.7
) +
scale_y_continuous(labels = scales::label_number(suffix = "%")) +
scale_x_log10() +
coord_cartesian(clip = "off") +
labs(x = TeX("(b) $\\log_{10}$ Population before change (m = millions)"),
y = "Population Change (%)") +
plot_theme
p_c_233 <- p_231 + p_232 + plot_layout(ncol = 2)loan50 |>
ggplot(aes(x = interest_rate)) +
geom_dotplot(
binwidth = 1,
method = "histodot",
fill = "#5A9BD5",
color = "#5A9BD5",
dotsize = 0.8,
stackratio = 1.2
) +
geom_point(
aes(y = -0.05, x = mean(loan50$interest_rate)),
color = "red",
size = 4,
shape = 17,
fill = "red",
stroke = 5
) +
geom_hline(yintercept = 0, color = "gray60") +
scale_y_continuous(NULL, breaks = NULL) +
scale_x_continuous(
breaks = seq(5, 30, 5),
labels = scales::percent_format(scale = 1),
limits = c(5, 30)
) +
coord_cartesian(ylim = c(-0.05, 1), clip = "off") +
labs(x = "Interest Rate, Rounded to Nearest Percent") A single number representing the middle of a set of numbers
Mean: \(\frac{\text{Sum of values}}{\text{# of values}}\)
Median: “Middle” value (50% of data above & below)
Mode: Most frequent value (usually for categorical data)
Percent of data below mean:

Standard deviation: distribution of values relative to the mean
Interquartile range (IQR): \(Q_3 - Q_1\) (middle 50% of data)
Range: max - min
Complaints are coming in about orders shipped from warehouse B, so you collect some data.
… here averages are misleading
Complaints are coming in about orders shipped from warehouse B, so you collect some data:
variability reveals difference in days to ship
days_to_ship |>
pivot_longer(-order, names_to = "warehouse", values_to = "days") |>
group_by(warehouse) |>
summarise(
mean = mean(days),
sd = sd(days),
iqr = IQR(days),
range = max(days) - min(days))# A tibble: 2 × 5
warehouse mean sd iqr range
<chr> <dbl> <dbl> <dbl> <dbl>
1 warehouseA 4.25 0.866 1.25 2
2 warehouseB 4.25 2.70 2.75 9
Mean and standard deviation are sensitive to outliers
Outliers: \(Q_1 - 1.5 IQR\) * \(Q_3 + 1.5 IQR\)
Extreme values: \(Q_1 - 3 IQR\) * \(Q_3 + 3 IQR\)

Source: Data Science Discovery
Robust statistics for continuous data (less sensitive to outliers)
Centrality: use median rather than mean
Variability: use IQR rather than standard deviation
anscombe |> tibble()
# A tibble: 11 × 8
x1 x2 x3 x4 y1 y2 y3 y4
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 10 10 10 8 8.04 9.14 7.46 6.58
2 8 8 8 8 6.95 8.14 6.77 5.76
3 13 13 13 8 7.58 8.74 12.7 7.71
4 9 9 9 8 8.81 8.77 7.11 8.84
5 11 11 11 8 8.33 9.26 7.81 8.47
6 14 14 14 8 9.96 8.1 8.84 7.04
7 6 6 6 8 7.24 6.13 6.08 5.25
8 4 4 4 19 4.26 3.1 5.39 12.5
9 12 12 12 8 10.8 9.13 8.15 5.56
10 7 7 7 8 4.82 7.26 6.42 7.91
11 5 5 5 8 5.68 4.74 5.73 6.89| Nominal (categorical) | Ordinal (categorical | Numerical (continuous) |
|---|---|---|
Measures
|
Measures
|
Measures
|
Charts
|
Charts
|
Charts
|
Summarize with counts/ percentages
Summarize: counts/ percentages
wildlife_impacts |>
count(incident_month, sort = TRUE) |>
mutate(percent = n / sum(n))
# A tibble: 12 × 3
incident_month n percent
<dbl> <int> <dbl>
1 9 7980 0.140
2 10 7754 0.136
3 8 7104 0.125
4 5 6161 0.108
5 7 6133 0.108
6 6 4541 0.0797
7 4 4490 0.0788
8 11 4191 0.0736
9 3 2678 0.0470
10 12 2303 0.0404
11 1 1951 0.0342
12 2 1692 0.0297Histograms:
Skewness
Number of modes
Boxplots:
Outliers
Comparing variablesn

Summarise:
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.0 0.0 50.0 983.8 1000.0 25000.0 18038
Visualize:

Height

Speed

Height

Speed

Histogram

Boxplot

Econ 115a: Econometrics