Training Data Visualization Scatter Plots, Correlation & Trend Lines
3 / 10

Scatter Plots, Correlation & Trend Lines

23 min Data Visualization

Scatter Plots, Correlation & Trend Lines

Scatter Plot

A graph of paired observations $(x_i, y_i)$. Each point represents one case. Used to detect direction, form, and strength of a relationship.

Pearson Correlation Coefficient

$$r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i-\bar{x})^2 \sum (y_i-\bar{y})^2}}$$

$-1 \leq r \leq 1$. Measures linear association. $|r|$ near 1 = strong; near 0 = weak.

Least-Squares Regression Line

$$\hat{y} = a + bx, \quad b = r\frac{s_y}{s_x}, \quad a = \bar{y} - b\bar{x}$$

Minimizes the sum of squared residuals.

Coefficient of Determination

$$r^2 = \text{proportion of variance in } y \text{ explained by } x$$

Example 1

Study hours: $(1,50), (2,55), (3,65), (4,70), (5,80)$. Find $r$.

$\bar{x}=3$, $\bar{y}=64$. Numerator $= (-2)(-14)+(-1)(-9)+0(1)+(1)(6)+(2)(16) = 28+9+0+6+32 = 75$.

$\sum(x_i-\bar{x})^2 = 10$, $\sum(y_i-\bar{y})^2 = 533$.

$r = 75/\sqrt{10 \cdot 533} = 75/73.0 \approx 0.987$. Strong positive.

Example 2

Using Example 1: find the regression line.

$s_x = \sqrt{10/4} = 1.581$, $s_y = \sqrt{533/4} = 11.543$.

$b = 0.987 \cdot 11.543/1.581 = 7.5$. $a = 64 - 7.5(3) = 41.5$.

$\hat{y} = 41.5 + 7.5x$.

Example 3

$r^2 = 0.81$. Interpret.

81% of the variability in $y$ is explained by the linear relationship with $x$.

Practice Problems

1. If $r = -0.92$, describe the relationship.
2. Does $r = 0$ mean no relationship?
3. Does correlation imply causation?
4. Residual = ?
5. The regression line always passes through what point?
6. If $r = 0.6$, what is $r^2$?
7. Outlier in a scatter plot: effect on $r$?
8. What does a residual plot check?
9. $r$ is sensitive to ___.
10. Predict $\hat{y}$ for $x = 6$ using $\hat{y} = 10 + 3x$.
11. What is extrapolation?
12. Can $r$ detect a curved relationship?
Show Answer Key

1. Strong negative linear relationship

2. No — just no linear relationship; could be nonlinear

3. No

4. $y_i - \hat{y}_i$ (observed minus predicted)

5. $(\bar{x}, \bar{y})$

6. $0.36$ (36% of variability explained)

7. Can inflate or deflate $r$ substantially

8. Whether a linear model is appropriate (look for patterns)

9. Outliers and non-linearity

10. $\hat{y} = 10+18 = 28$

11. Predicting beyond the range of observed $x$ — risky

12. No, $r$ only measures linear association