Training Data Visualization R — Statistical Computing & Graphics
7 / 10

R — Statistical Computing & Graphics

24 min Data Visualization

R — Statistical Computing & Graphics

What Is R?

R is a free, open-source programming language and environment designed for statistical computing and graphics. It is widely used in academia, biostatistics, data science, and machine learning. CRAN (Comprehensive R Archive Network) hosts over 20 000 add-on packages.

Key Data Types & Variables
  • Numeric: x <- 3.14 — a decimal number. <- is the assignment operator.
  • Integer: n <- 5L — the L suffix forces integer type.
  • Character: name <- "voltage" — a text string.
  • Logical: TRUE, FALSE — boolean values.
  • Vector: v <- c(1, 4, 9, 16) — the fundamental data structure; c() combines elements.
  • Data frame: df <- data.frame(x=1:5, y=c(2,4,6,8,10)) — a table of columns.
  • Factor: categorical variable, e.g. factor(c("A","B","A")).
Essential Functions
FunctionPurposeExample
mean(x)Arithmetic mean $\bar{x} = \frac{1}{n}\sum x_i$mean(c(2,4,6)) → 4
sd(x)Sample std dev $s = \sqrt{\frac{\sum(x_i-\bar{x})^2}{n-1}}$sd(c(2,4,6)) → 2
cor(x, y)Pearson $r$cor(x, y)
lm(y ~ x)Linear model (regression)fit <- lm(y ~ x, data=df)
t.test(x)One-sample $t$-testt.test(x, mu=0)
summary()Summary statistics or model summarysummary(fit)
Base R Plotting
  • plot(x, y) — scatter/line plot. x, y: numeric vectors.
  • hist(x, breaks=n) — histogram. breaks: number of bins.
  • boxplot(x ~ group) — box plots by group.
  • barplot(heights) — bar chart.
  • abline(fit) — add regression line to current plot.
ggplot2 — Grammar of Graphics

The ggplot2 package builds plots in layers:

ggplot(data, aes(x=var1, y=var2)) + geom_point() + geom_smooth(method="lm")

  • aes() — maps variables to aesthetics (x, y, color, size).
  • geom_point() — scatter points.
  • geom_line() — line chart.
  • geom_histogram() — histogram.
  • geom_boxplot() — box plot.
  • geom_smooth() — fitted trend with confidence band.
  • facet_wrap(~var) — small multiples by a grouping variable.
Example 1 — Linear Regression in R

Fit a regression line to study hours vs. exam score.

hours <- c(1, 2, 3, 4, 5)

score <- c(50, 55, 65, 70, 80)

fit <- lm(score ~ hours)

summary(fit)

Output includes: $\hat{y} = 40 + 7.5x$, $R^2 = 0.974$, $p < 0.01$.

Variables: hours — predictor vector; score — response vector; fit — fitted model object.

Example 2 — ggplot2 Scatter with Trend

Create a scatter plot of mpg vs. weight from the mtcars dataset with a regression line.

library(ggplot2)

ggplot(mtcars, aes(x=wt, y=mpg)) +

  geom_point(color="steelblue") +

  geom_smooth(method="lm", se=TRUE) +

  labs(title="MPG vs Weight", x="Weight (1000 lb)", y="MPG")

Variables: wt — car weight; mpg — miles per gallon; se=TRUE — show confidence band.

Example 3 — Hypothesis Test

Test if the mean reaction time is significantly different from 250 ms.

times <- c(243, 251, 260, 238, 255, 247)

t.test(times, mu=250)

Output: $t = -0.24$, $p = 0.82$. We fail to reject $H_0$. The mean is not significantly different from 250 ms.

Variables: times — sample data; mu — hypothesized population mean; p — p-value.

Practice Problems

1. What is R's assignment operator?
2. Create a vector of the first 10 positive integers.
3. What does c() stand for?
4. How do you compute the standard deviation in R?
5. What function fits a linear regression model?
6. In ggplot2, what does aes() do?
7. How do you add a regression line to a base R plot?
8. What does facet_wrap do?
9. What is CRAN?
10. How do you create a histogram in base R?
11. What type of test does t.test(x, mu=5) perform?
12. What is a data frame?
Show Answer Key

1. <- (also = works, but <- is conventional)

2. v <- 1:10 or v <- c(1,2,3,4,5,6,7,8,9,10)

3. Combine (concatenate elements into a vector)

4. sd(x)

5. lm() (linear model)

6. Maps data variables to visual aesthetics (x, y, color, size, etc.)

7. abline(lm(y ~ x))

8. Creates small multiples — the same plot repeated for each level of a variable

9. Comprehensive R Archive Network — the central repository for R packages

10. hist(x) or hist(x, breaks=10)

11. One-sample $t$-test: tests whether the population mean equals 5

12. A tabular data structure with named columns (like a spreadsheet or SQL table)