You understand statistics. You know how to code. This guide isn’t about either of those. It’s about mapping the concepts you already know onto R’s syntax so you can start analyzing data immediately. We’ll go from the foundational concepts to data manipulation, visualization, and a complete, guided statistical analysis.
This section covers the two concepts you must grasp. Everything else is just syntax.
In most programming languages, you loop through elements. In R, you operate on the entire collection at once. This concept is called vectorization, and it is the single most important idea for writing efficient and readable R code.
Example:
# Create a vector (a one-dimensional array)
my_numbers <- c(10, 20, 30, 40)
# In another language, you might write a loop. In R, you do this:
my_numbers * 2
Output:
[1] 20 40 60 80
Your first instinct should always be to find a function that works on the whole object, not to write a for
loop. Vectorized operations are faster because the looping is performed in a lower-level, compiled language (like C or Fortran), massively reducing the overhead of an interpreted loop in R.
The data.frame
is the central object in R for statistics. It is a table, with columns (variables) and rows (observations). Nearly every statistical function is designed to work with them. Think of it as an in-memory database table or a list where every element is a vector of the same length.
We will use the built-in mtcars
dataset for all examples. It contains fuel consumption and design details for 32 automobiles. To load it, simply type the following into your R console:
data(mtcars)
head(mtcars)
: See the first 6 rows.str(mtcars)
: See the structure of the object. This is the most useful inspection function, showing variable names, their data types, and the first few values.summary(mtcars)
: Get quick descriptive statistics for every variable (min, max, mean, quartiles).Subsetting is the process of extracting specific portions of your data.
Accessing Columns (Variables):
$
operator: mtcars$mpg
[[]]
: mtcars[["mpg"]]
Subsetting with [rows, columns]
Syntax:
The square bracket []
is used for accessing data. The format is dataframe[rows_to_select, columns_to_select]
. Leaving a position blank selects all.
# Get all cars with exactly 6 cylinders
mtcars[mtcars$cyl == 6, ]
# Get only the miles-per-gallon, horsepower, and weight columns
mtcars[ , c("mpg", "hp", "wt")]
# Get the mpg and hp columns for all cars with horsepower greater than 200
high_hp_cars <- mtcars[mtcars$hp > 200, c("mpg", "hp")]
Creating New Variables:
You can create a new column using the $
assignment operator.
# Create a new variable for weight in kilograms (original is in lbs/1000)
# The original weight needs to be multiplied by 1000, then converted to kg (1 lb = 0.453592 kg)
mtcars$wt_kg <- (mtcars$wt * 1000) * 0.453592
# Verify the new column
head(mtcars[ , c("wt", "wt_kg")])
Effective visualization is key to any analysis. R has two major graphics systems: Base R graphics and ggplot2
.
Base R graphics are great for quick, exploratory plots. Think of it like painting on a canvas: you start with a plot, then add layers on top.
# 1. Create the initial canvas with a scatterplot
plot(mtcars$wt, mtcars$mpg,
main = "MPG vs. Vehicle Weight",
xlab = "Weight (1000 lbs)",
ylab = "Miles Per Gallon",
pch = 19, # Use a solid circle for points
col = "blue") # Set the color to blue
# 2. Add a regression line from a linear model
model <- lm(mpg ~ wt, data = mtcars)
abline(model, col = "red", lwd = 2) # Add the line from the model
# 3. Add a legend
legend("topright", legend = "Regression Line", col = "red", lty = 1, lwd = 2)
This approach is imperative; you issue a sequence of commands to build the plot step-by-step.
ggplot2
)ggplot2
is an extremely popular package that implements the “Grammar of Graphics.” This declarative approach is powerful, flexible, and often produces more professional-looking plots with less effort.
First, install and load the package. You only need to install it once.
install.packages("ggplot2") # Run this once
library(ggplot2) # Run this every time you start a new R session
The core idea is to map variables from your data frame to aesthetic properties of the plot.
data
: The data frame to use.aes()
: The aesthetic mappings (e.g., which variable goes on the x-axis, which on the y-axis, which should determine color).geom
: The geometric object to draw (e.g., points, lines, bars).# Build the same plot as above, but with ggplot2
ggplot(data = mtcars, aes(x = wt, y = mpg)) +
geom_point(aes(color = as.factor(cyl)), size = 3) + # We use as.factor() to tell ggplot to treat the numeric 'cyl' variable as a discrete category (4, 6, 8) rather than a continuous gradient.
geom_smooth(method = "lm", se = FALSE, color = "red") + # Add a linear model smoothing line
labs(title = "MPG vs. Vehicle Weight",
x = "Weight (1000 lbs)",
y = "Miles Per Gallon",
color = "Cylinders") + # Change the legend title
theme_minimal() # Apply a clean theme
Notice how we declared what we wanted (points mapped to weight and MPG, color mapped to cylinders), and ggplot2
handled the details of creating the plot, axes, and legend.
Let’s put everything together to answer a research question. We’ll use the built-in iris
dataset.
The Question: Is there a statistically significant difference in Sepal Length across the three iris species (setosa, versicolor, and virginica)?
Step 1: Load and Inspect the Data
data(iris)
str(iris)
summary(iris)
Step 2: Visual Exploration
Before running a statistical test, we should always visualize the data. A boxplot is perfect for comparing a continuous variable across different categories.
library(ggplot2)
ggplot(data = iris, aes(x = Species, y = Sepal.Length)) +
geom_boxplot(aes(fill = Species)) +
geom_jitter(width = 0.1, alpha = 0.5) + # Add individual points for better visibility
labs(title = "Sepal Length Distribution by Iris Species",
x = "Species",
y = "Sepal Length (cm)") +
theme_light()
The plot clearly shows differences. The sepal lengths for setosa appear much shorter than for the other two, and virginica appears to be the longest. Now we can formally test this observation.
Step 3: Formal Testing (ANOVA)
The appropriate statistical test for comparing the means of three or more groups is an Analysis of Variance (ANOVA).
In R, statistical models are often specified using formula notation. The format is dependent_variable ~ independent_variable(s)
.
# Perform the ANOVA
# We are testing if Sepal.Length depends on Species
iris_anova <- aov(Sepal.Length ~ Species, data = iris)
# Look at the results
summary(iris_anova)
Step 4: Interpreting the Output
The summary()
command gives you a classic ANOVA table:
Df Sum Sq Mean Sq F value Pr(>F)
Species 2 63.21 31.606 119.3 <2e-16 ***
Residuals 147 38.96 0.265
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Your stats background tells you what to look for: the F-value (119.3) and the p-value (Pr(>F)
), which is extremely small (<2e-16
). Since the p-value is far less than our typical alpha of 0.05, we can reject the null hypothesis and conclude that there is a highly significant difference in the mean sepal length among the three species.
Now, apply what you’ve learned to the mtcars
dataset.
Manipulation: Create a new variable in mtcars
called hp_per_cyl
which is the horsepower (hp
) divided by the number of cylinders (cyl
).
Visualization: Using ggplot2
, create a scatterplot that shows the relationship between a car’s displacement (disp
) on the x-axis and its miles-per-gallon (mpg
) on the y-axis. Color the points based on the number of cylinders.
The Capstone Challenge (Linear Regression): A common hypothesis is that a car’s weight is a strong predictor of its fuel efficiency.
lm()
. Use the formula syntax lm(mpg ~ wt, data = mtcars)
.mpg_model
).summary()
function on your model object to see the results.Pr(>|t|)
, for the wt
coefficient).Congratulations! You’ve just covered the core workflow of R: inspecting, manipulating, visualizing, and modeling data. You’ve learned the most important concepts like vectorization and the data frame, and you’ve seen how to build powerful visualizations and run formal statistical tests.
The key to mastering R is to use it for your next project. When you get stuck, remember R has excellent built-in help (use ?function_name
) and is supported by a massive, active community online. You now have the foundation to solve any data problem you encounter. Happy coding!
Enjoyed the article? I write about 1-2 a month. Subscribe via email or RSS feed.