| Title: | Datasets and Supplemental Functions from 'OpenIntro' Textbooks and Labs | 
| Version: | 2.5.0 | 
| Description: | Supplemental functions and data for 'OpenIntro' resources, which includes open-source textbooks and resources for introductory statistics (https://www.openintro.org/). The package contains datasets used in our open-source textbooks along with custom plotting functions for reproducing book figures. Note that many functions and examples include color transparency; some plotting elements may not show up properly (or at all) when run in some versions of Windows operating system. | 
| License: | GPL-3 | 
| Encoding: | UTF-8 | 
| LazyData: | true | 
| LazyDataCompression: | xz | 
| RoxygenNote: | 7.3.1 | 
| Suggests: | broom, dplyr, forcats, knitr, lubridate, scales, testthat (≥ 3.0.0), tidyr, tidytext, stringr, maps | 
| Imports: | ggplot2 (≥ 2.2.1), graphics, readr, rmarkdown, tibble | 
| Depends: | R (≥ 2.10), airports, cherryblossom, usdata | 
| URL: | http://openintrostat.github.io/openintro/, https://github.com/OpenIntroStat/openintro/ | 
| BugReports: | https://github.com/OpenIntroStat/openintro/issues | 
| VignetteBuilder: | knitr | 
| Config/testthat/edition: | 3 | 
| NeedsCompilation: | no | 
| Packaged: | 2024-05-28 12:59:17 UTC; mine | 
| Author: | Mine Çetinkaya-Rundel
     | 
| Maintainer: | Mine Çetinkaya-Rundel <cetinkaya.mine@gmail.com> | 
| Repository: | CRAN | 
| Date/Publication: | 2024-05-31 13:40:02 UTC | 
openintro: Datasets and Supplemental Functions from 'OpenIntro' Textbooks and Labs
Description
Supplemental functions and data for 'OpenIntro' resources, which includes open-source textbooks and resources for introductory statistics (https://www.openintro.org/). The package contains datasets used in our open-source textbooks along with custom plotting functions for reproducing book figures. Note that many functions and examples include color transparency; some plotting elements may not show up properly (or at all) when run in some versions of Windows operating system.
Author(s)
Maintainer: Mine Çetinkaya-Rundel cetinkaya.mine@gmail.com (ORCID)
Authors:
David Diez david.m.diez@gmail.com
Andrew Bray andrew.bray@gmail.com
Albert Y. Kim albert.ys.kim@gmail.com (ORCID)
Ben Baumer ben.baumer@gmail.com
Chester Ismay chester.ismay@gmail.com
Nick Paterno nicholas.paterno@gmail.com
Christopher Barr
See Also
Useful links:
Report bugs at https://github.com/OpenIntroStat/openintro/issues
Create a Line That may have Arrows on the Ends
Description
Similar to lines, this function will include
endpoints that are solid points, open points, or arrows (mix-and-match
ready).
Usage
ArrowLines(
  x,
  y,
  lty = 1,
  lwd = 2.5,
  col = 1,
  length = 0.1,
  af = 3,
  cex.pch = 1.2,
  ends = c("a", "a"),
  ...
)
Arguments
x | 
 A vector of the x-coordinates of the line to be drawn.  | 
y | 
 A vector of the y-coordinates of the line to be drawn. This vector
should have the same length as that of   | 
lty | 
 The line type.  | 
lwd | 
 The line width.  | 
col | 
 The line and endpoint color.  | 
length | 
 If an end point is an arrow, then this specifies the sizing of
the arrow. See the   | 
af | 
 A tuning parameter for creating the arrow. Usually the default
(  | 
cex.pch | 
 Plotting character size (if open or closed point at the end).  | 
ends | 
 A character vector of length 2, where the first value
corresponds to the start of the line and the second to the end of the line.
A value of   | 
... | 
 All additional arguments are passed to the
  | 
Author(s)
David Diez
See Also
Examples
CCP(xlim = c(-6, 6), ylim = c(-6, 6), ticklabs = 2)
x <- c(-2, 0, 2, 4)
y <- c(0, 3, 0, 3)
ArrowLines(x, y, col = COL[1], ends = c("c", "c"))
points(x, y, col = COL[1], pch = 19, cex = 1.2)
CCP(xlim = c(-6, 6), ylim = c(-6, 6), ticklabs = 2)
x <- c(-3, 0, 1, 3)
y <- c(2, 1, -2, 1)
ArrowLines(x, y, col = COL[1], ends = c("c", "c"))
points(x, y, col = COL[1], pch = 19, cex = 1.2)
CCP(xlim = c(-6, 6), ylim = c(-6, 6), ticklabs = 2)
x <- seq(-2, 2, 0.01)
y <- x^2 - 3
ArrowLines(x, y, col = COL[1], ends = c("c", "c"))
x <- seq(-2, 2, 1)
y <- x^2 - 3
points(x, y, col = COL[1], pch = 19, cex = 1.2)
Build Better Looking Axis Labels for US Dollars
Description
Convert and simplify axis labels that are in US Dollars.
Usage
AxisInDollars(side, at, include.symbol = TRUE, simplify = TRUE, ...)
Arguments
side | 
 An integer specifying which side of the plot the axis is to be drawn on. The axis is place as follows: 1 = below, 2 = left, 3 = above and 4 = right.  | 
at | 
 The points at which tick-marks are to be drawn.  | 
include.symbol | 
 Whether to include a dollar or percent symbol, where the symbol chosen depends on the function.  | 
simplify | 
 For dollars, simplify the amount to use abbreviations of
  | 
... | 
 Arguments passed to   | 
Value
The numeric locations on the axis scale at which tick marks were drawn when the plot was first drawn.
Author(s)
David Diez
See Also
buildAxis AxisInDollars AxisInPercent
Examples
x <- sample(50e6, 100)
hist(x, axes = FALSE)
AxisInDollars(1, pretty(x))
Build Better Looking Axis Labels for Percentages
Description
Convert and simplify axis labels that are in percentages.
Usage
AxisInPercent(side, at, include.symbol = TRUE, simplify = TRUE, ...)
Arguments
side | 
 An integer specifying which side of the plot the axis is to be drawn on. The axis is place as follows: 1 = below, 2 = left, 3 = above and 4 = right.  | 
at | 
 The points at which tick-marks are to be drawn.  | 
include.symbol | 
 Whether to include a dollar or percent symbol, where the symbol chosen depends on the function.  | 
simplify | 
 For dollars, simplify the amount to use abbreviations of
  | 
... | 
 Arguments passed to   | 
Value
The numeric locations on the axis scale at which tick marks were drawn when the plot was first drawn.
Author(s)
David Diez
See Also
buildAxis AxisInDollars AxisInDollars
Examples
x <- sample(50e6, 100)
hist(x, axes = FALSE)
AxisInDollars(1, pretty(x))
Add background color to a plot
Description
Overlays a colored rectangle over the entire plotting region.
Usage
BG(col = openintro::COL[5, 9])
Arguments
col | 
 Color to overlay.  | 
See Also
Examples
Test <- function(col) {
  plot(1:7,
    col = COL[1:7], pch = 19, cex = 5,
    xlim = c(0, 8),
    ylim = c(0, 9)
  )
  BG(col)
  points(2:8, col = COL[1:7], pch = 19, cex = 5)
  text(2, 6, "Correct Color")
  text(6, 2, "Affected Color")
}
# Works well since black color almost fully transparent
Test(COL[5, 9])
# Works less well since transparency isn't as significant
Test(COL[5, 6])
# Pretty ugly due to overlay
Test(COL[5, 3])
# Basically useless due to heavy color gradient
Test(COL[4, 2])
Plot a Braces Symbol
Description
This function is not yet very flexible.
Usage
Braces(x, y, face.radians = 0, long = 1, short = 0.2, ...)
Arguments
x | 
 x-coordinate of the center of the braces.  | 
y | 
 y-coordinate of the center of the braces.  | 
face.radians | 
 Radians of where the braces should face.  For example,
the default with   | 
long | 
 The units for the long dimension of the braces.  | 
short | 
 The units for the short dimension of the braces. This must be less than or equal to half of the long dimension.  | 
... | 
 Arguments passed to   | 
Author(s)
David Diez
See Also
Examples
plot(0:1, 0:1, type = "n")
Braces(0.5, 0.5, face.radians = 3 * pi / 2)
Plot a Cartesian Coordinate Plane
Description
Create a Cartesian Coordinate Plane.
Usage
CCP(
  xlim = c(-4, 4),
  ylim = c(-4, 4),
  mar = rep(0, 4),
  length = 0.1,
  tcl = 0.007,
  xylab = FALSE,
  ticks = 1,
  ticklabs = 1,
  xpos = 1,
  ypos = 2,
  cex.coord = 1,
  cex.xylab = 1.5,
  add = FALSE
)
Arguments
xlim | 
 The x-limits for the plane (vector of length 2).  | 
ylim | 
 The y-limits for the plane (vector of length 2).  | 
mar | 
 Plotting margins.  | 
length | 
 The   | 
tcl | 
 Tick size.  | 
xylab | 
 Whether x and y should be shown next to the labels.  | 
ticks | 
 How frequently tick marks should be shown on the axes. If a vector of length 2, the first argument will correspond to the x-axis and the second to the y-axis.  | 
ticklabs | 
 How frequently tick labels should be shown on the axes. If a vector of length 2, the first argument will correspond to the x-axis and the second to the y-axis.  | 
xpos | 
 The position of the labels on the x-axis. See the   | 
ypos | 
 The position of the labels on the y-axis. See the   | 
cex.coord | 
 Inflation factor for font size of the coordinates, where
any value larger than zero is acceptable and   | 
cex.xylab | 
 Inflation factor for font size of the x and y labels, where
any value larger than zero is acceptable and   | 
add | 
 Indicate whether a new plot should be created (  | 
Author(s)
David Diez
See Also
lsegments, dlsegments,
ArrowLines
Examples
CCP()
CCP(xylab = TRUE, ylim = c(-3.5, 2), xpos = 3, cex.coord = 1)
CCP(xlim = c(-8, 8), ylim = c(-10, 6), ticklabs = c(2, 2), cex.xylab = 0.8)
OpenIntro Statistics colors
Description
These are the core colors used for the OpenIntro Statistics textbook. The blue, green, yellow, and red colors are also gray-scaled, meaning no changes are required when printing black and white copies.
Usage
COL
Format
A 7-by-13 matrix of 7 colors with thirteen fading scales: blue, green, yellow, red, black, gray, and light gray.
Source
Colors selected by OpenIntro's in-house graphic designer, Meenal Patel.
Examples
plot(1:7, 7:1,
  col = COL, pch = 19, cex = 6, xlab = "", ylab = "",
  xlim = c(0.5, 7.5), ylim = c(-2.5, 8), axes = FALSE
)
text(1:7, 7:1 + 0.7, paste("COL[", 1:7, "]", sep = ""), cex = 0.9)
points(1:7, 7:1 - 0.7, col = COL[, 2], pch = 19, cex = 6)
points(1:7, 7:1 - 1.4, col = COL[, 3], pch = 19, cex = 6)
points(1:7, 7:1 - 2.1, col = COL[, 4], pch = 19, cex = 6)
Contingency Table to Data Frame
Description
Take a 2D contingency table and create a data frame representing the individual cases.
Usage
CT2DF(x, rn = row.names(x), cn = colnames(x), dfn = c("row.var", "col.var"))
Arguments
x | 
 Contingency table as a matrix.  | 
rn | 
 Character vector of the row names.  | 
cn | 
 Character vector of the column names.  | 
dfn | 
 Character vector with 2 values for the variable representing the rows and columns.  | 
Value
A data frame with two columns.
Author(s)
David Diez
See Also
Examples
a <- matrix(
  c(459, 727, 854, 385, 99, 4198, 6245, 4821, 1634, 578),
  2,
  byrow = TRUE
)
b <-
  CT2DF(
    a,
    c("No", "Yes"),
    c("Excellent", "Very good", "Good", "Fair", "Poor"),
    c("coverage", "health_status")
  )
table(b)
Plot upper tail in chi-square distribution
Description
Plot a chi-square distribution and shade the upper tail.
Usage
ChiSquareTail(
  U,
  df,
  xlim = c(0, 10),
  col = fadeColor("black", "22"),
  axes = TRUE,
  ...
)
Arguments
U | 
 Cut off for the upper tail.  | 
df | 
 Degrees of freedom.  | 
xlim | 
 Limits for the plot.  | 
col | 
 Color of the shading.  | 
axes | 
 Whether to plot an x-axis.  | 
... | 
 Currently ignored.  | 
Value
Nothing is returned from the function.
Author(s)
David Diez
See Also
Examples
data(COL)
ChiSquareTail(11.7,
  7,
  c(0, 25),
  col = COL[1]
)
Introduction to Modern Statistics (IMS) Colors
Description
These are the core colors used for the Introduction to Modern Statistics textbook. The blue, green, pink, yellow, and red colors are also gray-scaled, meaning no changes are required when printing black and white copies.
Usage
IMSCOL
Format
A 8-by-13 matrix of 7 colors with four fading scales: blue,
green, pink, yellow, red, black, gray, and light gray.
Examples
plot(1:7, 7:1,
  col = IMSCOL, pch = 19, cex = 6, xlab = "", ylab = "",
  xlim = c(0.5, 7.5), ylim = c(-2.5, 8), axes = FALSE
)
text(1:7, 7:1 + 0.7, paste("IMSCOL[", 1:7, "]", sep = ""), cex = 0.9)
points(1:7, 7:1 - 0.7, col = IMSCOL[, 2], pch = 19, cex = 6)
points(1:7, 7:1 - 1.4, col = IMSCOL[, 3], pch = 19, cex = 6)
points(1:7, 7:1 - 2.1, col = IMSCOL[, 4], pch = 19, cex = 6)
LAhomes
Description
Data collected by Andrew Bray at Reed College on characteristics of LA Homes in 2010.
Usage
LAhomes
Format
A data frame with 1594 observations on the following 8 variables.
- city
 City where the home is located.
- type
 Type of home with levels
Condo/Twh- condo or townhouse,SFR- single family residence, andNA- bed
 Number of bedrooms in the home.
- bath
 Number of bathrooms in the home.
- garage
 Number of cars that can be parked in the garage. Note that a value of
4refers to 4 or more garage spaces.- sqft
 Squarefootage of the home.
- pool
 Indicates if the home has a pool.
- price
 Listing price of the home.
Examples
library(ggplot2)
ggplot(LAhomes, aes(sqft, price)) +
  geom_point(alpha = 0.2) +
  theme_minimal() +
  labs(
    title = "Can we predict list price from squarefootage?",
    subtitle = "Homes in the Los Angeles area",
    x = "Square feet",
    y = "List price"
  )
Custom Mosaic Plot
Description
Plot a mosaic plot custom built for a particular figure.
Usage
MosaicPlot(
  formula,
  data,
  col = "#00000022",
  border = 1,
  dir = c("v", "h"),
  off = 0.01,
  cex.axis = 0.7,
  col.dir = "v",
  flip = c("v"),
  ...
)
Arguments
formula | 
 Formula describing the variable relationship.  | 
data | 
 Data frame for the variables, optional.  | 
col | 
 Colors for plotting.  | 
border | 
 Ignored.  | 
dir | 
 Ignored.  | 
off | 
 Fraction of white space between each box in the plot.  | 
cex.axis | 
 Axis label size.  | 
col.dir | 
 Direction to lay out colors.  | 
flip | 
 Whether to flip the ordering of the vertical (  | 
... | 
 Ignored.  | 
Author(s)
David Diez
Examples
data(email)
data(COL)
email$spam <- ifelse(email$spam == 0, "not\nspam", "spam")
MosaicPlot(number ~ spam, email, col = COL[1:3], off = 0.02)
Plot data and add a regression line
Description
Plot data and add a regression line.
Usage
PlotWLine(
  x,
  y,
  xlab = "",
  ylab = "",
  col = fadeColor(4, "88"),
  cex = 1.2,
  pch = 20,
  n = 4,
  nMax = 4,
  yR = 0.1,
  axes = TRUE,
  ...
)
Arguments
x | 
 Predictor variable.  | 
y | 
 Outcome variable.  | 
xlab | 
 x-axis label.  | 
ylab | 
 y-axis label.  | 
col | 
 Color of points.  | 
cex | 
 Size of points.  | 
pch | 
 Plotting character.  | 
n | 
 The preferred number of axis labels.  | 
nMax | 
 The maximum number of axis labels.  | 
yR | 
 y-limit buffer factor.  | 
axes | 
 Boolean to indicate whether or not to include axes.  | 
... | 
 Passed to   | 
See Also
Examples
PlotWLine(1:10, seq(-5, -2, length.out = 10) + rnorm(10))
Absenteeism from school in New South Wales
Description
Researchers interested in the relationship between absenteeism from school and certain demographic characteristics of children collected data from 146 randomly sampled students in rural New South Wales, Australia, in a particular school year.
Usage
absenteeism
Format
A data frame with 146 observations on the following 5 variables.
- eth
 Ethnicity, representing Aboriginal (
A) or not (N).- sex
 Gender.
- age
 Age bucket.
- lrn
 Learner status, with average learner (
AL) and slow learner (SL).- days
 Number of days absent.
Source
Venables WN, Ripley BD. 2002. Modern Applied Statistics with S. Fourth Edition. New York: Springer.
Data can also be found in the R MASS package under the dataset name
quine.
Examples
library(ggplot2)
ggplot(absenteeism, aes(x = eth, y = days)) +
  geom_boxplot() +
  coord_flip()
American Community Survey, 2012
Description
Results from the US Census American Community Survey, 2012.
Usage
acs12
Format
A data frame with 2000 observations on the following 13 variables.
- income
 Annual income.
- employment
 Employment status.
- hrs_work
 Hours worked per week.
- race
 Race.
- age
 Age, in years.
- gender
 Gender.
- citizen
 Whether the person is a U.S. citizen.
- time_to_work
 Travel time to work, in minutes.
- lang
 Language spoken at home.
- married
 Whether the person is married.
- edu
 Education level.
- disability
 Whether the person is disabled.
- birth_qrtr
 The quarter of the year that the person was born, e.g.
Jan thru Mar.
Source
https://www.census.gov/programs-surveys/acs
Examples
library(dplyr)
library(ggplot2)
library(broom)
# employed only
acs12_emp <- acs12 |>
  filter(
    age >= 30, age <= 60,
    employment == "employed",
    income > 0
  )
# linear model
ggplot(acs12_emp, mapping = aes(x = age, y = income)) +
  geom_point() +
  geom_smooth(method = "lm")
lm(income ~ age, data = acs12_emp) |>
  tidy()
# log-transormed model
ggplot(acs12_emp, mapping = aes(x = age, y = log(income))) +
  geom_point() +
  geom_smooth(method = "lm")
lm(log(income) ~ age, data = acs12_emp) |>
  tidy()
Age at first marriage of 5,534 US women.
Description
Age at first marriage of 5,534 US women who responded to the National Survey of Family Growth (NSFG) conducted by the CDC in the 2006 and 2010 cycle.
Usage
age_at_mar
Format
A data frame with 5,534 observations and 1 variable.
- age
 Age a first marriage.
Source
National Survey of Family Growth, 2006-2010 cycle, https://www.cdc.gov/nchs/nsfg/nsfg_2006_2010_puf.htm.
Examples
library(ggplot2)
ggplot(age_at_mar, mapping = aes(x = age)) +
  geom_histogram(binwidth = 3) +
  labs(
    x = "Age", y = "Count", title = "Age at first marriage, US Women",
    subtitle = "Source: National Survey of Family Growth Survey, 2006 - 2010"
  )
Housing prices in Ames, Iowa
Description
Data set contains information from the Ames Assessor's Office used in computing assessed values for individual residential properties sold in Ames, IA from 2006 to 2010. See here for detailed variable descriptions.
Usage
ames
Format
A tbl_df with with 2930 rows and 82 variables:
- Order
 Observation number.
- PID
 Parcel identification number - can be used with city web site for parcel review.
- area
 Above grade (ground) living area square feet.
- price
 Sale price in USD.
- MS.SubClass
 Identifies the type of dwelling involved in the sale.
- MS.Zoning
 Identifies the general zoning classification of the sale.
- Lot.Frontage
 Linear feet of street connected to property.
- Lot.Area
 Lot size in square feet.
- Street
 Type of road access to property.
- Alley
 Type of alley access to property.
- Lot.Shape
 General shape of property.
- Land.Contour
 Flatness of the property.
- Utilities
 Type of utilities available.
- Lot.Config
 Lot configuration.
- Land.Slope
 Slope of property.
- Neighborhood
 Physical locations within Ames city limits (map available).
- Condition.1
 Proximity to various conditions.
- Condition.2
 Proximity to various conditions (if more than one is present).
- Bldg.Type
 Type of dwelling.
- House.Style
 Style of dwelling.
- Overall.Qual
 Rates the overall material and finish of the house.
- Overall.Cond
 Rates the overall condition of the house.
- Year.Built
 Original construction date.
- Year.Remod.Add
 Remodel date (same as construction date if no remodeling or additions).
- Roof.Style
 Type of roof.
- Roof.Matl
 Roof material.
- Exterior.1st
 Exterior covering on house.
- Exterior.2nd
 Exterior covering on house (if more than one material).
- Mas.Vnr.Type
 Masonry veneer type.
- Mas.Vnr.Area
 Masonry veneer area in square feet.
- Exter.Qual
 Evaluates the quality of the material on the exterior.
- Exter.Cond
 Evaluates the present condition of the material on the exterior.
- Foundation
 Type of foundation.
- Bsmt.Qual
 Evaluates the height of the basement.
- Bsmt.Cond
 Evaluates the general condition of the basement.
- Bsmt.Exposure
 Refers to walkout or garden level walls.
- BsmtFin.Type.1
 Rating of basement finished area.
- BsmtFin.SF.1
 Type 1 finished square feet.
- BsmtFin.Type.2
 Rating of basement finished area (if multiple types).
- BsmtFin.SF.2
 Type 2 finished square feet.
- Bsmt.Unf.SF
 Unfinished square feet of basement area.
- Total.Bsmt.SF
 Total square feet of basement area.
- Heating
 Type of heating.
- Heating.QC
 Heating quality and condition.
- Central.Air
 Central air conditioning.
- Electrical
 Electrical system.
- X1st.Flr.SF
 First Floor square feet.
- X2nd.Flr.SF
 Second floor square feet.
- Low.Qual.Fin.SF
 Low quality finished square feet (all floors).
- Bsmt.Full.Bath
 Basement full bathrooms.
- Bsmt.Half.Bath
 Basement half bathrooms.
- Full.Bath
 Full bathrooms above grade.
- Half.Bath
 Half baths above grade.
- Bedroom.AbvGr
 Bedrooms above grade (does NOT include basement bedrooms).
- Kitchen.AbvGr
 Kitchens above grade.
- Kitchen.Qual
 Kitchen quality.
- TotRms.AbvGrd
 Total rooms above grade (does not include bathrooms).
- Functional
 Home functionality (Assume typical unless deductions are warranted).
- Fireplaces
 Number of fireplaces.
- Fireplace.Qu
 Fireplace quality.
- Garage.Type
 Garage location.
- Garage.Yr.Blt
 Year garage was built.
- Garage.Finish
 Interior finish of the garage.
- Garage.Cars
 Size of garage in car capacity.
- Garage.Area
 Size of garage in square feet.
- Garage.Qual
 Garage quality.
- Garage.Cond
 Garage condition.
- Paved.Drive
 Paved driveway.
- Wood.Deck.SF
 Wood deck area in square feet.
- Open.Porch.SF
 Open porch area in square feet.
- Enclosed.Porch
 Enclosed porch area in square feet.
- X3Ssn.Porch
 Three season porch area in square feet.
- Screen.Porch
 Screen porch area in square feet.
- Pool.Area
 Pool area in square feet.
- Pool.QC
 Pool quality.
- Fence
 Fence quality.
- Misc.Feature
 Miscellaneous feature not covered in other categories.
- Misc.Val
 Dollar value of miscellaneous feature.
- Mo.Sold
 Month Sold (MM).
- Yr.Sold
 Year Sold (YYYY).
- Sale.Type
 Type of sale.
- Sale.Condition
 Condition of sale.
Source
De Cock, Dean. "Ames, Iowa: Alternative to the Boston housing data as an end of semester regression project." Journal of Statistics Education 19.3 (2011).
Acute Myocardial Infarction (Heart Attack) Events
Description
This dataset is simulated but contains realistic occurrences of AMI in NY City.
Usage
ami_occurrences
Format
A data frame with 365 observations on the following variable.
- ami
 Number of daily occurrences of heart attacks in NY City.
Examples
library(ggplot2)
ggplot(ami_occurrences, mapping = aes(x = ami)) +
  geom_bar() +
  labs(
    x = "Acute Myocardial Infarction events",
    y = "Count",
    title = "Acute Myocardial Infarction events in NYC"
  )
Pre-existing conditions in 92 children
Description
Pre-existing medical conditions of 92 children involved in a study on the optimal duration of antibiotic use in treatment of tracheitis, which is an upper respiratory infection.
Usage
antibiotics
Format
A data frame with 92 observations, each representing a child, on the following variable.
- condition
 Pre-existing medical condition.
Examples
library(ggplot2)
ggplot(antibiotics, aes(x = condition)) +
  geom_bar() +
  labs(
    x = "Conidition", y = "Count",
    title = "Pre-existing coniditions of children",
    subtitle = "in antibiotic use study"
  ) +
  coord_flip()
Male and female births in London
Description
Arbuthnot's data describes male and female christenings (births) for London from 1629-1710.
Usage
arbuthnot
Format
A tbl_df with with 82 rows and 3 variables:
- year
 year, ranging from 1629 to 1710
- boys
 number of male christenings (births)
- girls
 number of female christenings (births)
Details
John Arbuthnot (1710) used these time series data to carry out the first known significance test. During every one of the 82 years, there were more male christenings than female christenings. As Arbuthnot wondered, we might also wonder if this could be due to chance, or whether it meant the birth ratio was not actually 1:1.
Source
These data are excerpted from the Arbuthnot dataset in the
HistData package.
Examples
library(ggplot2)
library(tidyr)
# All births
ggplot(arbuthnot, aes(x = year, y = boys + girls, group = 1)) +
  geom_line()
# Boys and girls
arbuthnot |>
  pivot_longer(cols = -year, names_to = "sex", values_to = "n") |>
  ggplot(aes(x = year, y = n, color = sex, group = sex)) +
  geom_line()
How important is it to ask pointed questions?
Description
In this experiment, each individual was asked to be a seller of an iPod (a product commonly used to store music on before smart phones...). They participant received $10 + 5% of the sale price for participating. The iPod they were selling had frozen twice in the past inexplicably but otherwise worked fine. The prospective buyer starts off and then asks one of three final questions, depending on the seller's treatment group.
Usage
ask
Format
A data frame with 219 observations on the following 3 variables.
- question_class
 The type of question:
general,pos_assumption, andneg_assumption.- question
 The question corresponding to the
question.class- response
 The classified response from the seller, either
discloseorhide.
Details
The three possible questions:
General: What can you tell me about it?
Positive Assumption: It doesn't have any problems, does it?
Negative Assumption: What problems does it have?
The outcome variable is whether or not the participant discloses or hides the problem with the iPod.
Source
Minson JA, Ruedy NE, Schweitzer ME. There is such a thing as a stupid question: Question disclosure in strategic communication.
Examples
library(dplyr)
library(ggplot2)
# Distribution of responses based on question type
ask |>
  count(question_class, response)
# Visualize relative frequencies of responses based on question type
ggplot(ask, aes(x = question_class, fill = response)) +
  geom_bar(position = "fill")
# Perform chi-square test
(test <- chisq.test(table(ask$question_class, ask$response)))
# Check the test's assumption around sufficient expected observations
# per table cell.
test$expected
Simulated data for association plots
Description
Simulated dataset.
Usage
association
Format
A data frame with 121 observations on the following 4 variables.
- x1
 a numeric vector
- x2
 a numeric vector
- x3
 a numeric vector
- y1
 a numeric vector
- y2
 a numeric vector
- y3
 a numeric vector
- y4
 a numeric vector
- y5
 a numeric vector
- y6
 a numeric vector
- y7
 a numeric vector
- y8
 a numeric vector
- y9
 a numeric vector
- y10
 a numeric vector
- y11
 a numeric vector
- y12
 a numeric vector
Examples
library(ggplot2)
ggplot(association, aes(x = x1, y = y1)) +
  geom_point()
ggplot(association, aes(x = x2, y = y4)) +
  geom_point()
ggplot(association, aes(x = x3, y = y7)) +
  geom_point()
Eye color of couples
Description
Colors of the eye colors of male and female partners.
Usage
assortative_mating
Format
A data frame with 204 observations on the following 2 variables.
- self_male
 a factor with levels
blue,brown, andgreen- partner_female
 a factor with
blue,brown, andgreen
Source
B. Laeng et al. Why do blue-eyed men prefer women with the same eye color? In: Behavioral Ecology and Sociobiology 61.3 (2007), pp. 371-384.
Examples
data(assortive_mating)
table(assortive_mating)
Cardiovascular problems for two types of Diabetes medicines
Description
A comparison of cardiovascular problems for Rosiglitazone and Pioglitazone.
Usage
avandia
Format
A data frame with 227571 observations on the following 2 variables.
- treatment
 a factor with levels
PioglitazoneandRosiglitazone- cardiovascular_problems
 a factor with levels
noandyes
Source
D.J. Graham et al. Risk of acute myocardial infarction, stroke, heart failure, and death in elderly Medicare patients treated with rosiglitazone or pioglitazone. In: JAMA 304.4 (2010), p. 411. issn: 0098-7484.
Examples
table(avandia)
The Child Health and Development Studies
Description
The Child Health and Development Studies investigate a range of topics. One
study, in particular, considered all pregnancies between 1960 and 1967 among
women in the Kaiser Foundation Health Plan in the San Francisco East Bay
area. We do not have ideal provenance for these data. For a better documented
and more recent dataset on a similar topic with similar variables,
see births14. Additionally, Gestation dataset in the
mosaicData
package also contains similar data.
Usage
babies
Format
A data frame with 1236 rows and 8 variables:
- case
 id number
- bwt
 birthweight, in ounces
- gestation
 length of gestation, in days
- parity
 binary indicator for a first pregnancy (0 = first pregnancy)
- age
 mother's age in years
- height
 mother's height in inches
- weight
 mother's weight in pounds
- smoke
 binary indicator for whether the mother smokes
Source
These data come from Child Health and Development Studies.
Crawling age
Description
Crawling age of babies along with the average outdoor temperature at 6 months of age.
Usage
babies_crawl
Format
A data frame with 12 observations on the following 5 variables.
- birth_month
 A factor with levels corresponding to months
- avg_crawling_age
 a numeric vector
- sd
 a numeric vector
- n
 a numeric vector
- temperature
 a numeric vector
Source
J.B. Benson. Season of birth and onset of locomotion: Theoretical and methodological implications. In: Infant behavior and development 16.1 (1993), pp. 69-81. issn: 0163-6383.
Examples
library(ggplot2)
ggplot(babies_crawl, aes(x = temperature, y = avg_crawling_age)) +
  geom_point() +
  labs(x = "Temperature", y = "Average crawling age")
Beer and blood alcohol content
Description
Here we examine data from sixteen student volunteers at Ohio State University who each drank a randomly assigned number of cans of beer.
Usage
bac
Format
A data frame with 16 observations on the following 3 variables.
- student
 a numeric vector
- beers
 a numeric vector
- bac
 a numeric vector
Source
J. Malkevitch and L.M. Lesser. For All Practical Purposes: Mathematical Literacy in Today's World. WH Freeman & Co, 2008. The data origin is given in the Electronic Encyclopedia of Statistical Examples and Exercises, 1992.
Examples
library(ggplot2)
ggplot(bac, aes(x = beers, y = bac)) +
  geom_point() +
  labs(x = "Number of beers", y = "Blood alcohol content")
Lifespan of ball bearings
Description
A simulated dataset on lifespan of ball bearings.
Usage
ball_bearing
Format
A data frame with 75 observations on the following variable.
- life_span
 Lifespan of ball bearings (in hours).
Source
Simulated data.
Examples
library(ggplot2)
ggplot(ball_bearing, aes(x = life_span)) +
  geom_histogram(binwidth = 1)
qqnorm(ball_bearing$life_span)
Body measurements of 507 physically active individuals.
Description
Body girth measurements and skeletal diameter measurements, as well as age, weight, height and gender, are given for 507 physically active individuals - 247 men and 260 women. These data can be used to provide statistics students practice in the art of data analysis. Such analyses range from simple descriptive displays to more complicated multivariate analyses such as multiple regression and discriminant analysis.
Usage
bdims
Format
A data frame with 507 observations on the following 25 variables.
- bia_di
 A numerical vector, respondent's biacromial diameter in centimeters.
- bii_di
 A numerical vector, respondent's biiliac diameter (pelvic breadth) in centimeters.
- bit_di
 A numerical vector, respondent's bitrochanteric diameter in centimeters.
- che_de
 A numerical vector, respondent's chest depth in centimeters, measured between spine and sternum at nipple level, mid-expiration.
- che_di
 A numerical vector, respondent's chest diameter in centimeters, measured at nipple level, mid-expiration.
- elb_di
 A numerical vector, respondent's elbow diameter in centimeters, measured as sum of two elbows.
- wri_di
 A numerical vector, respondent's wrist diameter in centimeters, measured as sum of two wrists.
- kne_di
 A numerical vector, respondent's knee diameter in centimeters, measured as sum of two knees.
- ank_di
 A numerical vector, respondent's ankle diameter in centimeters, measured as sum of two ankles.
- sho_gi
 A numerical vector, respondent's shoulder girth in centimeters, measured over deltoid muscles.
- che_gi
 A numerical vector, respondent's chest girth in centimeters, measured at nipple line in males and just above breast tissue in females, mid-expiration.
- wai_gi
 A numerical vector, respondent's waist girth in centimeters, measured at the narrowest part of torso below the rib cage as average of contracted and relaxed position.
- nav_gi
 A numerical vector, respondent's navel (abdominal) girth in centimeters, measured at umbilicus and iliac crest using iliac crest as a landmark.
- hip_gi
 A numerical vector, respondent's hip girth in centimeters, measured at at level of bitrochanteric diameter.
- thi_gi
 A numerical vector, respondent's thigh girth in centimeters, measured below gluteal fold as the average of right and left girths.
- bic_gi
 A numerical vector, respondent's bicep girth in centimeters, measured when flexed as the average of right and left girths.
- for_gi
 A numerical vector, respondent's forearm girth in centimeters, measured when extended, palm up as the average of right and left girths.
- kne_gi
 A numerical vector, respondent's knee diameter in centimeters, measured as sum of two knees.
- cal_gi
 A numerical vector, respondent's calf maximum girth in centimeters, measured as average of right and left girths.
- ank_gi
 A numerical vector, respondent's ankle minimum girth in centimeters, measured as average of right and left girths.
- wri_gi
 A numerical vector, respondent's wrist minimum girth in centimeters, measured as average of right and left girths.
- age
 A numerical vector, respondent's age in years.
- wgt
 A numerical vector, respondent's weight in kilograms.
- hgt
 A numerical vector, respondent's height in centimeters.
- sex
 A categorical vector, 1 if the respondent is male, 0 if female.
Source
Heinz G, Peterson LJ, Johnson RW, Kerk CJ. 2003. Exploring Relationships in Body Dimensions. Journal of Statistics Education 11(2).
Examples
library(ggplot2)
ggplot(bdims, aes(x = hgt)) +
  geom_histogram(binwidth = 5)
ggplot(bdims, aes(x = hgt, y = wgt)) +
  geom_point() +
  labs(x = "Height", y = "Weight")
ggplot(bdims, aes(x = hgt, y = sho_gi)) +
  geom_point() +
  labs(x = "Height", y = "Shoulder girth")
ggplot(bdims, aes(x = hgt, y = hip_gi)) +
  geom_point() +
  labs(x = "Height", y = "Hip girth")
Efficacy of Pfizer-BioNTech COVID-19 vaccine on adolescents
Description
On March 31, 2021, Pfizer and BioNTech announced that "in a Phase 3 trial in adolescents 12 to 15 years of age with or without prior evidence of SARS-CoV-2 infection, the Pfizer-BioNTech COVID-19 vaccine BNT162b2 demonstrated 100% efficacy and robust antibody responses, exceeding those recorded earlier in vaccinated participants aged 16 to 25 years old, and was well tolerated." These results are from a Phase 3 trial in 2,260 adolescents 12 to 15 years of age in the United States. In the trial, 18 cases of COVID-19 were observed in the placebo group (n = 1,129) versus none in the vaccinated group (n = 1,131).
Usage
biontech_adolescents
Format
A data frame with 2260 observations on the following 2 variables.
- group
 Study group:
vaccine(Pfizer-BioNTech COVID-19 vaccine administered) orplacebo.- outcome
 Study outcome:
COVID-19orno COVID-19.
Source
"Pfizer-Biontech Announce Positive Topline Results Of Pivotal Covid-19 Vaccine Study In Adolescents". March 21, 2021. (Retrieved April 25, 2021.)
Examples
library(dplyr)
library(ggplot2)
biontech_adolescents |>
  count(group, outcome)
ggplot(biontech_adolescents, aes(y = group, fill = outcome)) +
  geom_bar()
Aircraft-Wildlife Collisions
Description
A collection of all collisions between aircraft in wildlife that were reported to the US Federal Aviation Administration between 1990 and 1997, with details on the circumstances of the collision.
Usage
birds
Format
A data frame with 19302 observations on the following 17 variables.
- opid
 Three letter identification code for the operator (carrier) of the aircraft.
- operator
 Name of the aircraft operator.
- atype
 Make and model of aircraft.
- remarks
 Verbal remarks regarding the collision.
- phase_of_flt
 Phase of the flight during which the collision occurred:
Approach,Climb,Descent,En Route,Landing Roll,Parked,Take-off run,Taxi.- ac_mass
 Mass of the aircraft classified as 2250 kg or less (1), 2251-5700 kg (2), 5701-27000 kg (3), 27001-272000 kg (4), above 272000 kg (5).
- num_engs
 Number of engines on the aircraft.
- date
 Date of the collision (MM/DD/YYYY).
- time_of_day
 Light conditions:
Dawn,Day,Dusk,Night.- state
 Two letter abbreviation of the US state in which the collision occurred.
- height
 Feet above ground level.
- speed
 Knots (indicated air speed).
- effect
 Effect on flight:
Aborted Take-off,Engine Shut Down,None,Other,Precautionary Landing.- sky
 Type of cloud cover, if any:
No Cloud,Overcast,Some Cloud.- species
 Common name for bird or other wildlife.
- birds_seen
 Number of birds/wildlife seen by pilot:
1,2-10,11-100,Over 100.- birds_struck
 Number of birds/wildlife struck:
0,1,2-10,11-100,Over 100.
Details
The FAA National Wildlife Strike Database contains strike reports that are voluntarily reported to the FAA by pilots, airlines, airports and others. Current research indicates that only about 20\ Wildlife strike reporting is not uniform as some organizations have more robust voluntary reporting procedures. Because of variations in reporting, users are cautioned that the comparisons between individual airports or airlines may be misleading.
Source
Aircraft Wildlife Strike Data: Search Tool - FAA Wildlife Strike Database. Available at https://datahub.transportation.gov/Aviation/Aircraft-Wildlife-Strike-Data-Search-Tool-FAA-Wild/jhay-dgxy. Retrieval date: Feb 4, 2012.
Examples
library(dplyr)
library(ggplot2)
library(forcats)
library(tidyr)
# Phase of the flight during which the collision occurred, tabular
birds |>
  count(phase_of_flt, sort = TRUE)
# Phase of the flight during which the collision occurred, barplot
ggplot(birds, aes(y = fct_infreq(phase_of_flt))) +
  geom_bar() +
  labs(x = "Phase of flight")
# Height summary statistics
summary(birds$height)
# Phase of flight vs. effect of crash
birds |>
  drop_na(phase_of_flt, effect) |>
  ggplot(aes(y = phase_of_flt, fill = effect)) +
  geom_bar(position = "fill") +
  labs(x = "Proportion", y = "Phase of flight", fill = "Effect")
North Carolina births, 100 cases
Description
Data on a random sample of 100 births for babies in North Carolina where the mother was not a smoker and another 50 where the mother was a smoker.
Usage
births
Format
A data frame with 150 observations on the following 14 variables.
- f_age
 Father's age.
- m_age
 Mother's age.
- weeks
 Weeks at which the mother gave birth.
- premature
 Indicates whether the baby was premature or not.
- visits
 Number of hospital visits.
- gained
 Weight gained by mother.
- weight
 Birth weight of the baby.
- sex_baby
 Gender of the baby.
- smoke
 Whether or not the mother was a smoker.
Source
Birth records released by North Carolina in 2004.
See Also
We do not have ideal provenance for these data. For a better documented and more recent dataset on a similar topic with similar variables, see births14. Additionally, ncbirths also contains similar data.
Examples
library(ggplot2)
ggplot(births, aes(x = smoke, y = weight)) +
  geom_boxplot()
US births
Description
Every year, the US releases to the public a large dataset containing information on births recorded in the country. This dataset has been of interest to medical researchers who are studying the relation between habits and practices of expectant mothers and the birth of their children. This is a random sample of 1,000 cases from the dataset released in 2014.
Usage
births14
Format
A data frame with 1,000 observations on the following 13 variables.
- fage
 Father's age in years.
- mage
 Mother's age in years.
- mature
 Maturity status of mother.
- weeks
 Length of pregnancy in weeks.
- premie
 Whether the birth was classified as premature (premie) or full-term.
- visits
 Number of hospital visits during pregnancy.
- gained
 Weight gained by mother during pregnancy in pounds.
- weight
 Weight of the baby at birth in pounds.
- lowbirthweight
 Whether baby was classified as low birthweight (
low) or not (not low).- sex
 Sex of the baby,
femaleormale.- habit
 Status of the mother as a
nonsmokeror asmoker.- marital
 Whether mother is
marriedornot marriedat birth.- whitemom
 Whether mom is
whiteornot white.
Source
United States Department of Health and Human Services. Centers for Disease Control and Prevention. National Center for Health Statistics. Natality Detail File, 2014 United States. Inter-university Consortium for Political and Social Research, 2016-10-07. doi:10.3886/ICPSR36461.v1.
Examples
library(ggplot2)
ggplot(births14, aes(x = habit, y = weight)) +
  geom_boxplot() +
  labs(x = "Smoking status of mother", y = "Birth weight of baby (in lbs)")
ggplot(births14, aes(x = whitemom, y = visits)) +
  geom_boxplot() +
  labs(x = "Mother's race", y = "Number of doctor visits during pregnancy")
ggplot(births14, aes(x = mature, y = gained)) +
  geom_boxplot() +
  labs(x = "Mother's age category", y = "Weight gained during pregnancy")
Blizzard Employee Voluntary Salary Info.
Description
Employee generated anonymous survey of salary information.
Usage
blizzard_salary
Format
A data frame with 466 rows and 9 variables.
- timestamp
 Time data was entered
- status
 Specifies employment status.
- current_title
 Current job title.
- current_salary
 Current salary (in USD).
- salary_type
 Frequency with levels year, hour, week.
- percent_incr
 Raise given July 2020.
- other_info
 Other information submitted by employee.
- location
 Current office of employment.
- performance_rating
 Most recent review performance rating.
Source
Bloomberg - Blizzard workers share salaries in revolt over wage disparities.
Examples
library(ggplot2)
library(dplyr)
plot_data <- blizzard_salary |>
  mutate(annual_salary = case_when(
    salary_type == "week" ~ current_salary * 52,
    salary_type == "hour" ~ current_salary * 40 * 52,
    TRUE ~ current_salary
  ))
ggplot(plot_data, aes(annual_salary)) +
  geom_histogram(binwidth = 25000, color = "white") +
  labs(
    title = "Current Salary of Blizzard Employees",
    x = "Salary",
    y = "Number of Employees"
  )
Sample of books on a shelf
Description
Simulated dataset.
Usage
books
Format
A data frame with 95 observations on the following 2 variables.
- type
 a factor with levels
fictionandnonfiction- format
 a factor with levels
hardcoverandpaperback
Examples
table(books)
Box plot
Description
An alternative to boxplot. Equations are not accepted. Instead, the
second argument, fact, is used to split the data.
Usage
boxPlot(
  x,
  fact = NULL,
  horiz = FALSE,
  width = 2/3,
  lwd = 1,
  lcol = "black",
  medianLwd = 2,
  pch = 20,
  pchCex = 1.8,
  col = grDevices::rgb(0, 0, 0, 0.25),
  add = FALSE,
  key = NULL,
  axes = TRUE,
  xlab = "",
  ylab = "",
  xlim = NULL,
  ylim = NULL,
  na.rm = TRUE,
  ...
)
Arguments
x | 
 A numerical vector.  | 
fact | 
 A character or factor vector defining the grouping for side-by-side box plots.  | 
horiz | 
 If   | 
width | 
 The width of the boxes in the plot. Value between   | 
lwd | 
 Width of lines used in box and whiskers.  | 
lcol | 
 Color of the box, median, and whiskers.  | 
medianLwd | 
 Width of the line marking the median.  | 
pch | 
 Plotting character of outliers.  | 
pchCex | 
 Size of outlier character.  | 
col | 
 Color of outliers.  | 
add | 
 If   | 
key | 
 The order in which to display the side-by-side boxplots. If
locations are specified in   | 
axes | 
 Whether to plot the axes.  | 
xlab | 
 Label for the x axis.  | 
ylab | 
 Label for the y axis.  | 
xlim | 
 Limits for the x axis.  | 
ylim | 
 Limits for the y axis.  | 
na.rm | 
 Indicate whether   | 
... | 
 Additional arguments to plot.  | 
Author(s)
David Diez
See Also
histPlot, dotPlot,
densityPlot
Examples
# univariarate
boxPlot(email$num_char, ylab = "Number of characters in emails")
# bivariate
boxPlot(email$num_char, email$spam,
  xlab = "Spam",
  ylab = "Number of characters in emails"
)
# faded outliers
boxPlot(email$num_char, email$spam,
  xlab = "Spam",
  ylab = "Number of characters in emails",
  col = fadeColor("black", 18)
)
# horizontal plots
boxPlot(email$num_char, email$spam,
  horiz = TRUE,
  xlab = "Spam",
  ylab = "Number of characters in emails",
  col = fadeColor("black", 18)
)
# bivariate relationships where categorical data have more than 2 levels
boxPlot(email$num_char, email$image,
  horiz = TRUE,
  xlab = "Number of attached images",
  ylab = "Number of characters in emails",
  col = fadeColor("black", 18)
)
# key can be used to restrict to only the desired groups
boxPlot(email$num_char, email$image,
  horiz = TRUE, key = c(0, 1, 2),
  xlab = "Number of attached images (limited to 0, 1, 2)",
  ylab = "Number of characters in emails",
  col = fadeColor("black", 18)
)
# combine boxPlot and dotPlot
boxPlot(tips$tip, tips$day,
  horiz = TRUE, key = c("Tuesday", "Friday")
)
dotPlot(tips$tip, tips$day,
  add = TRUE, at = 1:2 + 0.05,
  key = c("Tuesday", "Friday")
)
# adding a box
boxPlot(email$num_char[email$spam == 0], xlim = c(0, 3))
boxPlot(email$num_char[email$spam == 1], add = 2, axes = FALSE)
axis(1, at = 1:2, labels = c(0, 1))
boxPlot(email$num_char[email$spam == 0], ylim = c(0, 3), horiz = TRUE)
boxPlot(email$num_char[email$spam == 1], add = 2, horiz = TRUE, axes = FALSE)
axis(2, at = 1:2, labels = c(0, 1))
Axis function substitute
Description
The function buildAxis is built to provide more control of the number
of labels on the axis. This function is still under development.
Usage
buildAxis(side, limits, n, nMin = 2, nMax = 10, extend = 2, eps = 10^-12, ...)
Arguments
side | 
 The side of the plot where to add the axis.  | 
limits | 
 Either lower and upper limits on the axis or a dataset.  | 
n | 
 The preferred number of axis labels.  | 
nMin | 
 The minimum number of axis labels.  | 
nMax | 
 The maximum number of axis labels.  | 
extend | 
 How far the axis may extend beyond   | 
eps | 
 The smallest increment allowed.  | 
... | 
 Arguments passed to   | 
Details
The primary reason behind building this function was to allow a plot to be
created with similar features but with different datasets. For instance, if
a set of code was written for one dataset and the function axis had
been utilized with pre-specified values, the axis may not match the plot of
a new set of data. The function buildAxis addresses this problem by
allowing the number of axis labels to be specified and controlled.
The axis is built by assigning penalties to a variety of potential axis setups, ranking them based on these penalties and then selecting the axis with the best score.
Value
A vector of the axis plotted.
Author(s)
David Diez
See Also
histPlot, dotPlot,
boxPlot, densityPlot
Examples
# ===> 0 <===#
limits <- rnorm(100, 605490, 10)
hist(limits, axes = FALSE)
buildAxis(1, limits, 2, nMax = 4)
# ===> 1 <===#
x <- seq(0, 500, 10)
y <- 8 * x + rnorm(length(x), mean = 6000, sd = 200)
plot(x, y, axes = FALSE)
buildAxis(1, limits = x, n = 5)
buildAxis(2, limits = y, n = 3)
# ===> 2 <===#
x <- 9528412 + seq(0, 200, 10)
y <- 8 * x + rnorm(length(x), mean = 6000, sd = 200)
plot(x, y, axes = FALSE)
temp <- buildAxis(1, limits = x, n = 4)
buildAxis(2, y, 3)
# ===> 3 <===#
x <- seq(367, 1251, 10)
y <- 7.5 * x + rnorm(length(x), mean = 6000, sd = 800)
plot(x, y, axes = FALSE)
buildAxis(1, limits = x, n = 4, nMin = 3, nMax = 3)
buildAxis(2, limits = y, n = 4, nMin = 3, nMax = 5)
# ===> 4 <===#
x <- seq(367, 367.1, 0.001)
y <- 7.5 * x + rnorm(length(x), mean = 6000, sd = 0.01)
plot(x, y, axes = FALSE)
buildAxis(1, limits = x, n = 4, nMin = 5, nMax = 6)
buildAxis(2, limits = y, n = 2, nMin = 3, nMax = 4)
# ===> 5 <===#
x <- seq(-0.05, -0.003, 0.0001)
y <- 50 + 20 * x + rnorm(length(x), sd = 0.1)
plot(x, y, axes = FALSE)
buildAxis(1, limits = x, n = 4, nMin = 5, nMax = 6)
buildAxis(2, limits = y, n = 4, nMax = 5)
abline(lm(y ~ x))
# ===> 6 <===#
x <- seq(-0.0097, -0.008, 0.0001)
y <- 50 + 20 * x + rnorm(length(x), sd = 0.1)
plot(x, y, axes = FALSE)
buildAxis(1, limits = x, n = 4, nMin = 2, nMax = 5)
buildAxis(2, limits = y, n = 4, nMax = 5)
abline(lm(y ~ x))
# ===> 7 <===#
x <- seq(0.03, -0.003099, -0.00001)
y <- 50 + 20 * x + rnorm(length(x), sd = 0.1)
plot(x, y, axes = FALSE)
buildAxis(1, limits = x, n = 4, nMin = 2, nMax = 5)
buildAxis(2, limits = y, n = 4, nMax = 6)
abline(lm(y ~ x))
# ===> 8 - repeat <===#
m <- runif(1) / runif(1) +
  rgamma(1, runif(1) / runif(1), runif(1) / runif(1))
s <- rgamma(1, runif(1) / runif(1), runif(1) / runif(1))
x <- rnorm(50, m, s)
hist(x, axes = FALSE)
buildAxis(1, limits = x, n = 5, nMin = 4, nMax = 6, eps = 10^-12)
if (diff(range(x)) < 10^-12) {
  cat("too small\n")
}
Burger preferences
Description
Sample burger place preferences versus gender.
Usage
burger
Format
A data frame with 500 observations on the following 2 variables.
- best_burger_place
 Burger place.
- gender
 a factor with levels
FemaleandMale
Source
SurveyUSA, Results of SurveyUSA News Poll #17718, data collected on December 2, 2010.
Examples
table(burger)
Calculate hit streaks
Description
Calculate hit streaks
Usage
calc_streak(x)
Arguments
x | 
 A character vector of hits (  | 
Value
A data frame with one column, length, containing the length of
each hit streak.
Examples
data(kobe_basket)
calc_streak(kobe_basket$shot)
Cancer in dogs
Description
A study in 1994 examined 491 dogs that had developed cancer and 945 dogs as a control group to determine whether there is an increased risk of cancer in dogs that are exposed to the herbicide 2,4-Dichlorophenoxyacetic acid (2,4-D).
Usage
cancer_in_dogs
Format
A data frame with 1436 observations on the following 2 variables.
- order
 a factor with levels
2,4-Dandno 2,4-D- response
 a factor with levels
cancerandno cancer
Source
Hayes HM, Tarone RE, Cantor KP, Jessen CR, McCurnin DM, and Richardson RC. 1991. Case- Control Study of Canine Malignant Lymphoma: Positive Association With Dog Owner's Use of 2, 4- Dichlorophenoxyacetic Acid Herbicides. Journal of the National Cancer Institute 83(17):1226-1231.
Examples
table(cancer_in_dogs)
Deck of cards
Description
All the cards in a standard deck.
Usage
cards
Format
A data frame with 52 observations on the following 4 variables.
- value
 a factor with levels
1023456789AJKQ- color
 a factor with levels
blackred- suit
 a factor with levels
ClubDiamondHeartSpade- face
 a logical vector
Examples
table(cards$value)
table(cards$color)
table(cards$suit)
table(cards$face)
table(cards$suit, cards$face)
cars04
Description
A data frame with 428 rows and 19 columns. This is a record of characteristics on all of the new models of cars for sale in the US in the year 2004.
Usage
cars04
Format
A data frame with 428 observations on the following 19 variables.
- name
 The name of the vehicle including manufacturer and model.
- sports_car
 Logical variable indicating if the vehicle is a sports car.
- suv
 Logical variable indicating if the vehicle is an suv.
- wagon
 Logical variable indicating if the vehicle is a wagon.
- minivan
 Logical variable indicating if the vehicle is a minivan.
- pickup
 Logical variable indicating if the vehicle is a pickup.
- all_wheel
 Logical variable indicating if the vehicle is all-wheel drive.
- rear_wheel
 Logical variable indicating if the vehicle is rear-wheel drive.
- msrp
 Manufacturer suggested retail price of the vehicle.
- dealer_cost
 Amount of money the dealer paid for the vehicle.
- eng_size
 Displacement of the engine - the total volume of all the cylinders, measured in liters.
- ncyl
 Number of cylinders in the engine.
- horsepwr
 Amount of horsepower produced by the engine.
- city_mpg
 Gas mileage for city driving, measured in miles per gallon.
- hwy_mpg
 Gas mileage for highway driving, measured in miles per gallon.
- weight
 Total weight of the vehicle, measured in pounds.
- wheel_base
 Distance between the center of the front wheels and the center of the rear wheels, measured in inches.
- length
 Total length of the vehicle, measured in inches.
- width
 Total width of the vehicle, measured in inches.
Examples
library(ggplot2)
# Highway gas mileage
ggplot(cars04, aes(x = hwy_mpg)) +
  geom_histogram(
    bins = 15, color = "white",
    fill = openintro::IMSCOL["green", "full"]
  ) +
  theme_minimal() +
  labs(
    title = "Highway gas milage for cars from 2004",
    x = "Gas Mileage (miles per gallon)",
    y = "Number of cars"
  )
cars93
Description
A data frame with 54 rows and 6 columns. This data is a subset of the
Cars93 dataset from the MASS package.
Usage
cars93
Format
A data frame with 54 observations on the following 6 variables.
- type
 The vehicle type with levels
large,midsize, andsmall.- price
 Vehicle price (USD).
- mpg_city
 Vehicle mileage in city (miles per gallon).
- drive_train
 Vehicle drive train with levels
4WD,front, andrear.- passengers
 The vehicle passenger capacity.
- weight
 Vehicle weight (lbs).
Details
These cars represent a random sample for 1993 models that were in both
Consumer Reports and PACE Buying Guide. Only vehicles of type
small, midsize, and large were include.
Further description can be found in Lock (1993). Use the URL http://jse.amstat.org/v1n1/datasets.lock.html.
Source
Lock, R. H. (1993) 1993 New Car Data. Journal of Statistics Education 1(1).
Examples
library(ggplot2)
# Vehicle price by type
ggplot(cars93, aes(x = price)) +
  geom_histogram(binwidth = 5) +
  facet_wrap(~type)
# Vehicle price vs. weight
ggplot(cars93, aes(x = weight, y = price)) +
  geom_point()
# Milleage vs. weight
ggplot(cars93, aes(x = weight, y = mpg_city)) +
  geom_point() +
  geom_smooth()
Community college housing (simulated data)
Description
These are simulated data and intended to represent housing prices of students at a community college.
Usage
cchousing
Format
A data frame with 75 observations on the following variable.
- price
 Monthly housing price, simulated.
Examples
hist(cchousing$price)
Random sample of 2000 U.S. Census Data
Description
A random sample of 500 observations from the 2000 U.S. Census Data.
Usage
census
Format
A data frame with 500 observations on the following 8 variables.
- census_year
 Census Year.
- state_fips_code
 Name of state.
- total_family_income
 Total family income (in U.S. dollars).
- age
 Age.
- sex
 Sex with levels
FemaleandMale.- race_general
 Race with levels
American Indian or Alaska Native,Black,Chinese,Japanese,Other Asian or Pacific Islander,Two major races,WhiteandOther.- marital_status
 Marital status with levels
Divorced,Married/spouse absent,Married/spouse present,Never married/single,SeparatedandWidowed.- total_personal_income
 Total personal income (in U.S. dollars).
Source
https://data.census.gov/cedsci
Examples
library(dplyr)
library(ggplot2)
census |>
  filter(total_family_income > 0) |>
  ggplot(aes(x = total_family_income)) +
  geom_histogram(binwidth = 25000)
Summary information for 31 cherry trees
Description
Researchers wanting to understand the relationship between these variables for black cherry trees collected data from 31 trees in the Allegheny National Forest, Pennsylvania.
Usage
cherry
Format
A data frame with 31 observations on the following 3 variables.
- diam
 diameter in inches (at 54 inches above ground)
- height
 height is measured in feet
- volume
 volume in cubic feet
Source
D.J. Hand. A handbook of small data sets. Chapman & Hall/CRC, 1994.
Examples
library(ggplot2)
library(broom)
ggplot(cherry, aes(x = diam, y = volume)) +
  geom_point() +
  geom_smooth(method = "lm")
mod <- lm(volume ~ diam + height, cherry)
tidy(mod)
Gender Stereotypes in 5-7 year old Children
Description
Stereotypes are common, but at what age do they start? This study investigates stereotypes in young children aged 5-7 years old. There are four studies reported in the paper, and all four datasets are provided here.
Usage
children_gender_stereo
Format
This data object is more unusual than most. It is a list of 4 data frames. The four data frames correspond to the data used in Studies 1-4 of the referenced paper, and these data frames each have variables (columns) that are among the following:
- subject
 Subject ID. Note that Subject 1 in the first data frame (dataset) does not correspond to Subject 1 in the second data frame.
- gender
 Gender of the subject.
- age
 Age of the subject, in years.
- trait
 The trait that the children were making a judgement about, which was either
niceorsmart.- target
 The age group of the people the children were making judgements about (as being either nice or smart):
childrenoradults.- stereotype
 The proportion of trials where the child picked a gender target that matched the trait that was the same as the gender of the child. For example, suppose we had 18 pictures, where each picture showed 2 men and 2 women (and a different set of people in each photo). Then if we asked a boy to pick the person in each picture who they believed to be really smart, this
stereotypevariable would report the fraction of pictures where the boy picked a man. When a girl reviews the photos, then thisstereotypevariable reports the fraction of photos where she picked a woman. That is, this variable differs in meaning depending on the gender of the child. (This variable design is a little confusing, but it is useful when analyzing the data.)- high_achieve_caution
 The proportion of trials where the child said that children of their own gender were high-achieving in school.
- interest
 Average score that measured the interest of the child in the game.
- difference
 A difference score between the interest of the child in the “smart” game and their interest in the “try-hard” game.
Details
The structure of the data object is a little unusual, so we recommend reviewing the Examples section before starting your analysis.
Thank you to Nicholas Horton for pointing us to this study and the data!
Most of the results in the paper can be reproduced using the data provided here.
% TODO(David) - Add short descriptions of each study.
Source
Bian L, Leslie SJ, Cimpian A. 2017. "Gender stereotypes about intellectual ability emerge early and influence children's interests". Science 355:6323 (389-391). https://www.science.org/doi/10.1126/science.aah6524.
The original data may be found here.
Examples
# This dataset is a little funny to work with.
# If wanting to review the data for a study, we
# recommend first assigning the corresponding
# data frame to a new variable. For instance,
# below we assign the second study's data to an
# object called `d` (d is for data!).
d <- children_gender_stereo[[2]]
Child care hours
Description
The China Health and Nutrition Survey aims to examine the effects of the health, nutrition, and family planning policies and programs implemented by national and local governments.
Usage
china
Format
A data frame with 9788 observations on the following 3 variables.
- gender
 a numeric vector
- edu
 a numeric vector
- child_care
 a numeric vector
Source
UNC Carolina Population Center, China Health and Nutrition Survey, 2006.
Examples
summary(china)
CIA Factbook Details on Countries
Description
Country-level statistics from the US Central Intelligence Agency (CIA).
Usage
cia_factbook
Format
A data frame with 259 observations on the following 11 variables.
- country
 Country name.
- area
 Land area, in square kilometers. (1 square kilometer is 0.386 square miles
- birth_rate
 Birth rate, in births per 1,000 people.
- death_rate
 Death rate, in deaths per 1,000 people.
- infant_mortality_rate
 Infant mortality, in deaths per 1,000 live births.
- internet_users
 Total number of internet users.
- life_exp_at_birth
 Live expectancy at birth, in years.
- maternal_mortality_rate
 Number of female deaths per 100,000 live births where the death is related to pregnancy or birth.
- net_migration_rate
 Net migration rate.
- population
 Total population.
- population_growth_rate
 Population growth rate.
Source
CIA Factbook, Country Comparisons, 2014. https://www.cia.gov/the-world-factbook/references/guide-to-country-comparisons/
Examples
library(dplyr)
library(ggplot2)
cia_factbook_iup <- cia_factbook |>
  mutate(internet_users_percent = 100 * internet_users / population)
ggplot(cia_factbook_iup, aes(x = internet_users_percent, y = life_exp_at_birth)) +
  geom_point() +
  labs(x = "Percentage of internet users", y = "Life expectancy at birth")
Simulated class data
Description
This data is simulated and is meant to represent students scores from three different lectures who were all given the same exam.
Usage
classdata
Format
A data frame with 164 observations on the following 2 variables.
- m1
 Represents a first midterm score.
- lecture
 Three classes:
a,b, andc.
References
OpenIntro Statistics, Chapter 8.
Examples
anova(lm(m1 ~ lecture, classdata))
Cleveland and Sacramento
Description
Data on a sample of 500 people from the Cleveland, OH and Sacramento, CA metro areas.
Usage
cle_sac
Format
A data frame with 500 observations representing people on the following 8 variables.
- year
 Year the data was collected.
- state
 State where person resides.
- city
 City.
- age
 Age.
- sex
 Sex.
- race
 Race.
- marital_status
 Marital status.
- personal_income
 Personal income.
Examples
library(ggplot2)
ggplot(cle_sac, aes(x = personal_income)) +
  geom_histogram(binwidth = 20000) +
  facet_wrap(~city)
Temperature Summary Data, Geography Limited
Description
A random set of monitoring locations were taken from NOAA data that had both years of interest (1948 and 2018) as well as data for both summary metrics of interest (dx70 and dx90, which are described below).
Usage
climate70
Format
A data frame with 197 observations on the following 7 variables.
- station
 Station ID.
- latitude
 Latitude of the station.
- longitude
 Longitude of the station.
- dx70_1948
 Number of days above 70 degrees in 1948.
- dx70_2018
 Number of days above 70 degrees in 2018.
- dx90_1948
 Number of days above 90 degrees in 1948.
- dx90_2018
 Number of days above 90 degrees in 2018.
Details
Please keep in mind that these are two annual snapshots, and a complete analysis would consider much more than two years of data and much additional information for those years.
Source
https://www.ncdc.noaa.gov/cdo-web, retrieved 2019-04-24.
Examples
# Data sampled are from the US, Europe, and Australia.
# This geographic limitation may be due to the particular
# years considered, since locations without both 1948 and
# 2018 were discarded for this (simple) dataset.
plot(climate70$longitude, climate70$latitude)
plot(climate70$dx70_1948, climate70$dx70_2018)
abline(0, 1, lty = 2)
plot(climate70$dx90_1948, climate70$dx90_2018)
abline(0, 1, lty = 2)
hist(climate70$dx70_2018 - climate70$dx70_1948)
hist(climate70$dx90_2018 - climate70$dx90_1948)
t.test(climate70$dx70_2018 - climate70$dx70_1948)
t.test(climate70$dx90_2018 - climate70$dx90_1948)
Climber Drugs Data.
Description
Anonymous data was collected from urine samples at huts along the climb of Mont Blanc. Several types of drugs were tested, and proportions were reported.
Usage
climber_drugs
Format
A data frame with 211 rows and 6 variables.
- positive_sample
 Idendification number of a specific urine sample.
- hut
 Location where the sample was taken.
- substance
 Substance detected to be present in the urine sample.
- concentration
 Amount of substance found measured in ng/ml.
- screening_analysis
 Indicates that the concentration was determined by screening analysis.
- concomitant
 Indicates that this substance was always detected concomitantly with the previous one, within the same urine sample.
Source
PLOS One - Drug Use on Mont Blanc: A Study Using Automated Urine Collection
Examples
library(dplyr)
# Calculate the average concentration of each substance and number of occurrences.
climber_drugs |>
  group_by(substance) |>
  summarize(count = n(), mean_con = mean(concentration))
# Proportion samples in which each substance was detected.
climber_drugs |>
  group_by(substance) |>
  summarize(prop = n() / 154)
Coast Starlight Amtrak train
Description
Travel times and distances.
Usage
coast_starlight
Format
A data frame with 16 observations on the following 3 variables.
- station
 Station.
- dist
 Distance.
- travel_time
 Travel time.
Examples
library(ggplot2)
ggplot(coast_starlight, aes(x = dist, y = travel_time)) +
  geom_point()
comics
Description
A data frame containing information about comic book characters from Marvel Comics and DC Comics.
Usage
comics
Format
A data frame with 21821 observations on the following 11 variables.
- name
 Name of the character. May include: Real name, hero or villain name, alias(es) and/or which universe they live in (i.e. Earth-616 in Marvel's multiverse).
- id
 Status of the characters identity with levels
Secret,Publie,No DualandUnknown.- align
 Character's alignment with levels
Good,Bad,NeutralandReformed Criminals.- eye
 Character's eye color.
- hair
 Character's hair color.
- gender
 Character's gender.
- gsm
 Character's classification as a gender or sexual minority.
- alive
 Is the character dead or alive?
- appearances
 Number of comic boooks the character appears in.
- first_appear
 Date of publication for the comic book the character first appeared in.
- publisher
 Publisher of the comic with levels
MarvelandDC.
Examples
library(ggplot2)
library(dplyr)
# Good v Bad
plot_data <- comics |>
  filter(align == "Good" | align == "Bad")
ggplot(plot_data, aes(x = align, fill = align)) +
  geom_bar() +
  facet_wrap(~publisher) +
  scale_fill_manual(values = c(IMSCOL["red", "full"], IMSCOL["blue", "full"])) +
  theme_minimal() +
  labs(
    title = "Is there a balance of power",
    x = "",
    y = "Number of characters",
    fill = ""
  )
Generate Contingency Tables for LaTeX
Description
Input a data frame or a table, and the LaTeX output will be returned. Options exist for row and column proportions as well as for showing work.
Usage
contTable(
  x,
  prop = c("none", "row", "col"),
  show = FALSE,
  digits = 3,
  caption = NULL,
  label = NULL
)
Arguments
x | 
 A data frame (with two columns) or a table.  | 
prop | 
 Indicate whether row (  | 
show | 
 If row or column proportions are specified, indicate whether work should be shown.  | 
digits | 
 The number of digits after the decimal that should be shown for row or column proportions.  | 
caption | 
 A string that contains the table caption. The default value is
  | 
label | 
 The latex table label. The default value is   | 
Details
The contTable function makes substantial use of the
cat function.
Author(s)
David Diez
See Also
email, cars93, possum,
mariokart
Examples
data(email)
table(email[, c("spam", "sent_email")])
contTable(email[, c("spam", "sent_email")])
Sample datasets for correlation problems
Description
Simulated data.
Usage
corr_match
Format
A data frame with 121 observations on the following 9 variables.
- x
 a numeric vector
- y1
 a numeric vector
- y2
 a numeric vector
- y3
 a numeric vector
- y4
 a numeric vector
- y5
 a numeric vector
- y6
 a numeric vector
- y7
 a numeric vector
- y8
 a numeric vector
Source
Simulated dataset.
Examples
library(ggplot2)
ggplot(corr_match, aes(x = x, y = y1)) +
  geom_point()
cor(corr_match$x, corr_match$y1)
Country ISO information
Description
Country International Organization for Standardization (ISO) information.
Usage
country_iso
Format
A data frame with 249 observations on the following 4 variables.
- country_code
 Two-letter ISO country code.
- country_name
 Country name.
- year
 Year the two-letter ISO country code was assigned.
- top_level_domain
 op-level domain name.
Source
Wikipedia, retrieved 2018-11-18. https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2
Examples
country_iso
CPR dataset
Description
These patients were randomly divided into a treatment group where they received a blood thinner or the control group where they did not receive a blood thinner. The outcome variable of interest was whether the patients survived for at least 24 hours.
Usage
cpr
Format
A data frame with 90 observations on the following 2 variables.
- group
 a factor with levels
controlandtreatment- outcome
 a factor with levels
diedandsurvived
Source
Efficacy and safety of thrombolytic therapy after initially unsuccessful cardiopulmonary resuscitation: a prospective clinical trial, by Bottiger et al., The Lancet, 2001.
Examples
table(cpr)
CPU's Released between 2010 and 2020.
Description
Data on computer processors released between 2010 and 2020.
Usage
cpu
Format
A data frame with 875 rows and 12 variables.
- company
 Manufacturer of the CPU.
- name
 Model name of the processor.
- codename
 Name given by manufacturer to all chips with this architecture.
- cores
 Number of compute cores per processor.
- threads
 The number of threads represents the number of simultaneous calculations that can be ongoing in the processor.
- base_clock
 Base speed for the CPU in GHz.
- boost_clock
 Single-core max speed for the CPU in GHz.
- socket
 Specifies the type of connection to the motherboard.
- process
 Size of the process node used in production in nm.
- l3_cache
 Size of the level 3 cache on the processor in MB.
- tdp
 Total draw power of the processor.
- released
 Date which the processor was released to the public.
Source
Examples
library(ggplot2)
# CPU base speed
ggplot(cpu, aes(x = company, y = base_clock)) +
  geom_boxplot() +
  labs(
    x = "Company",
    y = "Base Clock (GHz)",
    title = "CPU base speed"
  )
# Process node size vs. boost speed
ggplot(cpu, aes(x = process, y = boost_clock)) +
  geom_point() +
  labs(
    x = "Process node size (nm)",
    y = "Boost Clock (GHz)",
    title = "Process node size vs. boost speed"
  )
College credits.
Description
A simulated dataset of number of credits taken by college students each semester.
Usage
credits
Format
A data frame with 100 observations on the following variable.
- credits
 Number of credits.
Source
Simulated data.
Examples
library(ggplot2)
ggplot(credits, aes(x = credits)) +
  geom_histogram(binwidth = 1)
Daycare fines
Description
Researchers tested the deterrence hypothesis which predicts that the introduction of a penalty will reduce the occurrence of the behavior subject to the fine, with the condition that the fine leaves everything else unchanged by instituting a fine for late pickup at daycare centers. For this study, they worked with 10 volunteer daycare centers that did not originally impose a fine to parents for picking up their kids late. They randomly selected 6 of these daycare centers and instituted a monetary fine (of a considerable amount) for picking up children late and then removed it. In the remaining 4 daycare centers no fine was introduced. The study period was divided into four: before the fine (weeks 1–4), the first 4 weeks with the fine (weeks 5-8), the entire period with the fine (weeks 5–16), and the after fine period (weeks 17-20). Throughout the study, the number of kids who were picked up late was recorded each week for each daycare. The study found that the number of late-coming parents increased significantly when the fine was introduced, and no reduction occurred after the fine was removed.
Usage
daycare_fines
Format
A data frame with 200 observations on the following 7 variables.
- center
 Daycare center id.
- group
 Study group:
test(fine instituted) orcontrol(no fine).- children
 Number of children at daycare center.
- week
 Week of study.
- late_pickups
 Number of late pickups for a given week and daycare center.
- study_period_4
 Period of study, divided into 4 periods:
before fine,first 4 weeks with fine,last 8 weeks with fine,after fine- study_period_3
 Period of study, divided into 4 periods:
before fine,with fine,after fine
Source
Gneezy, Uri, and Aldo Rustichini. "A fine is a price." The Journal of Legal Studies 29, no. 1 (2000): 1-17.
Examples
library(dplyr)
library(tidyr)
library(ggplot2)
# The following tables roughly match results presented in Table 2 of the source article
# The results are only off by rounding for some of the weeks
daycare_fines |>
  group_by(center, study_period_4) |>
  summarise(avg_late_pickups = mean(late_pickups), .groups = "drop") |>
  pivot_wider(names_from = study_period_4, values_from = avg_late_pickups)
daycare_fines |>
  group_by(center, study_period_3) |>
  summarise(avg_late_pickups = mean(late_pickups), .groups = "drop") |>
  pivot_wider(names_from = study_period_3, values_from = avg_late_pickups)
# The following plot matches Figure 1 of the source article
daycare_fines |>
  group_by(week, group) |>
  summarise(avg_late_pickups = mean(late_pickups), .groups = "drop") |>
  ggplot(aes(x = week, y = avg_late_pickups, group = group, color = group)) +
  geom_point() +
  geom_line()
Density plot
Description
Compute kernel density plots, written in the same structure as
boxPlot. Histograms can be automatically added for teaching
purposes.
Usage
densityPlot(
  x,
  fact = NULL,
  bw = "nrd0",
  histo = c("none", "faded", "hollow"),
  breaks = "Sturges",
  fading = "0E",
  fadingBorder = "25",
  lty = NULL,
  lwd = 1,
  col = c("black", "red", "blue"),
  key = NULL,
  add = FALSE,
  adjust = 1,
  kernel = c("gaussian", "epanechnikov", "rectangular", "triangular", "biweight",
    "cosine", "optcosine"),
  weights = NULL,
  n = 512,
  from,
  to,
  na.rm = FALSE,
  xlim = NULL,
  ylim = NULL,
  main = "",
  ...
)
Arguments
x | 
 A numerical vector.  | 
fact | 
 A character or factor vector defining the grouping for data in
  | 
bw | 
 Bandwidth. See   | 
histo | 
 Whether to plot a faded histogram (  | 
breaks | 
 The   | 
fading | 
 Character value of hexadecimal, e.g.   | 
fadingBorder | 
 Character value of hexadecimal, e.g.   | 
lty | 
 Numerical vector describing the line type for the density
curve(s). Each element corresponds to a different level of the
argument  | 
lwd | 
 Numerical vector describing the line width for the density
curve(s). Each element corresponds to a different level of the
argument  | 
col | 
 Numerical vector describing the line color for the density
curve(s). Each element corresponds to a different level of the
argument  | 
key | 
 An argument to specify ordering of the factor levels.  | 
add | 
 If   | 
adjust | 
 Argument passed to   | 
kernel | 
 Argument passed to   | 
weights | 
 Argument passed to   | 
n | 
 Argument passed to   | 
from | 
 Argument passed to   | 
to | 
 Argument passed to   | 
na.rm | 
 Argument passed to   | 
xlim | 
 x-axis limits.  | 
ylim | 
 y-axis limits.  | 
main | 
 Title for the plot.  | 
... | 
 If   | 
Author(s)
David Diez
See Also
Examples
# hollow histograms
histPlot(tips$tip[tips$day == "Tuesday"],
  hollow = TRUE, xlim = c(0, 30),
  lty = 1, main = "Tips by day"
)
histPlot(tips$tip[tips$day == "Friday"],
  hollow = TRUE, border = "red",
  add = TRUE, main = "Tips by day"
)
legend("topright",
  col = c("black", "red"),
  lty = 1:2, legend = c("Tuesday", "Friday")
)
# density plots
densityPlot(tips$tip, tips$day,
  col = c("black", "red"), main = "Tips by day"
)
legend("topright",
  col = c("black", "red"),
  lty = 1:2, legend = c("Tuesday", "Friday")
)
densityPlot(tips$tip,
  histo = "faded",
  breaks = 15, main = "Tips by day"
)
densityPlot(tips$tip,
  histo = "hollow",
  breaks = 30, fadingBorder = "66",
  lty = 1, main = "Tips by day"
)
Type 2 Diabetes Clinical Trial for Patients 10-17 Years Old
Description
Three treatments were compared to test their relative efficacy (effectiveness) in treating Type 2 Diabetes in patients aged 10-17 who were being treated with metformin. The primary outcome was lack of glycemic control (or not); lacking glycemic control means the patient still needed insulin, which is not the preferred outcome for a patient.
Usage
diabetes2
Format
A data frame with 699 observations on the following 2 variables.
- treatment
 The treatment the patient received.
- outcome
 Whether there patient still needs insulin (
failure) or met a basic positive outcome bar (success).
Details
Each of the 699 patients in the experiment were randomized to one of the
following treatments: (1) continued treatment with metformin
(coded as met), (2) formin combined with rosiglitazone (coded as
rosi), or or (3) a lifestyle-intervention program (coded as
lifestyle).
Source
Zeitler P, et al. 2012. A Clinical Trial to Maintain Glycemic Control in Youth with Type 2 Diabetes. N Engl J Med.
Examples
lapply(diabetes2, table)
(cont.table <- table(diabetes2))
(m <- chisq.test(cont.table))
m$expected
Create a Double Line Segment Plot
Description
Creae a plot showing two line segments. The union or intersection of those
line segments can also be generated by utilizing the type argument.
Usage
dlsegments(
  x1 = c(3, 7),
  x2 = c(5, 9),
  l = c("o", "o"),
  r = c("c", "c"),
  type = c("n", "u", "i"),
  COL = 2,
  lwd = 2.224,
  ylim = c(-0.35, 2),
  mar = rep(0, 4),
  hideOrig = FALSE
)
Arguments
x1 | 
 The endpoints of the first interval. Values larger (smaller) than 999 (-999) will be interpreted as (negative) infinity.  | 
x2 | 
 The endpoints of the second interval. Values larger (smaller) than 999 (-999) will be interpreted as (negative) infinity.  | 
l | 
 A vector of length 2, where the values correspond to the left end
point of each interval. A value of   | 
r | 
 A vector of length 2, where the values correspond to the right end
point of each interval. A value of   | 
type | 
 By default, no intersection or union of the two lines will be
shown (value of   | 
COL | 
 If the union or intersection is to be shown (see the   | 
lwd | 
 If the union or intersection is to be shown (see the   | 
ylim | 
 A vector of length 2 specifying the vertical plotting limits,
which may be useful for fine-tuning plots. The default is   | 
mar | 
 A vector of length 4 that represent the plotting margins.  | 
hideOrig | 
 An optional argument that to specify that the two line
segments should be shown (  | 
Author(s)
David Diez
See Also
Examples
dlsegments(c(-3, 3), c(1, 1000),
  r = c("o", "o"), l = c("c", "o"), COL = COL[4]
)
dlsegments(c(-3, 3), c(1, 1000),
  r = c("o", "o"), l = c("c", "o"), type = "un", COL = COL[4]
)
dlsegments(c(-3, 3), c(1, 1000),
  r = c("o", "o"), l = c("c", "o"), type = "in", COL = COL[4]
)
Dot plot
Description
Plot observations as dots.
Usage
dotPlot(
  x,
  fact = NULL,
  vertical = FALSE,
  at = 1,
  key = NULL,
  pch = 20,
  col = fadeColor("black", "66"),
  cex = 1.5,
  add = FALSE,
  axes = TRUE,
  xlim = NULL,
  ylim = NULL,
  ...
)
Arguments
x | 
 A numerical vector.  | 
fact | 
 A character or factor vector defining the grouping for data in
  | 
vertical | 
 If   | 
at | 
 The vertical coordinate of the points, or the horizontal
coordinate if   | 
key | 
 The factor levels corresponding to   | 
pch | 
 Plotting character. If   | 
col | 
 Plotting character color. If   | 
cex | 
 Plotting character size. If   | 
add | 
 If   | 
axes | 
 If   | 
xlim | 
 Limits for the x axis.  | 
ylim | 
 Limits for the y axis.  | 
... | 
 Additional arguments to be passed to   | 
Author(s)
David Diez
See Also
histPlot, densityPlot,
boxPlot
Examples
library(dplyr)
# Price by type
dotPlot(cars93$price,
  cars93$type,
  key = c("large", "midsize", "small"),
  cex = 1:3
)
# Hours worked by educational attainment or degree
gss2010_nona <- gss2010 |>
  filter(!is.na(hrs1) & !is.na(degree))
dotPlot(gss2010_nona$hrs1,
  gss2010_nona$degree,
  col = fadeColor("black", "11")
)
# levels reordered
dotPlot(gss2010_nona$hrs1,
  gss2010_nona$degree,
  col = fadeColor("black", "11"),
  key = c("LT HIGH SCHOOL", "HIGH SCHOOL", "BACHELOR", "JUNIOR COLLEGE", "GRADUATE")
)
# with boxPlot() overlaid
dotPlot(mariokart$total_pr,
  mariokart$cond,
  ylim = c(0.5, 2.5), xlim = c(25, 80), cex = 1
)
boxPlot(mariokart$total_pr,
  mariokart$cond,
  add = 1:2 + 0.1,
  key = c("new", "used"), horiz = TRUE, axes = FALSE
)
Add a Stacked Dot Plot to an Existing Plot
Description
Add a stacked dot plot to an existing plot. The locations for the points in the dot plot are returned from the function in a list.
Usage
dotPlotStack(x, radius = 1, seed = 1, addDots = TRUE, ...)
Arguments
x | 
 A vector of numerical observations for the dot plot.  | 
radius | 
 The approximate distance that should separate each point.  | 
seed | 
 A random seed (integer). Different values will produce different variations.  | 
addDots | 
 Indicate whether the points should be added to the plot.  | 
... | 
 Additional arguments are passed to
  | 
Value
Returns a list with a height that can be used as the upper bound of ylim for a plot, then also the x- and y-coordinates of the points in the stacked dot plot.
Author(s)
David Diez
See Also
Examples
#
Survey on views of the DREAM Act
Description
A SurveyUSA poll.
Usage
dream
Format
A data frame with 910 observations on the following 2 variables.
- ideology
 a factor with levels
ConservativeLiberalModerate- stance
 a factor with levels
NoNot sureYes
Source
SurveyUSA, News Poll #18927, data collected Jan 27-29, 2012.
Examples
table(dream)
Quadcopter Drone Blades
Description
Quality control dataset for quadcopter drone blades, where this data has been made up for an example.
Usage
drone_blades
Format
A data frame with 2000 observations on the following 2 variables.
- supplier
 The supplier for the blade.
- inspection
 The inspection conclusion.
References
OpenIntro Statistics, Third Edition and Fourth Edition.
Examples
library(dplyr)
drone_blades |>
  count(supplier, inspection)
Drug use of students and parents
Description
Summary of 445 student-parent pairs.
Usage
drug_use
Format
A data frame with 445 observations on the following 2 variables.
- student
 a factor with levels
notuses- parents
 a factor with levels
notused
Source
Ellis GJ and Stone LH. 1979. Marijuana Use in College: An Evaluation of a Modeling Explanation. Youth and Society 10:323-334.
Examples
table(drug_use)
Sale prices of houses in Duke Forest, Durham, NC
Description
Data on houses that were recently sold in the Duke Forest neighborhood of Durham, NC in November 2020.
Usage
duke_forest
Format
A data frame with 98 rows and 13 variables.
- address
 Address of house.
- price
 Sale price, in USD.
- bed
 Number of bedrooms.
- bath
 Number of bathrooms.
- area
 Area of home, in square feet.
- type
 Type of home (all are Single Family).
- year_built
 Year the home was built.
- heating
 Heating sytem.
- cooling
 Cooling system (
otherorcentral).- parking
 Type of parking available and number of parking spaces.
- lot
 Area of the entire property, in acres.
- hoa
 If the home belongs to an Home Owners Association, the associted fee (
NAotherwise).- url
 URL of the listing.
Source
Data were collected from Zillow in November 2020.
Examples
library(ggplot2)
# Number of bedrooms and price
ggplot(duke_forest, aes(x = as.factor(bed), y = price)) +
  geom_boxplot() +
  labs(
    x = "Number of bedrooms",
    y = "Sale price (USD)",
    title = "Homes for sale in Duke Forest, Durham, NC",
    subtitle = "Data are from November 2020"
  )
# Area and price
ggplot(duke_forest, aes(x = area, y = price)) +
  geom_point() +
  labs(
    x = "Area (square feet)",
    y = "Sale price (USD)",
    title = "Homes for sale in Duke Forest, Durham, NC",
    subtitle = "Data are from November 2020"
  )
Earthquakes
Description
Select set of notable earthquakes from 1900 to 1999.
Usage
earthquakes
Format
A data frame with 123 rows and 7 variables.
- year
 Year the earthquake took place.
- month
 Month the earthquake took place.
- day
 Day the earthquake took place
- richter
 Magnitude of earthquake using the Richter Scale.
- area
 City or geographic location of earthquakes.
- region
 Country or countries if the earthquake occurred on a border.
- deaths
 Approximate number of deaths caused by earthquake
Source
World Almanac and Book of Facts: 2011.
Examples
library(ggplot2)
ggplot(earthquakes, aes(x = richter, y = deaths)) +
  geom_point()
ggplot(earthquakes, aes(x = log(deaths))) +
  geom_histogram()
Survey on Ebola quarantine
Description
In New York City on October 23rd, 2014, a doctor who had recently been treating Ebola patients in Guinea went to the hospital with a slight fever and was subsequently diagnosed with Ebola. Soon thereafter, an NBC 4 New York/The Wall Street Journal/Marist Poll asked New Yorkers whether they favored a "mandatory 21-day quarantine for anyone who has come in contact with an Ebola patient". This poll included responses of 1,042 New York adults between October 26th and 28th, 2014.
Usage
ebola_survey
Format
A data frame with 1042 observations on the following variable.
- quarantine
 Indicates whether the respondent is in
favororagainstthe mandatory quarantine.
Source
Poll ID NY141026 on maristpoll.marist.edu.
Examples
table(ebola_survey)
Exploratory data analysis plot
Description
Explore different plotting methods using a click interface.
Usage
edaPlot(
  dataFrame,
  Col = c("#888888", "#FF0000", "#222222", "#FFFFFF", "#CCCCCC", "#3377AA")
)
Arguments
dataFrame | 
 A data frame.  | 
Col | 
 A vector containing six colors. The colors may be given in any form.  | 
Author(s)
David Diez
See Also
histPlot, densityPlot,
boxPlot, dotPlot
Examples
data(mlbbat10)
bat <- mlbbat10[mlbbat10$at_bat > 200, ]
# edaPlot(bat)
data(mariokart)
mk <- mariokart[mariokart$total_pr < 100, ]
# edaPlot(mk)
Elmhurst College gift aid
Description
A random sample of 50 students gift aid for students at Elmhurst College.
Usage
elmhurst
Format
A data frame with 50 observations on the following 3 variables.
- family_income
 Family income of the student.
- gift_aid
 Gift aid, in $1000s.
- price_paid
 Price paid by the student (tuition - gift aid).
Source
These data were sampled from a table of data for all freshman from the 2011 class at Elmhurst College that accompanied an article titled What Students Really Pay to Go to College published online by The Chronicle of Higher Education: https://www.chronicle.com/article/what-students-really-pay-to-go-to-college/?sra=true.
Examples
library(ggplot2)
library(broom)
ggplot(elmhurst, aes(x = family_income, y = gift_aid)) +
  geom_point() +
  geom_smooth(method = "lm")
mod <- lm(gift_aid ~ family_income, data = elmhurst)
tidy(mod)
Data frame representing information about a collection of emails
Description
These data represent incoming emails for the first three months of 2012 for an email account (see Source).
Usage
email
Format
A email (email_sent) data frame has 3921 (1252)
observations on the following 21 variables.
- spam
 Indicator for whether the email was spam.
- to_multiple
 Indicator for whether the email was addressed to more than one recipient.
- from
 Whether the message was listed as from anyone (this is usually set by default for regular outgoing email).
- cc
 Number of people cc'ed.
- sent_email
 Indicator for whether the sender had been sent an email in the last 30 days.
- time
 Time at which email was sent.
- image
 The number of images attached.
- attach
 The number of attached files.
- dollar
 The number of times a dollar sign or the word “dollar” appeared in the email.
- winner
 Indicates whether “winner” appeared in the email.
- inherit
 The number of times “inherit” (or an extension, such as “inheritance”) appeared in the email.
- viagra
 The number of times “viagra” appeared in the email.
- password
 The number of times “password” appeared in the email.
- num_char
 The number of characters in the email, in thousands.
- line_breaks
 The number of line breaks in the email (does not count text wrapping).
- format
 Indicates whether the email was written using HTML (e.g. may have included bolding or active links).
- re_subj
 Whether the subject started with “Re:”, “RE:”, “re:”, or “rE:”
- exclaim_subj
 Whether there was an exclamation point in the subject.
- urgent_subj
 Whether the word “urgent” was in the email subject.
- exclaim_mess
 The number of exclamation points in the email message.
- number
 Factor variable saying whether there was no number, a small number (under 1 million), or a big number.
Source
David Diez's Gmail Account, early months of 2012. All personally identifiable information has been removed.
See Also
Examples
e <- email
# ______ Variables For Logistic Regression ______#
# Variables are modified to match
#   OpenIntro Statistics, Second Edition
# As Is (7): spam, to_multiple, winner, format,
#            re_subj, exclaim_subj
# Omitted (6): from, sent_email, time, image,
#              viagra, urgent_subj, number
# Become Indicators (5): cc, attach, dollar,
#                        inherit, password
e$cc <- ifelse(email$cc > 0, 1, 0)
e$attach <- ifelse(email$attach > 0, 1, 0)
e$dollar <- ifelse(email$dollar > 0, 1, 0)
e$inherit <- ifelse(email$inherit > 0, 1, 0)
e$password <- ifelse(email$password > 0, 1, 0)
# Transform (3): num_char, line_breaks, exclaim_mess
# e$num_char     <- cut(email$num_char, c(0,1,5,10,20,1000))
# e$line_breaks  <- cut(email$line_breaks, c(0,10,100,500,10000))
# e$exclaim_mess <- cut(email$exclaim_mess, c(-1,0,1,5,10000))
g <- glm(
  spam ~ to_multiple + winner + format +
    re_subj + exclaim_subj +
    cc + attach + dollar +
    inherit + password, # +
  # num_char + line_breaks + exclaim_mess,
  data = e, family = binomial
)
summary(g)
# ______ Variable Selection Via AIC ______#
g. <- step(g)
plot(predict(g., type = "response"), e$spam)
# ______ Splitting num_char by html ______#
x <- log(email$num_char)
bw <- 0.004
R <- range(x) + c(-1, 1)
wt <- sum(email$format == 1) / nrow(email)
htmlAll <- density(x, bw = 0.4, from = R[1], to = R[2])
htmlNo <- density(x[email$format != 1],
  bw = 0.4,
  from = R[1], to = R[2]
)
htmlYes <- density(x[email$format == 1],
  bw = 0.4,
  from = R[1], to = R[2]
)
htmlNo$y <- htmlNo$y #* (1-wt)
htmlYes$y <- htmlYes$y #* wt + htmlNo$y
plot(htmlAll, xlim = c(-4, 6), ylim = c(0, 0.4))
lines(htmlNo, col = 4)
lines(htmlYes, lwd = 2, col = 2)
Sample of 50 emails
Description
This is a subsample of the email dataset.
Usage
email50
Format
A data frame with 50 observations on the following 21 variables.
- spam
 Indicator for whether the email was spam.
- to_multiple
 Indicator for whether the email was addressed to more than one recipient.
- from
 Whether the message was listed as from anyone (this is usually set by default for regular outgoing email).
- cc
 Number of people cc'ed.
- sent_email
 Indicator for whether the sender had been sent an email in the last 30 days.
- time
 Time at which email was sent.
- image
 The number of images attached.
- attach
 The number of attached files.
- dollar
 The number of times a dollar sign or the word “dollar” appeared in the email.
- winner
 Indicates whether “winner” appeared in the email.
- inherit
 The number of times “inherit” (or an extension, such as “inheritance”) appeared in the email.
- viagra
 The number of times “viagra” appeared in the email.
- password
 The number of times “password” appeared in the email.
- num_char
 The number of characters in the email, in thousands.
- line_breaks
 The number of line breaks in the email (does not count text wrapping).
- format
 Indicates whether the email was written using HTML (e.g. may have included bolding or active links).
- re_subj
 Whether the subject started with “Re:”, “RE:”, “re:”, or “rE:”
- exclaim_subj
 Whether there was an exclamation point in the subject.
- urgent_subj
 Whether the word “urgent” was in the email subject.
- exclaim_mess
 The number of exclamation points in the email message.
- number
 Factor variable saying whether there was no number, a small number (under 1 million), or a big number.
Source
David Diez's Gmail Account, early months of 2012. All personally identifiable information has been removed.
See Also
Examples
index <- c(
  101, 105, 116, 162, 194, 211, 263, 308, 361, 374,
  375, 465, 509, 513, 571, 691, 785, 842, 966, 968,
  1051, 1201, 1251, 1433, 1519, 1727, 1760, 1777, 1899, 1920,
  1943, 2013, 2052, 2252, 2515, 2629, 2634, 2710, 2823, 2835,
  2944, 3098, 3227, 3360, 3452, 3496, 3530, 3665, 3786, 3877
)
order <- c(
  3, 33, 12, 1, 21, 15, 43, 49, 8, 6,
  34, 25, 24, 35, 41, 9, 22, 50, 4, 48,
  7, 14, 46, 10, 38, 32, 26, 18, 23, 45,
  30, 16, 17, 20, 40, 47, 31, 37, 27, 11,
  5, 44, 29, 19, 13, 36, 39, 42, 28, 2
)
d <- email[index, ][order, ]
identical(d, email50)
American Adults on Regulation and Renewable Energy
Description
Pew Research conducted a poll to find whether American adults support regulation or believe the private market will move the American economy towards renewable energy.
Usage
env_regulation
Format
A data frame with 705 observations on the following variable.
- statement
 There were three possible outcomes for each person:
"Regulations necessary","Private marketplace will ensure", and"Don't know".
Details
The exact statements being selected were: (1) Government regulations are necessary to encourage businesses and consumers to rely more on renewable energy sources. (2) The private marketplace will ensure that businesses and consumers rely more on renewable energy sources, even without government regulations.
The actual sample size was 1012. However, the original data were not from a simple random sample; after accounting for the design, the equivalent sample size was about 705, which was what was used for the dataset here to keep things simpler for intro stat analyses.
Source
Examples
table(env_regulation)
Vehicle info from the EPA for 2012
Description
Details from the EPA.
Usage
epa2012
Format
A data frame with 1129 observations on the following 28 variables.
- model_yr
 a numeric vector
- mfr_name
 Manufacturer name.
- division
 Vehicle division.
- carline
 Vehicle line.
- mfr_code
 Manufacturer code.
- model_type_index
 Model type index.
- engine_displacement
 Engine displacement.
- no_cylinders
 Number of cylinders.
- transmission_speed
 Transmission speed.
- city_mpg
 City mileage.
- hwy_mpg
 Highway mileage.
- comb_mpg
 Combined mileage.
- guzzler
 Whether the car is considered a "guzzler" or not, a factor with levels
NandY.- air_aspir_method
 Air aspiration method.
- air_aspir_method_desc
 Air aspiration method description.
- transmission
 Transmission type.
- transmission_desc
 Transmission type description.
- no_gears
 Number of gears.
- trans_lockup
 Whether transmission locks up, a factor with levels
NandY.- trans_creeper_gear
 A factor with level
Nonly.- drive_sys
 Drive system, a factor with levels.
- drive_desc
 Drive system description.
- fuel_usage
 Fuel usage, a factor with levels.
- fuel_usage_desc
 Fuel usage description.
- class
 Class of car.
- car_truck
 Car or truck, a factor with levels
car,1,2.- release_date
 Date of vehicle release.
- fuel_cell
 Whether the car has a fuel cell or not, a factor with levels
N,Y.
Source
Fueleconomy.gov, Shared MPG Estimates: Toyota Prius 2012.
See Also
epa2021
Examples
library(ggplot2)
library(dplyr)
# Variable descriptions
distinct(epa2012, air_aspir_method_desc, air_aspir_method)
distinct(epa2012, transmission_desc, transmission)
distinct(epa2012, drive_desc, drive_sys)
distinct(epa2012, fuel_usage_desc, fuel_usage)
# Guzzlers and their mileages
ggplot(epa2012, aes(x = city_mpg, y = hwy_mpg, color = guzzler)) +
  geom_point() +
  facet_wrap(~guzzler, ncol = 1)
Vehicle info from the EPA for 2021
Description
Details from the EPA.
Usage
epa2021
Format
A data frame with 1108 observations on the following 28 variables.
- model_yr
 a numeric vector
- mfr_name
 Manufacturer name.
- division
 Vehicle division.
- carline
 Vehicle line.
- mfr_code
 Manufacturer code.
- model_type_index
 Model type index.
- engine_displacement
 Engine displacement.
- no_cylinders
 Number of cylinders.
- transmission_speed
 Transmission speed.
- city_mpg
 City mileage.
- hwy_mpg
 Highway mileage.
- comb_mpg
 Combined mileage.
- guzzler
 Whether the car is considered a "guzzler" or not, a factor with levels
NandY.- air_aspir_method
 Air aspiration method.
- air_aspir_method_desc
 Air aspiration method description.
- transmission
 Transmission type.
- transmission_desc
 Transmission type description.
- no_gears
 Number of gears.
- trans_lockup
 Whether transmission locks up, a factor with levels
NandY.- trans_creeper_gear
 A factor with level
Nonly.- drive_sys
 Drive system, a factor with levels.
- drive_desc
 Drive system description.
- fuel_usage
 Fuel usage, a factor with levels.
- fuel_usage_desc
 Fuel usage description.
- class
 Class of car.
- car_truck
 Car or truck, a factor with levels
car,1,??,1.- release_date
 Date of vehicle release.
- fuel_cell
 Whether the car has a fuel cell or not, a factor with levels
N,NA.
Source
Fuel Economy Data from fueleconomy.gov. Retrieved 6 May, 2021.
See Also
epa2012
Examples
library(ggplot2)
library(dplyr)
# Variable descriptions
distinct(epa2021, air_aspir_method_desc, air_aspir_method)
distinct(epa2021, transmission_desc, transmission)
distinct(epa2021, drive_desc, drive_sys)
distinct(epa2021, fuel_usage_desc, fuel_usage)
# Guzzlers and their mileages
ggplot(epa2021, aes(x = city_mpg, y = hwy_mpg, color = guzzler)) +
  geom_point() +
  facet_wrap(~guzzler, ncol = 1)
# Compare to 2012
epa2021 |>
  bind_rows(epa2012) |>
  group_by(model_yr) |>
  summarise(
    mean_city = mean(city_mpg),
    mean_hwy  = mean(hwy_mpg)
  )
Environmental Sustainability Index 2005
Description
This dataset comes from the 2005 Environmental Sustainability Index: Benchmarking National Environmental Stewardship. Countries are given an overall sustainability score as well as scores in each of several different environmental areas.
Usage
esi
Format
A data frame with 146 observations on the following 29 variables.
- code
 ISO3 country code.
- country
 Country.
- esi
 Environmental Sustainability Index.
- system
 ESI core component: systems
- stress
 ESI core component: stresses
- vulner
 ESI core component: vulnerability
- cap
 ESI core component: capacity
- global
 ESI core component: global stewardship
- sys_air
 Air quality.
- sys_bio
 Biodiversity.
- sys_lan
 Land.
- sys_wql
 Water quality.
- sys_wqn
 Water quantity.
- str_air
 Reducing air pollution.
- str_eco
 Reducing ecosystem stress.
- str_pop
 Reducing population pressure.
- str_was
 Reducing waste and consumption pressures.
- str_wat
 Reducing water stress.
- str_nrm
 Natural resource management.
- vul_hea
 Environmental health.
- vul_sus
 Basic human sustenance.
- vul_dis
 Exposure to natural disasters.
- cap_gov
 Environmental governance.
- cap_eff
 Eco-efficiency.
- cap_pri
 Private sector responsiveness.
- cap_st
 Science and technology.
- glo_col
 Participation in international collaboration efforts.
- glo_ghg
 Greenhouse gas emissions.
- glo_tbp
 Reducing transboundary environmental pressures.
Details
ESI and Component scores are presented as standard normal percentiles. Indicator scores are in the form of z-scores. See Appendix A of the report for information on the methodology and Appendix C for more detail on original data sources.
For more information on how each of the indices were calculated, see the documentation linked below.
Source
ESI Component Indicators. 2005 Environmental Sustainability Index: Benchmarking National Environmental Stewardship, Yale Center for Environmental Law and Policy, Yale University & Center for International Earth Science Information Network (CIESIN), Columbia University
In collaboration with: World Economic Forum, Geneva, Switzerland Joint Research Centre of the European Commission, Ispra, Italy.
Available at https://www.earth.columbia.edu/news/2005/images/ESI2005_policysummary.pdf.
References
Esty, Daniel C., Marc Levy, Tanja Srebotnjak, and Alexander de Sherbinin (2005). 2005 Environmental Sustainability Index: Benchmarking National Environmental Stewardship. New Haven: Yale Center for Environmental Law and Policy
Examples
library(ggplot2)
ggplot(esi, aes(x = cap_st, y = glo_col)) +
  geom_point(color = ifelse(esi$code == "USA", "red", "black")) +
  geom_text(
    aes(label = ifelse(code == "USA", as.character(code), "")),
    hjust = 1.2, color = "red"
  ) +
  labs(x = "Science and technology", y = "Participation in international collaboration efforts")
ggplot(esi, aes(x = vulner, y = cap)) +
  geom_point(color = ifelse(esi$code == "USA", "red", "black")) +
  geom_text(
    aes(label = ifelse(code == "USA", as.character(code), "")),
    hjust = 1.2, color = "red"
  ) +
  labs(x = "Vulnerability", y = "Capacity")
Ethanol Treatment for Tumors Experiment
Description
Experiment where 3 different treatments of ethanol were tested on the treatment of oral cancer tumors in hamsters.
Usage
ethanol
Format
A data frame with 24 observations, each representing one hamster, on the following 2 variables.
- treatment
 Treatment the hamster received.
- regress
 a factor with levels
noyes
Details
The ethyl_cellulose and pure_ethanol treatments consisted of
about a quarter of the volume of the tumors, while the
pure_ethanol_16x treatment was 16x that, so about 4 times the size of
the tumors.
Source
Morhard R, et al. 2017. Development of enhanced ethanol ablation as an alternative to surgery in treatment of superficial solid tumors. Scientific Reports 7:8750.
Examples
table(ethanol)
fisher.test(table(ethanol))
Professor evaluations and beauty
Description
The data are gathered from end of semester student evaluations for 463 courses taught by a sample of 94 professors from the University of Texas at Austin. In addition, six students rate the professors' physical appearance. The result is a data frame where each row contains a different course and each column has information on the course and the professor who taught that course.
Usage
evals
Format
A data frame with 463 observations on the following 23 variables.
- course_id
 Variable identifying the course (out of 463 courses).
- prof_id
 Variable identifying the professor who taught the course (out of 94 professors).
- score
 Average professor evaluation score: (1) very unsatisfactory - (5) excellent.
- rank
 Rank of professor: teaching, tenure track, tenured.
- ethnicity
 Ethnicity of professor: not minority, minority.
- gender
 Gender of professor: female, male.
- language
 Language of school where professor received education: English or non-English.
- age
 Age of professor.
- cls_perc_eval
 Percent of students in class who completed evaluation.
- cls_did_eval
 Number of students in class who completed evaluation.
- cls_students
 Total number of students in class.
- cls_level
 Class level: lower, upper.
- cls_profs
 Number of professors teaching sections in course in sample: single, multiple.
- cls_credits
 Number of credits of class: one credit (lab, PE, etc.), multi credit.
- bty_f1lower
 Beauty rating of professor from lower level female: (1) lowest - (10) highest.
- bty_f1upper
 Beauty rating of professor from upper level female: (1) lowest - (10) highest.
- bty_f2upper
 Beauty rating of professor from second level female: (1) lowest - (10) highest.
- bty_m1lower
 Beauty rating of professor from lower level male: (1) lowest - (10) highest.
- bty_m1upper
 Beauty rating of professor from upper level male: (1) lowest - (10) highest.
- bty_m2upper
 Beauty rating of professor from second upper level male: (1) lowest - (10) highest.
- bty_avg
 Average beauty rating of professor.
- pic_outfit
 Outfit of professor in picture: not formal, formal.
- pic_color
 Color of professor's picture: color, black & white.
Source
Daniel S. Hamermesh, Amy Parker, Beauty in the classroom: instructors’ pulchritude and putative pedagogical productivity, Economics of Education Review, Volume 24, Issue 4, 2005. doi:10.1016/j.econedurev.2004.07.013.
Examples
evals
Exam and course grades for statistics students
Description
Grades on three exams and overall course grade for 233 students during several years for a statistics course at a university.
Usage
exam_grades
Format
A data frame with 233 observations, each representing a student.
- semester
 Semester when grades were recorded.
- sex
 Sex of the student as recorded on the university registration system: Man or Woman.
- exam1
 Exam 1 grade.
- exam2
 Exam 2 grade.
- exam3
 Exam 3 grade.
- course_grade
 Overall course grade.
Examples
library(ggplot2)
library(dplyr)
# Course grade vs. each exam
ggplot(exam_grades, aes(x = exam1, y = course_grade)) +
  geom_point()
ggplot(exam_grades, aes(x = exam2, y = course_grade)) +
  geom_point()
ggplot(exam_grades, aes(x = exam2, y = course_grade)) +
  geom_point()
# Semester averages
exam_grades |>
  group_by(semester) |>
  summarise(across(exam1:course_grade, mean, na.rm = TRUE))
Exam scores
Description
Exam scores from a class of 19 students.
Usage
exams
Format
A data frame with 19 observations on the following variable.
- scores
 a numeric vector
Examples
hist(exams$scores)
Number of Exclusive Relationships
Description
A survey conducted on a reasonably random sample of 203 undergraduates asked, among many other questions, about the number of exclusive relationships these students have been in.
Usage
exclusive_relationship
Format
A data frame with 218 observations on the following variable.
- num
 Number of exclusive relationships.
Examples
summary(exclusive_relationship$num)
table(exclusive_relationship$num)
hist(exclusive_relationship$num)
Can Americans categorize facts and opinions?
Description
Pew Research Center conducted a survey in 2018, asking a sample of U.S. adults to categorize five factual and five opinion statements. This dataset provides data from this survey, with information on the age group of the participant as well as the number of factual and opinion statements they classified correctly (out of 5).
Usage
fact_opinion
Format
A data frame with 5,035 rows and 3 variables.
- age_group
 Age group of survey participant.
- fact_correct
 Number of factual statements classified correctly (out of 5).
- opinion_correct
 Number of opinion statements classified correctly (out of 5).
Source
Younger Americans are better than older Americans at telling factual news statements from opinions, Pew Research Center, October 23, 2018.
Examples
library(ggplot2)
library(dplyr)
library(tidyr)
library(forcats)
# Distribution of fact_correct by age group
ggplot(fact_opinion, aes(x = age_group, y = fact_correct)) +
  geom_boxplot() +
  labs(
    x = "Age group",
    y = "Number correct (factual)",
    title = "Number of factual statements classified correctly by age group"
  )
# Distribution of opinion_correct by age group
ggplot(fact_opinion, aes(x = age_group, y = opinion_correct)) +
  geom_boxplot() +
  labs(
    x = "Age group",
    y = "Number correct (opinion)",
    title = "Number of opinion statements classified correctly by age group"
  )
# Replicating the figure from Pew report (see source for link)
fact_opinion |>
  mutate(
    facts = case_when(
      fact_correct <= 2 ~ "Two or fewer",
      fact_correct %in% c(3, 4) ~ "Three or four",
      fact_correct == 5 ~ "All five"
    ),
    facts = fct_relevel(facts, "Two or fewer", "Three or four", "All five"),
    opinions = case_when(
      opinion_correct <= 2 ~ "Two or fewer",
      opinion_correct %in% c(3, 4) ~ "Three or four",
      opinion_correct == 5 ~ "All five"
    ),
    opinions = fct_relevel(opinions, "Two or fewer", "Three or four", "All five")
  ) |>
  select(-fact_correct, -opinion_correct) |>
  pivot_longer(cols = -age_group, names_to = "question_type", values_to = "n_correct") |>
  ggplot(aes(y = fct_rev(age_group), fill = n_correct)) +
  geom_bar(position = "fill") +
  facet_wrap(~question_type, ncol = 1) +
  scale_fill_viridis_d(guide = guide_legend(reverse = TRUE)) +
  labs(
    x = "Proportion",
    y = "Age group",
    fill = "Number of\ncorrect\nclassifications"
  )
Fade colors
Description
Fade colors so they are transparent.
Usage
fadeColor(col, fade = "FF")
Arguments
col | 
 An integer, color name, or RGB hexadecimal.  | 
fade | 
 The amount to fade   | 
Author(s)
David Diez
Examples
data(mariokart)
new <- mariokart$cond == "new"
used <- mariokart$cond == "used"
# ===> color numbers <===#
dotPlot(mariokart$total_pr[new],
  ylim = c(0, 3), xlim = c(25, 80), pch = 20,
  col = 2, cex = 2, main = "using regular colors"
)
dotPlot(mariokart$total_pr[used], at = 2, add = TRUE, col = 4, pch = 20, cex = 2)
dotPlot(mariokart$total_pr[new],
  ylim = c(0, 3), xlim = c(25, 80),
  col = fadeColor(2, "22"), pch = 20, cex = 2,
  main = "fading the colors first"
)
dotPlot(mariokart$total_pr[used],
  at = 2, add = TRUE,
  col = fadeColor(4, "22"), pch = 20, cex = 2
)
# ===> color names <===#
dotPlot(mariokart$total_pr[new],
  ylim = c(0, 3), xlim = c(25, 80), pch = 20,
  col = "red", cex = 2, main = "using regular colors"
)
dotPlot(mariokart$total_pr[used], at = 2, add = TRUE, col = "blue", pch = 20, cex = 2)
dotPlot(mariokart$total_pr[new],
  ylim = c(0, 3), xlim = c(25, 80),
  col = fadeColor("red", "22"), pch = 20, cex = 2,
  main = "fading the colors first"
)
dotPlot(mariokart$total_pr[used],
  at = 2, add = TRUE,
  col = fadeColor("blue", "22"), pch = 20, cex = 2
)
# ===> hexadecimal <===#
dotPlot(mariokart$total_pr[new],
  ylim = c(0, 3), xlim = c(25, 80), pch = 20,
  col = "#FF0000", cex = 2, main = "using regular colors"
)
dotPlot(mariokart$total_pr[used],
  at = 2, add = TRUE, col = "#0000FF", pch = 20,
  cex = 2
)
dotPlot(mariokart$total_pr[new],
  ylim = c(0, 3), xlim = c(25, 80),
  col = fadeColor("#FF0000", "22"), pch = 20, cex = 2,
  main = "fading the colors first"
)
dotPlot(mariokart$total_pr[used],
  at = 2, add = TRUE,
  col = fadeColor("#0000FF", "22"), pch = 20, cex = 2
)
# ===> alternative: rgb function <===#
dotPlot(mariokart$total_pr[new],
  ylim = c(0, 3), xlim = c(25, 80), pch = 20,
  col = rgb(1, 0, 0), cex = 2, main = "using regular colors"
)
dotPlot(mariokart$total_pr[used],
  at = 2, add = TRUE, col = rgb(0, 0, 1),
  pch = 20, cex = 2
)
dotPlot(mariokart$total_pr[new],
  ylim = c(0, 3), xlim = c(25, 80),
  col = rgb(1, 0, 0, 1 / 8), pch = 20, cex = 2,
  main = "fading the colors first"
)
dotPlot(mariokart$total_pr[used],
  at = 2, add = TRUE,
  col = rgb(0, 0, 1, 1 / 8), pch = 20, cex = 2
)
Simulated sample of parent / teen college attendance
Description
A simulated dataset based on real population summaries.
Usage
family_college
Format
A data frame with 792 observations on the following 2 variables.
- teen
 Whether the teen goes to
collegeornot.- parents
 Whether the parent holds a college
degreeornot.
Source
Simulation based off of summary information provided at https://eric.ed.gov/?id=ED460660.
Examples
library(dplyr)
family_college |>
  count(teen, parents)
Nutrition in fast food
Description
Nutrition amounts in 515 fast food items. The author of the data scraped only entrees (not sides, drinks, desserts, etc.).
Usage
fastfood
Format
A data frame with 515 observations on the following 17 variables.
- restaurant
 Name of restaurant
- item
 Name of item
- calories
 Number of calories
- cal_fat
 Calories from fat
- total_fat
 Total fat
- sat_fat
 Saturated fat
- trans_fat
 Trans fat
- cholesterol
 Cholesterol
- sodium
 Sodium
- total_carb
 Total carbs
- fiber
 Fiber
- sugar
 Suger
- protein
 Protein
- vit_a
 Vitamin A
- vit_c
 Vitamin C
- calcium
 Calcium
- salad
 Salad or not
Source
Retrieved from Tidy Tuesday Fast food entree data.
Summary of male heights from USDA Food Commodity Intake Database
Description
Sample of heights based on the weighted sample in the survey.
Usage
fcid
Format
A data frame with 100 observations on the following 2 variables.
- height
 a numeric vector
- num_of_adults
 a numeric vector
Examples
fcid
Female college student heights, in inches
Description
24 sample observations.
Usage
fheights
Format
A data frame with 24 observations on the following variable.
- heights
 height, in inches
Examples
hist(fheights$heights)
Young fish in the North Sea.
Description
Samples of 50 Tobis fish, or Sand Eels, were collected at three different locations in the North Sea and the number of one-year-old fish were counted.
Usage
fish_age
Format
A data frame with 300 rows and 3 variables:
- year
 Year the fish was caught with levels 1997 and 1998.
- location
 Site the fish was caught with levels A, B and C.
- one_year_old
 Is the fish one-year-old, yes or no?
Source
Henrik Madsen, Paul Thyregod. 2011. Introduction to General and Generalized Linear Models CRC Press. Boca Raton, FL. ISBN: 978-1-4200-9155-7 Website
Examples
library(dplyr)
library(tidyr)
# Count the number of one-year-old fish at each location.
fish_age |>
  filter(one_year_old == "yes") |>
  count(year, location) |>
  pivot_wider(names_from = location, values_from = n)
Findings on n-3 Fatty Acid Supplement Health Benefits
Description
The results summarize each of the health outcomes for an experiment where 12,933 subjects received a 1g fish oil supplement daily and 12,938 received a placebo daily. The experiment's duration was 5-years.
Usage
fish_oil_18
Format
The format is a list of 24 matrices. Each matrix is a 2x2 table, and below are the named items in the list, which also represent the outcomes.
- major_cardio_event
 Major cardiovascular event. (Primary end point.)
- cardio_event_expanded
 Cardiovascular event in expanded composite endpoint.
- myocardioal_infarction
 Total myocardial infarction. (Heart attack.)
- stroke
 Total stroke.
- cardio_death
 Death from cardiovascular causes.
- PCI
 Percutaneous coronary intervention.
- CABG
 Coronary artery bypass graft.
- total_coronary_heart_disease
 Total coronary heart disease.
- ischemic_stroke
 Ischemic stroke.
- hemorrhagic_stroke
 Hemorrhagic stroke.
- chd_death
 Death from coronary heart disease.
- myocardial_infarction_death
 Death from myocardial infraction.
- stroke_death
 Death from stroke.
- invasive_cancer
 Invasive cancer of any type. (Primary end point.)
- breast_cancer
 Breast cancer.
- prostate_cancer
 Prostate cancer.
- colorectal_cancer
 Colorectal cancer.
- cancer_death
 Death from cancer.
- death
 Death from any cause.
- major_cardio_event_after_2y
 Major cardiovascular event, excluding the first 2 years of follow-up.
- myocardial_infarction_after_2y
 Total myocardial infarction, excluding the first 2 years of follow-up.
- invasive_cancer_after_2y
 Invasive cancer of any type, excluding the first 2 years of follow-up.
- cancer_death_after_2y
 Death from cancer, excluding the first 2 years of follow-up.
- death_after_2y
 Death from any cause, excluding the first 2 years of follow-up.
Source
Manson JE, et al. 2018. Marine n-3 Fatty Acids and Prevention of Cardiovascular Disease and Cancer. NEJMoa1811403. doi:10.1056/NEJMoa1811403.
Examples
names(fish_oil_18)
(tab <- fish_oil_18[["major_cardio_event"]])
chisq.test(tab)
fisher.test(tab)
(tab <- fish_oil_18[["myocardioal_infarction"]])
chisq.test(tab)
fisher.test(tab)
River flow data
Description
Flow rates (mesured in cubic feet per second) of Clarks Creek, Leach Creek, Silver Creek, and Wildwood Creek Spring collected by volunteers of the Pierce Conservation District in the State of Washington in the US.
Usage
flow_rates
Format
A data frame with 31 rows and 3 variables.
- site
 Location where measurements were taken.
- date
 Date measurements were taken.
- flow
 Flow rate of the river in cubic feet per second.
Source
Pierce County Water Data Viewer.
Examples
library(ggplot2)
# River flow rates by site
ggplot(flow_rates, aes(x = site, y = flow)) +
  geom_boxplot() +
  labs(
    title = "River flow rates by site",
    x = "Site",
    y = expression(paste("Flow (ft"^3 * "/s)"))
  )
# River flow rates over time
ggplot(flow_rates, aes(x = date, y = flow, color = site, shape = site)) +
  geom_point(size = 2) +
  labs(
    title = "River flow rates over time",
    x = "Date",
    y = expression(paste("Flow (ft"^3 * "/s)")),
    color = "Site", shape = "Site"
  )
Friday the 13th
Description
This dataset addresses issues of how superstitions regarding Friday the 13th affect human behavior, and whether Friday the 13th is an unlucky day. Scanlon, et al. collected data on traffic and shopping patterns and accident frequency for Fridays the 6th and 13th between October of 1989 and November of 1992.
Usage
friday
Format
A data frame with 61 observations and 6 variables.
- type
 Type of observation,
traffic,shopping, oraccident.- date
 Year and month of observation.
- sixth
 Counts on the 6th of the month.
- thirteenth
 Counts on the 13th of the month.
- diff
 Difference between the sixth and the thirteenth.
- location
 Location where data is collected.
Details
There are three types of observations: traffic, shopping, and accident. For traffic, the researchers obtained information from the British Department of Transport regarding the traffic flows between junctions 7 to 8 and junctions 9 to 10 of the M25 motorway. For shopping, they collected the numbers of shoppers in nine different supermarkets in southeast England. For accidents, they collected numbers of emergency admissions to hospitals due to transport accidents.
Source
Scanlon, T.J., Luben, R.N., Scanlon, F.L., Singleton, N. (1993), "Is Friday the 13th Bad For Your Health?," BMJ, 307, 1584-1586. https://dasl.datadescription.com/datafile/friday-the-13th-traffic and https://dasl.datadescription.com/datafile/friday-the-13th-accidents.
Examples
library(dplyr)
library(ggplot2)
friday |>
  filter(type == "traffic") |>
  ggplot(aes(x = sixth)) +
  geom_histogram(binwidth = 2000) +
  xlim(110000, 140000)
friday |>
  filter(type == "traffic") |>
  ggplot(aes(x = thirteenth)) +
  geom_histogram(binwidth = 2000) +
  xlim(110000, 140000)
Poll about use of full-body airport scanners
Description
Poll about use of full-body airport scanners, where about 4-in-5 people supported the use of the scanners.
Usage
full_body_scan
Format
A data frame with 1137 observations on the following 2 variables.
- answer
 a factor with levels
do not know / no answershouldshould not- party.affiliation
 a factor with levels
DemocratIndependentRepublican
Source
S. Condon. Poll: 4 in 5 Support Full-Body Airport Scanners. In: CBS News (2010).
Examples
full_body_scan
GDP Countries Data.
Description
From World Bank, GDP in current U.S. dollars 1960-2020 by decade
Usage
gdp_countries
Format
A data frame with 659 rows and 9 variables.
- country
 Name of country.
- description
 description of data: GDP (in current US$), GDP growth (annual %), GDP per capita (in current US$)
- year_1960
 value in 1960
- year_1970
 value in 1970
- year_1980
 value in 1980
- year_1990
 value in 1990
- year_2000
 value in 2000
- year_2010
 value in 2010
- year_2020
 value in 2020
Source
Examples
library(dplyr)
# don't use scientific notation
options(scipen = 999)
# List the top 10 countries by GDP (There is a row for World)
gdp_countries |>
  filter(description == "GDP") |>
  mutate(year2020 = format(year_2020, big.mark = ",")) |>
  select(country, year2020) |>
  arrange(desc(year2020)) |>
  top_n(n = 11)
# List the 10 countries with the biggest GDP per capita change from 1960 to 2020
gdp_countries |>
  filter(description == "GDP per capita") |>
  mutate(change = format(round(year_2020 - year_1960, 0), big.mark = ",")) |>
  select(country, change, year_1960, year_2020) |>
  na.omit() |>
  arrange(desc(change)) |>
  top_n(n = 10)
Fake data for a gear company example
Description
Made-up data for whether a sample of two gear companies' parts pass inspection.
Usage
gear_company
Format
A data frame with 2000 observations on the following 2 variables.
- company
 a factor with levels
currentprospective- outcome
 a factor with levels
notpass
Examples
gear_company
Bank manager recommendations based on gender
Description
Study from the 1970s about whether gender influences hiring recommendations.
Usage
gender_discrimination
Format
A data frame with 48 observations on the following 2 variables.
- gender
 a factor with levels
femaleandmale- decision
 a factor with levels
not promotedandpromoted
Source
Rosen B and Jerdee T. 1974. Influence of sex role stereotypes on personnel decisions. Journal of Applied Psychology 59(1):9-14.
Examples
library(ggplot2)
table(gender_discrimination)
ggplot(gender_discrimination, aes(y = gender, fill = decision)) +
  geom_bar(position = "fill")
Get it Dunn Run, Race Times
Description
Get it Dunn is a small regional run that got extra attention when a runner, Nichole Porath, made the Guiness Book of World Records for the fastest time pushing a double stroller in a half marathon. This dataset contains results from the 2017 and 2018 races.
Usage
get_it_dunn_run
Format
A data frame with 978 observations on the following 10 variables.
- date
 Date of the run.
- race
 Run distance.
- bib_num
 Bib number of the runner.
- first_name
 First name of the runner.
- last_initial
 Initial of the runner's last name.
- sex
 Sex of the runner.
- age
 Age of the runner.
- city
 City of residence.
- state
 State of residence.
- run_time_minutes
 Run time, in minutes.
Source
Data were collected from GSE Timing: 2018 data, 2017 race data.
Examples
d <- subset(
  get_it_dunn_run,
  race == "5k" & date == "2018-05-12" &
    !is.na(age) & state %in% c("MN", "WI")
)
head(d)
m <- lm(run_time_minutes ~ sex + age + state, d)
summary(m)
plot(m$fitted, m$residuals)
boxplot(m$residuals ~ d$sex)
plot(m$residuals ~ d$age)
hist(m$residuals)
Analytical skills of young gifted children
Description
An investigator is interested in understanding the relationship, if any, between the analytical skills of young gifted children and the following variables: father's IQ, mother's IQ, age in month when the child first said "mummy" or "daddy", age in month when the child first counted to 10 successfully, average number of hours per week the child's mother or father reads to the child, average number of hours per week the child watched an educational program on TV during the past three months, average number of hours per week the child watched cartoons on TV during the past three months. The analytical skills are evaluated using a standard testing procedure, and the score on this test is used as the response variable.
Usage
gifted
Format
A data frame with 36 observations and 8 variables.
- score
 Score in test of analytical skills.
- fatheriq
 Father's IQ.
- motheriq
 Mother's IQ.
- speak
 Age in months when the child first said "mummy" or "daddy".
- count
 Age in months when the child first counted to 10 successfully.
- read
 Average number of hours per week the child's mother or father reads to the child.
- edutv
 Average number of hours per week the child watched an educational program on TV during the past three months.
- cartoons
 Average number of hours per week the child watched cartoons on TV during the past three months.
Details
Data were collected from schools in a large city on a set of thirty-six children who were identified as gifted children soon after they reached the age of four.
Source
Graybill, F.A. & Iyer, H.K., (1994) Regression Analysis: Concepts and Applications, Duxbury, p. 511-6.
Examples
gifted
Pew survey on global warming
Description
A 2010 Pew Research poll asked 1,306 Americans, "From what you've read and heard, is there solid evidence that the average temperature on earth has been getting warmer over the past few decades, or not?"
Usage
global_warming_pew
Format
A data frame with 2253 observations on the following 2 variables.
- party_or_ideology
 a factor with levels
Conservative RepublicanLiberal DemocratMod/Cons DemocratMod/Lib Republican- response
 Response.
Source
Pew Research Center, Majority of Republicans No Longer See Evidence of Global Warming, data collected on October 27, 2010.
Examples
global_warming_pew
Google stock data
Description
Google stock data from 2006 to early 2014, where data from the first day each month was collected.
Usage
goog
Format
A data frame with 98 observations on the following 7 variables.
- date
 a factor with levels
2006-01-03,2006-02-01, and so on- open
 a numeric vector
- high
 a numeric vector
- low
 a numeric vector
- close
 a numeric vector
- volume
 a numeric vector
- adj_close
 a numeric vector
Source
Yahoo! Finance.
Examples
goog
Pew Research poll on government approval ratings
Description
The poll's focus is on Obama and then Democrats and Republicans in Congress.
Usage
gov_poll
Format
A data frame with 4223 observations on the following 2 variables.
- poll
 a factor with levels
approvedisapprove- eval
 a factor with levels
DemocratsObamaRepublicans
Source
See the Pew Research website: www.people-press.org/2012/03/14/romney-leads-gop-contest-trails-in- matchup-with-obama. The counts in Table 6.19 are approximate.
Examples
gov_poll
Survey of Duke students on GPA, studying, and more
Description
A survey of 55 Duke University students asked about their GPA, number of hours they study at night, number of nights they go out, and their gender.
Usage
gpa
Format
A data frame with 55 observations on the following 5 variables.
- gpa
 a numeric vector
- studyweek
 a numeric vector
- sleepnight
 a numeric vector
- out
 a numeric vector
- gender
 a factor with levels
femalemale
Examples
gpa
Sample of students and their GPA and IQ
Description
Data on 78 students including GPA, IQ, and gender.
Usage
gpa_iq
Format
A data frame with 78 observations representing students on the following 5 variables.
- obs
 a numeric vector
- gpa
 Grade point average (GPA).
- iq
 IQ.
- gender
 Gender.
- concept
 a numeric vector
Examples
gpa_iq
gpa_study_hours
Description
A data frame with 193 rows and 2 columns. The columns represent the
variables gpa and study_hours for a sample of 193
undergraduate students who took an introductory statistics course in 2012 at
a private US university.
Usage
gpa_study_hours
Format
A data frame with 193 observations on the following 2 variables.
- gpa
 Grade point average (GPA) of student.
- study_hours
 Number of hours students study per week.
Details
GPA ranges from 0 to 4 points, however one student reported a GPA > 4. This is a data error but this observation has been left in the dataset as it is used to illustrate issues with real survey data. Both variables are self reported, hence may not be accurate.
Source
Collected at a private US university as part of an anonymous survey in an introductory statistics course.
Examples
library(ggplot2)
ggplot(gpa_study_hours, aes(x = study_hours, y = gpa)) +
  geom_point(alpha = 0.5) +
  labs(x = "Study hours/week", y = "GPA")
Simulated data for analyzing the relationship between watching TV and grades
Description
This is a simulated dataset to be used to estimate the relationship between number of hours per week students watch TV and the grade they got in a statistics class.
Usage
gradestv
Format
A data frame with 25 observations on the following 2 variables.
- tv
 Number of hours per week students watch TV.
- grades
 Grades students got in a statistics class (out of 100).
Details
There are a few potential outliers in this dataset. When analyzing the data one should consider how (if at all) these outliers may affect the estimates of correlation coefficient and regression parameters.
Source
Simulated data
Examples
library(ggplot2)
ggplot(gradestv, aes(x = tv, y = grades)) +
  geom_point() +
  geom_smooth(method = "lm")
Simulated Google search experiment
Description
The data were simulated to look like sample results from a Google search experiment.
Usage
gsearch
Format
A data frame with 10000 observations on the following 2 variables.
- type
 a factor with levels
new searchno new search- outcome
 a factor with levels
currenttest 1test 2
Examples
library(ggplot2)
table(gsearch$type, gsearch$outcome)
ggplot(gsearch, aes(x = type, fill = outcome)) +
  geom_bar(position = "fill") +
  labs(y = "proportion")
2010 General Social Survey
Description
Data from the 2010 General Social Survey.
Usage
gss2010
Format
A data frame with 2044 observations on the following 5 variables.
- hrsrelax
 After an average work day, about how many hours do you have to relax or pursue activities that you enjoy
- mntlhlth
 For how many days during the past 30 days was your mental health, which includes stress, depression, and problems with emotions, not good?
- hrs1
 Hours worked each week.
- degree
 Educational attainment or degree.
- grass
 Do you think the use of marijuana should be made legal, or not?
Source
US 2010 General Social Survey.
Examples
gss2010
gss_wordsum_class
Description
A data frame containing data from the General Social Survey.
Usage
gss_wordsum_class
Format
A data frame with 795 observations on the following 2 variables.
- wordsum
 A vocabulary score calculated based on a ten question vocabulary test, where a higher score means better vocabulary. Scores range from 1 to 10.
- class
 Self-identified social class has 4 levels: lower, working, middle, and upper class.
Examples
library(dplyr)
gss_wordsum_class |>
  group_by(class) |>
  summarize(mean_wordsum = mean(wordsum))
Health Coverage and Health Status
Description
Survey responses for 20,000 responses to the Behavioral Risk Factor Surveillance System.
Usage
health_coverage
Format
A data frame with 20000 observations on the following 2 variables.
- coverage
 Whether the person had health coverage or not.
- health_status
 The person's health status.
Source
Office of Surveillance, Epidemiology, and Laboratory Services Behavioral Risk Factor Surveillance System, BRFSS 2010 Survey Data.
Examples
table(health_coverage)
Pew Research Center poll on health care, including question variants
Description
For example, Pew Research Center conducted a survey with the following question: "As you may know, by 2014 nearly all Americans will be required to have health insurance. People who do not buy insurance will pay a penalty while people who cannot afford it will receive financial help from the government. Do you approve or disapprove of this policy?" For each randomly sampled respondent, the statements in brackets were randomized: either they were kept in the order given above, or the two statements were reversed.
Usage
healthcare_law_survey
Format
A data frame with 1503 observations on the following 2 variables.
- order
 a factor with levels
cannot_afford_secondpenalty_second- response
 a factor with levels
approvedisapproveother
Source
www.people-press.org/2012/03/26/public-remains-split-on-health-care-bill-opposed-to-mandate/. Sample sizes for each polling group are approximate.
Examples
healthcare_law_survey
Heart Transplant Data
Description
The Stanford University Heart Transplant Study was conducted to determine whether an experimental heart transplant program increased lifespan. Each patient entering the program was designated officially a heart transplant candidate, meaning that he was gravely ill and would most likely benefit from a new heart. Then the actual heart transplant occurs between a few weeks to several months depending on the availability of a donor. Very few candidates during this waiting period show improvement and get deselected as a heart transplant candidate, but for the purposes of this experiment those patients were kept in the data as continuing candidates.
Usage
heart_transplant
Format
A data frame with 103 observations on the following 8 variables.
- id
 ID number of the patient.
- acceptyear
 Year of acceptance as a heart transplant candidate.
- age
 Age of the patient at the beginning of the study.
- survived
 Survival status with levels
aliveanddead.- survtime
 Number of days patients were alive after the date they were determined to be a candidate for a heart transplant until the termination date of the study
- prior
 Whether or not the patient had prior surgery with levels
yesandno.- transplant
 Transplant status with levels
control(did not receive a transplant) andtreatment(received a transplant).- wait
 Waiting Time for Transplant
Source
http://www.stat.ucla.edu/~jsanchez/data/stanford.txt
References
Turnbull B, Brown B, and Hu M (1974). "Survivorship of heart transplant data." Journal of the American Statistical Association, vol. 69, pp. 74-80.
Examples
library(ggplot2)
ggplot(heart_transplant, aes(x = transplant, y = survtime)) +
  geom_boxplot() +
  labs(x = "Transplant", y = "Survival time (days)")
ggplot(heart_transplant, aes(x = transplant, fill = survived)) +
  geom_bar(position = "fill") +
  labs(x = "Transplant", y = "Proportion", fill = "Outcome")
Helium football
Description
At the 1976 Pro Bowl, Ray Guy, a punter for the Oakland Raiders, punted a ball that hung mid-air long enough for officials to question whether the pigskin was filled with helium. The ball was found to be filled with air, but since then many have tossed around the idea that a helium-filled football would outdistance an air-filled one. Students at Ohio State University conducted an experiment to test this myth. They used two identical footballs, one air filled with air and one filled with helium. Each football was kicked 39 times and the two footballs were alternated with each kick.
Usage
helium
Format
A data frame with 39 observations on the following 3 variables.
- trial
 Trial number.
- air
 Distance in years for air-filled football.
- helium
 Distance in years for helium-filled football.
Details
Lafferty, M. B. (1993), "OSU scientists get a kick out of sports controversy, "The Columbus Dispatch (November, 21, 1993), B7.
Source
Previously part of the Data and Story Library, https://dasl.datadescription.com. Removed as of 2020.
Examples
boxPlot(helium$air, xlab = "air")
boxPlot(helium$helium, xlab = "helium")
Socioeconomic status and reduced-fee school lunches
Description
Examining the relationship between socioeconomic status measured as the percentage of children in a neighborhood receiving reduced-fee lunches at school (lunch) and the percentage of bike riders in the neighborhood wearing helmets (helmet).
Usage
helmet
Format
A data frame with 12 observations representing neighborhoods on the following 2 variables.
- lunch
 Percent of students receiving reduced-fee school lunches.
- helmet
 Percent of bike riders wearing helmets.
Examples
library(ggplot2)
ggplot(helmet, aes(x = lunch, y = helmet)) +
  geom_point()
Human Freedom Index
Description
The Human Freedom Index is a report that attempts to summarize the idea of "freedom" through a bunch of different variables for many countries around the globe. It serves as a rough objective measure for the relationships between the different types of freedom - whether it's political, religious, economical or personal freedom - and other social and economic circumstances. The Human Freedom Index is an annually co-published report by the Cato Institute, the Fraser Institute, and the Liberales Institut at the Friedrich Naumann Foundation for Freedom.
Usage
hfi
Format
A data frame with 1458 observations on the following 123 variables.
- year
 Year
- ISO_code
 ISO code of country
- countries
 Name of country
- region
 Region where country is located
- pf_rol_procedural
 Procedural justice
- pf_rol_civil
 Civil justice
- pf_rol_criminal
 Criminal justice
- pf_rol
 Rule of law
- pf_ss_homicide
 Homicide
- pf_ss_disappearances_disap
 Disappearances
- pf_ss_disappearances_violent
 Violent conflicts
- pf_ss_disappearances_organized
 Violent conflicts
- pf_ss_disappearances_fatalities
 Terrorism fatalities
- pf_ss_disappearances_injuries
 Terrorism injuries
- pf_ss_disappearances
 Disappearances, conflict, and terrorism
- pf_ss_women_fgm
 Female genital mutilation
- pf_ss_women_missing
 Missing women
- pf_ss_women_inheritance_widows
 Inheritance rights for widows
- pf_ss_women_inheritance_daughters
 Inheritance rights for daughters
- pf_ss_women_inheritance
 Inheritance
- pf_ss_women
 Women's security
- pf_ss
 Security and safety
- pf_movement_domestic
 Freedom of domestic movement
- pf_movement_foreign
 Freedom of foreign movement
- pf_movement_women
 Women's movement
- pf_movement
 Freedom of movement
- pf_religion_estop_establish
 Freedom to establish religious organizations
- pf_religion_estop_operate
 Freedom to operate religious organizations
- pf_religion_estop
 Freedom to establish and operate religious organizations
- pf_religion_harassment
 Harassment and physical hostilities
- pf_religion_restrictions
 Legal and regulatory restrictions
- pf_religion
 Religious freedom
- pf_association_association
 Freedom of association
- pf_association_assembly
 Freedom of assembly
- pf_association_political_establish
 Freedom to establish political parties
- pf_association_political_operate
 Freedom to operate political parties
- pf_association_political
 Freedom to establish and operate political parties
- pf_association_prof_establish
 Freedom to establish professional organizations
- pf_association_prof_operate
 Freedom to operate professional organizations
- pf_association_prof
 Freedom to establish and operate professional organizations
- pf_association_sport_establish
 Freedom to establish educational, sporting, and cultural organizations
- pf_association_sport_operate
 Freedom to operate educational, sporting, and cultural organizations
- pf_association_sport
 Freedom to establish and operate educational, sporting, and cultural organizations
- pf_association
 Freedom to associate and assemble with peaceful individuals or organizations
- pf_expression_killed
 Press killed
- pf_expression_jailed
 Press jailed
- pf_expression_influence
 Laws and regulations that influence media content
- pf_expression_control
 Political pressures and controls on media content
- pf_expression_cable
 Access to cable/satellite
- pf_expression_newspapers
 Access to foreign newspapers
- pf_expression_internet
 State control over internet access
- pf_expression
 Freedom of expression
- pf_identity_legal
 Legal gender
- pf_identity_parental_marriage
 Parental rights in marriage
- pf_identity_parental_divorce
 Parental rights after divorce
- pf_identity_parental
 Parental rights
- pf_identity_sex_male
 Male-to-male relationships
- pf_identity_sex_female
 Female-to-female relationships
- pf_identity_sex
 Same-sex relationships
- pf_identity_divorce
 Divor
- pf_identity
 Identity and relationships
- pf_score
 Personal Freedom (score)
- pf_rank
 Personal Freedom (rank)
- ef_government_consumption
 Government consumption
- ef_government_transfers
 Transfers and subsidies
- ef_government_enterprises
 Government enterprises and investments
- ef_government_tax_income
 Top marginal income tax rate - Top marginal income tax rates
- ef_government_tax_payroll
 Top marginal income tax rate - Top marginal income and payroll tax rate
- ef_government_tax
 Top marginal tax rate
- ef_government
 Size of government
- ef_legal_judicial
 Judicial independence
- ef_legal_courts
 Impartial courts
- ef_legal_protection
 Protection of property rights
- ef_legal_military
 Military interference in rule of law and politics
- ef_legal_integrity
 Integrity of the legal system
- ef_legal_enforcement
 Legal enforcement of contracts
- ef_legal_restrictions
 Regulatory restrictions on the sale of real property
- ef_legal_police
 Reliability of police
- ef_legal_crime
 Business costs of crime
- ef_legal_gender
 Gender adjustment
- ef_legal
 Legal system and property rights
- ef_money_growth
 Money growth
- ef_money_sd
 Standard deviation of inflation
- ef_money_inflation
 Inflation - most recent year
- ef_money_currency
 Freedom to own foreign currency bank account
- ef_money
 Sound money
- ef_trade_tariffs_revenue
 Tariffs - Revenue from trade taxes (percentage of trade sector)
- ef_trade_tariffs_mean
 Tariffs - Mean tariff rate
- ef_trade_tariffs_sd
 Tariffs - Standard deviation of tariffs rates
- ef_trade_tariffs
 Tariffs
- ef_trade_regulatory_nontariff
 Regulatory trade barriers - Nontariff trade barriers
- ef_trade_regulatory_compliance
 Regulatory trade barriers - Compliance costs of importing and exporting
- ef_trade_regulatory
 Regulatory trade barriers
- ef_trade_black
 Black-market exchange rates
- ef_trade_movement_foreign
 Controls of the movement of capital and people - Foreign ownership/investment restrictions
- ef_trade_movement_capital
 Controls of the movement of capital and people - Capital controls
- ef_trade_movement_visit
 Controls of the movement of capital and people - Freedom of foreigners to visit
- ef_trade_movement
 Controls of the movement of capital and people
- ef_trade
 Freedom to trade internationally
- ef_regulation_credit_ownership
 Credit market regulations - Ownership of banks
- ef_regulation_credit_private
 Credit market regulations - Private sector credit
- ef_regulation_credit_interest
 Credit market regulations - Interest rate controls/negative real interest rates
- ef_regulation_credit
 Credit market regulation
- ef_regulation_labor_minwage
 Labor market regulations - Hiring regulations and minimum wage
- ef_regulation_labor_firing
 Labor market regulations - Hiring and firing regulations
- ef_regulation_labor_bargain
 Labor market regulations - Centralized collective bargaining
- ef_regulation_labor_hours
 Labor market regulations - Hours regulations
- ef_regulation_labor_dismissal
 Labor market regulations - Dismissal regulations
- ef_regulation_labor_conscription
 Labor market regulations - Conscription
- ef_regulation_labor
 Labor market regulation
- ef_regulation_business_adm
 Business regulations - Administrative requirements
- ef_regulation_business_bureaucracy
 Business regulations - Bureaucracy costs
- ef_regulation_business_start
 Business regulations - Starting a business
- ef_regulation_business_bribes
 Business regulations - Extra payments/bribes/favoritism
- ef_regulation_business_licensing
 Business regulations - Licensing restrictions
- ef_regulation_business_compliance
 Business regulations - Cost of tax compliance
- ef_regulation_business
 Business regulation
- ef_regulation
 Economic freedom regulation score
- ef_score
 Economic freedom score
- ef_rank
 Economic freedom rank
- hf_score
 Human freedom score
- hf_rank
 Human freedom rank
- hf_quartile
 Human freedom quartile
Details
This dataset contains information from Human Freedom Index reports from 2008-2016.
Source
Ian Vasquez and Tanja Porcnik, The Human Freedom Index 2018: A Global Measurement of Personal, Civil, and Economic Freedom (Washington: Cato Institute, Fraser Institute, and the Friedrich Naumann Foundation for Freedom, 2018). https://www.cato.org/sites/cato.org/files/human-freedom-index-files/human-freedom-index-2016.pdf. https://www.kaggle.com/gsutters/the-human-freedom-index.
Histogram or hollow histogram
Description
Create histograms and hollow histograms. This function permits easy color and appearance customization.
Usage
histPlot(
  x,
  col = fadeColor("black", "22"),
  border = "black",
  breaks = "default",
  probability = FALSE,
  hollow = FALSE,
  add = FALSE,
  lty = 2,
  lwd = 1,
  freqTable = FALSE,
  right = TRUE,
  axes = TRUE,
  xlab = NULL,
  ylab = NULL,
  xlim = NULL,
  ylim = NULL,
  ...
)
Arguments
x | 
 Numerical vector or a frequency table (matrix) where the first
column represents the observed values and the second column the frequencies.
See also   | 
col | 
 Shading of the histogram bins.  | 
border | 
 Color of histogram bin borders.  | 
breaks | 
 A vector for the bin boundaries or an approximate number of bins.  | 
probability | 
 If   | 
hollow | 
 If   | 
add | 
 If   | 
lty | 
 Line type. Applies only if   | 
lwd | 
 Line width. Applies only if   | 
freqTable | 
 Set to   | 
right | 
 Set to   | 
axes | 
 If   | 
xlab | 
 Label for the x axis.  | 
ylab | 
 Label for the y axis.  | 
xlim | 
 Limits for the x axis.  | 
ylim | 
 Limits for the y axis.  | 
... | 
 Additional arguments to   | 
Author(s)
David Diez
See Also
Examples
histPlot(tips$tip, main = "Tips")
# overlaid hollow histograms
histPlot(tips$tip[tips$day == "Tuesday"],
  probability = TRUE,
  hollow = TRUE,
  main = "Tips by day"
)
histPlot(tips$tip[tips$day == "Friday"],
  probability = TRUE,
  hollow = TRUE,
  add = TRUE,
  lty = 3,
  border = "red"
)
legend("topright",
  col = c("black", "red"),
  lty = 1:2,
  legend = c("Tuesday", "Friday")
)
# breaks and colors
histPlot(tips$tip,
  col = fadeColor("yellow", "33"),
  border = "darkblue",
  probability = TRUE,
  breaks = 30,
  lwd = 3
)
# custom breaks
brks <- c(-1, 0, 1, 2, 3, 4, seq(5, 20, 5), 22, 24, 26)
histPlot(tips$tip,
  probability = TRUE,
  breaks = brks,
  col = fadeColor("darkgoldenrod4", "33"),
  xlim = c(0, 26)
)
United States House of Representatives historical make-up
Description
The make-up of the United States House of Representatives every two years since 1789. The last Congress included is the 112th Congress, which completed its term in 2013.
Usage
house
Format
A data frame with 112 observations on the following 12 variables.
- congress
 The number of that year's Congress
- year_start
 Starting year
- year_end
 Ending year
- seats
 Total number of seats
- p1
 Name of the first political party
- np1
 Number of seats held by the first political party
- p2
 Name of the second political party
- np2
 Number of seats held by the second political party
- other
 Other
- vac
 Vacancy
- del
 Delegate
- res
 Resident commissioner
Source
Party Divisions of the House of Representatives, 1789 to Present. https://history.house.gov/Institution/Party-Divisions/Party-Divisions.
Examples
library(dplyr)
library(ggplot2)
library(forcats)
# Examine two-party relationship since 1855
house_since_1855 <- house |>
  filter(year_start >= 1855) |>
  mutate(
    p1_perc = 100 * np1 / seats,
    p2_perc = 100 * np2 / seats,
    era = case_when(
      between(year_start, 1861, 1865) ~ "Civil War",
      between(year_start, 1914, 1918) ~ "World War I",
      between(year_start, 1929, 1939) ~ "Great Depression",
      between(year_start, 1940, 1945) ~ "World War II",
      between(year_start, 1960, 1965) ~ "Vietnam War Start",
      between(year_start, 1965, 1975) ~ "Vietnam War Escalated",
      TRUE ~ NA_character_
    ),
    era = fct_relevel(
      era, "Civil War", "World War I",
      "Great Depression", "World War II",
      "Vietnam War Start", "Vietnam War Escalated"
    )
  )
ggplot(house_since_1855, aes(x = year_start)) +
  geom_rect(aes(
    xmin = year_start, xmax = lead(year_start),
    ymin = -Inf, ymax = Inf, fill = era
  )) +
  geom_line(aes(y = p1_perc, color = "Democrats")) + # Democrats
  geom_line(aes(y = p2_perc, color = "Republicans")) + # Republicans
  scale_fill_brewer(palette = "Pastel1", na.translate = FALSE) +
  scale_color_manual(
    name   = "Party",
    values = c("Democrats" = "blue", "Republicans" = "red"),
    labels = c("Democrats", "Republicans")
  ) +
  theme_minimal() +
  ylim(0, 100) +
  labs(x = "Year", y = "Percentage of seats", fill = "Era")
Simulated dataset on student housing
Description
Each observation represents a simulated rent price for a student.
Usage
housing
Format
A data frame with 75 observations on the following variable.
- cost
 a numeric vector
Examples
housing
High School and Beyond survey
Description
Two hundred observations were randomly sampled from the High School and Beyond survey, a survey conducted on high school seniors by the National Center of Education Statistics.
Usage
hsb2
Format
A data frame with 200 observations and 11 variables.
- id
 Student ID.
- gender
 Student's gender, with levels
femaleandmale.- race
 Student's race, with levels
african american,asian,hispanic, andwhite.- ses
 Socio economic status of student's family, with levels
low,middle, andhigh.- schtyp
 Type of school, with levels
publicandprivate.- prog
 Type of program, with levels
general,academic, andvocational.- read
 Standardized reading score.
- write
 Standardized writing score.
- math
 Standardized math score.
- science
 Standardized science score.
- socst
 Standardized social studies score.
Source
UCLA Institute for Digital Research & Education - Statistical Consulting.
Examples
library(ggplot2)
ggplot(hsb2, aes(x = read - write, y = ses)) +
  geom_boxplot() +
  labs(
    x = "Difference between reading and writing scores",
    y = "Socio-economic status"
  )
Great Britain: husband and wife pairs
Description
The Great Britain Office of Population Census and Surveys once collected data on a random sample of 170 married couples in Britain, recording the age (in years) and heights of the husbands and wives.
Usage
husbands_wives
Format
A data frame with 199 observations on the following 8 variables.
- age_husband
 Age of husband.
- age_wife
 Age of wife.
- ht_husband
 Height of husband (mm).
- ht_wife
 Height of wife (mm).
- age_husb_at_marriage
 Age of husband at the time they married.
- age_wife_at_marriage
 Age of wife at the time they married.
- years_married
 Number of years married.
Source
Hand DJ. 1994. A handbook of small data sets. Chapman & Hall/CRC.
Examples
library(ggplot2)
ggplot(husbands_wives, aes(x = ht_husband, y = ht_wife)) +
  geom_point()
Poll on illegal workers in the US
Description
910 randomly sampled registered voters in Tampa, FL were asked if they thought workers who have illegally entered the US should be (i) allowed to keep their jobs and apply for US citizenship, (ii) allowed to keep their jobs as temporary guest workers but not allowed to apply for US citizenship, or (iii) lose their jobs and have to leave the country as well as their political ideology.
Usage
immigration
Format
A data frame with 910 observations on the following 2 variables.
- response
 a factor with levels
Apply for citizenshipGuest workerLeave the countryNot sure- political
 a factor with levels
conservativeliberalmoderate
Source
SurveyUSA, News Poll #18927, data collected Jan 27-29, 2012.
Examples
immigration
Infant Mortality Rates, 2012
Description
This entry gives the number of deaths of infants under one year old in 2012 per 1,000 live births in the same year. This rate is often used as an indicator of the level of health in a country.
Usage
infmortrate
Format
A data frame with 222 observations on the following 2 variables.
- country
 Name of country.
- inf_mort_rate
 Infant mortality rate per 1,000 live births.
Details
The data is given in decreasing order of infant mortality rates. There are a few potential outliers.
Source
CIA World Factbook, https://www.cia.gov/the-world-factbook/field/infant-mortality-rate/country-comparison.
Examples
library(ggplot2)
ggplot(infmortrate, aes(x = inf_mort_rate)) +
  geom_histogram(binwidth = 10)
ggplot(infmortrate, aes(x = inf_mort_rate)) +
  geom_density()
iowa
Description
A data frame containing information about the 2016 US Presidential Election for the state of Iowa.
Usage
iowa
Format
A data frame with 1386 observations on the following 5 variables.
- office
 The office that the candidates were running for.
- candidate
 President/Vice President pairs who were running for office.
- party
 Political part of the candidate.
- county
 County in Iowa where the votes were cast.
- votes
 Number of votes received by the candidate.
Examples
library(ggplot2)
library(dplyr)
plot_data <- iowa |>
  filter(candidate != "Total") |>
  group_by(candidate) |>
  summarize(total_votes = sum(votes) / 1000)
ggplot(plot_data, aes(total_votes, candidate)) +
  geom_col() +
  theme_minimal() +
  labs(
    title = "2016 Presidential Election in Iowa",
    subtitle = "Popular vote",
    y = "",
    x = "Number of Votes (in thousands)
    "
  )
Facebook, Google, and LinkedIn IPO filings
Description
On Feb 1st, 2011, Facebook Inc. filed an S-1 form with the Securities and Exchange Commission as part of their initial public offering (IPO). This dataset includes the text of that document as well as text from the IPOs of two competing companies: Google and LinkedIn.
Usage
ipo
Format
The format is a list of three character vectors. Each vector contains the line-by-line text of the IPO Prospectus of Facebook, Google, and LinkedIn, respectively.
Details
Each of the three prospectuses is encoded in UTF-8 format and contains some non-word characters related to the layout of the original documents. For analysis on the words, it is recommended that the data be processed with packages such as tidytext. See examples below.
Source
All IPO prospectuses are available from the U.S. Securities and Exchange Commission: Facebook, Google, LinkedIn.
References
Zweig, J., 2020. Mark Zuckerberg: CEO For Life?. WSJ.
Examples
library(tidytext)
library(tibble)
library(dplyr)
library(ggplot2)
library(forcats)
# Analyzing Facebook IPO text
facebook <- tibble(text = ipo$facebook, company = "Facebook")
facebook |>
  unnest_tokens(word, text) |>
  anti_join(stop_words) |>
  count(word, sort = TRUE) |>
  slice_head(n = 20) |>
  ggplot(aes(y = fct_reorder(word, n), x = n, fill = n)) +
  geom_col() +
  labs(
    title = "Top 20 most common words in Facebook IPO",
    x = "Frequency",
    y = "Word"
  )
# Comparisons to Google and LinkedIn IPO texts
google <- tibble(text = ipo$google, company = "Google")
linkedin <- tibble(text = ipo$linkedin, company = "LinkedIn")
ipo_texts <- bind_rows(facebook, google, linkedin)
ipo_texts |>
  unnest_tokens(word, text) |>
  count(company, word, sort = TRUE) |>
  bind_tf_idf(word, company, n) |>
  arrange(desc(tf_idf)) |>
  group_by(company) |>
  slice_max(tf_idf, n = 15) |>
  ungroup() |>
  ggplot(aes(tf_idf, fct_reorder(word, tf_idf), fill = company)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~company, ncol = 3, scales = "free") +
  labs(x = "tf-idf", y = NULL)
Length of songs on an iPod
Description
A simulated dataset on lengths of songs on an iPod.
Usage
ipod
Format
A data frame with 3000 observations on the following variable.
- song_length
 Length of song (in minutes).
Source
Simulated data.
Examples
library(ggplot2)
ggplot(ipod, aes(x = song_length)) +
  geom_histogram(binwidth = 0.5)
iran
Description
A data frame containing information about the 2009 Presidential Election in Iran. There were widespread claims of election fraud in this election both internationally and within Iran.
Usage
iran
Format
A data frame with 366 observations on the following 9 variables.
- province
 Iranian province where votes were cast.
- city
 City within province where votes were cast.
- ahmadinejad
 Number of votes received by Ahmadinejad.
- rezai
 Number of votes received by Rezai.
- karrubi
 Number of votes received by Karrubi.
- mousavi
 Number of votes received by Mousavi.
- total_votes_cast
 Total number of votes cast.
- voided_votes
 Number of votes that were not counted.
- legitimate_votes
 Number of votes that were counted.
Examples
library(dplyr)
library(ggplot2)
library(tidyr)
library(stringr)
plot_data <- iran |>
  summarize(
    ahmadinejad = sum(ahmadinejad) / 1000,
    rezai = sum(rezai) / 1000,
    karrubi = sum(karrubi) / 1000,
    mousavi = sum(mousavi) / 1000
  ) |>
  pivot_longer(
    cols = c(ahmadinejad, rezai, karrubi, mousavi),
    names_to = "candidate",
    values_to = "votes"
  ) |>
  mutate(candidate = str_to_title(candidate))
ggplot(plot_data, aes(votes, candidate)) +
  geom_col() +
  theme_minimal() +
  labs(
    title = "2009 Iranian Presidential Election",
    x = "Number of votes (in thousands)",
    y = ""
  )
Simulated juror dataset
Description
Simulated dataset of registered voters proportions and representation on juries.
Usage
jury
Format
A data frame with 275 observations on the following variable.
- race
 a factor with levels
blackhispanicotherwhite
Examples
jury
Kobe Bryant basketball performance
Description
Data from the five games the Los Angeles Lakers played against the Orlando Magic in the 2009 NBA finals.
Usage
kobe_basket
Format
A data frame with 133 rows and 6 variables:
- vs
 A categorical vector, ORL if the Los Angeles Lakers played against Orlando
- game
 A numerical vector, game in the 2009 NBA finals
- quarter
 A categorical vector, quarter in the game, OT stands for overtime
- time
 A character vector, time at which Kobe took a shot
- description
 A character vector, description of the shot
- shot
 A categorical vector, H if the shot was a hit, M if the shot was a miss
Details
Each row represents a shot Kobe Bryant took during the five games of the 2009 NBA finals. Kobe Bryant's performance earned him the title of Most Valuable Player and many spectators commented on how he appeared to show a hot hand.
lab_report
Description
Acts as a simplified template to common parameters passed to rmarkdown::html_document().
Usage
lab_report(
  highlight = "pygments",
  theme = "spacelab",
  toc = TRUE,
  toc_float = TRUE,
  code_download = TRUE,
  code_folding = "show"
)
Arguments
highlight | 
 Syntax highlighting style. Supported styles include
"default", "tango", "pygments", "kate", "monochrome", "espresso", "zenburn",
"haddock", and "textmate". Pass   | 
theme | 
 Visual theme ("default", "cerulean", "journal", "flatly",
"readable", "spacelab", "united", "cosmo", "lumen", "paper", "sandstone",
"simplex", or "yeti"). Pass   | 
toc | 
 
  | 
toc_float | 
 
  | 
code_download | 
 Embed the Rmd source code within the document and provide a link that can be used by readers to download the code.  | 
code_folding | 
 Enable document readers to toggle the display of R code
chunks. Specify   | 
Are Emily and Greg More Employable Than Lakisha and Jamal?
Description
Original data from the experiment run by Bertrand and Mullainathan (2004).
Usage
labor_market_discrimination
Format
A tibble with 4870 observations of 63 variables.
- education
 Highest education, with levels of 0 = not reported; 1 = high school diploma; 2 = high school graduate; 3 = some college; 4 = college or more.
- n_jobs
 Number of jobs listed on resume.
- years_exp
 Number of years of work experience on the resume.
- honors
 Indicator variable for which 1 = resume mentions some honors.
- volunteer
 Indicator variable for which 1 = resume mentions some volunteering experience.
- military
 Indicator variable for which 1 = resume mentions some military experience.
- emp_holes
 Indicator variable for which 1 = resume mentions some employment holes.
- occup_specific
 1990 Census Occupation Code. See sources for a key.
- occup_broad
 Occupation broad with levels 1 = executives and managerial occupations, 2 = administrative supervisors, 3 = sales representatives, 4 = sales workers, 5 = secretaries and legal assistants, 6 = clerical occupations
- work_in_school
 Indicator variable for which 1 = resume mentions some work experience while at school
Indicator variable for which 1 = email address on applicant's resume.
- computer_skills
 Indicator variable for which 1 = resume mentions some computer skills.
- special_skills
 Indicator variable for which 1 = resume mentions some special skills.
- first_name
 Applicant's first name.
- sex
 Sex, with levels of 'f' = female; 'm' = male.
- race
 Race, with levels of 'b' = black; 'w' = white.
- h
 Indicator variable for which 1 = high quality resume.
- l
 Indicator variable for which 1 = low quality resume.
- call
 Indicator variable for which 1 = applicant was called back.
- city
 City, with levels of 'c' = chicago; 'b' = boston.
- kind
 Kind, with levels of 'a' = administrative; 's' = sales.
- ad_id
 Employment ad identifier.
- frac_black
 Fraction of blacks in applicant's zip.
- frac_white
 Fraction of whites in applicant's zip.
- l_med_hh_inc
 Log median household income in applicant's zip.
- frac_dropout
 Fraction of high-school dropouts in applicant's zip.
- frac_colp
 Fraction of college degree or more in applicant's zip
- l_inc
 Log per capita income in applicant's zip.
- col
 Indicator variable for which 1 = applicant has college degree or more.
- expminreq
 Minimum experience required, if any (in years when numeric).
- school_req
 Specific education requirement, if any. 'hsg' = high school graduate, 'somcol' = some college, 'colp' = four year degree or higher
- eoe
 Indicator variable for which 1 = ad mentions employer is 'Equal Opportunity Employer'.
- parent_sales
 Sales of parent company (in millions of US $).
- parent_emp
 Number of parent company employees.
- branch_sales
 Sales of branch (in millions of US $).
- branch_emp
 Number of branch employees.
- fed
 Indicator variable for which 1 = employer is a federal contractor.
- frac_black_emp_zip
 Fraction of blacks in employers's zipcode.
- frac_white_emp_zip
 Fraction of whites in employer's zipcode.
- l_med_hh_inc_emp_zip
 Log median household income in employer's zipcode.
- frac_dropout_emp_zip
 Fraction of high-school dropouts in employer's zipcode.
- frac_colp_emp_zip
 Fraction of college degree or more in employer's zipcode.
- l_inc_emp_zip
 Log per capita income in employer's zipcode.
- manager
 Indicator variable for which 1 = executives or managers wanted.
- supervisor
 Indicator variable for which 1 = administrative supervisors wanted.
- secretary
 Indicator variable for which 1 = secretaries or legal assistants wanted.
- off_support
 Indicator variable for which 1 = clerical workers wanted.
- sales_rep
 Indicator variable for which 1 = sales representative wanted.
- retail_sales
 Indicator variable for which 1 = retail sales worker wanted.
- req
 Indicator variable for which 1 = ad mentions any requirement for job.
- exp_req
 Indicator variable for which 1 = ad mentions some experience requirement.
- com_req
 Indicator variable for which 1 = ad mentions some communication skills requirement.
- educ_req
 Indicator variable for which 1 = ad mentions some educational requirement.
- comp_req
 Indicator variable for which 1 = ad mentions some computer skill requirement.
- org_req
 Indicator variable for which 1 = ad mentions some organizational skills requirement.
- manuf
 Indicator variable for which 1 = employer industry is manufacturing.
- trans_com
 Indicator variable for which 1 = employer industry is transport or communication.
- bank_real
 Indicator variable for which 1 = employer industry is finance, insurance or real estate.
- trade
 Indicator variable for which 1 = employer industry is wholesale or retail trade.
- bus_service
 Indicator variable for which 1 = employer industry is business or personal services.
- oth_service
 Indicator variable for which 1 = employer industry is health, education or social services.
- miss_ind
 Indicator variable for which 1 = employer industry is other or unknown.
- ownership
 Ownership status of employer, with levels of 'non-profit'; 'private'; 'public'
Details
From the summary: "We study race in the labor market by sending fictitious resumes to help-wanted ads in Boston and Chicago newspapers. To manipulate perceived race, resumes are randomly assigned African-American- or White-sounding names. White names receive 50 percent more callbacks for interviews. Callbacks are also more responsive to resume quality for White names than for African-American ones. The racial gap is uniform across occupation, industry, and employer size. We also find little evidence that employers are inferring social class from the names. Differential treatment by race still appears to be prominent in the U. S. labor market."
Source
Bertrand, Marianne, and Mullainathan, Sendhil. Replication data for: Are Emily and Greg More Employable Than Lakisha and Jamal? A Field Experiment on Labor Market Discrimination. Nashville, TN: American Economic Association [publisher], 2004. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2019-12-06. doi:10.3886/E116023V1.
Note: The description of the variables follows closely the labels provided in the original dataset, with small edits for clarity.
Examples
library(dplyr)
# Percent callback for typical White names and typical African-American names (table 1, p. 997)
labor_market_discrimination |>
  group_by(race) |>
  summarise(call_back = mean(call))
Gender, Socioeconomic Class, and Interview Invites
Description
Resumes were sent out to 316 top law firms in the United States, and there were two randomized characteristics of each resume. First, the gender associated with the resume was randomized by assigning a first name of either James or Julia. Second, the socioeconomic class of the candidate was randomly assigned and represented through five minor changes associated with personal interests and other other minor details (e.g. an extracurricular activity of sailing team vs track and field). The outcome variable was whether the candidate was received an interview.
Usage
law_resume
Format
A data frame with 316 observations on the following 3 variables. Each row represents a resume sent a top law firm for this experiment.
- class
 The resume represented irrelevant details suggesting either
"low"or"high"socioeconomic class.- gender
 The resume implied the candidate was either
"male"or"female".- outcome
 If the candidate received an invitation for an
"interview"or"not".
Source
For a casual overview, see https://hbr.org/2016/12/research-how-subtle-class-cues-can-backfire-on-your-resume.
For the academic paper, see Tilcsik A, Rivera LA. 2016. Class Advantage, Commitment Penalty. The Gendered Effect of Social Class Signals in an Elite Labor Market. American Sociological Review 81:6 p1097-1131. doi:10.1177/0003122416668154.
Examples
tapply(law_resume$outcome == "interview", law_resume[, c("class", "gender")], mean)
m <- glm(I(outcome == "interview") ~ gender * class, data = law_resume, family = binomial)
summary(m)
predict(m, type = "response")
Lecture Delivery Method and Learning Outcomes
Description
Data was collected from 276 students in a university psychology course to determine the effect of lecture delivery method on learning. Students were presented a live lecture by the professor on one day and a pre-recorded lecture on a different topic by the same professor on a different day. Survey data was collected during the lectures to determine mind wandering, interest, and motivation. Students were also ultimately asked about the preferred lecture delivery method. Finally, students completed an assessment at the end of the lecture to determine memory recall.
Usage
lecture_learning
Format
A data frame with 552 rows and 8 variables.
- student
 Identification number of a specific student. Each identification appears twice because same student heard both lecture delivery methods.
- gender
 Gender of student. Recored a binary variable with levels Male and Female in the study.
- method
 Delivery method of lecture was either in-person(Live) or pre-recorded(Video).
- mindwander
 An indicator of distraction during the lecture. It is a proportion of six mind wandering probes during the lecture when a student answered yes that mind wandering had just occurred.
- memory
 An indicator of recall of information provided during the lecture. It is the proportion of correct answers in a six question assessment given at the end of the lecture presentation.
- interest
 A Likert scale that gauged student interest level concerning the lecture.
- motivation_both
 After experiencing both lecture delivery methods, students were asked about which method they were most motivated to remain attentive.
- motivation_single
 After a single lecture delivery experience, this Likert scale was used to gauge motivation to remain attentive during the lecture.
Source
Examples
library(dplyr)
library(ggplot2)
# Calculate the average memory test proportion by lecture delivery method
# and gender.
lecture_learning |>
  group_by(method, gender) |>
  summarize(average_memory = mean(memory), count = n(), .groups = "drop")
# Compare visually the differences in memory test proportions by delivery
# method and gender.
ggplot(lecture_learning, aes(x = method, y = memory, fill = gender)) +
  geom_boxplot() +
  theme_minimal() +
  labs(
    title = "Difference in memory test proportions",
    x = "Method",
    y = "Memory",
    fill = "Gender"
  )
# Use a paired t-test to determine whether memory test proportion score
# differed by delivery method. Note that paired t-tests are identical
# to one sample t-test on the difference between the Live and Video methods.
learning_diff <- lecture_learning |>
  tidyr::pivot_wider(id_cols = student, names_from = method, values_from = memory) |>
  mutate(time_diff = Live - Video)
t.test(time_diff ~ 1, data = learning_diff)
# Calculating the proportion of students who were most motivated to remain
# attentive in each delivery method.
lecture_learning |>
  count(motivation_both) |>
  mutate(proportion = n / sum(n))
Legalization of Marijuana Support in 2010 California Survey
Description
In a 2010 Survey USA poll, 70% of the 119 respondents between the ages of 18 and 34 said they would vote in the 2010 general election for Prop 19, which would change California law to legalize marijuana and allow it to be regulated and taxed.
Usage
leg_mari
Format
A data frame with 119 observations on the following variable.
- response
 One of two values:
opposeorsupport.
Source
Survey USA, Election Poll #16804, data collected July 8-11, 2010.
Examples
table(leg_mari)
Population of Lego Sets for Sale between Jan. 1, 2018 and Sept. 11, 2020.
Description
Data about Lego Sets for sale. Based on JSDSE article by Anna Peterson and Laura Ziegler Data from their article was scrapped from multiple sources including brickset.com
Usage
lego_population
Format
A data frame with 1304 rows and 14 variables.
- item_number
 Set Item number
- set_name
 Name of the set.
- theme
 Set theme: Duplo, City or Friends.
- pieces
 Number of pieces in the set.
- price
 Recommended retail price from LEGO.
- amazon_price
 Price of the set at Amazon.
- year
 Year that it was produced.
- ages
 LEGO's recommended ages of children for the set
- pages
 Pages in the instruction booklet.
- minifigures
 Number of LEGO people in the data, if unknown "NA" was recorded.
- packaging
 Type of packaging: bag, box, etc.
- weight
 Weight of the set of LEGOS in pounds and kilograms.
- unique_pieces
 Number of pieces classified as unique in the instruction manual.
- size
 Size of the lego pieces: Large if safe for small children and Small for older children.
Source
Peterson, A. D., & Ziegler, L. (2021). Building a multiple linear regression model with LEGO brick data. Journal of Statistics and Data Science Education, 29(3),1-7. doi:10.1080/26939169.2021.1946450
BrickInstructions.com. (n.d.). Retrieved February 2, 2021 from
Brickset. (n.d.). BRICKSET: Your LEGO® set guide. Retrieved February 2, 2021 from
Examples
library(ggplot2)
library(dplyr)
lego_population |>
  filter(theme == "Friends" | theme == "City") |>
  ggplot(aes(x = pieces, y = amazon_price)) +
  geom_point(alpha = 0.3) +
  labs(
    x = "Pieces in the Set",
    y = "Amazon Price",
    title = "Amazon Price vs Number of Pieces in Lego Sets",
    subtitle = "Friends and City Themes"
  )
Sample of Lego Sets
Description
Data about Lego Sets for sale. Based on JSDSE article by Anna Peterson and Laura Ziegler Data from their article was scrapped from multiple sources including brickset.com
Usage
lego_sample
Format
A data frame with 75 rows and 15 variables.
- item_number
 Set Item number
- set_name
 Name of the set.
- theme
 Set theme: Duplo, City or Friends.
- pieces
 Number of pieces in the set.
- price
 Recommended retail price from LEGO.
- amazon_price
 Price of the set at Amazon.
- year
 Year that it was produced.
- ages
 LEGO's recommended ages of children for the set
- pages
 Pages in the instruction booklet.
- minifigures
 Number of LEGO people in the data, if unknown "NA" was recorded.
- packaging
 Type of packaging: bag, box, etc.
- weight
 Weight of the set of LEGOS in pounds and kilograms.
- unique_pieces
 Number of pieces classified as unique in the instruction manual.
- size
 Size of the lego pieces: Large if safe for small children and Small for older children.
Source
Peterson, A. D., & Ziegler, L. (2021). Building a multiple linear regression model with LEGO brick data. Journal of Statistics and Data Science Education, 29(3),1-7. doi:10.1080/26939169.2021.1946450
BrickInstructions.com. (n.d.). Retrieved February 2, 2021 from
Brickset. (n.d.). BRICKSET: Your LEGO® set guide. Retrieved February 2, 2021 from
Examples
library(ggplot2)
library(dplyr)
lego_sample |>
  filter(theme == "Friends" | theme == "City") |>
  ggplot(aes(x = pieces, y = amazon_price)) +
  geom_point(alpha = 0.3) +
  labs(
    x = "Pieces in the Set",
    y = "Amazon Price",
    title = "Amazon Price vs Number of Pieces in Lego Sets",
    subtitle = "Friends and City Themes"
  )
life_exp
Description
A data frame with 3142 rows and 4 columns. County level data for life expectancy and median income in the United States.
Usage
life_exp
Format
A data frame with 3142 observations on the following 4 variables.
- state
 Name of the state.
- county
 Name of the county.
- expectancy
 Life expectancy in the county.
- income
 Median income in the county, measured in US $.
Examples
library(ggplot2)
# Income V Expectancy
ggplot(life_exp, aes(x = income, y = expectancy)) +
  geom_point(color = openintro::IMSCOL["green", "full"], alpha = 0.2) +
  theme_minimal() +
  labs(
    title = "Is there a relationship between median income and life expectancy?",
    x = "Median income (US $)",
    y = "Life Expectancy (year)"
  )
Create simple regression plot with residual plot
Description
Create a simple regression plot with residual plot.
Usage
linResPlot(
  x,
  y,
  axes = FALSE,
  wBox = TRUE,
  wLine = TRUE,
  lCol = "#00000088",
  lty = 1,
  lwd = 1,
  main = "",
  xlab = "",
  ylab = "",
  marRes = NULL,
  col = fadeColor(4, "88"),
  pch = 20,
  cex = 1.5,
  yR = 0.1,
  ylim = NULL,
  subset = NULL,
  ...
)
Arguments
x | 
 Predictor variable.  | 
y | 
 Outcome variable.  | 
axes | 
 Whether to plot axis labels.  | 
wBox | 
 Whether to plot boxes around each plot.  | 
wLine | 
 Add a regression line.  | 
lCol | 
 Line color.  | 
lty | 
 Line type.  | 
lwd | 
 Line width.  | 
main | 
 Title for the top plot.  | 
xlab | 
 x-label.  | 
ylab | 
 y-label.  | 
marRes | 
 Margin for the residuals plot.  | 
col | 
 Color of the points.  | 
pch | 
 Plotting character of points.  | 
cex | 
 Size of points.  | 
yR | 
 An additional vertical stretch factor on the plot.  | 
ylim | 
 y-limits.  | 
subset | 
 Boolean vector, if wanting a subset of the data.  | 
... | 
 Additional arguments passed to both plots.  | 
See Also
Examples
# Currently seems broken for this example.
n <- 25
x <- runif(n)
y <- 5 * x + rnorm(n)
myMat <- rbind(matrix(1:2, 2))
myW <- 1
myH <- c(1, 0.45)
par(mar = c(0.35, 0.654, 0.35, 0.654))
layout(myMat, myW, myH)
linResPlot(x, y, col = COL[1, 2])
Field data on lizards observed in their natural habitat
Description
Data on here lizard was observed and the level of sunlight. The data are collected on Sceloporus occidentalis (western fence lizards) by Stephen C. Adolph in 1983 (in desert and mountain sites) and by Dee Asbury in 2002-3 (in valley site).
Usage
lizard_habitat
Format
A data frame with 332 observations on the following 2 variables.
- site
 Site of lizard observation:
desert,mountain, orvalley.- sunlight
 Sunlight level at time of observation:
sun(lizard was observed perching in full sunlight),partial(lizard was observed perching with part of its body in the sun, part in the shade),shade(lizard was observed perching in the shade).
Source
Adolph, S. C. 1990. Influence of behavioral thermoregulation on microhabitat use by two Sceloporus lizards. Ecology 71: 315-327. Asbury, D.A., and S. C. Adolph. 2007. Behavioral plasticity in an ecological generalist: microhabitat use by western fence lizards. Evolutionary Ecology Research 9:801-815.
Examples
library(ggplot2)
# Frequencies
table(lizard_habitat)
# Stacked bar plots
ggplot(lizard_habitat, aes(y = site, fill = sunlight)) +
  geom_bar(position = "fill") +
  labs(x = "Proportion")
Lizard speeds
Description
Data on top speeds measured on a laboratory race track for two species of lizards: Western fence lizard (Sceloporus occidentalis) and Sagebrush lizard (Sceloporus graciosus).
Usage
lizard_run
Format
A data frame with 48 observations on the following 3 variables.
- top_speed
 Top speed of lizard, meters per second.
- common_name
 Common name:
Western fence lizardandSagebrush lizard.- scientific_name
 Scientific name (Genus and species):
Sceloporus occidentalisandSceloporus graciosus.
Source
Adolph, S. C. 1987. Physiological and behavioral ecology of the lizards Sceloporus occidentalis and Sceloporus graciosus. Dissertation. University of Washington, Seattle, Washington, USA.
Examples
library(ggplot2)
library(dplyr)
# Top speed by species
ggplot(lizard_run, aes(x = top_speed, color = common_name, fill = common_name)) +
  geom_density(alpha = 0.5)
# Top speed summary statistics by species
lizard_run |>
  group_by(common_name) |>
  summarise(
    n    = n(),
    mean = mean(top_speed),
    sd   = sd(top_speed)
  )
Linear regression plot with residual plot
Description
Plot data, the linear model, and a residual plot simultaneously.
Usage
lmPlot(
  x,
  y,
  xAxis = 0,
  yAxis = 4,
  resAxis = 3,
  resSymm = TRUE,
  wBox = TRUE,
  wLine = TRUE,
  lCol = "#00000088",
  lty = 1,
  lwd = 1,
  xlab = "",
  ylab = "",
  marRes = NULL,
  col = "#22558888",
  pch = 20,
  cex = 1.5,
  xR = 0.02,
  yR = 0.1,
  xlim = NULL,
  ylim = NULL,
  subset = NULL,
  parCustom = FALSE,
  myHeight = c(1, 0.45),
  plots = c("both", "mainOnly", "resOnly"),
  highlight = NULL,
  hlCol = NULL,
  hlCex = 1.5,
  hlPch = 20,
  na.rm = TRUE,
  ...
)
Arguments
x | 
 The x coordinates of points in the plot.  | 
y | 
 The y coordinates of points in the plot.  | 
xAxis | 
 The maximum number of x axis labels.  | 
yAxis | 
 The maximum number of y axis labels.  | 
resAxis | 
 The maximum number of y axis labels in the residual plot.  | 
resSymm | 
 Boolean determining whether the range of the residual plot should be symmetric about zero.  | 
wBox | 
 Boolean determining whether a box should be added around each plot.  | 
wLine | 
 Boolean determining whether to add a regression line to the plot.  | 
lCol | 
 The color of the regression line to be added.  | 
lty | 
 The line type of the regression line to be added.  | 
lwd | 
 The line width of the regression line to be added.  | 
xlab | 
 A label for the x axis.  | 
ylab | 
 A label for the y axis  | 
marRes | 
 Margin specified for the residuals.  | 
col | 
 Color of points.  | 
pch | 
 Plotting character.  | 
cex | 
 Plotting character size.  | 
xR | 
 Scaling the limits of the x axis. Ignored if   | 
yR | 
 Scaling the limits of the y axis. Ignored if   | 
xlim | 
 Limits for the x axis.  | 
ylim | 
 Limits for the y axis.  | 
subset | 
 A subset of the data to be used for the linear model.  | 
parCustom | 
 If   | 
myHeight | 
 A numerical vector of length 2 representing the ratio of the primary plot to the residual plot, in height.  | 
plots | 
 Not currently utilized.  | 
highlight | 
 Numerical vector specifying particular points to highlight.  | 
hlCol | 
 Color of highlighted points.  | 
hlCex | 
 Size of highlighted points.  | 
hlPch | 
 Plotting characters of highlighted points.  | 
na.rm | 
 Remove cases with   | 
... | 
 Additional arguments to   | 
Author(s)
David Diez
See Also
Examples
lmPlot(satgpa$sat_sum, satgpa$fy_gpa)
lmPlot(gradestv$tv, gradestv$grades,
  xAxis = 4,
  xlab = "time watching TV", yR = 0.2, highlight = c(1, 15, 20)
)
Loan data from Lending Club
Description
This dataset represents thousands of loans made through the Lending Club platform, which is a platform that allows individuals to lend to other individuals. Of course, not all loans are created equal. Someone who is a essentially a sure bet to pay back a loan will have an easier time getting a loan with a low interest rate than someone who appears to be riskier. And for people who are very risky? They may not even get a loan offer, or they may not have accepted the loan offer due to a high interest rate. It is important to keep that last part in mind, since this dataset only represents loans actually made, i.e. do not mistake this data for loan applications!
Usage
loans_full_schema
Format
A data frame with 10,000 observations on the following 55 variables.
- emp_title
 Job title.
- emp_length
 Number of years in the job, rounded down. If longer than 10 years, then this is represented by the value
10.- state
 Two-letter state code.
- homeownership
 The ownership status of the applicant's residence.
- annual_income
 Annual income.
- verified_income
 Type of verification of the applicant's income.
- debt_to_income
 Debt-to-income ratio.
- annual_income_joint
 If this is a joint application, then the annual income of the two parties applying.
- verification_income_joint
 Type of verification of the joint income.
- debt_to_income_joint
 Debt-to-income ratio for the two parties.
- delinq_2y
 Delinquencies on lines of credit in the last 2 years.
- months_since_last_delinq
 Months since the last delinquency.
- earliest_credit_line
 Year of the applicant's earliest line of credit
- inquiries_last_12m
 Inquiries into the applicant's credit during the last 12 months.
- total_credit_lines
 Total number of credit lines in this applicant's credit history.
- open_credit_lines
 Number of currently open lines of credit.
- total_credit_limit
 Total available credit, e.g. if only credit cards, then the total of all the credit limits. This excludes a mortgage.
- total_credit_utilized
 Total credit balance, excluding a mortgage.
- num_collections_last_12m
 Number of collections in the last 12 months. This excludes medical collections.
- num_historical_failed_to_pay
 The number of derogatory public records, which roughly means the number of times the applicant failed to pay.
- months_since_90d_late
 Months since the last time the applicant was 90 days late on a payment.
- current_accounts_delinq
 Number of accounts where the applicant is currently delinquent.
- total_collection_amount_ever
 The total amount that the applicant has had against them in collections.
- current_installment_accounts
 Number of installment accounts, which are (roughly) accounts with a fixed payment amount and period. A typical example might be a 36-month car loan.
- accounts_opened_24m
 Number of new lines of credit opened in the last 24 months.
- months_since_last_credit_inquiry
 Number of months since the last credit inquiry on this applicant.
- num_satisfactory_accounts
 Number of satisfactory accounts.
- num_accounts_120d_past_due
 Number of current accounts that are 120 days past due.
- num_accounts_30d_past_due
 Number of current accounts that are 30 days past due.
- num_active_debit_accounts
 Number of currently active bank cards.
- total_debit_limit
 Total of all bank card limits.
- num_total_cc_accounts
 Total number of credit card accounts in the applicant's history.
- num_open_cc_accounts
 Total number of currently open credit card accounts.
- num_cc_carrying_balance
 Number of credit cards that are carrying a balance.
- num_mort_accounts
 Number of mortgage accounts.
- account_never_delinq_percent
 Percent of all lines of credit where the applicant was never delinquent.
- tax_liens
 a numeric vector
- public_record_bankrupt
 Number of bankruptcies listed in the public record for this applicant.
- loan_purpose
 The category for the purpose of the loan.
- application_type
 The type of application: either
individualorjoint.- loan_amount
 The amount of the loan the applicant received.
- term
 The number of months of the loan the applicant received.
- interest_rate
 Interest rate of the loan the applicant received.
- installment
 Monthly payment for the loan the applicant received.
- grade
 Grade associated with the loan.
- sub_grade
 Detailed grade associated with the loan.
- issue_month
 Month the loan was issued.
- loan_status
 Status of the loan.
- initial_listing_status
 Initial listing status of the loan. (I think this has to do with whether the lender provided the entire loan or if the loan is across multiple lenders.)
- disbursement_method
 Dispersement method of the loan.
- balance
 Current balance on the loan.
- paid_total
 Total that has been paid on the loan by the applicant.
- paid_principal
 The difference between the original loan amount and the current balance on the loan.
- paid_interest
 The amount of interest paid so far by the applicant.
- paid_late_fees
 Late fees paid by the applicant.
Source
This data comes from Lending Club (https://www.lendingclub.com/info/statistics.action), which provides a very large, open set of data on the people who received loans through their platform.
Examples
loans_full_schema
London Borough Boundaries
Description
This dataset contains the coordinates of the boundaries of all 32 boroughs of the Greater London area.
Usage
london_boroughs
Format
A data frame with 45341 observations on the following 3 variables.
- borough
 Name of the borough.
- x
 The "easting" component of the coordinate, see details.
- y
 The "northing" component of the coordinate, see details.
Details
Map data was made available through the Ordnance Survey Open Data
initiative.  The data use the
National Grid coordinate system,
based upon eastings (x) and northings (y) instead of longitude and latitude.
The name variable covers all 32 boroughs in Greater London:
Barking & Dagenham, Barnet, Bexley, Brent,
Bromley, Camden, Croydon, Ealing,
Enfield, Greenwich, Hackney, Hammersmith &
Fulham, Haringey, Harrow, Havering, Hillingdon,
Hounslow, Islington, Kensington & Chelsea,
Kingston, Lambeth, Lewisham, Merton,
Newham, Redbridge, Richmond, Southwark,
Sutton, Tower Hamlets, Waltham Forest,
Wandsworth, Westminster
Source
https://data.london.gov.uk/dataset/ordnance-survey-code-point
Contains Ordinance Survey data released under the Open Government License, OGL v2.
See Also
london_murders
Examples
library(dplyr)
library(ggplot2)
# Calculate number of murders by borough
london_murders_counts <- london_murders |>
  group_by(borough) |>
  add_tally()
london_murders_counts
## Not run: 
# Add number of murders to geographic boundary data
london_boroughs_murders <- inner_join(london_boroughs, london_murders_counts, by = "borough")
# Map murders
ggplot(london_boroughs_murders) +
  geom_polygon(aes(x = x, y = y, group = borough, fill = n), colour = "white") +
  scale_fill_distiller(direction = 1) +
  labs(x = "Easting", y = "Northing", fill = "Number of murders")
## End(Not run)
London Murders, 2006-2011
Description
This dataset contains the victim name, age, and location of every murder recorded in the Greater London area by the Metropolitan Police from January 1, 2006 to September 7, 2011.
Usage
london_murders
Format
A data frame with 838 observations on the following 5 variables.
- forename
 First name(s) of the victim.
- age
 Age of the victim.
- date
 Date of the murder (YYYY-MM-DD).
- year
 Year of the murder.
- borough
 The London borough in which the murder took place. See the Details section for a list of all the boroughs.
Details
To visualize this dataset using a map, see the
london_boroughs dataset, which contains the latitude and
longitude of polygons that define the boundaries of the 32 boroughs of
Greater London.
The borough variable covers all 32 boroughs in Greater London:
Barking & Dagenham, Barnet, Bexley, Brent,
Bromley, Camden, Croydon, Ealing,
Enfield, Greenwich, Hackney, Hammersmith &
Fulham, Haringey, Harrow, Havering, Hillingdon,
Hounslow, Islington, Kensington & Chelsea,
Kingston, Lambeth, Lewisham, Merton,
Newham, Redbridge, Richmond, Southwark,
Sutton, Tower Hamlets, Waltham Forest,
Wandsworth, Westminster
Source
https://www.theguardian.com/news/datablog/2011/oct/05/murder-london-list#data
References
Inspired by The Guardian Datablog.
Examples
library(dplyr)
library(ggplot2)
library(lubridate)
london_murders |>
  mutate(
    day_count = as.numeric(date - ymd("2006-01-01")),
    date_cut = cut(day_count, seq(0, 2160, 90))
  ) |>
  group_by(date_cut) |>
  add_tally() |>
  ggplot(aes(x = date_cut, y = n)) +
  geom_col() +
  theme(axis.text.x = element_blank(), axis.ticks.x = element_blank()) +
  labs(x = "Date from 01/2006 - 09/2011", y = "Number of deaths per 90 days")
Output a message while inside a loop
Description
NOTE: utils::txtProgressBar() and utils::setTxtProgressBar() are better.
Output a message while inside a for loop to update the user on progress. This
function is useful in tracking progress when the number of iterations is large
or the procedures in each iteration take a long time.
Usage
loop(i, n = NULL, every = 1, extra = NULL)
Arguments
i | 
 The index value used in the loop.  | 
n | 
 The last entry in the loop.  | 
every | 
 The number of loops between messages.  | 
extra | 
 Additional information to print.  | 
Author(s)
David Diez
See Also
Examples
for (i in 1:160) {
  loop(i, 160, 20, paste("iter", i))
}
Create a Line Segment Plot
Description
Creae a simple plot showing a line segment.
Usage
lsegments(
  x = c(3, 7),
  l = "o",
  r = "c",
  ticks = TRUE,
  labs = 1,
  add = 0,
  ylim = c(-0.75, 0.25)
)
Arguments
x | 
 The endpoints of the interval. Values larger (smaller) than 999 (-999) will be interpreted as (negative) infinity.  | 
l | 
 Indicate whether the left end point should be open (  | 
r | 
 Indicate whether the right end point should be open (  | 
ticks | 
 Indicate whether to show tick marks (  | 
labs | 
 The position for the point labels. Set to   | 
add | 
 Indicate whether the line segment should be added to an existing
plot (  | 
ylim | 
 A vector of length 2 specifying the vertical plotting limits,
which may be useful for fine-tuning plots. The default is
  | 
Author(s)
David Diez
See Also
Examples
lsegments(c(2, 7), "o", "c", ylim = c(-0.3, 0.2))
lsegments(c(5, 7), "c", "c", ylim = c(-0.3, 0.2))
lsegments(c(4, 1000), "o", "o", ylim = c(-0.3, 0.2))
Influence of a Good Mood on Helpfulness
Description
This study investigated whether finding a coin influenced a person's likelihood of mailing a sealed but addressed letter that appeared to have been accidentally left in a conspicuous place. Several variables were collected during the experiment, including two randomized variables of whether there was a coin to be found and whether the letter already had a stamp on it.
Usage
mail_me
Format
A data frame with 42 observations on the following 4 variables.
- stamped
 a factor with levels
noyes- found_coin
 a factor with levels
coinno_coin- gender
 a factor with levels
femalemale- mailed_letter
 a factor with levels
noyes
Details
The precise context was in a phone booth (this study is from the 1970s!), where a person who entered a phone booth would find a dime in the phone tray, which would be sufficient to pay for their phone call. There was also a letter next to the phone, which sometimes had a stamp on it.
Source
Levin PF, Isen AM. 1975. Studies on the Effect of Feeling Good on Helping. Sociometry 31(1), p141-147.
Examples
table(mail_me)
(x <- table(mail_me[, c("mailed_letter", "found_coin")]))
chisq.test(x)
(x <- table(mail_me[, c("mailed_letter", "stamped")]))
chisq.test(x)
m <- glm(mailed_letter ~ stamped + found_coin + gender,
  data = mail_me,
  family = binomial
)
summary(m)
Survey of Duke students and the area of their major
Description
Survey of 218 students, collecting information on their GPAs and their academic major.
Usage
major_survey
Format
A data frame with 218 observations on the following 2 variables.
- gpa
 Grade point average (GPA).
- major
 Area of academic major.
Examples
library(ggplot2)
ggplot(major_survey, aes(x = major, y = gpa)) +
  geom_boxplot()
Regression tube
Description
Produce a linear, quadratic, or nonparametric tube for regression data.
Usage
makeTube(
  x,
  y,
  Z = 2,
  R = 1,
  col = "#00000022",
  border = "#00000000",
  type = c("lin", "quad", "robust"),
  stDev = c("constant", "linear", "other"),
  length.out = 99,
  bw = "default",
  plotTube = TRUE,
  addLine = TRUE,
  ...
)
Arguments
x | 
 
  | 
y | 
 
  | 
Z | 
 Number of standard deviations out from the regression line to extend the tube.  | 
R | 
 Control of how far the tube extends to the left and right.  | 
col | 
 Fill color of the tube.  | 
border | 
 Border color of the tube.  | 
type | 
 The type of model fit to the data. Here   | 
stDev | 
 Choices are constant variance (  | 
length.out | 
 The number of observations used to build the regression model. This argument may be increased to increase the smoothing of a quadratic or nonparametric curve.  | 
bw | 
 Bandwidth used if   | 
plotTube | 
 Whether the tube should be plotted.  | 
addLine | 
 Whether the linear model should be plotted.  | 
... | 
 Additional arguments passed to the   | 
Value
X | 
 
  | 
Y | 
 
  | 
tubeX | 
 
  | 
tubeY | 
 
  | 
Author(s)
David Diez
See Also
Examples
# possum example
plot(possum$total_l, possum$head_l)
makeTube(possum$total_l, possum$head_l, 1)
makeTube(possum$total_l, possum$head_l, 2)
makeTube(possum$total_l, possum$head_l, 3)
# grades and TV example
plot(gradestv)
makeTube(gradestv$tv, gradestv$grades, 1.5)
plot(gradestv)
makeTube(gradestv$tv, gradestv$grades, 1.5, stDev = "o")
plot(gradestv)
makeTube(gradestv$tv, gradestv$grades, 1.5, type = "robust")
plot(gradestv)
makeTube(gradestv$tv, gradestv$grades, 1.5, type = "robust", stDev = "o")
# what can go wrong with a basic least squares model
# 1
x <- runif(100)
y <- 25 * x - 20 * x^2 + rnorm(length(x), sd = 1.5)
plot(x, y)
makeTube(x, y, type = "q")
# 2
x <- c(-0.6, -0.46, -0.091, runif(97))
y <- 25 * x + rnorm(length(x))
y[2] <- y[2] + 8
y[1] <- y[1] + 1
plot(x, y, ylim = range(y) + c(-10, 5))
makeTube(x, y)
# 3
x <- runif(100)
y <- 5 * x + rnorm(length(x), sd = x)
plot(x, y)
makeTube(x, y, stDev = "l", bw = 0.03)
Malaria Vaccine Trial
Description
Volunteer patients were randomized into one of two experiment groups where they would receive an experimental vaccine or a placebo. They were subsequently exposed to a drug-sensitive strain of malaria and observed to see whether they came down with an infection.
Usage
malaria
Format
A data frame with 20 observations on the following 2 variables.
- treatment
 Whether a person was given the experimental
vaccineor aplacebo.- outcome
 Whether the person got an
infectionorno infection.
Details
In this study, volunteer patients were randomized into one of two experiment groups: 14 patients received an experimental vaccine or 6 patients received a placebo vaccine. Nineteen weeks later, all 20 patients were exposed to a drug-sensitive malaria virus strain; the motivation of using a drug-sensitive strain of virus here is for ethical considerations, allowing any infections to be treated effectively.
Source
Lyke et al. 2017. PfSPZ vaccine induces strain-transcending T cells and durable protection against heterologous controlled human malaria infection. PNAS 114(10):2711-2716. doi:10.1073/pnas.1615324114.
Examples
library(dplyr)
# Calculate conditional probabilities of infection after vaccine/placebo
malaria |>
  count(treatment, outcome) |>
  group_by(treatment) |>
  mutate(prop = n / sum(n))
# Fisher's exact text
fisher.test(table(malaria))
Sample of 100 male heights
Description
Random sample based on Food Commodity Intake Database distribution
Usage
male_heights
Format
A data frame with 100 observations on the following variable.
- heights
 a numeric vector
References
What We Eat In America - Food Commodity Intake Database. Available at https://fcid.foodrisk.org/.
Examples
male_heights
Random sample of adult male heights
Description
This sample is based on data from the USDA Food Commodity Intake Database.
Usage
male_heights_fcid
Format
A data frame with 100 observations on the following variable.
- height_inch
 Height, in inches.
Source
Simulated based on data from USDA.
Examples
data(male_heights_fcid)
histPlot(male_heights_fcid$height_inch)
Sleep in Mammals
Description
This dataset includes data for 39 species of mammals distributed over 13 orders. The data were used for analyzing the relationship between constitutional and ecological factors and sleeping in mammals. Two qualitatively different sleep variables (dreaming and non dreaming) were recorded. Constitutional variables such as life span, body weight, brain weight and gestation time were evaluated. Ecological variables such as severity of predation, safety of sleeping place and overall danger were inferred from field observations in the literature.
Usage
mammals
Format
A data frame with 62 observations on the following 11 variables.
- species
 Species of mammals
- body_wt
 Total body weight of the mammal (in kg)
- brain_wt
 Brain weight of the mammal (in kg)
- non_dreaming
 Number of hours of non dreaming sleep
- dreaming
 Number of hours of dreaming sleep
- total_sleep
 Total number of hours of sleep
- life_span
 Life span (in years)
- gestation
 Gestation time (in days)
- predation
 An index of how likely the mammal is to be preyed upon. 1 = least likely to be preyed upon. 5 = most likely to be preyed upon.
- exposure
 An index of the how exposed the mammal is during sleep. 1 = least exposed (e.g., sleeps in a well-protected den). 5 = most exposed.
- danger
 An index of how much danger the mammal faces from other animals. This index is based upon Predation and Exposure. 1 = least danger from other animals. 5 = most danger from other animals.
Source
http://www.statsci.org/data/general/sleep.txt
References
T. Allison and D. Cicchetti, "Sleep in mammals: ecological and constitutional correlates," Arch. Hydrobiol, vol. 75, p. 442, 1975.
Examples
library(ggplot2)
ggplot(mammals, aes(x = log(body_wt), y = log(brain_wt))) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(x = "Log of body weight", x = "Log of brain weight")
Experiment with Mammogram Randomized
Description
An experiment where 89,835 women were randomized to either get a mammogram or a non-mammogram breast screening. The response measured was whether they had died from breast cancer within 25 years.
Usage
mammogram
Format
A data frame with 89835 observations on the following 2 variables.
- treatment
 a factor with levels
controlmammogram- breast_cancer_death
 a factor with levels
noyes
Source
Miller AB. 2014. Twenty five year follow-up for breast cancer incidence and mortality of the Canadian National Breast Screening Study: randomised screening trial. BMJ 2014;348:g366.
Examples
table(mammogram)
chisq.test(table(mammogram))
manhattan
Description
A data frame containing data on apartment rentals in Manhattan.
Usage
manhattan
Format
A data frame with 20 observations on the following 1 variable.
- rent
 Monthly rent for a 1 bedroom apartment listed as "For rent by owner".
Examples
library(ggplot2)
ggplot(manhattan, aes(rent)) +
  geom_histogram(color = "white", binwidth = 300) +
  theme_minimal() +
  labs(
    title = "Rent in Manhattan",
    subtitle = "1 Bedroom Apartments",
    x = "Rent (in US$)",
    caption = "Source: Craigslist"
  )
New York City Marathon Times (outdated)
Description
Marathon times of male and female winners of the New York City Marathon 1970-1999. See nyc_marathon for a more updated dataset. We recommend not using this dataset since the data source has been taken off the web.
Usage
marathon
Format
A data frame with 60 observations on the following 3 variables.
- year
 Year
- gender
 Gender
- time
 Running time (in hours)
Source
Data source has been removed.
Examples
library(ggplot2)
ggplot(marathon, aes(x = time)) +
  geom_histogram(binwidth = 0.15)
ggplot(marathon, aes(y = time, x = gender)) +
  geom_boxplot()
Wii Mario Kart auctions from Ebay
Description
Auction data from Ebay for the game Mario Kart for the Nintendo Wii. This data was collected in early October 2009.
Usage
mariokart
Format
A data frame with 143 observations on the following 12 variables. All prices are in US dollars.
- id
 Auction ID assigned by Ebay.
- duration
 Auction length, in days.
- n_bids
 Number of bids.
- cond
 Game condition, either
neworused.- start_pr
 Start price of the auction.
- ship_pr
 Shipping price.
- total_pr
 Total price, which equals the auction price plus the shipping price.
- ship_sp
 Shipping speed or method.
- seller_rate
 The seller's rating on Ebay. This is the number of positive ratings minus the number of negative ratings for the seller.
- stock_photo
 Whether the auction feature photo was a stock photo or not. If the picture was used in many auctions, then it was called a stock photo.
- wheels
 Number of Wii wheels included in the auction. These are steering wheel attachments to make it seem as though you are actually driving in the game. When used with the controller, turning the wheel actually causes the character on screen to turn.
- title
 The title of the auctions.
Details
There are several interesting features in the data. First off, note that there are two outliers in the data. These serve as a nice example of what one should do when encountering an outlier: examine the data point and remove it only if there is a good reason. In these two cases, we can see from the auction titles that they included other items in their auctions besides the game, which justifies removing them from the dataset.
This dataset includes all auctions for a full week in October 2009. Auctions were included in the dataset if they satisfied a number of conditions. (1) They were included in a search for "wii mario kart" on ebay.com, (2) items were in the Video Games > Games > Nintendo Wii section of Ebay, (3) the listing was an auction and not exclusively a "Buy it Now" listing (sellers sometimes offer an optional higher price for a buyer to end bidding and win the auction immediately, which is an optional Buy it Now auction), (4) the item listed was the actual game, (5) the item was being sold from the US, (6) the item had at least one bidder, (7) there were no other items included in the auction with the exception of racing wheels, either generic or brand-name being acceptable, and (8) the auction did not end with a Buy It Now option.
Source
Ebay.
Examples
library(ggplot2)
library(broom)
library(dplyr)
# Identify outliers
ggplot(mariokart, aes(x = total_pr, y = cond)) +
  geom_boxplot()
# Replot without the outliers
mariokart |>
  filter(total_pr < 80) |>
  ggplot(aes(x = total_pr, y = cond)) +
  geom_boxplot()
# Fit a multiple regression models
mariokart_no <- mariokart |> filter(total_pr < 80)
m1 <- lm(total_pr ~ cond + stock_photo + duration + wheels, data = mariokart_no)
tidy(m1)
m2 <- lm(total_pr ~ cond + stock_photo + wheels, data = mariokart_no)
tidy(m2)
m3 <- lm(total_pr ~ cond + wheels, data = mariokart_no)
tidy(m3)
# Fit diagnostics
aug_m3 <- augment(m3)
ggplot(aug_m3, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(x = "Fitted values", y = "Residuals")
ggplot(aug_m3, aes(x = .fitted, y = abs(.resid))) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(x = "Fitted values", y = "Absolute value of residuals")
ggplot(aug_m3, aes(x = 1:nrow(aug_m3), y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(x = "Order of data collection", y = "Residuals")
ggplot(aug_m3, aes(x = cond, y = .resid)) +
  geom_boxplot() +
  labs(x = "Condition", y = "Residuals")
ggplot(aug_m3, aes(x = wheels, y = .resid)) +
  geom_point() +
  labs(
    x = "Number of wheels", y = "Residuals",
    title = "Notice curvature"
  )
Marvel Cinematic Universe films
Description
A list of Marvel Cinematic Universe films through the Infinity saga. The Infinity saga is a 23 movie storyline spanning from Ironman in 2008 to Endgame in 2019.
Usage
mcu_films
Format
A data frame with 23 rows and 7 variables.
- movie
 Title of the movie.
- length_hrs
 Length of the movie: hours portion.
- length_min
 Length of the movie: minutes portion.
- release_date
 Date the movie was released in the US.
- opening_weekend_us
 Box office totals for opening weekend in the US.
- gross_us
 All box office totals in US.
- gross_world
 All box office totals world wide.
Details
Box office figures are not adjusted to a specific year. They are from the year the film was released.
Source
Examples
library(ggplot2)
library(scales)
ggplot(mcu_films, aes(x = opening_weekend_us, y = gross_us)) +
  geom_point() +
  labs(
    title = "MCU Box Office Totals: Opening weekend vs. all-time",
    x = "Opening weekend totals (USD in millions)",
    y = "All-time totals (USD)"
  ) +
  scale_x_continuous(labels = label_dollar(scale = 1 / 1000000)) +
  scale_y_continuous(labels = label_dollar(scale = 1 / 1000000))
President's party performance and unemployment rate
Description
Covers midterm elections.
Usage
midterms_house
Format
A data frame with 29 observations on the following 5 variables.
- year
 Year.
- potus
 The president in office.
- party
 President's party: Democrat or Republican.
- unemp
 Unemployment rate.
- house_change
 Change in House seats for the President's party.
Details
An older version of this data is at unemploy_pres.
Source
Wikipedia.
Examples
library(ggplot2)
ggplot(midterms_house, aes(x = unemp, y = house_change)) +
  geom_point()
Migraines and acupuncture
Description
Experiment involving acupuncture and sham acupuncture (as placebo) in the treatment of migraines.
Usage
migraine
Format
A data frame with 89 observations on the following 2 variables.
- group
 a factor with levels
controltreatment- pain_free
 a factor with levels
noyes
Source
G. Allais et al. Ear acupuncture in the treatment of migraine attacks: a randomized trial on the efficacy of appropriate versus inappropriate acupoints. In: Neurological Sci. 32.1 (2011), pp. 173-175.
Examples
migraine
US Military Demographics
Description
This dataset contains demographic information on every member of the US armed forces including gender, race, and rank.
Usage
military
Format
A data frame with 1,414,593 observations on the following 6 variables.
- grade
 The status of the service member as
enlistedofficerorwarrant officer.- branch
 The branch of the armed forces:
air force,army,marine corps,navy.- gender
 Whether the service member is
femaleormale.- race
 The race identified by the service member:
ami/aln(american indian/alaskan native),asian,black,multi(multi-ethnic),p/i(pacific islander),unk(unknown), orwhite.- hisp
 Whether a service member identifies with being hispanic (
TRUE) or not (FALSE).- rank
 The numeric rank of the service member (higher number indicates higher rank).
Details
The branches covered by this dataset include the Army, Navy, Air Force, and Marine Corps. Demographic information on the Coast Guard is contained in the original dataset but has not been included here.
Source
Data provided by the Department of Defense and made available at https://catalog.data.gov/dataset/personnel-trends-by-gender-race, retrieved 2012-02-20.
Examples
## Not run: 
library(dplyr)
library(ggplot2)
library(forcats)
# Proportion of females in military branches
military |>
  ggplot(aes(x = branch, fill = gender)) +
  geom_bar(position = "fill") +
  labs(
    x = "Branch", y = "Proportion", fill = "Gender",
    title = "Proportion of females in military branches"
  )
# Proportion of army officer females across ranks
military |>
  filter(
    grade == "officer",
    branch == "army"
  ) |>
  ggplot(aes(x = factor(rank), fill = fct_rev(gender))) +
  geom_bar(position = "fill") +
  labs(
    x = "Rank", y = "Proportion", fill = "Gender",
    title = "Proportion of army officer females across ranks"
  )
## End(Not run)
Salary data for Major League Baseball (2010)
Description
Salary data for Major League Baseball players in the year 2010.
Usage
mlb
Format
A data frame with 828 observations on the following 4 variables.
- player
 Player name
- team
 Team
- position
 Field position
- salary
 Salary (in $1000s)
Source
https://databases.usatoday.com/mlb-salaries/, retrieved 2011-02-23.
Examples
# _____ Basic Histogram _____ #
hist(mlb$salary / 1000,
  breaks = 15,
  main = "", xlab = "Salary (millions of dollars)", ylab = "",
  axes = FALSE,
  col = "#22558844"
)
axis(1, seq(0, 40, 10))
axis(2, c(0, 500))
axis(2, seq(100, 400, 100), rep("", 4), tcl = -0.2)
# _____ Histogram on Log Scale _____ #
hist(log(mlb$salary / 1000),
  breaks = 15,
  main = "", xlab = "log(Salary)", ylab = "",
  axes = FALSE, col = "#22558844"
)
axis(1) # , seq(0, 40, 10))
axis(2, seq(0, 300, 100))
# _____ Box plot of log(salary) against position _____ #
boxPlot(log(mlb$salary / 1000), mlb$position, horiz = TRUE, ylab = "")
Batter Statistics for 2018 Major League Baseball (MLB) Season
Description
Batter statistics for 2018 Major League Baseball season.
Usage
mlb_players_18
Format
A data frame with 1270 observations on the following 19 variables.
- name
 Player name
- team
 Team abbreviation
- position
 Position abbreviation:
1B= first base,2B= second base,3B= third base,C= catcher,CF= center field (outfield),DH= designated hitter,LF= left field (outfield),P= pitcher,RF= right field (outfield),SS= shortstop.- games
 Number of games played.
- AB
 At bats.
- R
 Runs.
- H
 Hits.
- doubles
 Doubles.
- triples
 Triples.
- HR
 Home runs.
- RBI
 Runs batted in.
- walks
 Walks.
- strike_outs
 Strike outs.
- stolen_bases
 Stolen bases.
- caught_stealing_base
 Number of times caught stealing a base.
- AVG
 Batting average.
- OBP
 On-base percentage.
- SLG
 Slugging percentage.
- OPS
 On-base percentage plus slugging percentage.
Source
See Also
Examples
d <- subset(mlb_players_18, !position %in% c("P", "DH") & AB >= 100)
dim(d)
# _____ Per Position, No Further Grouping _____ #
plot(d$OBP ~ as.factor(d$position))
model <- lm(OBP ~ as.factor(position), d)
summary(model)
anova(model)
# _____ Simplified Analysis, Fewer Positions _____ #
pos <- list(
  c("LF", "CF", "RF"),
  c("1B", "2B", "3B", "SS"),
  "C"
)
POS <- c("OF", "IF", "C")
table(d$position)
# _____ On-Base Percentage Across Positions _____ #
out <- c()
gp <- c()
for (i in 1:length(pos)) {
  these <- which(d$position %in% pos[[i]])
  out <- c(out, d$OBP[these])
  gp <- c(gp, rep(POS[i], length(these)))
}
plot(out ~ as.factor(gp))
summary(lm(out ~ as.factor(gp)))
anova(lm(out ~ as.factor(gp)))
Major League Baseball Teams Data.
Description
A subset of data on Major League Baseball teams from Lahman's Baseball Database. The full dataset is available in the Lahman R package.
Usage
mlb_teams
Format
A data frame with 2784 rows and 41 variables.
- year
 Year of play.
- league_id
 League the team plays in with levels AL (American League) and NL (National League).
- division_id
 Division the team plays in with levels W (west), E (east) and C (central).
- rank
 Team's rank in their division at the end of the regular season.
- games_played
 Games played.
- home_games
 Games played at home.
- wins
 Number of games won.
- losses
 Number of games lost.
- division_winner
 Did the team win their division? Levels of Y (yes) and N (no).
- wild_card_winner
 Was the team a wild card winner. Levels of Y (yes) and N (no).
- league_winner
 Did the team win their league? Levels of Y (yes) and N (no).
- world_series_winner
 Did the team win the World Series? Levels of Y (yes) and N (no).
- runs_scored
 Number of runs scored during the season.
- at_bats
 Number of at bats during the season.
- hits
 Number of hits during the season. Includes singles, doubles, triples and homeruns.
- doubles
 Number of doubles hit.
- triples
 Number of triples hit.
- homeruns
 Homeruns by batters.
- walks
 Number of walks.
- strikeouts_by_batters
 Number of batters struckout.
- stolen_bases
 Number of stolen bases.
- caught_stealing
 Number of base runners caught stealing.
- batters_hit_by_pitch
 Number of batters hit by a pitch.
- sacrifice_flies
 Number of sacrifice flies.
- opponents_runs_scored
 Number of runs scored by opponents.
- earned_runs_allowed
 Number of earned runs allowed.
- earned_run_average
 Earned run average.
- complete_games
 Number of games where a single pitcher played the entire game.
- shutouts
 Number of shutouts.
- saves
 Number of saves.
- outs_pitches
 Number of outs pitched for the season (number of innings pitched times 3).
- hits_allowed
 Number of hits made by opponents.
- homeruns_allowed
 Number of homeruns hit by opponents.
- walks_allowed
 Number of opponents who were walked.
- strikeouts_by_pitchers
 Number of opponents who were struckout.
- errors
 Number of errors.
- double_plays
 Number of double plays.
- fielding_percentage
 Teams fielding percentage.
- team_name
 Full name of team.
- ball_park
 Home ballpark name.
- home_attendance
 Home attendance total.
Source
Lahmans Baseball Database
Examples
library(dplyr)
# List the World Series winning teams for each year
mlb_teams |>
  filter(world_series_winner == "Y") |>
  select(year, team_name, ball_park)
# List the teams with their average number of wins and losses
mlb_teams |>
  group_by(team_name) |>
  summarize(mean_wins = mean(wins), mean_losses = mean(losses)) |>
  arrange((team_name))
Major League Baseball Player Hitting Statistics for 2010
Description
Major League Baseball Player Hitting Statistics for 2010.
Usage
mlbbat10
Format
A data frame with 1199 observations on the following 19 variables.
- name
 Player name
- team
 Team abbreviation
- position
 Player position
- game
 Number of games
- at_bat
 Number of at bats
- run
 Number of runs
- hit
 Number of hits
- double
 Number of doubles
- triple
 Number of triples
- home_run
 Number of home runs
- rbi
 Number of runs batted in
- total_base
 Total bases, computed as 3HR + 23B + 1*2B + H
- walk
 Number of walks
- strike_out
 Number of strikeouts
- stolen_base
 Number of stolen bases
- caught_stealing
 Number of times caught stealing
- obp
 On base percentage
- slg
 Slugging percentage (total_base / at_bat)
- bat_avg
 Batting average
Source
https://www.mlb.com, retrieved 2011-04-22.
Examples
library(ggplot2)
library(dplyr)
library(scales)
mlbbat10_200 <- mlbbat10 |>
  filter(mlbbat10$at_bat > 200)
# On-base percentage across positions
ggplot(mlbbat10_200, aes(x = position, y = obp, fill = position)) +
  geom_boxplot(show.legend = FALSE) +
  scale_y_continuous(labels = label_number(suffix = "%", accuracy = 0.01)) +
  labs(
    title = "On-base percentage across positions",
    y = "On-base percentage across positions",
    x = "Position"
  )
# Batting average across positions
ggplot(mlbbat10_200, aes(x = bat_avg, fill = position)) +
  geom_density(alpha = 0.5) +
  labs(
    title = "Batting average across positions",
    fill = NULL,
    y = "Batting average",
    x = "Position"
  )
# Mean number of home runs across positions
mlbbat10_200 |>
  group_by(position) |>
  summarise(mean_home_run = mean(home_run)) |>
  ggplot(aes(x = position, y = mean_home_run, fill = position)) +
  geom_col(show.legend = FALSE) +
  labs(
    title = "Mean number of home runs across positions",
    y = "Home runs",
    x = "Position"
  )
# Runs batted in across positions
ggplot(mlbbat10_200, aes(x = run, y = obp, fill = position)) +
  geom_boxplot(show.legend = FALSE) +
  labs(
    title = "Runs batted in across positions",
    y = "Runs",
    x = "Position"
  )
Minneapolis police use of force data.
Description
From Minneapolis, data from 2016 through August 2021
Usage
mn_police_use_of_force
Format
A data frame with 12925 rows and 13 variables.
- response_datetime
 DateTime of police response.
- problem
 Problem that required police response.
- is_911_call
 Whether response was iniated by call to 911.
- primary_offense
 Offense of subject.
- subject_injury
 Whether subject was injured Yes/No/null.
- force_type
 Type of police force used.
- force_type_action
 Detail of police force used.
- race
 Race of subject.
- sex
 Gender of subject.
- age
 Age of subject.
- type_resistance
 Resistance to police by subject.
- precinct
 Precinct where response occurred.
- neighborhood
 Neighborhood where response occurred.
Source
Examples
library(dplyr)
library(ggplot2)
# List percent of total for each race
mn_police_use_of_force |>
  count(race) |>
  mutate(percent = round(n / sum(n) * 100, 2)) |>
  arrange(desc(percent))
# Display use of force count by three races
race_sub <- c("Asian", "White", "Black")
ggplot(
  mn_police_use_of_force |> filter(race %in% race_sub),
  aes(force_type, ..count..)
) +
  geom_point(stat = "count", size = 4) +
  coord_flip() +
  facet_grid(race ~ .) +
  labs(
    x = "Force Type",
    y = "Number of Incidents"
  )
movies
Description
A dataset with information about movies released in 2003.
Usage
movies
Format
A data frame with 140 observations on the following 5 variables.
- movie
 Title of the movie.
- genre
 Genre of the movie.
- score
 Critics score of the movie on a 0 to 100 scale.
- rating
 MPAA rating of the film.
- box_office
 Millions of dollars earned at the box office in the US and Canada.
Source
Investigating Statistical Concepts, Applications and Methods
Examples
library(ggplot2)
ggplot(movies, aes(score, box_office, color = genre)) +
  geom_point() +
  theme_minimal() +
  labs(
    title = "Does a critic score predict box office earnings?",
    x = "Critic rating",
    y = "Box office earnings (millions US$",
    color = "Genre"
  )
Medial temporal lobe (MTL) and other data for 26 participants
Description
The data are from a convenience sample of 25 women and 10 men who were middle-aged or older. The purpose of the study was to understand the relationship between sedentary behavior and thickness of the medial temporal lobe (MTL) in the brain.
Usage
mtl
Format
A data frame with 35 observations on the following 23 variables.
- subject
 ID for the individual.
- sex
 Gender, which takes values
F(female) orM(male).- ethnic
 Ethnicity, simplified to
CaucasianandOther.- educ
 Years of educational.
- e4grp
 APOE-4 status, taking a value of
E4orNon-E4.- age
 Age, in years.
- mmse
 Score from the Mini-Mental State Examination, which is a global cognition evaluation.
- ham_a
 Score on the Hamilton Rating Scale for anxiety.
- ham_d
 Score on the Hamilton Rating Scale for depression.
- dig_sym
 We (the authors of this R package) are unsure as to the meaning of this variable.
- delay_vp
 We (the authors of this R package) are unsure as to the meaning of this variable.
- bfr_selective_reminding_delayed
 We (the authors of this R package) are unsure as to the meaning of this variable.
- sitting
 Self-reported time sitting per day, averaged to the nearest hour.
- met_minwk
 Metabolic equivalent units score (activity level). A score of
0means "no activity" while3000is considered "high activity".- ipa_qgrp
 Classification of
METminwkintoLoworHigh.- aca1
 Thickness of the CA1 subregion of the MTL.
- aca23dg
 Thickness of the CA23DG subregion of the MTL.
- ae_cort
 Thickness of a subregion of the MTL.
- a_fusi_cort
 Thickness of the fusiform gyrus subregion of the MTL.
- a_ph_cort
 Thickness of the perirhinal cortex subregion of the MTL.
- a_pe_cort
 Thickness of the entorhinal cortex subregion of the MTL.
- asubic
 Thickness of the subiculum subregion of the MTL.
- total
 Total MTL thickness.
Source
Siddarth P, Burggren AC, Eyre HA, Small GW, Merrill DA. 2018. Sedentary behavior associated with reduced medial temporal lobe thickness in middle-aged and older adults. PLoS ONE 13(4): e0195549. doi:10.1371/journal.pone.0195549.
Thank you to Professor Silas Bergen of Winona State University for pointing us to this dataset!
References
A New York Times article references this study. https://www.nytimes.com/2018/04/19/opinion/standing-up-at-your-desk-could-make-you-smarter.html
Examples
library(ggplot2)
ggplot(mtl, aes(x = ipa_qgrp, y = met_minwk)) +
  geom_boxplot()
Data for 20 metropolitan areas
Description
Population, percent in poverty, percent unemployment, and murder rate.
Usage
murders
Format
A data frame with 20 metropolitan areas on the following 4 variables.
- population
 Population.
- perc_pov
 Percent in poverty.
- perc_unemp
 Percent unemployed.
- annual_murders_per_mil
 Number of murders per year per million people.
Source
We do not have provenance for these data hence recommend not using them for analysis.
Examples
library(ggplot2)
ggplot(murders, aes(x = perc_pov, y = annual_murders_per_mil)) +
  geom_point() +
  labs(
    x = "Percent in poverty",
    y = "Number of murders per year per million people"
  )
Custom PDF function
Description
A similar function to pdf and png, except that different
defaults are provided, including for the plotting parameters.
Usage
myPDF(
  fileName,
  width = 5,
  height = 3,
  mar = c(3.9, 3.9, 1, 1),
  mgp = c(2.8, 0.55, 0),
  las = 1,
  tcl = -0.3,
  ...
)
Arguments
fileName | 
 File name for the image to be output. The name should end in
  | 
width | 
 The width of the image file (inches). Default:   | 
height | 
 The height of the image file (inches). Default:   | 
mar | 
 Plotting margins. To change, input a numerical vector of length 4.  | 
mgp | 
 Margin graphing parameters. To change, input a numerical vector of length 3. The first argument specifies where x and y labels are placed; the second specifies the axis labels are placed; and the third specifies how far to pull the entire axis from the plot.  | 
las | 
 Orientation of axis labels. Input   | 
tcl | 
 The tick mark length as a proportion of text height. The default
is   | 
... | 
 Additional arguments to   | 
Author(s)
David Diez
See Also
Examples
# save a plot to a PDF
# myPDF("myPlot.pdf")
histPlot(mariokart$total_pr)
# dev.off()
# save a plot to a PNG
# myPNG("myPlot.png")
histPlot(mariokart$total_pr)
# dev.off()
NBA Finals History
Description
This dataset contains information about the teams who played in the NBA Finals from 1950 - 2022.
Usage
nba_finals
Format
A data frame with 73 rows and 9 variables:
- year
 The year in which the Finals took place.
- winner
 The team who won the series.
- western_wins
 Number of series wins by the Western Conference Champions.
- eastern_wins
 Number of series wins by the Eastern Conference Champions.
- western_champions
 Team that won the Western Conference title and played in the Finals.
- eastern_champions
 Team that won the Eastern Conference title and played in the Finals.
- western_coach
 Coach of the Western Conference champions.
- eastern_coach
 Coach of the Eastern Conference champions.
- home_court
 Which conference held home court advantage for the series.
Source
Wikipedia: List of NBA Champions
Examples
library(dplyr)
library(ggplot2)
library(tidyr)
# Top 5 Appearing Coaches
nba_finals |>
  pivot_longer(
    cols = c("western_coach", "eastern_coach"),
    names_to = "conference", values_to = "coach"
  ) |>
  count(coach, sort = TRUE) |>
  slice_head(n = 5)
# Top 5 Winning Coaches
nba_finals |>
  mutate(
    winning_coach = case_when(
      western_wins == 4 ~ western_coach,
      eastern_wins == 4 ~ eastern_coach
    )
  ) |>
  count(winning_coach, sort = TRUE) |>
  slice_head(n = 5)
NBA Finals Team Summary
Description
A dataset with individual team summaries for the NBA Finals series from 1950 to 2022. To win the Finals, a team must win 4 games. The maximum number of games in a series is 7.
Usage
nba_finals_teams
Format
A data frame with 33 rows and 7 variables:
- team
 Team name.
- win
 Number of NBA Championships won.
- loss
 Number of NBA Championships lost.
- apps
 Number of NBA Finals appearances.
- pct
 Win percentage.
- years_won
 Years in which the team won a Championship.
- years_lost
 Years in which the team lost a Championship.
Details
Notes:
The Chicago Stags folded in 1950, the Washington Capitols in 1951 and the Baltimore Bullets in 1954.
This list uses current team names. For example, the Seattle SuperSonics are not on the list as that team moved and became the Oklahoma City Thunder.
Source
Examples
library(ggplot2)
library(dplyr)
library(openintro)
teams_with_apps <- nba_finals_teams |>
  filter(apps != 0)
ggplot(teams_with_apps, aes(x = win)) +
  geom_histogram(binwidth = 2) +
  labs(
    title = "Number of NBA Finals series wins",
    x = "Number of wins",
    y = "Number of teams"
  )
ggplot(teams_with_apps, aes(x = apps, y = win)) +
  geom_point(alpha = 0.3) +
  labs(
    title = "Can we predict how many NBA Championships a
team has based on the number of appearances?",
    x = "Number of NBA Finals appearances",
    y = "Number of NBA Finals series wins"
  )
NBA Player heights from 2008-9
Description
Heights of all NBA players from the 2008-9 season.
Usage
nba_heights
Format
A data frame with 435 observations (players) on the following 4 variables.
- last_name
 Last name.
- first_name
 First name.
- h_meters
 Height, in meters.
- h_in
 Height, in inches.
Source
Collected from NBA.
Examples
qqnorm(nba_heights$h_meters)
NBA Players for the 2018-2019 season
Description
Summary information from the NBA players for the 2018-2019 season.
Usage
nba_players_19
Format
A data frame with 494 observations on the following 7 variables.
- first_name
 First name.
- last_name
 Last name.
- team
 Team name
- team_abbr
 3-letter team abbreviation.
- position
 Player position.
- number
 Jersey number.
- height
 Height, in inches.
Source
Examples
hist(nba_players_19$height, 20)
table(nba_players_19$team)
North Carolina births, 1000 cases
Description
In 2004, the state of North Carolina released to the public a large dataset containing information on births recorded in this state. This dataset has been of interest to medical researchers who are studying the relation between habits and practices of expectant mothers and the birth of their children. This is a random sample of 1,000 cases from this dataset.
Usage
ncbirths
Format
A data frame with 1000 observations on the following 13 variables.
- fage
 Father's age in years.
- mage
 Mother's age in years.
- mature
 Maturity status of mother.
- weeks
 Length of pregnancy in weeks.
- premie
 Whether the birth was classified as premature (premie) or full-term.
- visits
 Number of hospital visits during pregnancy.
- gained
 Weight gained by mother during pregnancy in pounds.
- weight
 Weight of the baby at birth in pounds.
- lowbirthweight
 Whether baby was classified as low birthweight (
low) or not (not low).- gender
 Gender of the baby,
femaleormale.- habit
 Status of the mother as a
nonsmokeror asmoker.- marital
 Whether mother is
marriedornot marriedat birth.- whitemom
 Whether mom is
whiteornot white.
See Also
We do not have ideal provenance for these data. For a better documented and more recent dataset on a similar topic with similar variables, see births14.
Examples
library(ggplot2)
ggplot(ncbirths, aes(x = habit, y = weight)) +
  geom_boxplot() +
  labs(x = "Smoking status of mother", y = "Birth weight of baby (in lbs)")
ggplot(ncbirths, aes(x = whitemom, y = visits)) +
  geom_boxplot() +
  labs(x = "Mother's race", y = "Number of doctor visits during pregnancy")
ggplot(ncbirths, aes(x = mature, y = gained)) +
  geom_boxplot() +
  labs(x = "Mother's age category", y = "Weight gained during pregnancy")
Normal distribution tails
Description
Produce a normal (or t) distribution and shaded tail.
Usage
normTail(
  m = 0,
  s = 1,
  L = NULL,
  U = NULL,
  M = NULL,
  df = 1000,
  curveColor = 1,
  border = 1,
  col = "#CCCCCC",
  xlim = NULL,
  ylim = NULL,
  xlab = "",
  ylab = "",
  digits = 2,
  axes = 1,
  detail = 999,
  xLab = c("number", "symbol"),
  cex.axis = 1,
  xAxisIncr = 1,
  add = FALSE,
  ...
)
Arguments
m | 
 Numerical value for the distribution mean.  | 
s | 
 Numerical value for the distribution standard deviation.  | 
L | 
 Numerical value representing the cutoff for a shaded lower tail.  | 
U | 
 Numerical value representing the cutoff for a shaded upper tail.  | 
M | 
 Numerical value representing the cutoff for a shaded central region.  | 
df | 
 Numerical value describing the degrees of freedom.  Default is
  | 
curveColor | 
 The color for the distribution curve.  | 
border | 
 The color for the border of the shaded area.  | 
col | 
 The color for filling the shaded area.  | 
xlim | 
 Limits for the x axis.  | 
ylim | 
 Limits for the y axis.  | 
xlab | 
 A title for the x axis.  | 
ylab | 
 A title for the y axis.  | 
digits | 
 The maximum number of digits past the decimal to use in axes values.  | 
axes | 
 A numeric value denoting whether to draw both axes (  | 
detail | 
 A number describing the number of points to use in drawing the normal curve. Smaller values correspond to a less smooth curve but reduced memory usage in the final file.  | 
xLab | 
 If   | 
cex.axis | 
 Numerical value controlling the size of the axis labels.  | 
xAxisIncr | 
 A number describing how often axis labels are placed,
scaled by standard deviations. This argument is ignored if   | 
add | 
 Boolean indicating whether to add this normal curve to the existing plot.  | 
... | 
 Additional arguments to   | 
Author(s)
David Diez
See Also
Examples
normTail(3, 2, 5)
normTail(3, 2, 1, xLab = "symbol")
normTail(3, 2, M = 1:2, xLab = "symbol", cex.axis = 0.8)
normTail(3, 2, U = 5, axes = FALSE)
normTail(L = -1, U = 2, M = c(0, 1), axes = 3, xAxisIncr = 2)
normTail(
  L = -1, U = 2, M = c(0, 1),
  xLab = "symbol", cex.axis = 0.8, xAxisIncr = 2
)
Nuclear Arms Reduction Survey
Description
A simple random sample of 1,028 US adults in March 2013 found that 56\ support nuclear arms reduction.
Usage
nuclear_survey
Format
A data frame with 1028 observations on the following variable.
- arms_reduction
 Responses of
favororagainst.
Source
Gallup report: In U.S., 56 percent Favor U.S.-Russian Nuclear Arms Reductions. Available at https://news.gallup.com/poll/161198/favor-russian-nuclear-arms-reductions.aspx.
Examples
table(nuclear_survey)
nyc
Description
Zagat is a public survey where anyone can provide scores to a restaurant. The scores from the general public are then gathered to produce ratings. This dataset contains a list of 168 NYC restaurants and their Zagat Ratings.
Usage
nyc
Format
A data frame with 168 observations on the following 6 variables.
- restaurant
 Name of the restaurant.
- price
 Price of a mean for two, with drinks, in US $.
- food
 Zagat rating for food.
- decor
 Zagat rating for decor.
- service
 Zagat rating for service.
- east
 Indicator variable for location of the restaurant.
0= west of 5th Avenue,1= east of 5th Avenue
Details
For each category the scales are as follows:
0 - 9: poor to fair 10 - 15: fair to good 16 - 19: good to very good 20 - 25: very good to excellent 25 - 30: extraordinary to perfection
Examples
library(dplyr)
library(ggplot2)
location_labs <- c("West", "East")
names(location_labs) <- c(0, 1)
ggplot(nyc, mapping = aes(x = price, group = east, fill = east)) +
  geom_boxplot(alpha = 0.5) +
  facet_grid(east ~ ., labeller = labeller(east = location_labs)) +
  labs(
    title = "Is food more expensive east of 5th Avenue?",
    x = "Price (US$)"
  ) +
  guides(fill = "none") +
  theme_minimal() +
  theme(axis.text.y = element_blank())
New York City Marathon Times
Description
Marathon times of runners in the Men and Women divisions of the New York City Marathon, 1970 - 2023.
Usage
nyc_marathon
Format
A data frame with 108 observations on the following 7 variables.
- year
 Year of marathom.
- name
 Name of winner.
- country
 Country of winner.
- time
 Running time (HH:MM:SS).
- time_hrs
 Running time (in hours).
- division
 Division:
MenorWomen.- note
 Note about the race or the winning time.
Source
Wikipedia, List of winners of the New York City Marathon. Retrieved 6 November, 2023.
Examples
library(ggplot2)
ggplot(nyc_marathon, aes(x = year, y = time_hrs, color = division, shape = division)) +
  geom_point()
Flights data
Description
On-time data for a random sample of flights that departed NYC (i.e. JFK, LGA or EWR) in 2013.
Usage
nycflights
Format
A tbl_df with 32,735 rows and 16 variables:
- year,month,day
 Date of departure.
- dep_time,arr_time
 Departure and arrival times, local tz.
- dep_delay,arr_delay
 Departure and arrival delays, in minutes. Negative times represent early departures/arrivals.
- hour,minute
 Time of departure broken in to hour and minutes.
- carrier
 Two letter carrier abbreviation. See
airlinesin thenycflights13package for more information or google the airline code.- tailnum
 Plane tail number.
- flight
 Flight number.
- origin,dest
 Origin and destination. See
airportsin thenycflights13package for more information or google airport the code.- air_time
 Amount of time spent in the air.
- distance
 Distance flown.
Source
Hadley Wickham (2014). nycflights13: Data about flights departing NYC in 2013. R package version 0.1.
Examples
library(dplyr)
# Longest departure delays
nycflights |>
  select(flight, origin, dest, dep_delay, arr_delay) |>
  arrange(desc(dep_delay))
# Longest arrival delays
nycflights |>
  select(flight, origin, dest, dep_delay, arr_delay) |>
  arrange(desc(arr_delay))
California poll on drilling off the California coast
Description
A 2010 survey asking a randomly sample of registered voters in California for their position on drilling for oil and natural gas off the Coast of California.
Usage
offshore_drilling
Format
A data frame with 827 observations on the following 2 variables.
- position
 a factor with levels
do not knowopposesupport- college_grad
 a factor with levels
noyes
Source
Survey USA, Election Poll #16804, data collected July 8-11, 2010.
Examples
offshore_drilling
OpenIntro colors
Description
A character string of full colors from IMSCOL[,1]
Usage
openintro_colors
Format
A named character string with 9 elements: "blue", "green", "pink", "yellow", "red", "black", "gray", "lgray
Examples
openintro_colors
openintro_colors["blue"]
Function to extract OpenIntro IMS colors as hex codes
Description
Uses full colors from IMSCOL
Usage
openintro_cols(...)
Arguments
... | 
 Character names of openintro_colors  | 
Examples
openintro_cols("blue")
openintro_cols("red")
Return function to interpolate an OpenIntro IMS color palette
Description
Not exported
Usage
openintro_pal(palette = "main", reverse = FALSE, ...)
Arguments
palette | 
 Character name of palette in openintro_palettes  | 
reverse | 
 Boolean indicating whether the palette should be reversed  | 
... | 
 Additional arguments to pass to   | 
OpenIntro palettes
Description
A list with OpenIntro color palettes
Usage
openintro_palettes
Format
A list with 8 color palettes: main, two, three, four, five, cool, hot, gray
Examples
openintro_palettes
openintro_palettes$main
openintro_palettes$three
openintro_palettes$cool
openintro_palettes$hot
Opportunity cost of purchases
Description
In a study on opportunity cost, 150 students were given the following statement: "Imagine that you have been saving some extra money on the side to make some purchases, and on your most recent visit to the video store you come across a special sale on a new video. This video is one with your favorite actor or actress, and your favorite type of movie (such as a comedy, drama, thriller, etc.). This particular video that you are considering is one you have been thinking about buying for a long time. It is available for a special sale price of $14.99. What would you do in this situation? Please circle one of the options below." Half of the students were given the following two options: (A) Buy this entertaining video. (B) Not buy this entertaining video. The other half were given the following two options (note the modified option B): (A) Buy this entertaining video. (B) Not buy this entertaining video. Keep the $14.99 for other purchases. The results of this study are in this dataset.
Usage
opportunity_cost
Format
A data frame with 150 observations on the following 2 variables.
- group
 a factor with levels
controlandtreatment- decision
 a factor with levels
buy videoandnot buy video
Source
Frederick S, Novemsky N, Wang J, Dhar R, Nowlis S. 2009. Opportunity Cost Neglect. Journal of Consumer Research 36: 553-561.
Examples
library(ggplot2)
table(opportunity_cost)
ggplot(opportunity_cost, aes(y = group, fill = decision)) +
  geom_bar(position = "fill")
1986 Challenger disaster and O-rings
Description
On January 28, 1986, a routine launch was anticipated for the Challenger space shuttle. Seventy-three seconds into the flight, disaster happened: the shuttle broke apart, killing all seven crew members on board. An investigation into the cause of the disaster focused on a critical seal called an O-ring, and it is believed that damage to these O-rings during a shuttle launch may be related to the ambient temperature during the launch. The table below summarizes observational data on O-rings for 23 shuttle missions, where the mission order is based on the temperature at the time of the launch.
Usage
orings
Format
A data frame with 23 observations on the following 4 variables.
- mission
 Shuttle mission number.
- temperature
 Temperature, in Fahrenheit.
- damaged
 Number of damaged O-rings (out of 6).
- undamaged
 Number of undamaged O-rings (out of 6).
Source
https://archive.ics.uci.edu/dataset/92/challenger+usa+space+shuttle+o+ring
Examples
library(dplyr)
library(forcats)
library(tidyr)
library(broom)
# This is a wide data frame. You can convert it to a long
# data frame to predict probability of O-ring damage based
# on temperature using logistic regression.
orings_long <- orings |>
  pivot_longer(cols = c(damaged, undamaged), names_to = "outcome", values_to = "n") |>
  uncount(n) |>
  mutate(outcome = fct_relevel(outcome, "undamaged", "damaged"))
orings_mod <- glm(outcome ~ temperature, data = orings_long, family = "binomial")
tidy(orings_mod)
Oscar winners, 1929 to 2018
Description
Best actor and actress Oscar winners from 1929 to 2018
Usage
oscars
Format
A data frame with 182 observations on the following 10 variables.
- oscar_no
 Oscar ceremony number.
- oscar_yr
 Year the Oscar ceremony was held.
- award
 Best actressorBest actor.- name
 Name of winning actor or actress.
- movie
 Name of movie actor or actress got the Oscar for.
- age
 Age at which the actor or actress won the Oscar.
- birth_pl
 US State where the actor or actress was born, country if foreign.
- birth_date
 Birth date of actor or actress.
- birth_mo
 Birth month of actor or actress.
- birth_d
 Birth day of actor or actress.
- birth_y
 Birth year of actor or actress.
Details
Although there have been only 84 Oscar ceremonies until 2012, there are 85 male winners and 85 female winners because ties happened on two occasions (1933 for the best actor and 1969 for the best actress).
Source
Journal of Statistical Education, http://jse.amstat.org/datasets/oscars.dat.txt, updated through 2019 using information from Oscars.org and Wikipedia.org.
Examples
library(ggplot2)
library(dplyr)
ggplot(oscars, aes(x = award, y = age)) +
  geom_boxplot()
ggplot(oscars, aes(x = factor(birth_mo))) +
  geom_bar()
oscars |>
  count(birth_pl, sort = TRUE)
Simulated datasets for different types of outliers
Description
Data sets for showing different types of outliers
Usage
outliers
Format
A data frame with 50 observations on the following 5 variables.
- x
 a numeric vector
- y
 a numeric vector
- x_inf
 a numeric vector
- y_lev
 a numeric vector
- y_out
 a numeric vector
Examples
outliers
Race time for Olympic and Paralympic 1500m.
Description
Compiled gold medal times for the 1500m race in the Olympic Games and the Paralympic Games. The times given for contestants competing in the Paralympic Games are for athletes with different visual impairments; T11 indicates fully blind (with an option to race with a guide-runner) with T12 and T13 as lower levels of visual impairment.
Usage
paralympic_1500
Format
A data frame with 83 rows and 10 variables.
- year
 Year the games took place.
- city
 City of the games.
- country_of_games
 Country of the games.
- division
 Division:
MenorWomen.- type
 Type.
- name
 Name of the athlete.
- country_of_athlete
 Country of athlete.
- time
 Time of gold medal race, in m:s.
- time_min
 Time of gold medal race, in decimal minutes (min + sec/60).
Source
https://www.paralympic.org/ and https://en.wikipedia.org/wiki/1500_metres_at_the_Olympics.
Examples
library(ggplot2)
library(dplyr)
paralympic_1500 |>
  mutate(
    sight_level = case_when(
      type == "T11" ~ "total impairment",
      type == "T12" ~ "some impairment",
      type == "T13" ~ "some impairment",
      type == "Olympic" ~ "no impairment"
    )
  ) |>
  filter(division == "Men", year > 1920) |>
  filter(type == "Olympic" | type == "T11") |>
  ggplot(aes(x = year, y = time_min, color = sight_level, shape = sight_level)) +
  geom_point() +
  scale_x_continuous(breaks = seq(1924, 2020, by = 8)) +
  labs(
    title = "Men's Olympic and Paralympic 1500m race times",
    x = "Year",
    y = "Time of Race (minutes)",
    color = "Sight level",
    shape = "Sight level"
  )
Guesses at the weight of Penelope (a cow)
Description
The data was collected by the Planet Money podcast to test a theory about crowd-sourcing. Penelope's actual weight was 1,355 pounds.
Usage
penelope
Format
A data frame with 17,184 observations on the following variable.
- weight
 Guesses of Penelope's weight, in pounds.
Source
Examples
library(ggplot2)
ggplot(penelope, aes(x = weight)) +
  geom_histogram(binwidth = 250)
summary(penelope$weight)
What's the best way to loosen a rusty bolt?
Description
The channel Project Farm on YouTube investigated penetrating oils and other options for loosening rusty bolts. Eight options were evaluated, including a control group, to determine which was most effective.
Usage
penetrating_oil
Format
A data frame with 30 observations on the following 2 variables.
- treatment
 The different treatments tried:
none(control),Heat(via blow torch),Acetone/ATF,AeroKroil,Liquid Wrench,PB Blaster,Royal Purple, andWD-40.- torque
 Torque required to loosen the rusty bolt, which was measured in foot-pounds.
Source
https://www.youtube.com/watch?v=xUEob2oAKVs
Examples
m <- lm(torque ~ treatment, data = penetrating_oil)
anova(m)
# There are 28 pairwise comparisons to be made.
xbar <- tapply(penetrating_oil$torque, penetrating_oil$treatment, mean)
n <- tapply(penetrating_oil$torque, penetrating_oil$treatment, length)
s <- summary(m)$sigma
df <- summary(m)$df[1]
diff <- c()
se <- c()
k <- 0
N <- length(n)
K <- N * (N - 1) / 2
for (i in 1:(N - 1)) {
  for (j in (i + 1):N) {
    k <- k + 1
    diff[k] <- xbar[i] - xbar[j]
    se[k] <- s * sqrt(1 / n[i] + 1 / n[j])
    if (2 * K * pt(-abs(diff[k] / se[k]), df) < 0.05) {
      cat("0.05 - ", names(n)[c(i, j)], "\n")
    } else if (2 * K * pt(-abs(diff[k] / se[k]), df) < 0.1) {
      cat("0.1 - ", names(n)[c(i, j)], "\n")
    } else if (2 * K * pt(-abs(diff[k] / se[k]), df) < 0.2) {
      cat("0.2 - ", names(n)[c(i, j)], "\n")
    } else if (2 * K * pt(-abs(diff[k] / se[k]), df) < 0.3) {
      cat("0.3 - ", names(n)[c(i, j)], "\n")
    }
  }
}
# Smallest p-value using Bonferroni
min(2 * K * pt(-abs(diff / se), df))
# Better pairwise comparison method.
anova(m1 <- aov(torque ~ treatment, data = penetrating_oil))
TukeyHSD(m1)
Penny Ages
Description
Sample of pennies and their ages. Taken in 2004.
Usage
penny_ages
Format
A data frame with 648 observations on the following 2 variables.
- year
 Penny's year.
- age
 Age as of 2004.
Examples
hist(penny_ages$year)
Pew Survey on Energy Sources in 2018
Description
US-based survey on support for expanding six different sources of energy, including solar, wind, offshore drilling, hydrolic fracturing ("fracking"), coal, and nuclear.
Usage
pew_energy_2018
Format
The format is: List of 6 $ solar_panel_farms : List of responses on solar farms. $ wind_turbine_farms : List of responses on wind turbine farms. $ offshore_drilling : List of responses on offshore drilling. $ hydrolic_fracturing : List of responses on hydrolic fracturing. $ coal_mining : List of responses on coal mining. $ nuclear_power_plants: List of responses on nuclear.
Details
We did not have access to individual responses in original dataset, so we took the published percentages and backed out the breakdown
Source
Examples
data(pew_energy_2018)
lapply(pew_energy_2018, head)
lapply(pew_energy_2018, length)
lapply(pew_energy_2018, table)
Prop <- function(x) {
  table(x) / length(x)
}
lapply(pew_energy_2018, Prop)
Photo classifications: fashion or not
Description
This is a simulated dataset for photo classifications based on a machine learning algorithm versus what the true classification is for those photos. While the data are not real, they resemble performance that would be reasonable to expect in a well-built classifier.
Usage
photo_classify
Format
A data frame with 1822 observations on the following 2 variables.
- mach_learn
 The prediction by the machine learning system as to whether the photo is about fashion or not.
- truth
 The actual classification of the photo by a team of humans.
Details
The hypothetical ML algorithm has a precision of 90\ photos it claims are fashion, about 90\ The recall of the ML algorithm is about 64\ about fashion, it correctly predicts that they are about fashion about 64\ of the time.
Source
The data are simulated / hypothetical.
Examples
data(photo_classify)
table(photo_classify)
Piracy and PIPA/SOPA
Description
This dataset contains observations on all 100 US Senators and 434 of the 325 US Congressional Representatives related to their support of anti-piracy legislation that was introduced at the end of 2011.
Usage
piracy
Format
A data frame with 534 observations on the following 8 variables.
- name
 Name of legislator.
- party
 Party affiliation as democrat (
D), Republican (R), or Independent (I).- state
 Two letter state abbreviation.
- money_pro
 Amount of money in dollars contributed to the legislator's campaign in 2010 by groups generally thought to be supportive of PIPA/SOPA: movie and TV studios, record labels.
- money_con
 Amount of money in dollars contributed to the legislator's campaign in 2010 by groups generally thought to be opposed to PIPA/SOPA: computer and internet companies.
- years
 Number of years of service in Congress.
- stance
 Degree of support for PIPA/SOPA with levels
Leaning No,No,Undecided,Unknown,Yes- chamber
 Whether the legislator is a member of either the
houseorsenate.
Details
The Stop Online Piracy Act (SOPA) and the Protect Intellectual Property Act (PIPA) were two bills introduced in the US House of Representatives and the US Senate, respectively, to curtail copyright infringement. The bill was controversial because there were concerns the bill limited free speech rights. ProPublica, the independent and non-profit news organization, compiled this dataset to compare the stance of legislators towards the bills with the amount of campaign funds that they received from groups considered to be supportive of or in opposition to the legislation.
For more background on the legislation and the formulation of
money_pro and money_con, read the documentation on ProPublica,
linked below.
Source
https://projects.propublica.org/sopa The list may be slightly out of date since many politician's perspectives on the legislation were in flux at the time of data collection.
Examples
library(dplyr)
library(ggplot2)
pipa <- filter(piracy, chamber == "senate")
pipa |>
  group_by(stance) |>
  summarise(money_pro_mean = mean(money_pro, na.rm = TRUE)) |>
  ggplot(aes(x = stance, y = money_pro_mean)) +
  geom_col() +
  labs(
    x = "Stance", y = "Average contribution, in $",
    title = "Average contribution to the legislator's campaign in 2010",
    subtitle = "by groups supportive of PIPA/SOPA (movie and TV studios, record labels)"
  )
ggplot(pipa, aes(x = stance, y = money_pro)) +
  geom_boxplot() +
  labs(
    x = "Stance", y = "Contribution, in $",
    title = "Contribution by groups supportive of PIPA/SOPA",
    subtitle = "Movie and TV studios, record labels"
  )
ggplot(pipa, aes(x = stance, y = money_con)) +
  geom_boxplot() +
  labs(
    x = "Stance", y = "Contribution, in $",
    title = "Contribution by groups opposed to PIPA/SOPA",
    subtitle = "Computer and internet companies"
  )
pipa |>
  filter(
    money_pro > 0,
    money_con > 0
  ) |>
  mutate(for_pipa = ifelse(stance == "yes", "yes", "no")) |>
  ggplot(aes(x = money_pro, y = money_con, color = for_pipa)) +
  geom_point() +
  scale_color_manual(values = c("gray", "red")) +
  scale_y_log10() +
  scale_x_log10() +
  labs(
    x = "Contribution by pro-PIPA groups",
    y = "Contribution by anti-PIPA groups",
    color = "For PIPA"
  )
Table of Playing Cards in 52-Card Deck
Description
A table describing each of the 52 cards in a deck.
Usage
playing_cards
Format
A data frame with 52 observations on the following 2 variables.
- number
 The number or card type.
- suit
 Card suit, which takes one of four values:
Club,Diamond,Heart, orSpade.- face_card
 Whether the card counts as a face card.
Source
This extremely complex dataset was generated from scratch.
Examples
playing_cards <- data.frame(
  number = rep(c(2:10, "J", "Q", "K", "A"), 4),
  suit = rep(c("Spade", "Diamond", "Club", "Heart"), rep(13, 4))
)
playing_cards$face_card <-
  ifelse(playing_cards$number %in% c(2:10, "A"), "no", "yes")
Air quality for Durham, NC
Description
Daily air quality is measured by the air quality index (AQI) reported by the Environmental Protection Agency in 2011.
Usage
pm25_2011_durham
Format
A data frame with 449 observations on the following 20 variables.
- date
 Date
- aqs_site_id
 The numeric site ID.
- poc
 A numeric vector, the Parameter Occurance Code.
- daily_mean_pm2_5_concentration
 A numeric vector with the average daily concentration of fine particulates, or particulate matter 2.5.
- units
 A character vector with value
ug/m3 LC.- daily_aqi_value
 A numeric vector with the daily air quality index.
- daily_obs_count
 A numeric vector.
- percent_complete
 A numeric vector.
- aqs_parameter_code
 A numeric vector.
- aqs_parameter_desc
 A factor with levels
PM2.5 - Local ConditionsandAcceptable PM2.5 AQI & Speciation Mass.- cbsa_code
 A numeric vector.
- cbsa_name
 A character vector with value
Durham, NC.- state_code
 A numeric vector.
- state
 A character vector with value
North Carolina.- county_code
 A numeric vector.
- county
 A character vector with value
Durham.- site_latitude
 A numeric vector of the latitude.
- site_longitude
 A numeric vector of the longitude.
- csa_code
 a numeric vector
- csa_name
 a factor with levels
Raleigh-Durham-Cary, NC
Source
US Environmental Protection Agency, AirData, 2011. http://www3.epa.gov/airdata/ad_data_daily.html
Examples
library(ggplot2)
ggplot(pm25_2011_durham, aes(x = date, y = daily_mean_pm2_5_concentration, group = 1)) +
  geom_line()
Air quality for Durham, NC
Description
Daily air quality is measured by the air quality index (AQI) reported by the Environmental Protection Agency in 2022.
Usage
pm25_2022_durham
Format
A data frame with 356 observations on the following 20 variables.
- date
 Date.
- aqs_site_id
 The numeric site ID.
- poc
 A numeric vector, the Parameter Occurance Code.
- daily_mean_pm2_5_concentration
 A numeric vector with the average daily concentration of fine particulates, or particulate matter 2.5.
- units
 A character vector with value
ug/m3 LC.- daily_aqi_value
 A numeric vector with the daily air quality index.
- daily_obs_count
 A numeric vector.
- percent_complete
 A numeric vector.
- aqs_parameter_code
 A numeric vector.
- aqs_parameter_desc
 A factor vector with level
PM2.5 - Local Conditions.- cbsa_code
 A numeric vector.
- cbsa_name
 A character vector with value
Durham-Chapel Hill, NC.- state_code
 A numeric vector.
- state
 A character vector with value
North Carolina.- county_code
 A numeric vector.
- county
 A character vector with value
Durham.- site_latitude
 A numeric vector of the latitude.
- site_longitude
 A numeric vector of the longitude.
- site_name
 A character vector with value
Durham Armory.
Source
US Environmental Protection Agency, AirData, 2022. http://www3.epa.gov/airdata/ad_data_daily.html
Examples
library(ggplot2)
ggplot(pm25_2022_durham, aes(x = date, y = daily_mean_pm2_5_concentration, group = 1)) +
  geom_line()
Poker winnings during 50 sessions
Description
Poker winnings (and losses) for 50 days by a professional poker player.
Usage
poker
Format
A data frame with 49 observations on the following variable.
- winnings
 Poker winnings and losses, in US dollars.
Source
Anonymity has been requested by the player.
Examples
library(ggplot2)
ggplot(poker, aes(x = winnings)) +
  geom_histogram(binwidth = 250)
Possums in Australia and New Guinea
Description
Data representing possums in Australia and New Guinea. This is a copy of the
dataset by the same name in the DAAG package, however, the dataset
included here includes fewer variables.
Usage
possum
Format
A data frame with 104 observations on the following 8 variables.
- site
 The site number where the possum was trapped.
- pop
 Population, either
Vic(Victoria) orother(New South Wales or Queensland).- sex
 Gender, either
m(male) orf(female).- age
 Age.
- head_l
 Head length, in mm.
- skull_w
 Skull width, in mm.
- total_l
 Total length, in cm.
- tail_l
 Tail length, in cm.
Source
Lindenmayer, D. B., Viggers, K. L., Cunningham, R. B., and Donnelly, C. F. 1995. Morphological variation among columns of the mountain brushtail possum, Trichosurus caninus Ogilby (Phalangeridae: Marsupiala). Australian Journal of Zoology 43: 449-458.
Examples
library(ggplot2)
# Skull width vs. head length
ggplot(possum, aes(x = head_l, y = skull_w)) +
  geom_point()
# Total length vs. sex
ggplot(possum, aes(x = total_l, fill = sex)) +
  geom_density(alpha = 0.5)
US Poll on who it is better to raise taxes on
Description
A poll of 691 people, with party affiliation collected, asked whether they think it's better to raise taxes on the rich or raise taxes on the poor.
Usage
ppp_201503
Format
A data frame with 691 observations on the following 2 variables.
- party
 Political party affiliation.
- taxes
 Support for who to raise taxes on.
Source
Public Policy Polling, Americans on College Degrees, Classic Literature, the Seasons, and More, data collected Feb 20-22, 2015.
Examples
library(ggplot2)
ggplot(ppp_201503, aes(x = party, fill = taxes)) +
  geom_bar(position = "fill") +
  labs(x = "Party", x = "Proportion", fill = "Taxes")
Birth counts
Description
An updated version of the historical Arbuthnot dataset. Numbers of boys and girls born in the United States between 1940 and 2002.
Usage
present
Format
A data frame with 63 observations on the following 3 variables.
- year
 Year.
- boys
 Number of boys born.
- girls
 Number of girls born.
Source
Mathews, T. J., and Brady E. Hamilton. "Trend analysis of the sex ratio at birth in the United States." National vital statistics reports 53.20 (2005): 1-17.
Examples
library(ggplot2)
ggplot(present, mapping = aes(x = year, y = boys / girls)) +
  geom_line()
United States Presidental History
Description
Summary of the changes in the president and vice president for the United States of America.
Usage
president
Format
A data frame with 67 observations on the following 5 variables.
- potus
 President of the United States
- party
 Political party of the president
- start
 Start year
- end
 End year
- vpotus
 Vice President of the United States
Source
Presidents of the United States (table) – infoplease.com (visited: Nov 2nd, 2010)
https://www.infoplease.com/us/government/executive-branch/presidents and https://www.infoplease.com/us/government/executive-branch/vice-presidents
Examples
president
Prison isolation experiment
Description
Subjects from Central Prison in Raleigh, NC, volunteered for an experiment involving an "isolation" experience. The goal of the experiment was to find a treatment that reduces subjects' psychopathic deviant T scores. This score measures a person's need for control or their rebellion against control, and it is part of a commonly used mental health test called the Minnesota Multiphasic Personality Inventory (MMPI) test.
Usage
prison
Format
A data frame with 14 observations on the following 6 variables.
- pre_trt1
 Pre-treatment 1.
- post_trt1
 Post-treatment 1.
- pre_trt2
 Pre-treatment 2.
- post_trt2
 Post-treatment 2.
- pre_trt3
 Pre-treatment 3.
- post_trt3
 Post-treatment 3.
Source
https://stat.duke.edu/datasets/prison-isolation
Examples
prison
User reported fuel efficiency for 2017 Toyota Prius Prime
Description
Fueleconomy.gov, the official US government source for fuel economy information, allows users to share gas mileage information on their vehicles. These data come from 19 users sharing gas mileage on their 2017 Toyota Prius Prime. Note that these data are user estimates and since the sources data cannot be verified, the accuracy of these estimates are not guaranteed.
Usage
prius_mpg
Format
A data frame with 19 observations on the following 10 variables.
- average_mpg
 Average mileage as estimated by the user.
- state
 US State the user lives in.
- stop_and_go
 Proportion of stop and go driving.
- highway
 Proportion of highway driving.
- last_updated
 Date estimate was last updated.
Source
Fueleconomy.gov, https://www.fueleconomy.gov/mpg/MPG.do?action=mpgData&vehicleID=38531&browser=true&details=on, retrieved 2019-04-14.
Examples
library(ggplot2)
library(dplyr)
ggplot(prius_mpg, aes(x = average_mpg)) +
  geom_histogram(binwidth = 25)
Generate simulated QQ plots
Description
Create a 3 x 3 grid of quantile-quantile plots, the first of which corresponds to the input data. The other eight plots arise from simulating random normal data with the same mean, standard deviation, and length as the data. For use in comparing known-normal qqplots to an observed qqplot to assess normality.
Usage
qqnormsim(sample, data)
Arguments
sample | 
 the variable to be plotted.  | 
data | 
 data frame to use.  | 
Value
A 3 x 3 grid of qqplots.
Yahoo! News Race and Justice poll results
Description
Results from a Yahoo! News poll conducted by YouGov on May 29-31, 2020. In total 1060 U.S. adults were asked a series of questions regarding race and justice in the wake of the killing of George Floyd by a police officer. Results in this dataset are percentages for the question, "Do you think Blacks and Whites receive equal treatment from the police?" For this particular question there were 1059 respondents.
Usage
race_justice
Format
A data frame with 1,059 rows and 2 variables.
- race_eth
 Race/ethnicity of respondent, with levels
White,Black,Hispanic, andOther.- response
 Response to the question "Do you think Black and White people receive equal treatment from the police?", with levels
Yes,No, andNot sure.
Source
Yahoo! News Race and Justice - May 31, 2020.
Examples
library(ggplot2)
library(dplyr)
# Conditional probabilities of response for each race/ethnicity
race_justice |>
  count(race_eth, response) |>
  group_by(race_eth) |>
  mutate(prop = n / sum(n))
# Stacked bar plot of counts
ggplot(race_justice, aes(x = race_eth, fill = response)) +
  geom_bar() +
  labs(
    x = "Race / ethnicity",
    y = "Count",
    title = "Do you think Black and White people receive
equal treatment from the police?",
    fill = "Response"
  )
# Stacked bar plot of proportions
ggplot(race_justice, aes(x = race_eth, fill = response)) +
  geom_bar(position = "fill") +
  labs(
    x = "Race / ethnicity",
    y = "Proportion",
    title = "Do you think Black and White people receive
equal treatment from the police?",
    fill = "Response"
  )
Reddit Survey on Financial Independence.
Description
A reduced set of the official results of the 2020 FI Survey from Reddit (r/financialindependence). Only responses that represent the respondent (not other contributors in the household) are listed. Does not include retired individuals. As per instructed, respondents give dollar values in their native currency.
Usage
reddit_finance
Format
A data frame with 1998 rows and 65 variables.
- num_incomes
 How many individuals contribute to your household income?
- pan_inc_chg
 As a result of the pandemic, did your earned income increase, decrease, or remain the same?
- pan_inc_chg_pct
 By how much did your earned income change?
- pan_exp_chg
 As a result of the pandemic, did your expenses increase, decrease, or remain the same?
- pan_exp_chg_pct
 By how much did your expenses change?
- pan_fi_chg
 As a result of the pandemic, did your FI (financially independent) number...
- pan_ret_date_chg
 As a result of the pandemic, did your planned RE (retirement) date...
- pan_financial_impact
 Overall, how would you characterize the pandemic's impact on your finances?
- political
 With which political party do you most closely identify? You do not need to be registered with a party to select it, answer based on your personal views.
- race_eth
 What is your race/ethnicity? Select all that apply.
- gender
 What is your gender?
- age
 What is your age?
- edu
 What is the highest level of education you have completed?
- rel_status
 What is your relationship status?
- children
 Do you have children?
- country
 What country are you in?
- fin_indy
 Are you financially independent? Meaning you do not need to work for money, regardless of whether you work for money.
- fin_indy_num
 At what amount invested will you consider yourself Financially Independent? (What is your FI number?)
- fin_indy_pct
 What percent FI are you? (What percent of your FI number do you currently have?)
- retire_invst_num
 At what amount invested do you intend to retire? (What is your RE number)
- tgt_sf_wthdrw_rt
 What is your target safe withdrawal rate? (If your answer is 3.5%, enter it as 3.5)
- max_retire_sup
 How much annual income do you expect to have from the sources you selected in question T5 at the point where you are utilizing all of them (or a majority if you do not intend to use all at the same time)? Enter your answer as a dollar amount.
- retire_exp
 How much money (from your savings and other sources) do you intend to spend each year once you are retired? Enter your answer as a dollar amount.
- whn_fin_indy_num
 At what amount invested did you consider yourself Financially Independent? (AKA what was your "FI number")
- fin_indy_lvl
 Which of the following would you have considered yourself at the time you reached Financial Independence:
- retire_age
 At what age do you intend to retire?
- stp_whn_fin_indy
 Do you intend to stop working for money when you reach financial independence?
- industry
 Which of the following best describes the industry in which you currently or most recently work(ed)?
- employer
 Which of the following best describes your current or most recent employer?
- role
 Which of the following best describes your current or most recent job role?
- ft_status
 What is your current employment status? - Full Time
- pt_status
 What is your current employment status? - Part Time, Regular
- gig_status
 What is your current employment status? -Side Gig, Intermittent
- ne_status
 What is your current employment status? -Not Employed
- edu_status
 What is your current educational status?
- housing
 What is your current housing situation?
- home_value
 Primary residence value.
- brokerage_accts_tax
 Brokerage accounts (Taxable).
- retirement_accts_tax
 Retirement accounts (Tax Advantaged).
- cash
 Cash / cash equivalents (Savings, Checking, C.D.s, Money Market).
- invst_accts
 Dedicated Savings/Investment Accounts (Healthcare, Education).
- spec_crypto
 Speculation (Crypto, P2P Lending, Gold, etc.).
- invst_prop_bus_own
 investment properties / owned business(es).
- other_val
 Other assets.
- student_loans
 Outstanding student loans.
- mortgage
 Outstanding mortgage / HELOC.
- auto_loan
 Outstanding auto loans.
- credit_personal_loan
 Outstanding credit cards / personal loans.
- medical_debt
 Outstanding medical debt.
- invst_prop_bus_own_debt
 Debt from investment properties / owned business.
- other_debt
 Debt from other sources.
- 2020_gross_inc
 What was your 2020 gross (pre-tax, pre-deductions) annual household income?
- 2020_housing_exp
 Housing expenses(rent, mortgage, insurance, taxes, upkeep).
- 2020_utilities_exp
 Utilities expenses(phone, internet, gas, electric, water, sewer).
- 2020_transp_exp
 Transportation expenses(car payment, bus / subway tickets, gas, insurance, maintenance).
- 2020_necessities_exp
 Necessities expenses(Groceries, Clothing, Personal Care, Household Supplies).
- 2020_lux_exp
 Luxury expenses (Restaurants/Dining, Entertainment, Hobbies, Travel, Pets, Gifts).
- 2020_child_exp
 Children expenses(child care, soccer team, etc.).
- 2020_debt_repay
 Debt repayment (excluding mortgage/auto).
- 2020_invst_save
 Investments / savings.
- 2020_charity
 Charity / Tithing.
- 2020_healthcare_exp
 Healthcare expenses(direct costs, co-pays, insurance you pay).
- 2020_taxes
 Taxes (the sum of all taxes paid, including amounts deducted from paychecks).
- 2020_edu_exp
 Education expenses.
- 2020_other_exp
 Other expenses.
Source
Reddit Official 2020 FI Survey Results, https://www.reddit.com/r/financialindependence/comments/m1q8ia/official_2020_fi_survey_results.
Examples
library(ggplot2)
# Histogram of Expected Retirement Age.
ggplot(reddit_finance, aes(retire_age)) +
  geom_bar(na.rm = TRUE) +
  labs(
    title = "At what age do you expect to retire?",
    x = "Age Bracket",
    y = "Number of Respondents"
  )
# Histogram of Dollar Amount at Which FI was reached.
ggplot(reddit_finance, aes(whn_fin_indy_num)) +
  geom_histogram(na.rm = TRUE, bins = 20) +
  labs(
    title = "At what amount invested did you consider\nyourself Financially Independent?",
    x = "Dollar Amount (in local currency)",
    y = "Number of Respondents"
  )
Simulated data for regression
Description
Simulated data for regression
Usage
res_demo_1
Format
A data frame with 100 observations on the following 3 variables.
- x
 a numeric vector
- y_lin
 a numeric vector
- y_fan_back
 a numeric vector
Examples
res_demo_1
Simulated data for regression
Description
Simulated data for regression
Usage
res_demo_2
Format
A data frame with 300 observations on the following 3 variables.
- x
 a numeric vector
- y_fan
 a numeric vector
- y_log
 a numeric vector
Examples
res_demo_2
Which resume attributes drive job callbacks?
Description
This experiment data comes from a study that sought to understand the influence of race and gender on job application callback rates. The study monitored job postings in Boston and Chicago for several months during 2001 and 2002 and used this to build up a set of test cases. Over this time period, the researchers randomly generating resumes to go out to a job posting, such as years of experience and education details, to create a realistic-looking resume. They then randomly assigned a name to the resume that would communicate the applicant's gender and race. The first names chosen for the study were selected so that the names would predominantly be recognized as belonging to black or white individuals. For example, Lakisha was a name that their survey indicated would be interpreted as a black woman, while Greg was a name that would generally be interpreted to be associated with a white male.
Usage
resume
Format
A data frame with 4870 observations, representing 4870 resumes, over
30 different variables that describe the job details, the outcome
(received_callback), and attributes of the resume. 
- job_ad_id
 Unique ID associated with the advertisement.
- job_city
 City where the job was located.
- job_industry
 Industry of the job.
- job_type
 Type of role.
- job_fed_contractor
 - 
Indicator for if the employer is a federal contractor.
 - job_equal_opp_employer
 Indicator for if the employer is an Equal Opportunity Employer.
- job_ownership
 The type of company, e.g. a nonprofit or a private company.
- job_req_any
 Indicator for if any job requirements are listed. If so, the other
job_req_*fields give more detail.- job_req_communication
 Indicator for if communication skills are required.
- job_req_education
 Indicator for if some level of education is required.
- job_req_min_experience
 - 
Amount of experience required.
 - job_req_computer
 Indicator for if computer skills are required.
- job_req_organization
 - 
Indicator for if organization skills are required.
 - job_req_school
 Level of education required.
- received_callback
 Indicator for if there was a callback from the job posting for the person listed on this resume.
- firstname
 The first name used on the resume.
- race
 Inferred race associated with the first name on the resume.
- gender
 Inferred gender associated with the first name on the resume.
- years_college
 Years of college education listed on the resume.
- college_degree
 Indicator for if the resume listed a college degree.
- honors
 - 
Indicator for if the resume listed that the candidate has been awarded some honors.
 - worked_during_school
 Indicator for if the resume listed working while in school.
- years_experience
 Years of experience listed on the resume.
- computer_skills
 - 
Indicator for if computer skills were listed on the resume. These skills were adapted for listings, though the skills were assigned independently of other details on the resume.
 - special_skills
 Indicator for if any special skills were listed on the resume.
- volunteer
 Indicator for if volunteering was listed on the resume.
- military
 Indicator for if military experience was listed on the resume.
- employment_holes
 Indicator for if there were holes in the person's employment history.
- has_email_address
 Indicator for if the resume lists an email address.
- resume_quality
 Each resume was generally classified as either lower or higher quality.
Details
Because this is an experiment, where the race and gender attributes are being randomly assigned to the resumes, we can conclude that any statistically significant difference in callback rates is causally linked to these attributes.
Do you think it's reasonable to make a causal conclusion? You may have some health skepticism. However, do take care to appreciate that this was an experiment: the first name (and so the inferred race and gender) were randomly assigned to the resumes, and the quality and attributes of a resume were assigned independent of the race and gender. This means that any effects we observe are in fact causal, and the effects related to race are both statistically significant and very large: white applicants had about a 50\
Do you still have doubts lingering in the back of your mind about the validity of this study? Maybe a counterargument about why the standard conclusions from this study may not apply? The article summarizing the results was exceptionally well-written, and it addresses many potential concerns about the study's approach. So if you're feeling skeptical about the conclusions, please find the link below and explore!
Source
Bertrand M, Mullainathan S. 2004. "Are Emily and Greg More Employable than Lakisha and Jamal? A Field Experiment on Labor Market Discrimination". The American Economic Review 94:4 (991-1013). doi:10.3386/w9873.
See Also
Examples
head(resume, 5)
# Some checks to confirm balance between race and
# other attributes of a resume. There should be
# some minor differences due to randomness, but
# each variable should be (and is) generally
# well-balanced.
table(resume$race, resume$years_college)
table(resume$race, resume$college_degree)
table(resume$race, resume$honors)
table(resume$race, resume$worked_during_school)
table(resume$race, resume$years_experience)
table(resume$race, resume$computer_skills)
table(resume$race, resume$special_skills)
table(resume$race, resume$volunteer)
table(resume$race, resume$military)
table(resume$race, resume$employment_holes)
table(resume$race, resume$has_email_address)
table(resume$race, resume$resume_quality)
# Regarding the callback outcome for race,
# we observe a very large difference.
tapply(
  resume$received_callback,
  resume[c("race", "gender")],
  mean
)
# Natural question: is this statisticaly significant?
# A proper analysis would take into account the
# paired nature of the data. For each ad, let's
# compute the following statistic:
#     <callback rate for white candidates>
#     - <callback rate for black candidates>
# First contruct the callbacks for white and
# black candidates by ad ID:
table(resume$race)
cb_white <- with(
  subset(resume, race == "white"),
  tapply(received_callback, job_ad_id, mean)
)
cb_black <- with(
  subset(resume, race == "black"),
  tapply(received_callback, job_ad_id, mean)
)
# Next, compute the differences, where the
# names(cb_white) part ensures we matched up the
# job ad IDs.
diff <- cb_white - cb_black[names(cb_white)]
# Finally, we can apply a t-test on the differences:
t.test(diff)
# There is very strong evidence of an effect.
# Here's a similar check with gender. There are
# more female-inferred candidates used on the resumes.
table(resume$gender)
cb_male <- with(
  subset(resume, gender == "m"),
  tapply(received_callback, job_ad_id, mean)
)
cb_female <- with(
  subset(resume, gender == "f"),
  tapply(received_callback, job_ad_id, mean)
)
diff <- cb_female - cb_male[names(cb_female)]
# The `na.rm = TRUE` part ensures we limit to jobs
# where both a male and female resume were sent.
t.test(diff, na.rm = TRUE)
# There is no statistically significant difference.
# Was that the best analysis? Absolutely not!
# However, the analysis was unbiased. To get more
# precision on the estimates, we could build a
# multivariate model that includes many characteristics
# of the resumes sent, e.g. years of experience.
# Since those other characteristics were assigned
# independently of the race characteristics, this
# means the race finding will almost certainy will
# hold. However, it is possible that we'll find
# more interesting results with the gender investigation.
Sample Responses to Two Public Health Questions
Description
Public health has improved and evolved, but has the public's knowledge changed with it? This dataset explores sample responses for two survey questions posed by Hans Rosling during lectures to a wide array of well-educated audiences.
Usage
rosling_responses
Format
A data frame with 278 rows and 3 variables:
- question
 ID for the question being posed.
- response
 Noting whether the response was
correctorincorrect.- prob_random_correct
 The probability the person would have guessed the answer correctly if they were guessing completely randomly.
Source
The samples we describe are plausible based on the exact rates observed in larger samples. For more info on the actual rates observed, visit https://www.gapminder.org.
Another relevant reference is a book by Hans Rosling, Anna Rosling Ronnlund, and Ola Rosling called Factfulness.
Examples
frac_correct <- tapply(
  rosling_responses$response == "correct",
  rosling_responses$question,
  mean
)
frac_correct
n <- table(rosling_responses$question)
n
expected <- tapply(
  rosling_responses$prob_random_correct,
  rosling_responses$question,
  mean
)
# Construct confidence intervals.
se <- sqrt(frac_correct * (1 - frac_correct) / n)
# Lower bounds.
frac_correct - 1.96 * se
# Upper bounds.
frac_correct + 1.96 * se
# Construct Z-scores and p-values.
z <- (frac_correct - expected) / se
pt(z, df = n - 1)
Russians' Opinions on US Election Influence in 2016
Description
Survey of Russian citizens on whether they believed their government tried to influence the 2016 US election. The survey was taken in Spring 2018 by Pew Research.
Usage
russian_influence_on_us_election_2016
Format
A data frame with 506 observations on the following variable.
- influence_2016
 Response of the Russian survey participant to the question of whether their government tried to influence the 2016 election in the United States.
Details
The actual sample size was 1000. However, the original data were not from a simple random sample; after accounting for the design, the equivalent sample size was 506, which was what was used for the dataset here to keep things simpler for intro stat analyses.
Source
Examples
table(russian_influence_on_us_election_2016)
Sustainability and Economic Indicators for South Africa.
Description
Includes yearly data on gdp, gni, co2 emissions, start up costs.
Usage
sa_gdp_elec
Format
A data frame with 16 rows and 7 variables.
- year
 Year data collected.
- access_elec
 Access to electricity as a percentage of the population.
- startup
 cost of business startup procedures as a percent of GNI.
- co2
 CO2 emission in kt (kiloton).
- gdp
 GDP per capita, PPP in constant 2017 international dollars.
- gni
 GNI per capita, PPP in constant 2017 international dollars.
- co2_kg_ppp
 kg per 2017 PPP dollars of GDP.
Source
Examples
library(ggplot2)
ggplot(sa_gdp_elec, aes(year, access_elec)) +
  geom_point(alpha = 0.3) +
  labs(
    x = "Year",
    y = "Percent of Population",
    title = "Access to Electricity in South Africa 2003 - 2018"
  )
Salinity in Bimini Lagoon, Bahamas
Description
Data collected at three different water masses in the Bimini Lagoon, Bahamas.
Usage
salinity
Format
A data frame with 30 rows and 2 variables.
- site_number
 Location where measurements were taken.
- salinity_ppt
 Salinity value in parts per thousand.
Source
Till, R. (1974) Statistical Methods for the Earth Scientist: An Introduction. London: Macmillon, 104.
Examples
library(ggplot2)
library(broom)
ggplot(salinity, aes(x = salinity_ppt)) +
  geom_dotplot() +
  facet_wrap(~site_number, ncol = 1)
tidy(aov(salinity_ppt ~ site_number, data = salinity))
Simulated data for SAT score improvement
Description
Fake data for score improvements from students who took a course from an SAT score improvement company.
Usage
sat_improve
Format
A data frame with 30 observations on the following variable.
- sat_improve
 a numeric vector
Examples
sat_improve
SAT and GPA data
Description
SAT and GPA data for 1000 students at an unnamed college.
Usage
satgpa
Format
A data frame with 1000 observations on the following 6 variables.
- sex
 Gender of the student.
- sat_v
 Verbal SAT percentile.
- sat_m
 Math SAT percentile.
- sat_sum
 Total of verbal and math SAT percentiles.
- hs_gpa
 High school grade point average.
- fy_gpa
 First year (college) grade point average.
Source
Educational Testing Service originally collected the data.
References
https://chance.dartmouth.edu/course/Syllabi/Princeton96/ETSValidation.html
Examples
library(ggplot2)
library(broom)
# Verbal scores
ggplot(satgpa, aes(x = sat_v, fy_gpa)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(
    x = "Verbal SAT percentile",
    y = "First year (college) grade point average"
  )
mod <- lm(fy_gpa ~ sat_v, data = satgpa)
tidy(mod)
# Math scores
ggplot(satgpa, aes(x = sat_m, fy_gpa)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(
    x = "Math SAT percentile",
    y = "First year (college) grade point average"
  )
mod <- lm(fy_gpa ~ sat_m, data = satgpa)
tidy(mod)
Color scale constructor for OpenIntro IMS colors
Description
Color scale constructor for OpenIntro IMS colors
Usage
scale_color_openintro(palette = "main", discrete = TRUE, reverse = FALSE, ...)
Arguments
palette | 
 Character name of palette in openintro_palettes  | 
discrete | 
 Boolean indicating whether color aesthetic is discrete or not  | 
reverse | 
 Boolean indicating whether the palette should be reversed  | 
... | 
 Additional arguments passed to   | 
Examples
library(ggplot2)
# Categorical variable with three levels
ggplot(evals, aes(
  x = bty_avg, y = score,
  color = rank, shape = rank
)) +
  geom_jitter(size = 2, alpha = 0.6) +
  scale_color_openintro("three")
# Categorical variable with two levels
ggplot(evals, aes(
  x = bty_avg, y = score,
  color = language, shape = language
)) +
  geom_jitter(size = 2, alpha = 0.6) +
  scale_color_openintro("two")
# Continuous variable
# Generates a palette, but not recommended
ggplot(evals, aes(
  x = bty_avg, y = score,
  color = score
)) +
  geom_jitter(size = 2, alpha = 0.8) +
  scale_color_openintro(discrete = FALSE)
# For continous palettes
# use scale_color_gradient instead
ggplot(evals, aes(
  x = bty_avg, y = score,
  color = score
)) +
  geom_jitter(size = 2) +
  scale_color_gradient(low = IMSCOL["blue", "full"], high = IMSCOL["blue", "f6"])
ggplot(evals, aes(
  x = bty_avg, y = score,
  color = cls_perc_eval
)) +
  geom_jitter(size = 2) +
  scale_color_gradient(low = COL["red", "full"], high = COL["red", "f8"])
Fill scale constructor for OpenIntro IMS colors
Description
Fill scale constructor for OpenIntro IMS colors
Usage
scale_fill_openintro(palette = "main", discrete = TRUE, reverse = FALSE, ...)
Arguments
palette | 
 Character name of palette in openintro_palettes  | 
discrete | 
 Boolean indicating whether color aesthetic is discrete or not  | 
reverse | 
 Boolean indicating whether the palette should be reversed  | 
... | 
 Additional arguments passed to   | 
Examples
library(ggplot2)
library(dplyr)
# Categorical variable with two levels
ggplot(evals, aes(x = ethnicity, fill = ethnicity)) +
  geom_bar() +
  scale_fill_openintro("two")
# Categorical variable with three levels
ggplot(evals, aes(x = rank, fill = rank)) +
  geom_bar() +
  scale_fill_openintro("three")
# Continuous variable with levels
# Generates a palette, but may not be the best palette
# in terms of color-blind and grayscale friendliness
ggplot(diamonds, aes(x = clarity, fill = clarity)) +
  geom_bar() +
  scale_fill_openintro()
# For continuous palettes
# use scale_color_gradient instead
ggplot(evals, aes(
  x = bty_avg, y = score,
  color = score
)) +
  geom_jitter(size = 2) +
  scale_color_gradient(low = IMSCOL["blue", "full"], high = IMSCOL["blue", "f6"])
ggplot(evals, aes(
  x = bty_avg, y = score,
  color = cls_perc_eval
)) +
  geom_jitter(size = 2) +
  scale_color_gradient(low = IMSCOL["green", "full"], high = IMSCOL["green", "f6"])
Public Opinion with SCOTUS ruling on American Healthcare Act
Description
On June 28, 2012 the U.S. Supreme Court upheld the much debated 2010 healthcare law, declaring it constitutional. A Gallup poll released the day after this decision indicates that 46% of 1,012 Americans agree with this decision.
Usage
scotus_healthcare
Format
A data frame with 1012 observations on the following variable.
- response
 Response values reported are
agreeandother.
Source
Gallup, Americans Issue Split Decision on Healthcare Ruling, retrieved 2012-06-28.
Examples
table(scotus_healthcare)
Names of pets in Seattle
Description
Names of registered pets in Seattle, WA, between 2003 and 2018, provided by the city's Open Data Portal.
Usage
seattlepets
Format
A data frame with 52,519 rows and 7 variables:
- license_issue_date
 Date the animal was registered with Seattle
- license_number
 Unique license number
- animal_name
 Animal's name
- species
 Animal's species (dog, cat, goat, etc.)
- primary_breed
 Primary breed of the animal
- secondary_breed
 Secondary breed if mixed
- zip_code
 Zip code animal is registered in
Source
These data come from Seattle's Open Data Portal, https://data.seattle.gov/Community/Seattle-Pet-Licenses/jguv-t9rb
Bank manager recommendations based on sex
Description
Study from the 1970s about whether sex influences hiring recommendations.
Usage
sex_discrimination
Format
A data frame with 48 observations on the following 2 variables.
- sex
 a factor with levels
femaleandmale- decision
 a factor with levels
not promotedandpromoted
Source
Rosen B and Jerdee T. 1974. Influence of sex role stereotypes on personnel decisions. Journal of Applied Psychology 59(1):9-14.
Examples
library(ggplot2)
table(sex_discrimination)
ggplot(sex_discrimination, aes(y = sex, fill = decision)) +
  geom_bar(position = "fill")
Simpson's Paradox: Covid
Description
A dataset on Delta Variant Covid-19 cases in the UK. This dataset gives a great example of Simpson's Paradox. When aggregating results without regard to age group, the death rate for vaccinated individuals is higher – but they have a much higher risk population. Once we look at populations with more comparable risks (breakout age groups), we see that the vaccinated group tends to be lower risk in each risk-bucketed group and that many of the higher risk patients had gotten vaccinated. The dataset was brought to OpenIntro's attention by Matthew T. Brenneman of Embry-Riddle Aeronautical University. Note: some totals in the original source differ as there were some cases that did not have ages associated with them.
Usage
simpsons_paradox_covid
Format
A data frame with 286,166 rows and 3 variables:
- age_group
 Age of the person. Levels:
under 50,50 +.- vaccine_status
 Vaccination status of the person. Note: the vaccinated group includes those who were only partially vaccinated. Levels:
vaccinated,unvaccinated- outcome
 Did the person die from the Delta variant? Levels:
deathandsurvived.
Source
Public Health England: Technical briefing 20
Examples
library(dplyr)
library(scales)
# Calculate the mortality rate for all cases by vaccination status
simpsons_paradox_covid |>
  group_by(vaccine_status, outcome) |>
  summarize(count = n()) |>
  ungroup() |>
  group_by(vaccine_status) |>
  mutate(total = sum(count)) |>
  filter(outcome == "death") |>
  select(c(vaccine_status, count, total)) |>
  mutate(mortality_rate = label_percent(accuracy = 0.01)(round(count / total, 4))) |>
  select(-c(count, total))
# Calculate mortality rate by age group and vaccination status
simpsons_paradox_covid |>
  group_by(age_group, vaccine_status, outcome) |>
  summarize(count = n()) |>
  ungroup() |>
  group_by(age_group, vaccine_status) |>
  mutate(total = sum(count)) |>
  filter(outcome == "death") |>
  select(c(age_group, vaccine_status, count, total)) |>
  mutate(mortality_rate = label_percent(accuracy = 0.01)(round(count / total, 4))) |>
  select(-c(count, total))
Simulated datasets, not necessarily drawn from a normal distribution.
Description
Data were simulated in R, and some of the simulations do not represent data from actual normal distributions.
Usage
simulated_dist
Format
The format is: List of 4 $ d1: dataset of 100 observations. $ d2: dataset of 50 observations. $ d3: num dataset of 500 observations. $ d4: dataset of 15 observations. $ d5: num dataset of 25 observations. $ d6: dataset of 50 observations.
Examples
data(simulated_dist)
lapply(simulated_dist, qqnorm)
Simulated datasets, drawn from a normal distribution.
Description
Data were simulated using rnorm.
Usage
simulated_normal
Format
The format is: List of 3 $ n40 : 40 observations from a standard normal distribution. $ n100: 100 observations from a standard normal distribution. $ n400: 400 observations from a standard normal distribution.
Examples
data(simulated_normal)
lapply(simulated_normal, qqnorm)
Simulated data for sample scatterplots
Description
Fake data.
Usage
simulated_scatter
Format
A data frame with 500 observations on the following 3 variables.
- group
 Group, representing data for a specific plot.
- x
 x-value.
- y
 y-value.
Examples
library(ggplot2)
ggplot(simulated_scatter, aes(x = x, y = y)) +
  geom_point() +
  facet_wrap(~group)
Sinusitis and antibiotic experiment
Description
Researchers studying the effect of antibiotic treatment for acute sinusitis to one of two groups: treatment or control.
Usage
sinusitis
Format
A data frame with 166 observations on the following 2 variables.
- group
 a factor with levels
controlandtreatment- self_reported_improvement
 a factor with levels
noandyes
Source
J.M. Garbutt et al. Amoxicillin for Acute Rhinosinusitis: A Randomized Controlled Trial. In: JAMA: The Journal of the American Medical Association 307.7 (2012), pp. 685-692.
Examples
sinusitis
Survey on sleep deprivation and transportation workers
Description
The National Sleep Foundation conducted a survey on the sleep habits of randomly sampled transportation workers and a control sample of non-transportation workers.
Usage
sleep_deprivation
Format
A data frame with 1087 observations on the following 2 variables.
- sleep
 a factor with levels
<6,6-8, and>8- profession
 a factor with levels
bus / taxi / limo drivers,control,pilots,train operators,truck drivers
Source
National Sleep Foundation, 2012 Sleep in America Poll: Transportation Workers' Sleep, 2012. https://www.sleepfoundation.org/professionals/sleep-americar-polls/2012-sleep-america-poll-transportation-workers-sleep
Examples
sleep_deprivation
Smallpox vaccine results
Description
A sample of 6,224 individuals from the year 1721 who were exposed to smallpox in Boston. Some of them had received a vaccine (inoculated) while others had not. Doctors at the time believed that inoculation, which involves exposing a person to the disease in a controlled form, could reduce the likelihood of death.
Usage
smallpox
Format
A data frame with 6224 observations on the following 2 variables.
- result
 Whether the person
diedorlived.- inoculated
 Whether the person received inoculated.
Source
Fenner F. 1988. Smallpox and Its Eradication (History of International Public Health, No. 6). Geneva: World Health Organization. ISBN 92-4-156110-6.
Examples
data(smallpox)
table(smallpox)
UK Smoking Data
Description
Survey data on smoking habits from the UK. The dataset can be used for analyzing the demographic characteristics of smokers and types of tobacco consumed.
Usage
smoking
Format
A data frame with 1691 observations on the following 12 variables.
- gender
 Gender with levels
FemaleandMale.- age
 Age.
- marital_status
 Marital status with levels
Divorced,Married,Separated,SingleandWidowed.- highest_qualification
 Highest education level with levels
A Levels,Degree,GCSE/CSE,GCSE/O Level,Higher/Sub Degree,No Qualification,ONC/BTECandOther/Sub Degree- nationality
 Nationality with levels
British,English,Irish,Scottish,Welsh,Other,RefusedandUnknown.- ethnicity
 Ethnicity with levels
Asian,Black,Chinese,Mixed,WhiteandRefusedUnknown.- gross_income
 Gross income with levels
Under 2,600,2,600 to 5,200,5,200 to 10,400,10,400 to 15,600,15,600 to 20,800,20,800 to 28,600,28,600 to 36,400,Above 36,400,RefusedandUnknown.- region
 Region with levels
London,Midlands & East Anglia,Scotland,South East,South West,The NorthandWales- smoke
 Smoking status with levels
NoandYes- amt_weekends
 Number of cigarettes smoked per day on weekends.
- amt_weekdays
 Number of cigarettes smoked per day on weekdays.
- type
 Type of cigarettes smoked with levels
Packets,Hand-Rolled,Both/Mainly PacketsandBoth/Mainly Hand-Rolled
Source
National STEM Centre, Large Datasets from stats4schools, https://www.stem.org.uk/resources/elibrary/resource/28452/large-datasets-stats4schools.
Examples
library(ggplot2)
ggplot(smoking, aes(x = amt_weekends)) +
  geom_histogram(binwidth = 5)
ggplot(smoking, aes(x = amt_weekdays)) +
  geom_histogram(binwidth = 5)
ggplot(smoking, aes(x = gender, fill = smoke)) +
  geom_bar(position = "fill")
ggplot(smoking, aes(x = marital_status, fill = smoke)) +
  geom_bar(position = "fill")
Snowfall at Paradise, Mt. Rainier National Park
Description
Annual snowfall data for Paradise, Mt. Rainier National Park. To include a full winter season, snowfall is recorded from July 1 to June 30. Data from 1943-1946 not available due to road closure during World War II. Records also unavailable from 1948-1954.
Usage
snowfall
Format
A data frame with 100 rows and 3 variables.
- year_start
 The year snowfall measurement began on July 1.
- year_end
 The year snowfall measurement ended on June 30.
- total_snow
 Snowfall measured in inches.
Source
Examples
library(ggplot2)
ggplot(snowfall, aes(x = total_snow)) +
  geom_histogram(binwidth = 50) +
  labs(
    title = "Annual Snowfall",
    subtitle = "Paradise, Mt. Rainier National Park",
    x = "Snowfall (in.)",
    y = "Number of Years",
    caption = "Source: National Parks Services"
  )
ggplot(snowfall, aes(x = year_start, y = total_snow, group = 1)) +
  geom_line() +
  labs(
    title = "Annual Snowfall",
    subtitle = "Paradise, Mt. Rainier National Park",
    y = "Snowfall (in.)",
    x = "Year",
    caption = "Source: National Parks Services"
  )
Social experiment
Description
A "social experiment" conducted by a TV program questioned what people do when they see a very obviously bruised woman getting picked on by her boyfriend. On two different occasions at the same restaurant, the same couple was depicted. In one scenario the woman was dressed "provocatively" and in the other scenario the woman was dressed "conservatively". The table below shows how many restaurant diners were present under each scenario, and whether or not they intervened.
Usage
socialexp
Format
A data frame with 45 observations on the following 2 variables.
- intervene
 Whether other diners intervened or not.
- scenario
 How the woman was dressed.
Examples
table(socialexp)
soda
Description
A randomly generated dataset of soda preference (cola or orange) based on location.
Usage
soda
Format
A data frame with 60 observations on the following 2 variables.
- drink
 Soda preference, cola or orange.
- location
 Is the person from the West coast or East coast?
Examples
library(dplyr)
soda |>
  count(location, drink)
Energy Output From Two Solar Arrays in San Francisco
Description
The data provide the energy output for several months from two roof-top solar arrays in San Francisco. This city is known for having highly variable weather, so while these two arrays are only about 1 mile apart from each other, the Inner Sunset location tends to have more fog.
Usage
solar
Format
A data frame with 284 observations on the following 3 variables. Each row represents a single day for one of the arrays.
- location
 Location for the array.
- date
 Date.
- kwh
 Number of kWh
Details
The Haight-Ashbury array is a 10.4 kWh array, while the Inner Sunset array is a 2.8 kWh array. The kWh units represents kilowatt-hours, which is the unit of energy that typically is used for electricity bills. The cost per kWh in San Francisco was about $0.25 in 2016.
Source
These data were provided by Larry Rosenfeld, a resident in San Francisco.
Examples
solar.is <- subset(solar, location == "Inner_Sunset")
solar.ha <- subset(solar, location == "Haight_Ashbury")
plot(solar.is$date, solar.is$kwh, type = "l", ylim = c(0, max(solar$kwh)))
lines(solar.ha$date, solar.ha$kwh, col = 4)
d <- merge(solar.ha, solar.is, by = "date")
plot(d$date, d$kwh.x / d$kwh.y, type = "l")
SOWC Child Mortality Data.
Description
Child mortality data from UNICEF's State of the World's Children 2019 Statistical Tables.
Usage
sowc_child_mortality
Format
A data frame with 195 rows and 19 variables.
- countries_and_areas
 Country or area name.
- under5_mortality_1990
 Under-5 mortality rate (deaths per 1,000 live births) in 1990.
- under5_mortality_2000
 Under-5 mortality rate (deaths per 1,000 live births) in 2000.
- under5_mortality_2018
 Under-5 mortality rate (deaths per 1,000 live births) in 2018.
- under5_reduction
 Annual rate of reduction in under-5 mortality rate (%)2000–2018.
- under5_mortality_2018_male
 Under-5 mortality rate male (deaths per 1,000 live births) 2018.
- under5_mortality_2018_female
 Under-5 mortality rate female (deaths per 1,000 live births) 2018.
- infant_mortality_1990
 Infant mortality rate (deaths per 1,000 live births) 1990
- infant_mortality_2018
 Infant mortality rate (deaths per 1,000 live births) 2018
- neonatal_mortality_1990
 Neonatal mortality rate (deaths per 1,000 live births) 1990.
- neonatal_mortality_2000
 Neonatal mortality rate (deaths per 1,000 live births) 2000.
- neonatal_mortality_2018
 Neonatal mortality rate (deaths per 1,000 live births) 2018.
- prob_dying_age5to14_1990
 Probability of dying among children aged 5–14 (deaths per 1,000 children aged 5) 1990.
- prob_dying_age5to14_2018
 Probability of dying among children aged 5–14 (deaths per 1,000 children aged 5) 2018.
- under5_deaths_2018
 Annual number of under-5 deaths (thousands) 2018.
- neonatal_deaths_2018
 Annual number of neonatal deaths (thousands) 2018.
- neonatal_deaths_percent_under5
 Neonatal deaths as proportion of all under-5 deaths (%) 2018.
- age5to14_deaths_2018
 Number of deaths among children aged 5–14 (thousands) 2018.
Source
United Nations Children's Emergency Fund (UNICEF)
Examples
library(dplyr)
library(ggplot2)
# List countries and areas whose children aged 5 and under have a higher probability of dying in
# 2018 than they did in 1990
sowc_child_mortality |>
  mutate(decrease_prob_dying = prob_dying_age5to14_1990 - prob_dying_age5to14_2018) |>
  select(countries_and_areas, decrease_prob_dying) |>
  filter(decrease_prob_dying < 0) |>
  arrange(decrease_prob_dying)
# List countries and areas and their relative rank for neonatal mortality in 2018
sowc_child_mortality |>
  mutate(rank = round(rank(-neonatal_mortality_2018))) |>
  select(countries_and_areas, rank, neonatal_mortality_2018) |>
  arrange(rank)
SOWC Demographics Data.
Description
Demographic data from UNICEF's State of the World's Children 2019 Statistical Tables.
Usage
sowc_demographics
Format
A data frame with 202 rows and 18 variables.
- countries_and_areas
 Country or area name.
- total_pop_2018
 Population in 2018 in thousands.
- under18_pop_2018
 Population under age 18 in 2018 in thousands.
- under5_pop_2018
 Population under age 5 in 2018 in thousands.
- pop_growth_rate_2018
 Rate at which population is growing in 2018.
- pop_growth_rate_2030
 Rate at which population is estimated to grow in 2030.
- births_2018
 Number of births in 2018 in thousands.
- fertility_2018
 Number of live births per woman in 2018.A total fertility level of 2.1 is called replacement level and represents a level at which the population would remain the same size.
- life_expectancy_1970
 Life expectancy at birth in 1970.
- life_expectancy_2000
 Life expectancy at birth in 2000.
- life_expectancy_2018
 Life expectancy at birth in 2018.
- dependency_ratio_total
 The ratio of the not-working-age population to the working-age population of 15 - 64 years.
- dependency_ratio_child
 The ratio of the under 15 population to the working-age population of 15 - 64 years.
- dependency_ratio_oldage
 The ratio of the over 64 population to the working-age population of 15 - 64 years.
- percent_urban_2018
 Percent of population living in urban areas.
- pop_urban_growth_rate_2018
 Annual urban population growth rate from 2000 to 2018.
- pop_urban_growth_rate_2030
 Estimated annual urban population growth rate from 2018 to 2030.
- migration_rate
 Net migration rate per 1000 population from 2015 to 2020.
Source
United Nations Children's Emergency Fund (UNICEF)
Examples
library(dplyr)
library(ggplot2)
# List countries and areas' life expectancy, ordered by rank of life expectancy in 2018
sowc_demographics |>
  mutate(life_expectancy_change = life_expectancy_2018 - life_expectancy_1970) |>
  mutate(rank_life_expectancy = round(rank(-life_expectancy_2018), 0)) |>
  select(
    countries_and_areas, rank_life_expectancy, life_expectancy_2018,
    life_expectancy_change
  ) |>
  arrange(rank_life_expectancy)
# List countries and areas' migration rate and population, ordered by rank of migration rate
sowc_demographics |>
  mutate(rank = round(rank(migration_rate))) |>
  mutate(population_millions = total_pop_2018 / 1000) |>
  select(countries_and_areas, rank, migration_rate, population_millions) |>
  arrange(rank)
# Scatterplot of life expectancy v population in 2018
ggplot(sowc_demographics, aes(life_expectancy_1970, life_expectancy_2018, size = total_pop_2018)) +
  geom_point(alpha = 0.5) +
  labs(
    title = "Life Expectancy",
    subtitle = "1970 v. 2018",
    x = "Life Expectancy in 1970",
    y = "Life Expectancy in 2018",
    size = "2018 Total Population"
  )
SOWC Maternal and Newborn Health Data.
Description
Data from UNICEF's State of the World's Children 2019 Statistical Tables.
Usage
sowc_maternal_newborn
Format
A data frame with 202 rows and 18 variables.
- countries_and_areas
 Country or area name.
- life_expectancy_female
 Life expectancy: female in 2018.
- family_planning_1549
 Demand for family planning satisfied with modern methods (%) 2013–2018 Women aged 15 to 49.
- family_planning_1519
 Demand for family planning satisfied with modern methods (%) 2013–2018 Women aged 15 to 19.
- adolescent_birth_rate
 Adolescent birth rate 2013 to 2018.
- births_age_18
 Births by age 18 (%) 2013 to 2018.
- antenatal_care_1
 Antenatal care (%) 2013 to 2018 At least one visit.
- antenatal_care_4_1549
 Antenatal care (%) 2013 to 2018 At least four visits Women aged 15 to 49.
- antenatal_care_4_1519
 Antenatal care (%) 2013 to 2018 At least four visits Women aged 15 to 19.
- delivery_care_attendant_1549
 Delivery care (%) 2013 to 2018 Skilled birth attendant Women aged 15 to 49.
- delivery_care_attendant_1519
 Delivery care (%) 2013 to 2018 Skilled birth attendant Women aged 15 to 19.
- delivery_care_institutional
 Delivery care (%) 2013 to 2018 Institutional delivery.
- c_section
 Delivery care (%) 2013–2018 C-section.
- postnatal_health_newborns
 Postnatal health check(%) 2013 to 2018 For newborns.
- postnatal_health_mothers
 Postnatal health check(%) 2013 to 2018 For mothers.
- maternal_deaths_2017
 Maternal mortality 2017 Number of maternal deaths.
- maternal_mortality_ratio_2017
 Maternal mortality 2017 Maternal Mortality Ratio.
- risk_maternal_death_2017
 Maternal mortality 2017 Lifetime risk of maternal death (1 in X).
Source
United Nations Children's Emergency Fund (UNICEF)
Examples
library(dplyr)
library(ggplot2)
# List countries and lifetime risk of maternal death (1 in X), ranked
sowc_maternal_newborn |>
  mutate(rank = round(rank(risk_maternal_death_2017), 0)) |>
  select(countries_and_areas, rank, risk_maternal_death_2017) |>
  arrange(rank)
# Graph scatterplot of Maternal Mortality Ratio 2017 and Antenatal Care 4+ Visits %
sowc_maternal_newborn |>
  select(antenatal_care_4_1549, maternal_mortality_ratio_2017) |>
  remove_missing(na.rm = TRUE) |>
  ggplot(aes(antenatal_care_4_1549, maternal_mortality_ratio_2017)) +
  geom_point(alpha = 0.5) +
  labs(
    title = "Antenatal Care and Mortality",
    x = "Antenatal Care 4+ visits %",
    y = "Maternal Mortality Ratio"
  )
Financial information for 50 S&P 500 companies
Description
Fifty companies were randomly sampled from the 500 companies in the S&P 500, and their financial information was collected on March 8, 2012.
Usage
sp500
Format
A data frame with 50 observations on the following 12 variables.
- market_cap
 Total value of all company shares, in millions of dollars.
- stock
 The name of the stock (e.g.
AAPLfor Apple).- ent_value
 Enterprise value, which is an alternative to market cap that also accounts for things like cash and debt, in millions of dollars.
- trail_pe
 The market cap divided by the earnings (profits) over the last year.
- forward_pe
 The market cap divided by the forecasted earnings (profits) over the next year.
- ev_over_rev
 Enterprise value divided by the company's revenue.
- profit_margin
 Percent of earnings that are profits.
- revenue
 Revenue, in millions of dollars.
- growth
 Quartly revenue growth (year over year), in millions of dollars.
- earn_before
 Earnings before interest, taxes, depreciation, and amortization, in millions of dollars.
- cash
 Total cash, in millions of dollars.
- debt
 Total debt, in millions of dollars.
Source
Yahoo! Finance, retrieved 2012-03-08.
Examples
library(ggplot2)
ggplot(sp500, aes(x = ent_value, y = earn_before)) +
  geom_point() +
  labs(x = "Enterprise value", y = "Earnings")
ggplot(sp500, aes(x = ev_over_rev, y = forward_pe)) +
  geom_point() +
  labs(
    x = "Enterprise value / revenue, logged",
    y = "Market cap / forecasted earnings, logged"
  )
ggplot(sp500, aes(x = ent_value, y = earn_before)) +
  geom_point() +
  scale_x_log10() +
  scale_y_log10() +
  labs(x = "Enterprise value", y = "Earnings")
ggplot(sp500, aes(x = ev_over_rev, y = forward_pe)) +
  geom_point() +
  scale_x_log10() +
  scale_y_log10() +
  labs(
    x = "Enterprise value / revenue, logged",
    y = "Market cap / forecasted earnings, logged"
  )
Daily observations for the S&P 500
Description
Data runs from 1950 to near the end of 2018.
Usage
sp500_1950_2018
Format
A data frame with 17346 observations on the following 7 variables.
- Date
 Date of the form
"YYYY-MM-DD".- Open
 Opening price.
- High
 Highest price of the day.
- Low
 Lowest price of the day.
- Close
 Closing price of the day.
- Adj.Close
 Adjusted price at close after accounting for dividends paid out.
- Volume
 Trading volume.
Source
Yahoo! Finance
Examples
data(sp500_1950_2018)
sp500.ten.years <- subset(
  sp500_1950_2018,
  "2009-01-01" <= as.Date(Date) & as.Date(Date) <= "2018-12-31"
)
d <- diff(sp500.ten.years$Adj.Close)
mean(d > 0)
S&P 500 stock data
Description
Daily stock returns from the S&P500 for 1990-2011 can be used to assess whether stock activity each day is independent of the stock's behavior on previous days. We label each day as Up or Down (D) depending on whether the market was up or down that day. For example, consider the following changes in price, their new labels of up and down, and then the number of days that must be observed before each Up day.
Usage
sp500_seq
Format
A data frame with 2948 observations on the following variable.
- race
 a factor with levels
1,2,3,4,5,6, and7+
Source
Examples
sp500_seq
Speed, gender, and height of 1325 students
Description
1,325 UCLA students were asked to fill out a survey where they were asked about their height, fastest speed they have ever driven, and gender.
Usage
speed_gender_height
Format
A data frame with 1325 observations on the following 3 variables.
- speed
 a numeric vector
- gender
 a factor with levels
femaleandmale- height
 a numeric vector
Examples
speed_gender_height
SSD read and write speeds
Description
User submitted data on 1TB solid state drives (SSD).
Usage
ssd_speed
Format
A data frame with 54 rows and 7 variables.
- brand
 Brand name of the drive.
- model
 Model name of the drive.
- samples
 Number of user submitted benchmarks.
- form_factor
 Physical form of the drive with levels
2.5,m.2, andmSATA.- nvme
 If a drive uses the nvme protocol this value is 1, 0 if it does not.
- read
 Average read speed from user benchmarks in MB/s.
- write
 Average write speed from user benchmarks in MB/s.
Source
UserBenchmark, retrieved September 1, 2020.
Examples
library(ggplot2)
library(dplyr)
ssd_speed |>
  count(form_factor)
ssd_speed |>
  filter(form_factor != "mSATA") |>
  ggplot(aes(x = read, y = write, color = form_factor)) +
  geom_point() +
  labs(
    title = "Average read vs. write speed of SSDs",
    x = "Read speed (MB/s)",
    y = "Write speed (MB/s)"
  ) +
  facet_wrap(~form_factor, ncol = 1, scales = "free") +
  guides(color = FALSE)
Starbucks nutrition
Description
Nutrition facts for several Starbucks food items
Usage
starbucks
Format
A data frame with 77 observations on the following 7 variables.
- item
 Food item.
- calories
 Calories.
- fat
 a numeric vector
- carb
 a numeric vector
- fiber
 a numeric vector
- protein
 a numeric vector
- type
 a factor with levels
bakery,bistro box,hot breakfast,parfait,petite,salad, andsandwich
Source
https://www.starbucks.com/menu, retrieved 2011-03-10.
Examples
starbucks
Final exam scores for twenty students
Description
Scores range from 57 to 94.
Usage
stats_scores
Format
A data frame with 20 observations on the following variable.
- scores
 a numeric vector
Examples
stats_scores
Embryonic stem cells to treat heart attack (in sheep)
Description
Does treatment using embryonic stem cells (ESCs) help improve heart function following a heart attack? Each sheep in the study was randomly assigned to the ESC or control group, and the change in their hearts' pumping capacity was measured in the study. A positive value corresponds to increased pumping capacity, which generally suggests a stronger recovery.
Usage
stem_cell
Format
A data frame with 18 observations on the following 3 variables.
- trmt
 a factor with levels
ctrlesc- before
 a numeric vector
- after
 a numeric vector
Source
doi:10.1016/S0140-6736(05)67380-1
Examples
stem_cell
Stents for the treatment of stroke
Description
An experiment that studies effectiveness of stents in treating patients at
risk of stroke with some unexpected results. stent30 represents
the results 30 days after stroke and stent365 represents the results
365 days after stroke.
Usage
stent30
Format
A data frame with 451 observations on the following 2 variables.
- group
 a factor with levels
controlandtreatment- outcome
 a factor with levels
no eventandstroke
Source
Chimowitz MI, Lynn MJ, Derdeyn CP, et al. 2011. Stenting versus Aggressive Med- ical Therapy for Intracranial Arterial Stenosis. New England Journal of Medicine 365:993- 1003. doi:10.1056/NEJMoa1105335. NY Times article reporting on the study: https://www.nytimes.com/2011/09/08/health/research/08stent.html.
Examples
# 30-day results
table(stent30)
# 365-day results
table(stent365)
Monthly Returns for a few stocks
Description
Monthly return data for a few stocks, which covers stock prices from November 2015 through October 2018.
Usage
stocks_18
Format
A data frame with 36 observations on the following 3 variables.
- date
 First day of the month corresponding to the returns.
- goog
 Google stock price change.
- cat
 Caterpillar stock price change.
- xom
 Exxon Mobil stock price change.
Source
Yahoo! Finance, direct download.
Examples
d <- stocks_18
dim(d)
apply(d[, 2:3], 2, mean)
apply(d[, 2:3], 2, sd)
Community college housing (simulated data, 2015)
Description
These are simulated data and intended to represent housing prices of students at a college.
Usage
student_housing
Format
A data frame with 175 observations on the following variable.
- price
 Monthly housing price, simulated.
Examples
set.seed(5)
generate_student_housing <- data.frame(
  price = round(rnorm(175, 515, 65) + exp(rnorm(175, 4.2, 1)))
)
hist(student_housing$price, 20)
t.test(student_housing$price)
mean(student_housing$price)
sd(student_housing$price)
identical(student_housing, generate_student_housing)
Sleep for 110 students (simulated)
Description
A simulated dataset for how much 110 college students each slept in a single night.
Usage
student_sleep
Format
A data frame with 110 observations on the following variable.
- hours
 Number of hours slept by this student (simulated).
Source
Simulated data.
Examples
set.seed(2)
x <- exp(c(
  rnorm(100, log(7.5), 0.15),
  rnorm(10, log(10), 0.196)
))
x <- round(x - mean(x) + 7.42, 2)
identical(x, student_sleep$hours)
Treating heart attacks
Description
Experiment data for studying the efficacy of treating patients who have had a heart attack with Sulphinpyrazone.
Usage
sulphinpyrazone
Format
A data frame with 1475 observations on the following 2 variables.
- group
 a factor with levels
controltreatment- outcome
 a factor with levels
diedlived
Source
Anturane Reinfarction Trial Research Group. 1980. Sulfinpyrazone in the prevention of sudden death after myocardial infarction. New England Journal of Medicine 302(5):250-256.
Examples
sulphinpyrazone
Supreme Court approval rating
Description
Summary of a random survey of 976 people.
Usage
supreme_court
Format
A data frame with 976 observations on the following variable.
- answer
 a factor with levels
approveandnot
Source
Examples
supreme_court
Teacher Salaries in St. Louis, Michigan
Description
This dataset contains teacher salaries from 2009-2010 for 71 teachers employed by the St. Louis Public School in Michigan, as well as several covariates.
Usage
teacher
Format
A data frame with 71 observations on the following 8 variables.
- id
 Identification code for each teacher, assigned randomly.
- degree
 Highest educational degree attained:
BA(bachelor's degree) orMA(master's degree).- fte
 Full-time enrollment status: full-time
1or part-time0.5.- years
 Number of years employed by the school district.
- base
 Base annual salary, in dollars.
- fica
 Amount paid into Social Security and Medicare per year through the Federal Insurance Contribution Act (FICA), in dollars.
- retirement
 Amount paid into the retirement fund of the teacher per year, in dollars.
- total
 Total annual salary of the teacher, resulting from the sum of base salary + fica + retirement, in dollars.
Source
Originally posted on SODA Developers (dev.socrata.com/data), removed in 2020.
Examples
library(ggplot2)
# Salary and education level
ggplot(teacher, aes(x = degree, y = base)) +
  geom_boxplot() +
  labs(
    x = "Highest educational degree attained",
    y = "Base annual salary, in $",
    color = "Degree",
    title = "Salary and education level"
  )
# Salary and years of employment
ggplot(teacher, aes(x = years, y = base, color = degree)) +
  geom_point() +
  labs(
    x = "Number of years employed by the school district",
    y = "Base annual salary, in $",
    color = "Degree",
    title = "Salary and years of employment"
  )
Textbook data for UCLA Bookstore and Amazon
Description
A random sample was taken of nearly 10\ textbook for each course was identified, and its new price at the UCLA Bookstore and on Amazon.com were recorded.
Usage
textbooks
Format
A data frame with 73 observations on the following 7 variables.
- dept_abbr
 Course department (abbreviated).
- course
 Course number.
- isbn
 Book ISBN.
- ucla_new
 New price at the UCLA Bookstore.
- amaz_new
 New price on Amazon.com.
- more
 Whether additional books were required for the course (
Ymeans "yes, additional books were required").- diff
 The UCLA Bookstore price minus the Amazon.com price for each book.
Details
The sample represents only courses where textbooks were listed online
through UCLA Bookstore's website. The most expensive textbook was selected
based on the UCLA Bookstore price, which may insert bias into the data; for
this reason, it may be beneficial to analyze only the data where more
is "N".
Source
Collected by David Diez.
Examples
library(ggplot2)
ggplot(textbooks, aes(x = diff)) +
  geom_histogram(binwidth = 5)
t.test(textbooks$diff)
Thanksgiving spending, simulated based on Gallup poll.
Description
This entry gives simulated spending data for Americans during Thanksgiving in 2009 based on findings of a Gallup poll.
Usage
thanksgiving_spend
Format
A data frame with 436 observations on the following 1 variable.
- spending
 Amount of spending, in US dollars.
Examples
library(ggplot2)
ggplot(thanksgiving_spend, aes(x = spending)) +
  geom_histogram(binwidth = 20)
Tip data
Description
A simulated dataset of tips over a few weeks on a couple days per week. Each tip is associated with a single group, which may include several bills and tables (i.e. groups paid in one lump sum in simulations).
Usage
tips
Format
A data frame with 95 observations on the following 5 variables.
- week
 Week number.
- day
 Day, either
FridayorTuesday.- n_peop
 Number of people associated with the group.
- bill
 Total bill for the group.
- tip
 Total tip from the group.
Details
This dataset was built using simulations of tables, then bills, then tips based on the bills. Large groups were assumed to only pay the gratuity, which is evident in the data. Tips were set to be plausible round values; they were often (but not always) rounded to dollars, quarters, etc.
Source
Simulated dataset.
Examples
library(ggplot2)
ggplot(tips, aes(x = day, y = tip)) +
  geom_boxplot()
ggplot(tips, aes(x = tip, fill = factor(week))) +
  geom_density(alpha = 0.5) +
  labs(x = "Tip", y = "Density", fill = "Week")
ggplot(tips, aes(x = tip)) +
  geom_dotplot()
ggplot(tips, aes(x = tip, fill = factor(day))) +
  geom_density(alpha = 0.5) +
  labs(x = "Tip", y = "Density", fill = "Day")
Simulated polling dataset
Description
Simulated data for a fake political candidate.
Usage
toohey
Format
A data frame with 500 observations on the following variable.
- vote_for
 a factor with levels
noyes
Examples
toohey
Turkey tourism
Description
Summary of tourism in Turkey.
Usage
tourism
Format
A data frame with 47 observations on the following 3 variables.
- year
 a numeric vector
- visitor_count_tho
 a numeric vector
- tourist_spending
 a numeric vector
Source
Association of Turkish Travel Agencies, Foreign Visitors Figure & Tourist Spendings By Years. http://www.tursab.org.tr/en/statistics/foreign-visitors-figure-tourist-spendings-by-years_1083.html
Examples
tourism
Simulated dataset for ANOVA
Description
Simulated dataset for getting a better understanding of intuition that ANOVA is based off of.
Usage
toy_anova
Format
A data frame with 70 observations on the following 3 variables.
- group
 a factor with levels
IIIIII- outcome
 a numeric vector
Examples
toy_anova
Transplant consultant success rate (fake data)
Description
Summarizing whether there was or was not a complication for 62 patients who used a particular medical consultant.
Usage
transplant
Format
A data frame with 62 observations on the following variable.
- outcome
 a factor with levels
complicationsokay
Examples
transplant
Construct tree diagrams
Description
Construct beautiful tree diagrams
Usage
treeDiag(
  main,
  p1,
  p2,
  out1 = c("Yes", "No"),
  out2 = c("Yes", "No"),
  textwd = 0.15,
  solwd = 0.2,
  SBS = c(TRUE, TRUE),
  showSol = TRUE,
  solSub = NULL,
  digits = 4,
  textadj = 0.015,
  cex.main = 1.3,
  col.main = "#999999",
  showWork = FALSE
)
Arguments
main | 
 Character vector with two variable names, descriptions, or questions  | 
p1 | 
 Vector of probabilities for the primary branches  | 
p2 | 
 List for the secondary branches, where each list item should be a
numerical vector of probabilities corresponding to the primary branches of
  | 
out1 | 
 Character vector of the outcomes corresponding to the primary branches  | 
out2 | 
 Character vector of the outcomes corresponding to the secondary branches  | 
textwd | 
 The width provided for text with a default of   | 
solwd | 
 The with provided for the solution with a default of   | 
SBS | 
 A boolean vector indicating whether to place text and probability side-by-side for the primary and secondary branches  | 
showSol | 
 Boolean indicating whether to show the solution in the tree diagram  | 
solSub | 
 An optional list of vectors corresponding to   | 
digits | 
 The number of digits to show in the solution  | 
textadj | 
 Vertical adjustment of text  | 
cex.main | 
 Size of   | 
col.main | 
 Color of   | 
showWork | 
 Whether work should be shown for the solutions  | 
Author(s)
David Diez, Christopher Barr
Examples
treeDiag(
  c("Flight on time?", "Luggage on time?"),
  c(0.8, 0.2), list(c(0.97, 0.03), c(0.15, 0.85))
)
treeDiag(c("Breakfast?", "Go to class"), c(.4, .6),
  list(c(0.4, 0.36, 0.34), c(0.6, 0.3, 0.1)), c("Yes", "No"),
  c("Statistics", "English", "Sociology"),
  showWork = TRUE
)
treeDiag(
  c("Breakfast?", "Go to class"), c(0.4, 0.11, 0.49),
  list(c(0.4, 0.36, 0.24), c(0.6, 0.3, 0.1), c(0.1, 0.4, 0.5)),
  c("one", "two", "three"), c("Statistics", "English", "Sociology")
)
treeDiag(c("Dow Jones rise?", "NASDAQ rise?"),
  c(0.53, 0.47), list(c(0.75, 0.25), c(0.72, 0.28)),
  solSub = list(c("(a)", "(b)"), c("(c)", "(d)")), solwd = 0.08
)
twins
Description
A data frame containing data collected in the mid 20th century by Cyril Burt from a study tracked down identical twins who were separated at birth: one child was raised in the home of their biological parents and the other in a foster home. In an attempt to answer the question of whether intelligence is the result of nature or nurture, both children were given IQ tests.
Usage
twins
Format
A data frame with 27 observations on the following 2 variables.
- foster
 IQ score of the twin raised by Foster parents.
- biological
 IQ score of the twin raised by Biological parents.
Examples
library(ggplot2)
library(dplyr)
library(tidyr)
plot_data <- twins |>
  pivot_longer(cols = c(foster, biological), names_to = "twin", values_to = "iq")
ggplot(plot_data, aes(iq, fill = twin)) +
  geom_histogram(color = "white", binwidth = 5) +
  facet_wrap(~twin) +
  theme_minimal() +
  labs(
    title = "IQ of identical twins",
    subtitle = "Separated at birth",
    x = "IQ",
    y = "Count",
    fill = ""
  )
UCLA courses in Fall 2018
Description
List of all courses at UCLA during Fall 2018.
Usage
ucla_f18
Format
A data frame with 3950 observations on the following 14 variables.
- year
 Year the course was offered
- term
 Term the course was offered
- subject
 Subject
- subject_abbr
 Subject abbreviation, if any
- course
 Course name
- course_num
 Course number, complete
- course_numeric
 Course number, numeric only
- seminar
 Boolean for if this is a seminar course
- ind_study
 Boolean for if this is some form of independent study
- apprenticeship
 Boolean for if this is an apprenticeship
- internship
 Boolean for if this is an internship
- honors_contracts
 Boolean for if this is an honors contracts course
- laboratory
 Boolean for if this is a lab
- special_topic
 Boolean for if this is any of the special types of courses listed
Source
https://sa.ucla.edu/ro/public/soc, retrieved 2018-11-22.
Examples
nrow(ucla_f18)
table(ucla_f18$special_topic)
subset(ucla_f18, is.na(course_numeric))
table(subset(ucla_f18, !special_topic)$course_numeric < 100)
elig_courses <-
  subset(ucla_f18, !special_topic & course_numeric < 100)
set.seed(1)
ucla_textbooks_f18 <-
  elig_courses[sample(nrow(elig_courses), 100), ]
tmp <- order(
  ucla_textbooks_f18$subject,
  ucla_textbooks_f18$course_numeric
)
ucla_textbooks_f18 <- ucla_textbooks_f18[tmp, ]
rownames(ucla_textbooks_f18) <- NULL
head(ucla_textbooks_f18)
Sample of UCLA course textbooks for Fall 2018
Description
A sample of courses were collected from UCLA from Fall 2018, and the corresponding textbook prices were collected from the UCLA bookstore and also from Amazon.
Usage
ucla_textbooks_f18
Format
A data frame with 201 observations on the following 20 variables.
- year
 Year the course was offered
- term
 Term the course was offered
- subject
 Subject
- subject_abbr
 Subject abbreviation, if any
- course
 Course name
- course_num
 Course number, complete
- course_numeric
 Course number, numeric only
- seminar
 Boolean for if this is a seminar course.
- ind_study
 Boolean for if this is some form of independent study
- apprenticeship
 Boolean for if this is an apprenticeship
- internship
 Boolean for if this is an internship
- honors_contracts
 Boolean for if this is an honors contracts course
- laboratory
 Boolean for if this is a lab
- special_topic
 Boolean for if this is any of the special types of courses listed
- textbook_isbn
 Textbook ISBN
- bookstore_new
 New price at the UCLA bookstore
- bookstore_used
 Used price at the UCLA bookstore
- amazon_new
 New price sold by Amazon
- amazon_used
 Used price sold by Amazon
- notes
 Any relevant notes
Details
A past dataset was collected from UCLA courses in Spring 2010, and Amazon at that time was found to be almost uniformly lower than those of the UCLA bookstore's. Now in 2018, the UCLA bookstore is about even with Amazon on the vast majority of titles, and there is no statistical difference in the sample data.
The most expensive book required for the course was generally used.
The reason why we advocate for using raw amount differences instead of percent differences is that a 20\ to a 20\ price difference on low-priced books would balance numerically (but not in a practical sense) a moderate but important price difference on more expensive books. So while this tends to result in a bit less sensitivity in detecting some effect, we believe the absolute difference compares prices in a more meaningful way.
Used prices contain the shipping cost but do not contain tax. The used prices are a more nuanced comparison, since these are all 3rd party sellers. Amazon is often more a marketplace than a retail site at this point, and many people buy from 3rd party sellers on Amazon now without realizing it. The relationship Amazon has with 3rd party sellers is also challenging. Given the frequently changing dynamics in this space, we don't think any analysis here will be very reliable for long term insights since products from these sellers changes frequently in quantity and price. For this reason, we focus only on new books sold directly by Amazon in our comparison. In a future round of data collection, it may be interesting to explore whether the dynamics have changed in the used market.
Source
https://sa.ucla.edu/ro/public/soc
See Also
Examples
library(ggplot2)
library(dplyr)
ggplot(ucla_textbooks_f18, aes(x = bookstore_new, y = amazon_new)) +
  geom_point() +
  geom_abline(slope = 1, intercept = 0, color = "orange") +
  labs(
    x = "UCLA Bookstore price", y = "Amazon price",
    title = "Amazon vs. UCLA Bookstore prices of new textbooks",
    subtitle = "Orange line represents y = x"
  )
# The following outliers were double checked for accuracy
ucla_textbooks_f18_with_diff <- ucla_textbooks_f18 |>
  mutate(diff = bookstore_new - amazon_new)
ucla_textbooks_f18_with_diff |>
  filter(diff > 20 | diff < -20)
# Distribution of price differences
ggplot(ucla_textbooks_f18_with_diff, aes(x = diff)) +
  geom_histogram(binwidth = 5)
# t-test of price differences
t.test(ucla_textbooks_f18_with_diff$diff)
United Kingdom Demographic Data
Description
This dataset comes from the Guardian's Data Blog and includes five financial demographic variables.
Usage
ukdemo
Format
A data frame with 12 observations on the following 6 variables.
- region
 Region in the United Kingdom
- debt
 Average regional debt, not including mortgages, in pounds
- unemployment
 Percent unemployment
- house
 Average house price, in pounds
- pay
 Average hourly pay, in pounds
- rpi
 Retail price index, which is standardized to 100 for the entire UK, and lower index scores correspond to lower prices
Source
The data was described in the Guardian Data Blog: https://www.theguardian.com/news/datablog/interactive/2011/oct/27/debt-money-expert-facts, retrieved 2011-11-01.
References
Guardian Data Blog
Examples
library(ggplot2)
ggplot(ukdemo, aes(x = pay, y = rpi)) +
  geom_point() +
  labs(x = "Average hourly pay", y = "Retail price index")
Annual unemployment since 1890
Description
A compilation of two datasets that provides an estimate of unemployment from 1890 to 2010.
Usage
unempl
Format
A data frame with 121 observations on the following 3 variables.
- year
 Year
- unemp
 Unemployment rate, in percent
- us_data
 1if from the Bureau of Labor Statistics,0otherwise
Source
The data are from Wikipedia at the following URL accessed on November 1st, 2010:
https://en.wikipedia.org/wiki/File:US_Unemployment_1890-2009.gif
Below is a direct quote from Wikipedia describing the sources of the data:
Own work by Peace01234 Complete raw data are on Peace01234. 1930-2009 data are from Bureau of Labor Statistics (BLS), Employment status of the civilian noninstitutional population, 1940 to date retrieved on March 6, 2009 and February 12, 2010 from the BLS' FTP server. Data prior to 1948 are for persons age 14 and over. Data beginning in 1948 are for persons age 16 and over. See also "Historical Comparability" under the Household Data section of the Explanatory Notes at https://www.bls.gov/cps/eetech_methods.pdf. 1890-1930 data are from Christina Romer (1986). "Spurious Volatility in Historical Unemployment Data", The Journal of Political Economy, 94(1): 1-37. 1930-1940 data are from Robert M. Coen (1973). "Labor Force and Unemployment in the 1920's and 1930's: A Re-Examination Based on Postwar Experience", The Review of Economics and Statistics, 55(1): 46-55. Unemployment data was only surveyed once each decade until 1940 when yearly surveys were begun. The yearly data estimates before 1940 are based on the decade surveys combined with other relevant surveys that were collected during those years. The methods are described in detail by Coen and Romer.
Examples
# =====> Time Series Plot of Data <=====#
COL <- c("#DDEEBB", "#EEDDBB", "#BBDDEE", "#FFD5DD", "#FFC5CC")
plot(unempl$year, unempl$unemp, type = "n")
rect(0, -50, 3000, 100, col = "#E2E2E2")
rect(1914.5, -1000, 1918.9, 1000, col = COL[1], border = "#E2E2E2")
rect(1929, -1000, 1939, 1000, col = COL[2], border = "#E2E2E2")
rect(1939.7, -1000, 1945.6, 1000, col = COL[3], border = "#E2E2E2")
rect(1955.8, -1000, 1965.3, 1000, col = COL[4], border = "#E2E2E2")
rect(1965.3, -1000, 1975.4, 1000, col = COL[5], border = "#E2E2E2")
abline(h = seq(0, 50, 5), col = "#F8F8F8", lwd = 2)
abline(v = seq(1900, 2000, 20), col = "#FFFFFF", lwd = 1.3)
lines(unempl$year, unempl$unemp)
points(unempl$year, unempl$unemp, pch = 20)
legend("topright",
  fill = COL,
  c(
    "World War I", "Great Depression", "World War II",
    "Vietnam War Start", "Vietnam War Escalated"
  ),
  bg = "#FFFFFF", border = "#FFFFFF"
)
President's party performance and unemployment rate
Description
Covers midterm elections.
Usage
unemploy_pres
Format
A data frame with 29 observations on the following 5 variables.
- year
 Year.
- potus
 The president in office.
- party
 President's party.
- unemp
 Unemployment rate.
- change
 Change in House seats for the president's party.
Source
Wikipedia.
Examples
unemploy_pres
US temperatures in 1950 and 2022
Description
A representative set of monitoring locations were taken from NOAA data in 1950 and 2022 such that the locations are sampled roughly geographically across the continental US (the observations do not represent a random sample of geographical locations).
Usage
us_temperature
Format
A data frame with 18759 observations on the following 9 variables.
- location
 Location of the NOAA weather station.
- station
 Formal ID of the NOAA weather station.
- latitude
 Latitude of the NOAA weather station.
- longitude
 Longitude of the NOAA weather station.
- elevation
 Elevation of the NOAA weather station.
- date
 Date the measurement was taken (Y-m-d).
- tmax
 Maximum daily temperature (Farenheit).
- tmin
 Minimum daily temperature (Farenheit).
- year
 Year of the measurement.
Details
Please keep in mind that the data represent two annual snapshots, and a complete analysis would consider more than two years of data and a random or more complete sampling of weather stations across the US.
Source
NOAA Climate Data Online. Retrieved 23 September, 2023.
Examples
library(dplyr)
library(ggplot2)
library(maps)
summarized_temp <- us_temperature |>
  group_by(station, year, latitude, longitude) |>
  summarize(tmax_med = median(tmax, na.rm = TRUE)) |>
  mutate(plot_shift = ifelse(year == "1950", 0, 1)) |>
  mutate(year = as.factor(year))
usa <- map_data("state")
ggplot(data = usa, aes(x = long, y = lat)) +
  geom_polygon(aes(group = group), color = "black", fill = "white") +
  geom_point(
    data = summarized_temp,
    aes(
      x = longitude + plot_shift, y = latitude,
      color = tmax_med, shape = year
    )
  ) +
  scale_color_gradient(high = IMSCOL["red", 1], low = IMSCOL["yellow", 1]) +
  ggtitle("Median of the daily high temp, 1950 & 2022") +
  labs(
    x = "longitude",
    color = "median high temp"
  ) +
  guides(shape = guide_legend(override.aes = list(color = "black")))
ucb_admit
Description
Data from a study carried out by the graduate Division of the University of California, Berkeley in the early 1970's to evaluate whether there was a sex bias in graduate admissions.
Usage
ucb_admit
Format
A data frame with 4526 observations on the following 3 variables.
- admit
 Was the applicant admitted to the university?
- gender
 Whether the applicant identified as male or female.
- department
 What department did the applicant apply to, noted as A through F for confidentiality.
Examples
library(ggplot2)
library(dplyr)
plot_data <- ucb_admit |>
  count(dept, gender, admit)
ggplot(plot_data, aes(dept, n, fill = gender)) +
  geom_col(position = "dodge") +
  facet_wrap(~admit) +
  theme_minimal() +
  labs(
    title = "Does gender discrimination play a role in college admittance?",
    x = "Department",
    y = "Number of Students",
    fill = "Gender",
    caption = "Source: UC Berkeley, 1970's"
  )
Time Between Gondola Cars at Sterling Winery
Description
These times represent times between gondolas at Sterling Winery. The main take-away: there are 7 cars, as evidenced by the somewhat regular increases in splits between every 7 cars. The reason the times are slightly non-constant is that the gondolas come off the tracks, so times will change a little between each period.
Usage
winery_cars
Format
A data frame with 52 observations on the following 2 variables.
- obs_number
 The observation number, e.g. observation 3 was immediately preceded by observation 2.
- time_until_next
 Time until this gondola car arrived since the last car had left.
Details
Important context: there was a sufficient line that people were leaving the winery.
So why is this data valuable? It indicates that the winery should add one more car since it has a lot of time wasted every 7th car. By adding another car, fewer visitors are likely to be turned away, resulting in increased revenue.
Source
In-person data collection by David Diez (OpenIntro) on 2013-07-04.
Examples
winery_cars$car_number <- rep(1:7, 10)[1:nrow(winery_cars)]
col <- COL[ifelse(winery_cars$car_number == 3, 4, 1)]
plot(winery_cars[, c("obs_number", "time_until_next")],
  col = col, pch = 19
)
plot(winery_cars$car_number, winery_cars$time_until_next,
  col = fadeColor(col, "88"), pch = 19
)
World Population Data.
Description
From World Bank, population 1960-2020
Usage
world_pop
Format
A data frame with 216 rows and 62 variables.
- country
 Name of country.
- year_1960
 population in 1960.
- year_1961
 population in 1961.
- year_1962
 population in 1962.
- year_1963
 population in 1963.
- year_1964
 population in 1964.
- year_1965
 population in 1965.
- year_1966
 population in 1966.
- year_1967
 population in 1967.
- year_1968
 population in 1968.
- year_1969
 population in 1969.
- year_1970
 population in 1970.
- year_1971
 population in 1971.
- year_1972
 population in 1972.
- year_1973
 population in 1973.
- year_1974
 population in 1974.
- year_1975
 population in 1975.
- year_1976
 population in 1976.
- year_1977
 population in 1977.
- year_1978
 population in 1978.
- year_1979
 population in 1979.
- year_1980
 population in 1980.
- year_1981
 population in 1981.
- year_1982
 population in 1982.
- year_1983
 population in 1983.
- year_1984
 population in 1984.
- year_1985
 population in 1985.
- year_1986
 population in 1986.
- year_1987
 population in 1987.
- year_1988
 population in 1988.
- year_1989
 population in 1989.
- year_1990
 population in 1990.
- year_1991
 population in 1991.
- year_1992
 population in 1992.
- year_1993
 population in 1993.
- year_1994
 population in 1994.
- year_1995
 population in 1995.
- year_1996
 population in 1996.
- year_1997
 population in 1997.
- year_1998
 population in 1998.
- year_1999
 population in 1999.
- year_2000
 population in 2000.
- year_2001
 population in 2001.
- year_2002
 population in 2002.
- year_2003
 population in 2003.
- year_2004
 population in 2004.
- year_2005
 population in 2005.
- year_2006
 population in 2006.
- year_2007
 population in 2007.
- year_2008
 population in 2008.
- year_2009
 population in 2009.
- year_2010
 population in 2010.
- year_2011
 population in 2011.
- year_2012
 population in 2012.
- year_2013
 population in 2013.
- year_2014
 population in 2014.
- year_2015
 population in 2015.
- year_2016
 population in 2016.
- year_2017
 population in 2017.
- year_2018
 population in 2018.
- year_2019
 population in 2019.
- year_2020
 population in 2020.
Source
Examples
library(dplyr)
library(ggplot2)
library(tidyr)
# List percentage of population change from 1960 to 2020
world_pop |>
  mutate(percent_change = round((year_2020 - year_1960) / year_2020 * 100, 2)) |>
  mutate(rank_pop_change = round(rank(-percent_change)), 0) |>
  select(rank_pop_change, country, percent_change) |>
  arrange(rank_pop_change)
# Graph population in millions by decade for specified countries
world_pop |>
  select(
    country, year_1960, year_1970, year_1980, year_1990,
    year_2000, year_2010, year_2020
  ) |>
  filter(country %in% c("China", "India", "United States")) |>
  pivot_longer(
    cols = c(year_1960, year_1970, year_1980, year_1990, year_2000, year_2010, year_2020),
    names_to = "year",
    values_to = "population"
  ) |>
  mutate(year = as.numeric(gsub("year_", "", year))) |>
  ggplot(aes(year, population, color = country)) +
  geom_point() +
  geom_smooth(method = "loess", formula = "y ~ x") +
  labs(
    title = "Population",
    subtitle = "by Decade",
    x = "Year",
    y = "Population (in millions)",
    color = "Country"
  )
Create a CSV variant of .rda files
Description
The function should be run with a path to a package directory.
It will then look through the data directory of the package,
and for all datasets that are data frames, create CSV variants
in a data-csv directory.
Usage
write_pkg_data(
  pkg,
  dir = paste0("data-", out_type),
  overwrite = FALSE,
  out_type = c("csv", "tab", "R")
)
Arguments
pkg | 
 The R package where we'd like to generate CSVs of any data frames.  | 
dir | 
 A character string representing the path to the folder. where the CSV files should be written. If no such directory exists, one will be created (recursively).  | 
overwrite | 
 Boolean to indicate if to overwrite any existing files that have conflicting names in the directory specified.  | 
out_type | 
 Format for the type of output as a CSV (  | 
Examples
## Not run: 
write_pkg_data("openintro")
list.files("data-csv")
## End(Not run)
Exxon Mobile stock data
Description
Monthly data covering 2006 through early 2014.
Usage
xom
Format
A data frame with 98 observations on the following 7 variables.
- date
 Date.
- open
 a numeric vector
- high
 a numeric vector
- low
 a numeric vector
- close
 a numeric vector
- volume
 a numeric vector
- adj_close
 a numeric vector
Source
Yahoo! Finance.
Examples
xom
Contagiousness of yawning
Description
An experiment conducted by the MythBusters, a science entertainment TV program on the Discovery Channel, tested if a person can be subconsciously influenced into yawning if another person near them yawns. 50 people were randomly assigned to two groups: 34 to a group where a person near them yawned (treatment) and 16 to a group where there wasn't a person yawning near them (control).
Usage
yawn
Format
A data frame with 50 observations on the following 2 variables.
- result
 a factor with levels
not yawnyawn- group
 a factor with levels
ctrltrmt
Source
MythBusters, Season 3, Episode 28.
Examples
yawn
Youth Risk Behavior Surveillance System (YRBSS)
Description
Select variables from YRBSS.
Usage
yrbss
Format
A data frame with 13583 observations on the following 13 variables.
- age
 Age, in years.
- gender
 Gender.
- grade
 School grade.
- hispanic
 Hispanic or not.
- race
 Race / ethnicity.
- height
 Height, in meters (3.28 feet per meter).
- weight
 Weight, in kilograms (2.2 pounds per kilogram).
- helmet_12m
 How often did you wear a helmet when biking in the last 12 months?
- text_while_driving_30d
 How many days did you text while driving in the last 30 days?
- physically_active_7d
 How many days were you physically active for 60+ minutes in the last 7 days?
- hours_tv_per_school_day
 How many hours of TV do you typically watch on a school night?
- strength_training_7d
 How many days did you do strength training (e.g. lift weights) in the last 7 days?
- school_night_hours_sleep
 How many hours of sleep do you typically get on a school night?
Source
CDC's Youth Risk Behavior Surveillance System (YRBSS)
Examples
table(yrbss$physically_active_7d)
Sample of Youth Risk Behavior Surveillance System (YRBSS)
Description
A sample of the yrbss dataset.
Usage
yrbss_samp
Format
A data frame with 100 observations on the following 13 variables.
- age
 Age, in years.
- gender
 Gender.
- grade
 School grade.
- hispanic
 Hispanic or not.
- race
 Race / ethnicity.
- height
 Height, in meters (3.28 feet per meter).
- weight
 Weight, in kilograms (2.2 pounds per kilogram).
- helmet_12m
 How often did you wear a helmet when biking in the last 12 months?
- text_while_driving_30d
 How many days did you text while driving in the last 30 days?
- physically_active_7d
 How many days were you physically active for 60+ minutes in the last 7 days?
- hours_tv_per_school_day
 How many hours of TV do you typically watch on a school night?
- strength_training_7d
 How many days did you do strength training (e.g. lift weights) in the last 7 days?
- school_night_hours_sleep
 How many hours of sleep do you typically get on a school night?
Source
CDC's Youth Risk Behavior Surveillance System (YRBSS)
Examples
table(yrbss_samp$physically_active_7d)