First, loading tidyverse to use dplyr
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
In this example, we will be using the iris data built into R
Creating a new data set called iris1 that contains only species virginica and versicolor with sepal lengths longer than 6cm and sepal widths longer than 2.5cm Note the number of observations in the data set
iris1 <- iris%>%
filter((Species == 'virginica' | Species == 'versicolor')
& Sepal.Length > 6
& Sepal.Width > 2.5)
iris1
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 7.0 3.2 4.7 1.4 versicolor
## 2 6.4 3.2 4.5 1.5 versicolor
## 3 6.9 3.1 4.9 1.5 versicolor
## 4 6.5 2.8 4.6 1.5 versicolor
## 5 6.3 3.3 4.7 1.6 versicolor
## 6 6.6 2.9 4.6 1.3 versicolor
## 7 6.1 2.9 4.7 1.4 versicolor
## 8 6.7 3.1 4.4 1.4 versicolor
## 9 6.1 2.8 4.0 1.3 versicolor
## 10 6.1 2.8 4.7 1.2 versicolor
## 11 6.4 2.9 4.3 1.3 versicolor
## 12 6.6 3.0 4.4 1.4 versicolor
## 13 6.8 2.8 4.8 1.4 versicolor
## 14 6.7 3.0 5.0 1.7 versicolor
## 15 6.7 3.1 4.7 1.5 versicolor
## 16 6.1 3.0 4.6 1.4 versicolor
## 17 6.2 2.9 4.3 1.3 versicolor
## 18 6.3 3.3 6.0 2.5 virginica
## 19 7.1 3.0 5.9 2.1 virginica
## 20 6.3 2.9 5.6 1.8 virginica
## 21 6.5 3.0 5.8 2.2 virginica
## 22 7.6 3.0 6.6 2.1 virginica
## 23 7.3 2.9 6.3 1.8 virginica
## 24 7.2 3.6 6.1 2.5 virginica
## 25 6.5 3.2 5.1 2.0 virginica
## 26 6.4 2.7 5.3 1.9 virginica
## 27 6.8 3.0 5.5 2.1 virginica
## 28 6.4 3.2 5.3 2.3 virginica
## 29 6.5 3.0 5.5 1.8 virginica
## 30 7.7 3.8 6.7 2.2 virginica
## 31 7.7 2.6 6.9 2.3 virginica
## 32 6.9 3.2 5.7 2.3 virginica
## 33 7.7 2.8 6.7 2.0 virginica
## 34 6.3 2.7 4.9 1.8 virginica
## 35 6.7 3.3 5.7 2.1 virginica
## 36 7.2 3.2 6.0 1.8 virginica
## 37 6.2 2.8 4.8 1.8 virginica
## 38 6.1 3.0 4.9 1.8 virginica
## 39 6.4 2.8 5.6 2.1 virginica
## 40 7.2 3.0 5.8 1.6 virginica
## 41 7.4 2.8 6.1 1.9 virginica
## 42 7.9 3.8 6.4 2.0 virginica
## 43 6.4 2.8 5.6 2.2 virginica
## 44 6.3 2.8 5.1 1.5 virginica
## 45 6.1 2.6 5.6 1.4 virginica
## 46 7.7 3.0 6.1 2.3 virginica
## 47 6.3 3.4 5.6 2.4 virginica
## 48 6.4 3.1 5.5 1.8 virginica
## 49 6.9 3.1 5.4 2.1 virginica
## 50 6.7 3.1 5.6 2.4 virginica
## 51 6.9 3.1 5.1 2.3 virginica
## 52 6.8 3.2 5.9 2.3 virginica
## 53 6.7 3.3 5.7 2.5 virginica
## 54 6.7 3.0 5.2 2.3 virginica
## 55 6.5 3.0 5.2 2.0 virginica
## 56 6.2 3.4 5.4 2.3 virginica
Noted that there are 56 observations of 5 columns in iris1
Now create a data set, iris2, from iris1, the contains only Species, Sepal Width, and Sepal Length
## Rows: 56
## Columns: 3
## $ Species <fct> versicolor, versicolor, versicolor, versicolor, versicolo…
## $ Sepal.Length <dbl> 7.0, 6.4, 6.9, 6.5, 6.3, 6.6, 6.1, 6.7, 6.1, 6.1, 6.4, 6.…
## $ Sepal.Width <dbl> 3.2, 3.2, 3.1, 2.8, 3.3, 2.9, 2.9, 3.1, 2.8, 2.8, 2.9, 3.…
Create iris3 from iris2 that orders observations from largest to smallest sepal length Show first 6 rows of this data
## Species Sepal.Length Sepal.Width
## 1 virginica 7.9 3.8
## 2 virginica 7.7 3.8
## 3 virginica 7.7 2.6
## 4 virginica 7.7 2.8
## 5 virginica 7.7 3.0
## 6 virginica 7.6 3.0
Create an iris4 data frame from iris3 that creates a column with a sepal area (length * width) value for each observation. How many observations and variables are in the data set?
## Species Sepal.Length Sepal.Width Sepal.Area
## 1 virginica 7.9 3.8 30.02
## 2 virginica 7.7 3.8 29.26
## 3 virginica 7.7 2.6 20.02
## 4 virginica 7.7 2.8 21.56
## 5 virginica 7.7 3.0 23.10
## 6 virginica 7.6 3.0 22.80
## 7 virginica 7.4 2.8 20.72
## 8 virginica 7.3 2.9 21.17
## 9 virginica 7.2 3.6 25.92
## 10 virginica 7.2 3.2 23.04
## 11 virginica 7.2 3.0 21.60
## 12 virginica 7.1 3.0 21.30
## 13 versicolor 7.0 3.2 22.40
## 14 versicolor 6.9 3.1 21.39
## 15 virginica 6.9 3.2 22.08
## 16 virginica 6.9 3.1 21.39
## 17 virginica 6.9 3.1 21.39
## 18 versicolor 6.8 2.8 19.04
## 19 virginica 6.8 3.0 20.40
## 20 virginica 6.8 3.2 21.76
## 21 versicolor 6.7 3.1 20.77
## 22 versicolor 6.7 3.0 20.10
## 23 versicolor 6.7 3.1 20.77
## 24 virginica 6.7 3.3 22.11
## 25 virginica 6.7 3.1 20.77
## 26 virginica 6.7 3.3 22.11
## 27 virginica 6.7 3.0 20.10
## 28 versicolor 6.6 2.9 19.14
## 29 versicolor 6.6 3.0 19.80
## 30 versicolor 6.5 2.8 18.20
## 31 virginica 6.5 3.0 19.50
## 32 virginica 6.5 3.2 20.80
## 33 virginica 6.5 3.0 19.50
## 34 virginica 6.5 3.0 19.50
## 35 versicolor 6.4 3.2 20.48
## 36 versicolor 6.4 2.9 18.56
## 37 virginica 6.4 2.7 17.28
## 38 virginica 6.4 3.2 20.48
## 39 virginica 6.4 2.8 17.92
## 40 virginica 6.4 2.8 17.92
## 41 virginica 6.4 3.1 19.84
## 42 versicolor 6.3 3.3 20.79
## 43 virginica 6.3 3.3 20.79
## 44 virginica 6.3 2.9 18.27
## 45 virginica 6.3 2.7 17.01
## 46 virginica 6.3 2.8 17.64
## 47 virginica 6.3 3.4 21.42
## 48 versicolor 6.2 2.9 17.98
## 49 virginica 6.2 2.8 17.36
## 50 virginica 6.2 3.4 21.08
## 51 versicolor 6.1 2.9 17.69
## 52 versicolor 6.1 2.8 17.08
## 53 versicolor 6.1 2.8 17.08
## 54 versicolor 6.1 3.0 18.30
## 55 virginica 6.1 3.0 18.30
## 56 virginica 6.1 2.6 15.86
Create iris5 that calculates the average sepal length, the average sepal width, and the sample size of the entire iris4 data frame and print iris5
iris5 <- iris4 %>%
summarize(Avg.Sepal.Length = mean(Sepal.Length, na.rm=T),
Avg.Sepal.Width = mean(Sepal.Width, na.rm=T),
Sample.Size = n())
glimpse(iris5)
## Rows: 1
## Columns: 3
## $ Avg.Sepal.Length <dbl> 6.698214
## $ Avg.Sepal.Width <dbl> 3.041071
## $ Sample.Size <int> 56
Finally, create iris6 that calculates the average sepal length, the average sepal width, and the sample size for each species of in the iris4 data frame and print iris6
iris6 <- iris4 %>%
group_by(Species) %>%
summarize(Avg.Sepal.Length = mean(Sepal.Length, na.rm=T),
Avg.Sepal.Width = mean(Sepal.Width, na.rm=T),
Sample.Size = n())
glimpse(iris6)
## Rows: 2
## Columns: 4
## $ Species <fct> versicolor, virginica
## $ Avg.Sepal.Length <dbl> 6.482353, 6.792308
## $ Avg.Sepal.Width <dbl> 2.988235, 3.064103
## $ Sample.Size <int> 17, 39
Create a ‘longer’ data frame using the original iris data set with three columns named “Species”, “Measure”, “Value”. The column “Species” will retain the species names of the data set. The column “Measure” will include whether the value corresponds to Sepal.Length, Sepal.Width, Petal.Length, or Petal.Width and the column “Value” will include the numerical values of those measurements
iris_long <- iris%>%
pivot_longer(cols = Sepal.Length:Petal.Width, names_to = "Measure", values_to = "Value")
glimpse(iris_long)
## Rows: 600
## Columns: 3
## $ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa…
## $ Measure <chr> "Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", …
## $ Value <dbl> 5.1, 3.5, 1.4, 0.2, 4.9, 3.0, 1.4, 0.2, 4.7, 3.2, 1.3, 0.2, 4.…