Learning goals

After completing these tasks, you'll be able to convert raw data into tidy format. You will also be able to make publication quality plots using ggplot2. New concepts: tidy data, grammar of graphics, aesthetics. New packages: tidyverse (specifically tidyr, dplyr, ggplot2), patchwork

Exercises

3.1 Tidy data

Demo 3: Tidy data (talk) Note: The whole data wrangling section of this "course" relies on concepts in this video so make sure you watch this or some other video explaining tidy data (with respect to the tidyverse family of packages in R)

Demo 3: Tidy data (demo)

Please note that it's not always easy to decide how to get your data from its current state to tidy! If you feel a little overwhelmed with your own data set and how to get it to where it needs to be, you can book a one-on-one meeting with me for Monday 17th May (see Moodle for how). We can go over your specific project and I can give you tips on how to proceed! In the mean time, you can complete these tasks with the example data set.

Loading and inspecting data

If your own data set is already clean, you can also practice these steps with the exercise_data_messy.csv which is available on Moodle. On the other hand, if you have raw data which needs cleaning, you can also do this on your own data set. The solutions are shown on the exercise data. You'll need to have the package tidyverse installed and loaded to work these exercises. See a handy cheat sheet for many of the commands we're using today here

Q3.1.1 Load in your data using read_csv.

Tip: If your csv uses semicolon (;) instead of comma as a separator, you should use read_csv2 instead.

Wrangling data

Note: A more recent version of tidyverse includes functions pivot_wider and pivot_longer, which are recommended for use instead of spread and gather, respectively. You can choose to use either! The syntax is slightly different though, so if in the demo we ran

data_long <- data %>% gather(Partner:M_Stranger, key='person', value='emotional_bond')

to do the same with pivot_longer, we would use

data_long <- data %>% pivot_longer(cols=Partner:M_Stranger, names_to='person', values_to='emotional_bond')

The end result is the same in both cases (except pivot_longer sorts the data by subid and gather sorts the data by person). For more information, see the function help or see here

Q3.1.2 What needs to get done to make your data tidy? For example, do you need to go from long to wide (spread or pivot_wider) or wide to long (gather or pivot_longer) or join data from different files (full_join)?

Q3.1.3 Try to do that! In the example data set, I would argue that it makes most sense to have the data in a wide format (i.e. each food type has its own column). See if you can get it to there!

Manipulating data

Demo 3: Manipulating data (talk)

Demo 3: Manipulating data (demo)

Q3.1.4 Inspect your data as we learned last time. How many missing values? Are there any values that might be wrongly coded NA's?

Tip: Look for negative values for variables which cannot be negative, like age, or much higher values than what is feasible, like BMI over 100. You can also try plotting your variables to observe very clear outliers.

Q3.1.5 Fix any wrongly coded NA's. If you're changing a single value, you can use data %>% mutate(<columnname> = na_if(<columnname>, <valuetoreplace>)). If you're changing multiple values, you can use data %>% mutate(<columnname> = ifelse(<columnname>[logical expression, e.g. >100], NA, <columnname>))

Q3.1.6 Make sure all of your factors are factors. If not, transform them with mutate

Q3.1.7 See if you want to rename some columns with rename to make the names shorter (and therefore quicker to type). In the example data set, the first five columns could benefit from a shorter name.

Q3.1.8 Create a new column by combining values from existing columns with mutate. If you're working on the example data set, you can e.g. sum values for different types of bread into a new column called 'Bread'.

Q3.1.9 Use group_by to group your data according to some factor(s) and then use tally() (doesn't need any arguments if you use %>% to give it your data) to count occurrences in those groups.

3.2 Better plotting

Demo 3: Plotting with ggplot2 (talk)

Demo 3: Plotting with ggplot2 (demo)

Let's remake the bar chart from last set of exercises, but this time with ggplot2.

Q3.2.1 Start by making a basic histogram

ggplot(data, aes(x=Freshveg)) +
  geom_histogram()

Q3.2.2 Make sure you understand the different components of this function call. What does "aes" refer to? What happens if you remove the "+" at the end of the first row?

Q3.2.3 Ask the visuals to show boys and girls separately by adding fill=sex (assuming you renamed co_Gender to sex in Q3.1.5).

Tip: the value you give fill has to be a factor - if you did not mutate your factors, you should do that now

Q3.2.4 Oops, looks like ggplot will by default stack the bars and not show them side by side. You can solve this by adding position="identity" as an argument for geom_histogram.

Q3.2.5 In some cases you can now only see the colour which is plotted on top. To make the bars slightly see-through, add alpha=0.5 in the call for geom_histogram. You can also try to give other values between 0 and 1 to alpha. Will larger numbers make it more see-through or more solid?

Q3.2.6 To get the plot to show probability density instead of number of responses in a bin, add aes(y=..density..) to geom_histogram.

Q3.2.7 Add title to the plot and axis labels with labs(title="Consumption of fresh vegetables", x="Times per week", y = "Density")

Tip: Remember that with ggplot, adding new elements happens by adding a + at the end of your previously last line and then adding the new function call below

Q3.2.8 Now we get to the fun part! You can change the look of the plot by choosing a theme. Look at the themes here https://ggplot2.tidyverse.org/reference/ggtheme.html#examples and apply one of them to your plot.

Q3.2.9 We can easily add other elements to the plot. For example, adding geom_density(alpha=0.2) as a new function will overlay probability density function for our data. To superimpose a normal distribution curve, we can add stat_function(fun = dnorm, n = 101, args = list(mean = mean(data$Freshveg, na.rm=T), sd = sd(data$Freshveg, na.rm=T)), colour='red')

Q3.2.10 Now save the plot.

Tip: You have three different ways to do the saving. From easiest to most difficult, they are

  1. You can also save plot via the graphical interface with "Export" (plots tab, lower right hand side). However, I recommend you try to minimise the number of point-and-click operations you do with R.
  2. When you use ggplot2 to create your plots, you can use ggsave(<filename>) after making your plot to save it. NB: This might not work if you use base R or some other package to make your plot.
  3. You can also use pdf() and dev.off() as you did before. If you do this, you can either put your pdf() and dev.off() calls around the whole thing, or you can first assign the whole plot call to a variable (so the first line starts with p1 <- ggplot etc.).

Now make a figure with your own data, if you have some!

Q3.2.11 Start by creating a simple scatter plot with geom_point. If you want to add a bit of visual jitter (e.g. your data has a lot of similar values), you can use geom_jitter instead.

Q3.2.12 Add a trend line with geom_smooth.

Q3.2.13 Fine tune your plot: add more legible axis names if needed, pick your theme, change colours and and shapes of the dots. See this cheat sheet for help with ggplot!

Let's plot multiple plots in the same fig

Q3.2.14 Install and source package "patchwork"

Q3.2.15 Assign the barchart we did to p1 and the new plot you made to p2

Q3.2.16 Run p1 + p2

Q3.2.17 Run p1 / p2

Q3.2.18 Which one do you prefer? How would you add a third plot?

Bonus:

Think of a plot type you would like to do with your own data. Can you figure out how to do that in ggplot2? If you have made a plot from your own data with ggplot2, please share it on the course forum to show off and to motivate others!

Tip: Google (especially the image search) is your best friend in this! You can google ggplot2 (e.g. ggplot2 violin plot) to find examples of how to do that type of plot in ggplot2.

 

Created by Juulia Suvilehto
juulia.suvilehto@iki.fi

Creative Commons License