After completing these tasks, you'll be able to convert raw data into tidy format. You will also be able to make publication quality plots using ggplot2. New concepts: tidy data, grammar of graphics, aesthetics. New packages: tidyverse (specifically tidyr, dplyr, ggplot2), patchwork
Demo 3: Tidy data (talk) Note: The whole data wrangling section of this "course" relies on concepts in this video so make sure you watch this or some other video explaining tidy data (with respect to the tidyverse family of packages in R)
Please note that it's not always easy to decide how to get your data from its current state to tidy! If you feel a little overwhelmed with your own data set and how to get it to where it needs to be, you can book a one-on-one meeting with me for Monday 17th May (see Moodle for how). We can go over your specific project and I can give you tips on how to proceed! In the mean time, you can complete these tasks with the example data set.
If your own data set is already clean, you can also practice these steps with the exercise_data_messy.csv which is available on Moodle. On the other hand, if you have raw data which needs cleaning, you can also do this on your own data set. The solutions are shown on the exercise data. You'll need to have the package tidyverse installed and loaded to work these exercises. See a handy cheat sheet for many of the commands we're using today here
Q3.1.1 Load in your data using read_csv
.
Tip: If your csv uses semicolon (;) instead of comma as a separator, you should use read_csv2 instead.
Note: A more recent version of tidyverse includes functions pivot_wider
and pivot_longer
, which are recommended for use instead of spread
and gather
, respectively. You can choose to use either! The syntax is slightly different though, so if in the demo we ran
data_long <- data %>% gather(Partner:M_Stranger, key='person', value='emotional_bond')
to do the same with pivot_longer, we would use
data_long <- data %>% pivot_longer(cols=Partner:M_Stranger, names_to='person', values_to='emotional_bond')
The end result is the same in both cases (except pivot_longer sorts the data by subid and gather sorts the data by person). For more information, see the function help or see here
Q3.1.2 What needs to get done to make your data tidy? For example, do you need to go from long to wide (spread
or pivot_wider
) or wide to long (gather
or pivot_longer
) or join data from different files (full_join
)?
Q3.1.3 Try to do that! In the example data set, I would argue that it makes most sense to have the data in a wide format (i.e. each food type has its own column). See if you can get it to there!
Demo 3: Manipulating data (talk)
Demo 3: Manipulating data (demo)
Q3.1.4 Inspect your data as we learned last time. How many missing values? Are there any values that might be wrongly coded NA's?
Tip: Look for negative values for variables which cannot be negative, like age, or much higher values than what is feasible, like BMI over 100. You can also try plotting your variables to observe very clear outliers.
Q3.1.5 Fix any wrongly coded NA's. If you're changing a single value, you can use data %>% mutate(<columnname> = na_if(<columnname>, <valuetoreplace>))
. If you're changing multiple values, you can use data %>% mutate(<columnname> = ifelse(<columnname>[logical expression, e.g. >100], NA, <columnname>))
Q3.1.6 Make sure all of your factors are factors. If not, transform them with mutate
Q3.1.7 See if you want to rename some columns with rename
to make the names shorter (and therefore quicker to type). In the example data set, the first five columns could benefit from a shorter name.
Q3.1.8 Create a new column by combining values from existing columns with mutate
. If you're working on the example data set, you can e.g. sum values for different types of bread into a new column called 'Bread'.
Q3.1.9 Use group_by
to group your data according to some factor(s) and then use tally()
(doesn't need any arguments if you use %>% to give it your data) to count occurrences in those groups.
Demo 3: Plotting with ggplot2 (talk)
Demo 3: Plotting with ggplot2 (demo)
Q3.2.1 Start by making a basic histogram
ggplot(data, aes(x=Freshveg)) +
geom_histogram()
Q3.2.2 Make sure you understand the different components of this function call. What does "aes" refer to? What happens if you remove the "+" at the end of the first row?
Q3.2.3 Ask the visuals to show boys and girls separately by adding fill=sex
(assuming you renamed co_Gender to sex in Q3.1.5).
Tip: the value you give fill has to be a factor - if you did not mutate your factors, you should do that now
Q3.2.4 Oops, looks like ggplot will by default stack the bars and not show them side by side. You can solve this by adding position="identity"
as an argument for geom_histogram
.
Q3.2.5 In some cases you can now only see the colour which is plotted on top. To make the bars slightly see-through, add alpha=0.5
in the call for geom_histogram
. You can also try to give other values between 0 and 1 to alpha. Will larger numbers make it more see-through or more solid?
Q3.2.6 To get the plot to show probability density instead of number of responses in a bin, add aes(y=..density..)
to geom_histogram.
Q3.2.7 Add title to the plot and axis labels with labs(title="Consumption of fresh vegetables", x="Times per week", y = "Density")
Tip: Remember that with ggplot, adding new elements happens by adding a + at the end of your previously last line and then adding the new function call below
Q3.2.8 Now we get to the fun part! You can change the look of the plot by choosing a theme. Look at the themes here https://ggplot2.tidyverse.org/reference/ggtheme.html#examples and apply one of them to your plot.
Q3.2.9 We can easily add other elements to the plot. For example, adding geom_density(alpha=0.2)
as a new function will overlay probability density function for our data. To superimpose a normal distribution curve, we can add stat_function(fun = dnorm, n = 101, args = list(mean = mean(data$Freshveg, na.rm=T), sd = sd(data$Freshveg, na.rm=T)), colour='red')
Q3.2.10 Now save the plot.
Tip: You have three different ways to do the saving. From easiest to most difficult, they are
ggsave(<filename>)
after making your plot to save it. NB: This might not work if you use base R or some other package to make your plot.pdf()
and dev.off()
as you did before. If you do this, you can either put your pdf()
and dev.off()
calls around the whole thing, or you can first assign the whole plot call to a variable (so the first line starts with p1 <- ggplot
etc.).Q3.2.11 Start by creating a simple scatter plot with geom_point
. If you want to add a bit of visual jitter (e.g. your data has a lot of similar values), you can use geom_jitter
instead.
Q3.2.12 Add a trend line with geom_smooth
.
Q3.2.13 Fine tune your plot: add more legible axis names if needed, pick your theme, change colours and and shapes of the dots. See this cheat sheet for help with ggplot!
Q3.2.14 Install and source package "patchwork"
Q3.2.15 Assign the barchart we did to p1 and the new plot you made to p2
Q3.2.16 Run p1 + p2
Q3.2.17 Run p1 / p2
Q3.2.18 Which one do you prefer? How would you add a third plot?
Think of a plot type you would like to do with your own data. Can you figure out how to do that in ggplot2? If you have made a plot from your own data with ggplot2, please share it on the course forum to show off and to motivate others!
Tip: Google (especially the image search) is your best friend in this! You can google ggplot2
Created by Juulia Suvilehto
juulia.suvilehto@iki.fi
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License