Overview:
1. Installing dplyr package – install.packages(‘dplyr’)
2. Main dplyr verbs – select, filter, mutate, group_by, summarise, and. arrange.
3. Using select function – df %>% select()

A researcher’s life is easier when they can swiftly wrangle the data. Usually, a researcher checks the data for the presence of outliers, normality, and missing entries. To deal with the anomalies, a researcher must know the dplyr package. The dplyr has many verbs in it. However, there are six verbs that help in performing major data wrangling tasks.

Installing dplyr package:

To install the dplyr package, use install.packages('dplyr') command. Instead of installing only the dplyr package, one can also install the tidyverse package. The tidyverse package includes the dplyr, ggplot2, stringr, and many more packages

dplyr verbs:

1. select() – helps in selecting columns
2. mutate() – helps in creating and modifying columns
3. filter() – helps in selecting observations or information row wise
4. group_by() – helps in grouping the data
5. summarise() – helps in creating a single value (e.g., average)
6. arrange() – helps in sorting the rows in ascending or descending order

In this blog post, let’s learn how to use the select() function with some practical examples. As explained earlier, the select() function allows the selection of columns and supports renaming the column names.

Selecting columns:
Imagine you have a data frame named df with column names col1, col2, col3, col4, col4_1, col4_2, col4_3, and col5. If you only need col1 and col2 for the data analysis, you can use the select() function. However, we must keep in mind that we load the dplyr package first.

As an additional information, the dplyr package also allows the data analysts to perform chaining of operations. Imagine, you wish to select the columns first followed by grouping and then extracting the mean values based on the groups, you do not have to perform them separately – you can chain them using the pipes available in the dplyr package. The pipes are written as %>%.

# Step 1: Load the dplyr package:

library(dplyr) 

# For my data analysis tasks, I load the tidyverse package instead of loading dplyr, ggplot2 and others separately.

# Step 2: Creating a dummy data set based on the example:

df <- data.frame(col1 = 1:5, col2 = 6:10, col3 = 11:15, col4 = 16:20, col4_1 = 21:25, col4_2 = 26:30, col4_3 = 31:35, col5 = 36:40)

# Let's select col1 and col2

# Syntax: select(yourdataframe, column_names)

select(df, col1, col2)

# Using the dplyr pipe -  %>% 

# Syntax: yourdataframe %>% select(column_names)

df %>% select(col1, col2) 

# Using select function to rename columns

# Let's rename col1 to column_1 and col2 to column_2
# Please be aware that when you select and rename that columns that you specify within the select function will only be available for later use
# If you wish to only rename columns - use rename() function

# Syntax: df %>% select( new_column_name = old_column_name)

df %>% select(column_1 = col1, column_2 = col2)

# Renaming column with rename() function

# Syntax: rename( new_column_name = old_column_name )

df %>% rename( column_3 = col3) # Note: Here we are not selecting column instead we are only rename column 3. Hence, all the columns are selected. 

Selecting multiple columns:
In the previous demonstration, I showed how to select col1 and col2. There are other ways to achieve the same task using the index of the columns. For example, the col1 and col2 are in position 1 and 2. Hence, one can simply use 1:2 within the select function to select col1 and col2.

# Selecting series of columns using index:
# Task: select the first four columns of the data frame

df %>% select(1:4) 

OUTPUT:

  col1 col2 col3 col4
1    1    6   11   16

2    2    7   12   17
3    3    8   13   18
4    4    9   14   19
5    5   10   15   20

# Selection columns of interest without consecutive series/ range: 
# Imagine, you wish to select first, third and the fifth column
# Then you can make use of the concatenate function 

df %>% select(c(1,3,5))

OUTPUT:

  col1 col3 col4_1

1    1   11     21
2    2   12     22
3    3   13     23
4    4   14     24
5    5   15     25

Deselecting columns:
Now we know we can select the columns using the select function. Imagine you asked to remove col4_1 from the data frame. What should we do here? We can use the negate operator ( ! ) in the select function to drop columns.

# Drop one column:
# Solution to remove col4_1 

df %>% select(!col4_1)

OUTPUT:

  col1 col2 col3 col4 col4_2 col4_3 col5
1    1    6   11   16     26     31   36
2    2    7   12   17     27     32   37
3    3    8   13   18     28     33   38
4    4    9   14   19     29     34   39
5    5   10   15   20     30     35   40

# Drop multiple columns
# Let's remove col1, col3, col4_1, col4_2, col4_3

df %>% select(!c(col1, col3, col4_1, col4_2, col4_3))

OUTPUT:

  col2 col4 col5

1    6   16   36
2    7   17   37
3    8   18   38
4    9   19   39
5   10   20   40

I hope you are now familiar with the select() function. In the next blog post, I shall focus on the usage selection helper function and its uses.

HOME

Leave a Reply