R series – 5: Subsetting and modifying data

Overview:
1. Introduction to subsetting – use square brackets [] to subset data in R
2. Subsetting vector, matrix, data frame, and list.
3. Modifying data is easier when you know how to subset data.

In the previous post, I showed how to create a vector, matrix, data frame, and list. . In this post, I wish to show how to access and modify a piece of information from those objects.

In their work, the data analyst always has to access a piece of information from the data set. Accessing some information (namely elements, columns, or rows) from the data set is called subsetting in R. When the user inputs multiple values or writes a function, you will use the round brackets (), whereas, to access certain information, the user has to use the square brackets [].

I shall explain how to subset a vector, matrix, data frame, and list. I shall skip writing about the subsetting of an array. In general, researchers in the field of marketing, psychology, and economics work with data frames.

Why subsetting?
As explained earlier, a user may need a few values of their interest from a vector. Sometimes, they may also wish to delete a few values that they are not interested. For this purpose, we will use subsetting. Below I shall provide an example to provide an understanding of subsetting.

# General guidelines to subsetting

# Subsetting to select a few values

my_vector <- c(34, 50, 40, 80)

# Imagine the user wishes to extract only the 2 and 3 values of the vector.
# For that purpose, they can make use of the index of the vector.
# Index refers to the position of the values.
# The index in R starts from 1. In Python, it starts at 0. 
# Index of value 34 in my_vector is 1. 
# Index of value 80 in my_vector is 4.
# When you wish to extract multiple values, use the concatenate function - c().

my_vector[c(2,3)]

OUTPUT:

[1] 50 40

# Subsetting to remove/drop values - negative indexing

my_vector[-1]

OUTPUT:

[1] 50 40 80

# Subsetting with a logical condition - logical subsetting

my_vector[my_vector > 40]

OUTPUT:
[1] 50 80

# Modifying values by subsetting

# Let's replace the first index with 99, that is changing value 34 to 99. 

my_vector[1] <- 99

my_vector

OUTPUT:

[1] 99 50 40 80

Now, we know why we use subsetting. Let’s try subsetting a vector, matrix, data frame, and list.

Subsetting a vector:

## Let's create a vector using seq() function:

my_newvector <- seq(1, 20, by=2)

my_newvector 

OUTPUT:
[1] 1 3 5 7 9 11 13 15 17 19

## Access the 5th and 7th value of the vector

my_newvector[c(5,7)]

OUTPUT:
[1]  9 13

## Show all the values except 3rd and 8th value of the vector

my_newvector[-c(3,8)]

OUTPUT:
[1]  1  3  7  9 11 13 17 19

## Using logical subsetting to find values more than 10:

my_newvector[my_newvector > 10]

OUTPUT:
[1] 11 13 15 17 19

Subsetting a matrix:
The matrix consists of two dimensions (rows and columns). To access it, we have to supply the row and column information within the square brackets.

## Let's create a simple 3 x 3 matrix:

set.seed(100) 
# set.seed() function is useful for replication
# In the example, I am going to use sample() function
# The sample function will randomly fetches the values from the vector and it will vary every time.
# If we wish to replicate the results across different computers. We have to set the seed using set.seed(). 
# It can be any number for the seed. However, it should be the same across all the computers. 

my_matrix <- matrix(sample(200:400, size=9), nrow=3, byrow=F) 

# The sample() function helps you to randomly sample some values between 200 and 400. 
# The size in the sample() function refers to the number of values you wish to sample.

my_matrix

OUTPUT:

     [,1] [,2] [,3]

[1,]  301  397  269
[2,]  311  203  297
[3,]  350  254  334

## Now the subsetting gets a little tricky!
# As the matrix has rows and columns. We need to supply two arguments within the square brackets.
# For example, if you wish to extract the first row and first column value. You have to use my_matrix[1,1]
# In general, the arguments look like this my_matrix["row_index", "column_index"]

my_matrix[2,3] # this code will fetch the second row, third column value. 

OUTPUT:
[1] 297

# Imagine you wish to extract all the values of the 2nd row.

my_matrix[2,] # leave the column information empty

OUTPUT:
[1] 311 203 297


# TASK: It is your turn now, try to fetch the entire third column values. 

# Negative indexing: display all values expect the 2nd and 3rd row values and 3rd column 

my_matrix[-c(2,3), -3]

OUTPUT:
[1] 301 397

# Logical subsetting: 

# Fetch all the values more than 300

my_matrix[my_matrix> 300]

OUTPUT:
[1] 301 311 350 397 334

# Modifying the values:
# Let's change the third row first column value to 900

my_matrix[3,1] <- 900

OUTPUT:
     [,1] [,2] [,3]
[1,]  301  397  269
[2,]  311  203  297
[3,]  900  254  334

Subsetting a data frame:
Data frame subsetting works similar to matrix subsetting since the data frame is two dimensions but with a capability to hold multiple data types.

# Let's create a data frame with three columns and ten rows

set.seed(100)
my_df <- data.frame(age = sample(20:80, size=10),
                    daily_wage = sample(500:2000, size=10),
                    daily_expense = sample(300:1500, size=10))

my_df

OUTPUT:

   age daily_wage daily_expense
1   26       1781          1270
2   74       1958           927
3   62        823           904
4   75       1591          1481
5   37       1009          1032
6   31       1447          1410
7   54        787           522
8   70       1864          1031
9   27        846           550
10  76       1690           801

# If you wish to only fetch the age column:

my_df[ ,1] # or simply use the code below

my_df$age # the $ operator comes handy when you work with data frames and lists - given you provide a name to the columns and objects. 

OUTPUT:
[1] 26 74 62 75 37 31 54 70 27 76

# Negative indexing:

# Let's remove the first first rows of the data frame.

my_df[c(1:5), ] # similar to matrix subsetting - leave the column empty will select all the column values

OUTPUT:
   age daily_wage daily_expense
6   31       1447          1410
7   54        787           522
8   70       1864          1031
9   27        846           550
10  76       1690           801

# Logical subsetting:

# Fetching rows age more than 50 and daily expense more than 1000

my_df[my_df$age > 50 & my_df$daily_expense > 1000, ] # here we are using & operator to indicate that we want those two conditions to be met

OUTPUT:
  age daily_wage daily_expense
4  75       1591          1481
8  70       1864          1031

# Modifying the values:

# Let's change age value that is equal to 54 to 87

my_df[my_df$age == 54, 1] <- 87

# If you leave the column information empty (for example: my_df[my_df$age == 54,]), it will replace all the values of the specific row to 87

OUTPUT:

   age daily_wage daily_expense
1   26       1781          1270
2   74       1958           927
3   62        823           904
4   75       1591          1481
5   37       1009          1032
6   31       1447          1410
7   87        787           522
8   70       1864          1031
9   27        846           550
10  76       1690           801

Subsetting a list:
A list can hold any form of data in it. Hence, let’s save the vector, matrix, and data frame we created. To subset any information from the list, we can use [[ instead of [. To access any object of a list, we will use [[. Using single square brackets after the double square brackets will allow users to fetch information within the object of the list.

# Create a list with all objects we created until now.

set.seed(150)
my_list <- list(my_vector = c(34, 50, 40, 80), 
                my_matrix = matrix(sample(200:400, size=9), nrow=3, byrow=F),
                my_df = data.frame(age = sample(20:80, size=10),
                                   daily_wage = sample(500:2000, size=10),
                                   daily_expense = sample(300:1500, size=10)))

# Alert: The values in the object (except for my_vector) will be different from previous outputs since we are creating it with a different seed. 
# Printing my_list

my_list

$my_vector

[1] 34 50 40 80

$my_matrix
     [,1] [,2] [,3]
[1,]  234  330  380
[2,]  346  219  214
[3,]  344  310  334

$my_df
   age daily_wage daily_expense
1   40       1920          1456
2   27       1386           312
3   30       1050           785
4   48       1484          1123
5   33       1034          1190
6   80        593          1421
7   71       1689           957
8   37       1147          1118
9   46        881           506
10  32        635           722

# To access my_vector from my_list. One can either use the $ operator or [[]]

my_list[[1]]

# or 

my_list$my_vector

# will produce the similar result

OUTPUT: 
[1] 34 50 40 80

# Selecting values or elements within the list:

# Let's access the 5th row from my_df

my_list[[3]][5,] # alternatively one can use my_list$my_df[5,] - this is simple to use

OUTPUT:
  age daily_wage daily_expense

5  33       1034          1190

# Modifying a value of an object within the list:

# Let's change the 3rd row 3rd column value of the my_df to 5000

my_list$df[3,3] <- 5000 # or my_list[[3]][3,3]

# Print my_df from my_list

my_list$my_df

OUTPUT:
   age daily_wage daily_expense

1   40       1920          1456
2   27       1386           312
3   30       1050          5000 # my_df with modified value
4   48       1484          1123
5   33       1034          1190
6   80        593          1421
7   71       1689           957
8   37       1147          1118
9   46        881           506
10  32        635           722

In the next blog post, I shall focus on reading CSV and Excel files.

HOME

R series – 5: Subsetting and modifying data

Like this:

Leave a ReplyCancel reply

R series – 5: Subsetting and modifying data

Share this:

Like this:

Leave a ReplyCancel reply

Discover more from Balachandar Kaliappan