Chapter 3 Programming in R
Optional reading: R Cookbook Ch 15
3.1 Writing functions in R
When you have to copy and paste some code more than 2 times, you should consider writing a function
Writing a function can simplify your code and isolate the main part of your program
General format of a function
Example: Let’s try to write a function to compute the sample variance.
my_var <- function(x){
mean_x <- mean(x)
n <- length(x)
return(sum((x - mean_x)^2) / (n - 1))
}
y <- 1:9
my_var(y)
## [1] 7.5
var(y) # compared with the bulit-in function
## [1] 7.5
x <- rnorm(1000, mean = 0, sd = 2)
my_var(x)
## [1] 3.899088
var(x) # why the result is not equal to 4?
## [1] 3.899088
- We can also write
- The variable
x
is the argument to be passed into the function. The variablesmean_x
andn
are local variables whose scope is within this function.
y <- 2
f <- function(x) {
y <- x
x <- 4
y
}
y
## [1] 2
f(3) # output the value f(3)
## [1] 3
y # y is unchanged, y is defined in the global environment
## [1] 2
x
## Error in eval(expr, envir, enclos): object 'x' not found
We shall write code using proper indentation (easier to read and debug)
# with indentation (use this one)
my_var <- function(x) {
mean_x <- mean(x)
n <- length(x)
return(sum((x - mean_x)^2) / (n - 1))
}
# no indentation
my_var <- function(x) {
mean_x <- mean(x)
n <- length(x)
return(sum((x - mean_x)^2) / (n - 1))
}
The number of arguments passed to a function can be more than one
Example: Write a function to compute the pooled sample standard deviation of two independent samples \(x_1,\ldots,x_n\) and \(y_1,\ldots,y_m\) of sizes \(n\) and \(m\). Recall that the pooled sample standard deviation is defined as: \[ S_p := \sqrt{\frac{(n-1)S^2_X + (m-1)S^2_Y}{m+n-2}}, \] where \(S^2_X\) and \(S^2_Y\) are the sample variances of \(x_1,\ldots,x_n\) and \(y_1,\ldots,y_m\), respectively.
pooled_sd <- function(x, y) {
n <- length(x)
m <- length(y)
return(sqrt(((n - 1) * var(x) + (m - 1) * var(y)) / (m + n - 2)))
}
Remark: if the final statement will output something, it will be the output of the function. You can also use return()
as above. That is, pooled_sd
and pooled_sd2
are exactly the same.
pooled_sd2 <- function(x, y) {
n <- length(x)
m <- length(y)
sqrt(((n - 1) * var(x) + (m - 1) * var(y)) / (m + n - 2))
}
You can return more than one value in a function
my_var_sd <- function(x) {
mean_x <- mean(x)
n <- length(x)
my_var <- sum((x - mean_x)^2) / (n - 1)
return(c(my_var, sqrt(my_var)))
}
You may also return a list
my_var_sd <- function(x) {
mean_x <- mean(x)
n <- length(x)
my_var <- sum((x - mean_x)^2) / (n - 1)
output <- list(var = my_var, sd = sqrt(my_var))
return(output)
}
Example: write a function called my_summary
that will output a list with elements being equal to the mean, sd, median, min and max of a given vector.
my_summary <- function(x) {
output <- list(mean = mean(x), sd = sd(x), median = median(x),
min = min(x), max = max(x))
return(output)
}
my_summary(1:10)
## $mean
## [1] 5.5
##
## $sd
## [1] 3.02765
##
## $median
## [1] 5.5
##
## $min
## [1] 1
##
## $max
## [1] 10
Define a function with default value
my_power <- function(x, p = 2) {
return(x^p)
}
my_power(3) # by default, p = 2, will compute 3^2
## [1] 9
my_power(3, 3) # will compute 3^3
## [1] 27
Examples with matrix input
Write a function called matrix_times_vector
to compute \(X Y\), where \(X\) is a matrix and \(Y\) is a vector. The output should be a vector.
matrix_times_vector = function(X, Y) {
as.vector(X %*% Y)
}
# e.g.
X <- matrix(1:12, 3, 4)
Y <- 1:4
X
## [,1] [,2] [,3] [,4]
## [1,] 1 4 7 10
## [2,] 2 5 8 11
## [3,] 3 6 9 12
Y
## [1] 1 2 3 4
matrix_times_vector(X, Y)
## [1] 70 80 90
Note:
X %*% Y
will return a matrix. We can useas.vector
to change it into a vector.- It is common to see the error
non-conformable arguments
. This is because the dimensions of your matrices/vectors do not match.
- If you have a \(n\times p\) matrix \(A\) and \(m \times q\) matrix \(B\), you can do the matrix multiplication \(AB\) only if \(p = m\). In R, if this is not the case, there will be an error.
- Similarly, if you have a vector \(d\) of length \(m\). You can do the matrix multiplication \(A d\) only if \(p = m\).
Write a function called matrix_times_vector2
to compute \(X Y\), where \(X\) is a matrix and \(Y\) is a vector. The output should be a vector. However, you should check if the dimensions of the inputs are appropriate before you perform the calculation. Display an error message The dimensions do not match
if this is not the case.
matrix_times_vector2 = function(X, Y) {
p <- ncol(X)
m <- length(Y)
if (p == m) {
return(as.vector(X %*% Y))
} else {
cat("The dimensions do not match")
}
}
# e.g.
X <- matrix(1:12, 4, 3)
Y <- 1:4
matrix_times_vector2(X, Y)
## The dimensions do not match
X <- matrix(1:12, 3, 4)
Y <- 1:4
matrix_times_vector2(X, Y)
## [1] 70 80 90
3.1.1 Argument Matching
Two ways of calling R function with arguments:
- by position of the argument
- by name of the argument
Example:
set.seed(1)
rnorm(5, 1, 2) # generate 5 rv from N(0, 2^2)
## [1] -0.2529076 1.3672866 -0.6712572 4.1905616
## [5] 1.6590155
set.seed(1)
rnorm(n = 5, mean = 1, sd = 2) # this is the same as above
## [1] -0.2529076 1.3672866 -0.6712572 4.1905616
## [5] 1.6590155
When you call the function by specifiying the name of the argument, the order does not matter:
set.seed(1)
rnorm(mean = 1, n = 5, sd = 2)
## [1] -0.2529076 1.3672866 -0.6712572 4.1905616
## [5] 1.6590155
But of course, one should not change the order generally.
3.2 Control Flow
3.2.1 for loop
You can use a for loop when you know how many times you will loop.
Syntax:
Examples:
for (i in 1:5) { # note: you do not have to define i beforehand
print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
for (i in c(1, 3, 6)) {
print(i)
}
## [1] 1
## [1] 3
## [1] 6
Example: write a function with a for loop to produce a conversion table of temperature from Fahrenheit (from \(0\) to \(200\) with increment \(20\)) to their Celsius equivalent.
conv_table <- function(low, up, step) {
f_temp <- seq(low, up, step)
n <- length(f_temp)
c_temp <- rep(0, n)
for (i in 1:n) {
c_temp[i] <- (5 / 9) * (f_temp[i] - 32)
}
# alternatively, we can use the vectorized operation
# c_temp <- (5 / 9) * (f_temp - 32)
cbind(f_temp, c_temp)
}
# test your function
conv_table(0, 200, 20)
## f_temp c_temp
## [1,] 0 -17.777778
## [2,] 20 -6.666667
## [3,] 40 4.444444
## [4,] 60 15.555556
## [5,] 80 26.666667
## [6,] 100 37.777778
## [7,] 120 48.888889
## [8,] 140 60.000000
## [9,] 160 71.111111
## [10,] 180 82.222222
## [11,] 200 93.333333
3.2.2 nested for loop
for loops can be nested inside of each other
Examples
- Write R code to find \(\sum^{10}_{i=1} \sum^4_{j=1} \frac{i^2}{(i+j)^2}\).
- Write R code to find \(\sum^{10}_{i=1} \sum^i_{j=1} \frac{i^2}{(i+j)^3}\).
3.2.3 while loop
You can use a while loop if you want to loop until a specific condition is met. For example, when you minimize a function numerically using some iterative algorithm, you may want to stop when the objective value does not change much. You may not know how many loops are required in advance so that a while loop may be better than a for loop in this application.
Syntax:
A simple example:
i <- 1
while (i < 6) {
print(i)
i <- i + 1
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
# What happen if you change "i < 6" to "i <= 6"? Ans: will print 1 to 6
# What happen if you change "i < 6" to "i <= 5"? Ans: outputs are the same
Another example:
3.2.4 if (cond)
Syntax:
Example: write a function that outputs “positive” if a positive number is entered.
3.2.5 if (cond) else expr
Syntax
if (condition) {
statement1 # do this if condition is TRUE
} else {
statement2 # do this if condition is FALSE
}
Example:
# write my own absolute value function
my_abs <- function(x) {
if (x>=0) {
return(x)
} else {
return(-x)
}
}
my_abs(-2)
## [1] 2
my_abs(3)
## [1] 3
my_abs(0)
## [1] 0
Error-handling in a function:
3.2.6 If else ladder
Syntax
# Example
if (condition1) {
statement1
} else if (condition2) {
statement2
} else if (condition1) {
statement3
}
Example:
score_to_grade = function(x) {
if (x>=90) {
cat("A+")
} else if (x >= 85) {
cat("A")
} else if (x >= 80) {
cat("A-")
} else {
cat("B+ or below")
}
}
# after you write the function, you should check each case carefully
score_to_grade(92)
## A+
score_to_grade(88)
## A
score_to_grade(83)
## A-
score_to_grade(78)
## B+ or below
3.2.7 switch
Suppose you wish to write a function to generate random variables with two options: standard normal and uniform.
If we use if else, then
rdist <- function(n, dist) {
if (dist == "norm") {
rnorm(n)
} else if (dist == "unif") {
runif(n)
}
}
Using switch
:
3.2.8 next
, break
break
breaks out of a for, while or repeat loop; control is transferred to the first statement outside the inner-most loop.
next
halts the processing of the current iteration and advances the looping index.
Both break and next apply only to the innermost of nested loops.
Example of break
:
for (j in 1:3) {
for (i in 1:5) {
if (i <= 3) {
next
}
print(c(i, j))
}
}
## [1] 4 1
## [1] 5 1
## [1] 4 2
## [1] 5 2
## [1] 4 3
## [1] 5 3
Example of next
:
3.3 Loop functions
3.3.1 apply()
Apply a function over the margins of an array/ matrix
Basic usage: apply(X, MARGIN, FUN)
Margin
: For a matrix, 1
indicates rows, 2
indicates columns
FUN
: function to be applied
Examples
X <- matrix(runif(20), nrow = 4, ncol = 5)
# row sum
apply(X, 1, sum)
## [1] 2.020915 1.959644 3.476258 2.314615
# column sum
apply(X, 2, sum)
## [1] 1.453658 2.977065 2.304328 1.430564 1.605818
# row mean
apply(X, 1, mean)
## [1] 0.4041831 0.3919289 0.6952516 0.4629231
# column mean
apply(X, 2, mean)
## [1] 0.3634145 0.7442663 0.5760820 0.3576409 0.4014545
For the special cases of finding row/column sums and means of matrices, there are specific functions:
X <- matrix(runif(20), nrow = 4, ncol = 5)
rowSums(X)
## [1] 2.839428 2.709573 3.370761 2.423309
rowMeans(X)
## [1] 0.5678857 0.5419147 0.6741522 0.4846618
colSums(X)
## [1] 1.761405 2.398024 2.602992 2.655045 1.925607
colMeans(X)
## [1] 0.4403512 0.5995059 0.6507480 0.6637612 0.4814016
They are faster and their names are easier to understand when reading the code.
3.3.2 lapply()
lapply()
: Apply a function over a list of vector
It will return a list of the same length as your input.
Examples:
set.seed(1)
n <- 1:5
lapply(n, rnorm) # the result is a list of 5 elements
## [[1]]
## [1] -0.6264538
##
## [[2]]
## [1] 0.1836433 -0.8356286
##
## [[3]]
## [1] 1.5952808 0.3295078 -0.8204684
##
## [[4]]
## [1] 0.4874291 0.7383247 0.5757814 -0.3053884
##
## [[5]]
## [1] 1.5117812 0.3898432 -0.6212406 -2.2146999
## [5] 1.1249309
The following for loop will give the same result
set.seed(1)
output <- list()
for (i in 1:5) {
output[[i]] <- rnorm(i)
}
output
## [[1]]
## [1] -0.6264538
##
## [[2]]
## [1] 0.1836433 -0.8356286
##
## [[3]]
## [1] 1.5952808 0.3295078 -0.8204684
##
## [[4]]
## [1] 0.4874291 0.7383247 0.5757814 -0.3053884
##
## [[5]]
## [1] 1.5117812 0.3898432 -0.6212406 -2.2146999
## [5] 1.1249309
3.4 Automatically Reindent Code
To indent a block of code, highlight the text in RStudio, then press Ctrl+i (Windows or Linux) or press Cmd+i (Mac).
Poor indentation, difficult to read
for (i in 1:5) {
if (i >= 3) {
print(i * 2)
} else {
print(i * 3)
}
}
## [1] 3
## [1] 6
## [1] 6
## [1] 8
## [1] 10
Highlight the block of code, press Ctrl+i or Cmd+i
3.5 Speed Consideration
While the computing power is getting stronger and stronger, we should still write code that runs efficiently.
# suppose we want to simulate 200,000 normal random variables
n <- 200000
x <- rep(0, n) # create a vector for storage of the values
initial_time <- proc.time()
for (i in 1:n) {
x[i] <- rnorm(1)
}
proc.time() - initial_time
## user system elapsed
## 0.33 0.03 0.44
n <- 200000
x <- rep(0, n) # create a vector for storage of the values
# Alternatively
system.time({
for (i in 1:n) {
x[i] <- rnorm(1)
}
})
## user system elapsed
## 0.39 0.02 0.39
The ‘user time’ is the CPU time charged for the execution of user instructions of the calling process. The ‘system time’ is the CPU time charged for execution by the system on behalf of the calling process.
A much more efficient way for the same task is to use
Another example:
set.seed(1)
x <- rnorm(2e6)
y <- rnorm(2e6)
v <- rep(0, 2e6)
system.time({
for (i in 1:length(x)){
v[i] <- x[i] + y[i]
}
})
## user system elapsed
## 0.09 0.00 0.09
system.time(v <- x + y)
## user system elapsed
## 0 0 0
The general rule is to use vectorized operations whenever possible and to avoid using for loops. We use a for loop when the code is not time-consuming or when the code is hard to write without using a for loop. A more advanced option is to combine C++ with R using the package rcpp
. That is, we can write the most time-consuming part of the R code in C++, which could run many times faster (will not be discussed in this course).
See also http://www.noamross.net/archives/2014-04-16-vectorization-in-r-why/
3.6 Another Simulation Example
A simple model on the stock return assumes that (i)
\[r_{t+1} := \log \frac{P_{t+1}}{P_t} \sim N(\mu,\sigma^2), \]
where \(r_{t+1}\) is the log-return at Day \(t+1\), \(P_t\) is the stock price at the end of Day \(t\); (ii) \(r_1,r_2,\ldots\) are iid. Simple algebra shows that \[P_{t+1} = P_t e^{r_{t+1}}.\] Applying the above equation repeatedly, we have
\[P_{t+1} = P_{t-1}e^{r_{t+1}} e^{r_t} = \ldots = P_0 e^{r_{t+1}} e^{r_t} \cdots e^{r_1} = P_0 e^{ \sum^{t+1}_{i=1} r_i}.\]
Suppose that the current price of a certain stock \(P_0\) is \(100\), \(\mu = 0.0002\) and \(\sigma = 0.015\). Using simulation, estimate the probability that the price is below $95 at the close of at least one of the next 30 trading days.
3.7 Additional Exercises
- Write a function that takes two numbers as arguments and returns the sum of the numbers.
- Write a function that takes a vector as an argument and returns the mean of the elements of the vector.
my_mean <- function(x) {
# let's use a for loop to compute the mean
n <- length(x)
output <- 0
for (i in 1:n) {
output <- output + x[i]
}
return(output / n)
}
my_mean(c(1, 3, 11))
## [1] 5
- Write a function that takes a matrix as an argument and returns the sum of the diagonal elements.
sum_diag <- function(A) {
return(sum(diag(A)))
}
C <- matrix(1:81, nrow = 9, ncol = 9)
sum_diag(C)
## [1] 369
- Write a for loop that iterates from 1 to 10 and prints the square of each number.
for (i in 1:10) {
print(i^2)
}
## [1] 1
## [1] 4
## [1] 9
## [1] 16
## [1] 25
## [1] 36
## [1] 49
## [1] 64
## [1] 81
## [1] 100
- Write a for loop that iterates through a vector of names and prints each name.
v_names <- c("A", "BC", "EO", "QP")
for (i in 1:length(v_names)) {
print(v_names[i])
}
## [1] "A"
## [1] "BC"
## [1] "EO"
## [1] "QP"
# alternatively
for (i in v_names) {
print(i)
}
## [1] "A"
## [1] "BC"
## [1] "EO"
## [1] "QP"
- Write a for loop that iterates through a vector of numbers and prints only the even numbers.
x <- c(1, 5, 3, 4, 2, 10)
for (i in 1:length(x)) {
if (x[i] %% 2 == 0) {
print(x[i])
}
}
## [1] 4
## [1] 2
## [1] 10
- Sum the even numbers in a vector
x <- c(1, 5, 3, 4, 2, 10)
x %% 2 == 0 # vectorized operation
## [1] FALSE FALSE FALSE TRUE TRUE TRUE
sum(x[x %% 2 == 0])
## [1] 16
# alternative way of doing the same task
output <- 0
for (i in 1:length(x)) {
if (x[i] %% 2 == 0) {
output <- output + x[i]
}
}
output
## [1] 16
- Generate some random integers from 1 to 1 million
sample(1:1e6, size = 10)
## [1] 417176 357954 20243 436290 659809 742377 660262
## [8] 861097 500307 900017
- Write an if statement that checks if a variable
x
is equal to 5, and if so print “x is equal to 5”.
- Write an if-else statement that checks if a variable y is greater than 10 and if so, print “y is greater than 10”; otherwise, print “y is less than or equal to 10”.
y <- 10
if (y > 10) {
print("y is greater than 10")
} else {
print("y is less than or equal to 10")
}
## [1] "y is less than or equal to 10"
- Write a nested if-else statement that checks if a variable z is positive, and if so, check if it is even or odd and prints “z is positive and even”. Print “z is not positive” if z is not positive.
z <- -2
if (z > 0) {
if (z %% 2 == 0) {
print("z is positive and even")
}
} else {
print("z is not positive")
}
## [1] "z is not positive"
- Another example
check_pos_even <- function(x) {
if (x > 0) {
pos_neg <- "+ve"
if (x %% 2 == 0) {
even_odd <- "even"
} else {
even_odd <- "odd"
}
} else {
pos_neg <- "-ve"
if (x %% 2 == 0) {
even_odd <- "even"
} else {
even_odd <- "odd"
}
}
print(paste0(pos_neg, " and ", even_odd))
}
check_pos_even(1)
## [1] "+ve and odd"
- Person \(A\) generates a random variable \(X \sim N(2, 1)\) and Person \(B\) generates a random variable \(Z \sim Unif(0, 4)\). If \(X < Z\), Person \(A\) will discard \(X\) and generate another random variable \(Y \sim Exp(0.5)\). Find the probability that the number generated by \(A\) is greater than that by \(B\).