Chapter 3 Programming in R

Optional reading: R Cookbook Ch 15

3.1 Writing functions in R

  • When you have to copy and paste some code more than 2 times, you should consider writing a function

  • Writing a function can simplify your code and isolate the main part of your program

General format of a function

function_name <- function(argument1, argument2) {
  statements
}

Example: Let’s try to write a function to compute the sample variance.

my_var <- function(x){
  mean_x <- mean(x)
  n <- length(x)
  return(sum((x - mean_x)^2) / (n - 1))
}

y <- 1:9

my_var(y)
## [1] 7.5

var(y) # compared with the bulit-in function
## [1] 7.5

x <- rnorm(1000, mean = 0, sd = 2)
my_var(x)
## [1] 3.899088
var(x) # why the result is not equal to 4?
## [1] 3.899088
  • We can also write
my_var2 <- function(x) {sum((x - mean(x))^2) / (length(x) - 1)}
  • The variable x is the argument to be passed into the function. The variables mean_x and n are local variables whose scope is within this function.
y <- 2

f <- function(x) {
  y <- x
  x <- 4
  y
}

y
## [1] 2

f(3) # output the value f(3)
## [1] 3

y # y is unchanged, y is defined in the global environment
## [1] 2

x
## Error in eval(expr, envir, enclos): object 'x' not found

We shall write code using proper indentation (easier to read and debug)

# with indentation (use this one)
my_var <- function(x) {
  mean_x <- mean(x)
  n <- length(x)
  return(sum((x - mean_x)^2) / (n - 1))
}

# no indentation
my_var <- function(x) {
mean_x <- mean(x)
n <- length(x)
return(sum((x - mean_x)^2) / (n - 1))
}

The number of arguments passed to a function can be more than one

Example: Write a function to compute the pooled sample standard deviation of two independent samples \(x_1,\ldots,x_n\) and \(y_1,\ldots,y_m\) of sizes \(n\) and \(m\). Recall that the pooled sample standard deviation is defined as: \[ S_p := \sqrt{\frac{(n-1)S^2_X + (m-1)S^2_Y}{m+n-2}}, \] where \(S^2_X\) and \(S^2_Y\) are the sample variances of \(x_1,\ldots,x_n\) and \(y_1,\ldots,y_m\), respectively.

pooled_sd <- function(x, y) {
  n <- length(x)
  m <- length(y)
  return(sqrt(((n - 1) * var(x) + (m - 1) * var(y)) / (m + n - 2)))
}

Remark: if the final statement will output something, it will be the output of the function. You can also use return() as above. That is, pooled_sd and pooled_sd2 are exactly the same.

pooled_sd2 <- function(x, y) {
  n <- length(x)
  m <- length(y)
  sqrt(((n - 1) * var(x) + (m - 1) * var(y)) / (m + n - 2))
}

You can return more than one value in a function

my_var_sd <- function(x) {
  mean_x <- mean(x)
  n <- length(x)
  my_var <- sum((x - mean_x)^2) / (n - 1)
  return(c(my_var, sqrt(my_var)))
}

You may also return a list

my_var_sd <- function(x) {
  mean_x <- mean(x)
  n <- length(x)
  my_var <- sum((x - mean_x)^2) / (n - 1)
  output <- list(var = my_var, sd = sqrt(my_var))
  return(output)
}

Example: write a function called my_summary that will output a list with elements being equal to the mean, sd, median, min and max of a given vector.

my_summary <- function(x) {
  output <- list(mean = mean(x), sd = sd(x), median = median(x), 
                 min = min(x), max = max(x))
  return(output)  
}

my_summary(1:10)
## $mean
## [1] 5.5
## 
## $sd
## [1] 3.02765
## 
## $median
## [1] 5.5
## 
## $min
## [1] 1
## 
## $max
## [1] 10

Define a function with default value

my_power <- function(x, p = 2) {
 return(x^p)
}

my_power(3) # by default, p = 2, will compute 3^2
## [1] 9

my_power(3, 3) # will compute 3^3
## [1] 27

Examples with matrix input

Write a function called matrix_times_vector to compute \(X Y\), where \(X\) is a matrix and \(Y\) is a vector. The output should be a vector.

matrix_times_vector = function(X, Y) {
  as.vector(X %*% Y)
}

# e.g.
X <- matrix(1:12, 3, 4)
Y <- 1:4
X
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12
Y
## [1] 1 2 3 4
matrix_times_vector(X, Y)
## [1] 70 80 90

Note:

  1. X %*% Y will return a matrix. We can use as.vector to change it into a vector.
  2. It is common to see the error non-conformable arguments. This is because the dimensions of your matrices/vectors do not match.
  • If you have a \(n\times p\) matrix \(A\) and \(m \times q\) matrix \(B\), you can do the matrix multiplication \(AB\) only if \(p = m\). In R, if this is not the case, there will be an error.
  • Similarly, if you have a vector \(d\) of length \(m\). You can do the matrix multiplication \(A d\) only if \(p = m\).

Write a function called matrix_times_vector2 to compute \(X Y\), where \(X\) is a matrix and \(Y\) is a vector. The output should be a vector. However, you should check if the dimensions of the inputs are appropriate before you perform the calculation. Display an error message The dimensions do not match if this is not the case.

matrix_times_vector2 = function(X, Y) {
  p <- ncol(X)
  m <- length(Y)
  if (p == m) {
    return(as.vector(X %*% Y))   
  } else {
    cat("The dimensions do not match")
  }
}

# e.g.
X <- matrix(1:12, 4, 3)
Y <- 1:4
matrix_times_vector2(X, Y)
## The dimensions do not match

X <- matrix(1:12, 3, 4)
Y <- 1:4
matrix_times_vector2(X, Y)
## [1] 70 80 90

3.1.1 Argument Matching

Two ways of calling R function with arguments:

  1. by position of the argument
  2. by name of the argument

Example:

set.seed(1)
rnorm(5, 1, 2) # generate 5 rv from N(0, 2^2)
## [1] -0.2529076  1.3672866 -0.6712572  4.1905616
## [5]  1.6590155

set.seed(1)
rnorm(n = 5, mean = 1, sd = 2) # this is the same as above
## [1] -0.2529076  1.3672866 -0.6712572  4.1905616
## [5]  1.6590155

When you call the function by specifiying the name of the argument, the order does not matter:

set.seed(1)
rnorm(mean = 1, n = 5, sd = 2)
## [1] -0.2529076  1.3672866 -0.6712572  4.1905616
## [5]  1.6590155

But of course, one should not change the order generally.

3.2 Control Flow

3.2.1 for loop

You can use a for loop when you know how many times you will loop.

Syntax:

for (var in sequence) {
  statement # do this statement for each value of i
}

Examples:

for (i in 1:5) { # note: you do not have to define i beforehand
  print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5

for (i in c(1, 3, 6)) {
  print(i)
}
## [1] 1
## [1] 3
## [1] 6

Example: write a function with a for loop to produce a conversion table of temperature from Fahrenheit (from \(0\) to \(200\) with increment \(20\)) to their Celsius equivalent.

conv_table <- function(low, up, step) {
  f_temp <- seq(low, up, step) 
  n <- length(f_temp)
  c_temp <- rep(0, n)
  for (i in 1:n) {
    c_temp[i] <- (5 / 9) * (f_temp[i] - 32)
  }
  # alternatively, we can use the vectorized operation
  # c_temp <- (5 / 9) * (f_temp - 32)
  cbind(f_temp, c_temp)
}

# test your function
conv_table(0, 200, 20)
##       f_temp     c_temp
##  [1,]      0 -17.777778
##  [2,]     20  -6.666667
##  [3,]     40   4.444444
##  [4,]     60  15.555556
##  [5,]     80  26.666667
##  [6,]    100  37.777778
##  [7,]    120  48.888889
##  [8,]    140  60.000000
##  [9,]    160  71.111111
## [10,]    180  82.222222
## [11,]    200  93.333333

3.2.2 nested for loop

for loops can be nested inside of each other

Examples

  1. Write R code to find \(\sum^{10}_{i=1} \sum^4_{j=1} \frac{i^2}{(i+j)^2}\).
sum <- 0
for (i in 1:10) {
  for (j in 1:4) {
    sum <- sum + i^2 / (i + j)^2
  }
}
sum
## [1] 18.26491
  1. Write R code to find \(\sum^{10}_{i=1} \sum^i_{j=1} \frac{i^2}{(i+j)^3}\).
sum <- 0
for (i in 1:10) {
  for (j in 1:i) {
    sum <- sum + i^2 / (i + j)^3
  }
}
sum
## [1] 2.779252

3.2.3 while loop

You can use a while loop if you want to loop until a specific condition is met. For example, when you minimize a function numerically using some iterative algorithm, you may want to stop when the objective value does not change much. You may not know how many loops are required in advance so that a while loop may be better than a for loop in this application.

Syntax:

while (condition) {
  statement # while the condition is TRUE, do this
}

A simple example:

i <- 1
while (i < 6) {
  print(i)
  i <- i + 1
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5

# What happen if you change "i < 6" to "i <= 6"? Ans: will print 1 to 6

# What happen if you change "i < 6" to "i <= 5"? Ans: outputs are the same

Another example:

# find the smallest n such that 1^2+ 2^2+ ... + n^2 > 65
sum <- 0
i <- 0
while (sum < 65) {
  i <- i + 1
  sum <- sum + i^2
  print(c(i, sum))  
}
## [1] 1 1
## [1] 2 5
## [1]  3 14
## [1]  4 30
## [1]  5 55
## [1]  6 91
i # 6 
## [1] 6

3.2.4 if (cond)

Syntax:

if (condition) {
  statement # do this if the condition is TRUE
}

Example: write a function that outputs “positive” if a positive number is entered.

# check if a number if positive
my_pos <- function(x) {
  if (x > 0) { 
    print("positive")  
  }
}

my_pos(-2)

my_pos(2)
## [1] "positive"

3.2.5 if (cond) else expr

Syntax

if (condition) {
  statement1 # do this if condition is TRUE
  } else {
  statement2 # do this if condition is FALSE
}

Example:

# write my own absolute value function
my_abs <- function(x) {
 if (x>=0) {
   return(x)
 } else {
   return(-x)
 }
}

my_abs(-2)
## [1] 2

my_abs(3)
## [1] 3

my_abs(0)
## [1] 0

Error-handling in a function:

my_sqrt = function(x) {
  if (x >= 0) {
    print(sqrt(x))             # do this if x >= 0
  } else {
    cat("Error: this is a negative number!")   # do this otherwise
  }
}
my_sqrt(-2)
## Error: this is a negative number!

3.2.6 If else ladder

Syntax

# Example
if (condition1) {
  statement1
} else if (condition2) {
  statement2
} else if (condition1) {
  statement3
} 

Example:

score_to_grade = function(x) {
  if (x>=90) {
    cat("A+")
  } else if (x >= 85) {
    cat("A")
  } else if (x >= 80) {
    cat("A-")
  } else {
    cat("B+ or below")
  }
}

# after you write the function, you should check each case carefully
score_to_grade(92)
## A+
score_to_grade(88)
## A
score_to_grade(83)
## A-
score_to_grade(78)
## B+ or below

3.2.7 switch

Suppose you wish to write a function to generate random variables with two options: standard normal and uniform.

If we use if else, then

rdist <- function(n, dist) {
  if (dist == "norm") {
    rnorm(n)
  } else if (dist == "unif") {
    runif(n)
  }
}

Using switch:

rdist_switch <- function(n, dist) {
  switch(dist, norm = rnorm(n), unif = runif(n))
}

3.2.8 next, break

break breaks out of a for, while or repeat loop; control is transferred to the first statement outside the inner-most loop.

next halts the processing of the current iteration and advances the looping index.

Both break and next apply only to the innermost of nested loops.

Example of break:

for (j in 1:3) {
  for (i in 1:5) {
    if (i <= 3) {
      next
    }
    print(c(i, j))
  }
}
## [1] 4 1
## [1] 5 1
## [1] 4 2
## [1] 5 2
## [1] 4 3
## [1] 5 3

Example of next:

for (j in 1:3) {
  for (i in 1:5) {
    if (i >= 3) {
      break
    }
  }
  print(c(i, j))
}
## [1] 3 1
## [1] 3 2
## [1] 3 3

3.3 Loop functions

3.3.1 apply()

Apply a function over the margins of an array/ matrix

Basic usage: apply(X, MARGIN, FUN)

Margin: For a matrix, 1 indicates rows, 2 indicates columns

FUN: function to be applied

Examples

X <- matrix(runif(20), nrow = 4, ncol = 5)

# row sum
apply(X, 1, sum)
## [1] 2.020915 1.959644 3.476258 2.314615

# column sum
apply(X, 2, sum)
## [1] 1.453658 2.977065 2.304328 1.430564 1.605818

# row mean
apply(X, 1, mean)
## [1] 0.4041831 0.3919289 0.6952516 0.4629231

# column mean
apply(X, 2, mean)
## [1] 0.3634145 0.7442663 0.5760820 0.3576409 0.4014545

For the special cases of finding row/column sums and means of matrices, there are specific functions:

X <- matrix(runif(20), nrow = 4, ncol = 5)
rowSums(X)
## [1] 2.839428 2.709573 3.370761 2.423309
rowMeans(X)
## [1] 0.5678857 0.5419147 0.6741522 0.4846618
colSums(X)
## [1] 1.761405 2.398024 2.602992 2.655045 1.925607
colMeans(X)
## [1] 0.4403512 0.5995059 0.6507480 0.6637612 0.4814016

They are faster and their names are easier to understand when reading the code.

3.3.2 lapply()

lapply(): Apply a function over a list of vector

It will return a list of the same length as your input.

Examples:

set.seed(1)
n <- 1:5
lapply(n, rnorm) # the result is a list of 5 elements
## [[1]]
## [1] -0.6264538
## 
## [[2]]
## [1]  0.1836433 -0.8356286
## 
## [[3]]
## [1]  1.5952808  0.3295078 -0.8204684
## 
## [[4]]
## [1]  0.4874291  0.7383247  0.5757814 -0.3053884
## 
## [[5]]
## [1]  1.5117812  0.3898432 -0.6212406 -2.2146999
## [5]  1.1249309

The following for loop will give the same result

set.seed(1)
output <- list()
for (i in 1:5) {
  output[[i]] <- rnorm(i)
}
output
## [[1]]
## [1] -0.6264538
## 
## [[2]]
## [1]  0.1836433 -0.8356286
## 
## [[3]]
## [1]  1.5952808  0.3295078 -0.8204684
## 
## [[4]]
## [1]  0.4874291  0.7383247  0.5757814 -0.3053884
## 
## [[5]]
## [1]  1.5117812  0.3898432 -0.6212406 -2.2146999
## [5]  1.1249309

3.4 Automatically Reindent Code

To indent a block of code, highlight the text in RStudio, then press Ctrl+i (Windows or Linux) or press Cmd+i (Mac).

Poor indentation, difficult to read

for (i in 1:5) {
if (i >= 3) {
      print(i * 2)
          } else {
  print(i * 3)
}
      }
## [1] 3
## [1] 6
## [1] 6
## [1] 8
## [1] 10

Highlight the block of code, press Ctrl+i or Cmd+i

for (i in 1:5) {
  if (i >= 3) {
    print(i * 2)
  } else {
    print(i * 3)
  }
}
## [1] 3
## [1] 6
## [1] 6
## [1] 8
## [1] 10

3.5 Speed Consideration

While the computing power is getting stronger and stronger, we should still write code that runs efficiently.

# suppose we want to simulate 200,000 normal random variables
n <- 200000 
x <- rep(0, n) # create a vector for storage of the values
initial_time <- proc.time()
for (i in 1:n) {
  x[i] <- rnorm(1)
}
proc.time() - initial_time
##    user  system elapsed 
##    0.33    0.03    0.44
n <- 200000 
x <- rep(0, n) # create a vector for storage of the values
# Alternatively
system.time({
  for (i in 1:n) {
    x[i] <- rnorm(1)
  }
})
##    user  system elapsed 
##    0.39    0.02    0.39

The ‘user time’ is the CPU time charged for the execution of user instructions of the calling process. The ‘system time’ is the CPU time charged for execution by the system on behalf of the calling process.

A much more efficient way for the same task is to use

system.time({
  n <- 200000
  x <- rnorm(n)
})
##    user  system elapsed 
##    0.01    0.00    0.00

Another example:

set.seed(1)
x <- rnorm(2e6)
y <- rnorm(2e6)
v <- rep(0, 2e6)

system.time({
  for (i in 1:length(x)){ 
    v[i] <- x[i] + y[i]
  }
})
##    user  system elapsed 
##    0.09    0.00    0.09

system.time(v <- x + y)
##    user  system elapsed 
##       0       0       0

The general rule is to use vectorized operations whenever possible and to avoid using for loops. We use a for loop when the code is not time-consuming or when the code is hard to write without using a for loop. A more advanced option is to combine C++ with R using the package rcpp. That is, we can write the most time-consuming part of the R code in C++, which could run many times faster (will not be discussed in this course).

See also http://www.noamross.net/archives/2014-04-16-vectorization-in-r-why/

3.6 Another Simulation Example

A simple model on the stock return assumes that (i)

\[r_{t+1} := \log \frac{P_{t+1}}{P_t} \sim N(\mu,\sigma^2), \]

where \(r_{t+1}\) is the log-return at Day \(t+1\), \(P_t\) is the stock price at the end of Day \(t\); (ii) \(r_1,r_2,\ldots\) are iid. Simple algebra shows that \[P_{t+1} = P_t e^{r_{t+1}}.\] Applying the above equation repeatedly, we have

\[P_{t+1} = P_{t-1}e^{r_{t+1}} e^{r_t} = \ldots = P_0 e^{r_{t+1}} e^{r_t} \cdots e^{r_1} = P_0 e^{ \sum^{t+1}_{i=1} r_i}.\]

Suppose that the current price of a certain stock \(P_0\) is \(100\), \(\mu = 0.0002\) and \(\sigma = 0.015\). Using simulation, estimate the probability that the price is below $95 at the close of at least one of the next 30 trading days.

no_sim <- 10000 # number of simulation
below <- rep(0, no_sim)
for (i in 1:no_sim) {
  price <- 100 * exp(cumsum(rnorm(30, mean = 0.0002, sd = 0.015)))
  below[i] <- min(price) < 95
}
mean(below)
## [1] 0.4384

3.7 Additional Exercises

  1. Write a function that takes two numbers as arguments and returns the sum of the numbers.
my_sum <- function(x, y) {
  return(x + y)
}
my_sum(2, 3)
## [1] 5
  1. Write a function that takes a vector as an argument and returns the mean of the elements of the vector.
my_mean <- function(x) {
  # let's use a for loop to compute the mean
  n <- length(x)
  output <- 0
  for (i in 1:n) {
    output <- output + x[i]  
  }
  return(output / n)
}
my_mean(c(1, 3, 11))
## [1] 5
  1. Write a function that takes a matrix as an argument and returns the sum of the diagonal elements.
sum_diag <- function(A) {
  return(sum(diag(A)))
}
C <- matrix(1:81, nrow = 9, ncol = 9)
sum_diag(C)
## [1] 369
  1. Write a for loop that iterates from 1 to 10 and prints the square of each number.
for (i in 1:10) {
  print(i^2)
}
## [1] 1
## [1] 4
## [1] 9
## [1] 16
## [1] 25
## [1] 36
## [1] 49
## [1] 64
## [1] 81
## [1] 100
  1. Write a for loop that iterates through a vector of names and prints each name.
v_names <- c("A", "BC", "EO", "QP")
for (i in 1:length(v_names)) {
  print(v_names[i])
}
## [1] "A"
## [1] "BC"
## [1] "EO"
## [1] "QP"

# alternatively
for (i in v_names) {
  print(i)
}
## [1] "A"
## [1] "BC"
## [1] "EO"
## [1] "QP"
  1. Write a for loop that iterates through a vector of numbers and prints only the even numbers.
x <- c(1, 5, 3, 4, 2, 10)

for (i in 1:length(x)) {
  if (x[i] %% 2 == 0) {
    print(x[i])
  }
}
## [1] 4
## [1] 2
## [1] 10
  1. Sum the even numbers in a vector
x <- c(1, 5, 3, 4, 2, 10)
x %% 2 == 0 # vectorized operation
## [1] FALSE FALSE FALSE  TRUE  TRUE  TRUE
sum(x[x %% 2 == 0])
## [1] 16

# alternative way of doing the same task
output <- 0
for (i in 1:length(x)) {
  if (x[i] %% 2 == 0) {
    output <- output + x[i]
  }
}
output
## [1] 16
  1. Generate some random integers from 1 to 1 million
sample(1:1e6, size = 10)
##  [1] 417176 357954  20243 436290 659809 742377 660262
##  [8] 861097 500307 900017
  1. Write an if statement that checks if a variable x is equal to 5, and if so print “x is equal to 5”.
x <- 7
if (x == 5) {
  print("x is equal to 5")
}
  1. Write an if-else statement that checks if a variable y is greater than 10 and if so, print “y is greater than 10”; otherwise, print “y is less than or equal to 10”.
y <- 10
if (y > 10) {
  print("y is greater than 10")
} else {
  print("y is less than or equal to 10")
}
## [1] "y is less than or equal to 10"
  1. Write a nested if-else statement that checks if a variable z is positive, and if so, check if it is even or odd and prints “z is positive and even”. Print “z is not positive” if z is not positive.
z <- -2
if (z > 0) {
  if (z %% 2 == 0) {
    print("z is positive and even")
  }
} else {
  print("z is not positive")
}
## [1] "z is not positive"
  1. Another example
check_pos_even <- function(x) {
  if (x > 0) {
    pos_neg <- "+ve"
    if (x %% 2 == 0) {
      even_odd <- "even"
    } else {
      even_odd <- "odd"
    }
  } else {
    pos_neg <- "-ve"
    if (x %% 2 == 0) {
      even_odd <- "even"
    } else {
      even_odd <- "odd"
    }
  }
  print(paste0(pos_neg, " and ", even_odd))
}
check_pos_even(1)
## [1] "+ve and odd"
  1. Person \(A\) generates a random variable \(X \sim N(2, 1)\) and Person \(B\) generates a random variable \(Z \sim Unif(0, 4)\). If \(X < Z\), Person \(A\) will discard \(X\) and generate another random variable \(Y \sim Exp(0.5)\). Find the probability that the number generated by \(A\) is greater than that by \(B\).
n <- 100000
greater <- rep(0, n)
for (i in 1:n) {
  X <- rnorm(1, 2, 1)
  Z <- runif(1, 0, 4)
  if (X< Z) {
    Y <- rexp(1, rate = 0.5)
    greater[i] <- Y > Z
  } else {
    greater[i] <- 1 # 1 means A's no > B's no
  }
}

mean(greater)
## [1] 0.63753