Chapter 1 Introduction

1.1 What is R and RStudio?

R

  • R is a programming language and environment for statistical computing, analysis, and graphics.

  • R is an interpreted language (individual language expressions are read and then executed immediately as soon as the command is entered)

  • To download R, go to https://cloud.r-project.org/

RStudio is an integrated development environment (IDE) for R programming

1.2 Why R?

  • It’s free, open source, and available on every major platform. As a result, if you do your analysis in R, anyone can easily reproduce it.

  • A massive set of packages for statistical modelling, maching learning, visualization, and importing and manipulating data.

  • Powerful tools for communicating your results. RMarkdown makes it easy to turn your results into HTML files, PDFs, Word documents, PowerPoint presentations, and more. Shiny allows you to make beautiful interactive apps without any knowledge of HTML or javascript.

  • RStudio provides an integrated development environment.

  • Cutting edge tools. Researchers in statistics and machine learning will often publish an R package to accompany their articles. This means you have access to the latest statistical techniques and implementations.

  • The ease with which R can connect to high-performance programming languages like C, Fortran, and C++.

BUT, R is not perfect. One of the challenges is that most R users are not programmers. This means, for example, that much of the R code you will see in the wild is written in haste to solve a pressing problem. As a result, code is not very elegant, fast, or easy to understand. Most users do not revise their code to address these shortcomings. R is also not a particularly fast programming languages, and poorly written R code can be terribly slow.

1.3 What will you learn in this course?

Note: we do not assume you know R or any programming language before.

1.3.1 R and R as a programming language

  • operators
  • control flow (if..else.., for loop)
  • defining a function

1.3.2 Data Wrangling

Data wrangling = the process of tidying and transforming the data

1.3.3 Data Visualization

Graphs are powerful to illustrate features of the data. You will learn how to create some basic plots as well as using the package ggplot2 to create more elegant plots.

Consider a dataset about cars.

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.2.3
mpg
## # A tibble: 234 × 11
##    manufacturer model      displ  year   cyl trans      drv     cty   hwy fl    class  
##    <chr>        <chr>      <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr>  
##  1 audi         a4           1.8  1999     4 auto(l5)   f        18    29 p     compact
##  2 audi         a4           1.8  1999     4 manual(m5) f        21    29 p     compact
##  3 audi         a4           2    2008     4 manual(m6) f        20    31 p     compact
##  4 audi         a4           2    2008     4 auto(av)   f        21    30 p     compact
##  5 audi         a4           2.8  1999     6 auto(l5)   f        16    26 p     compact
##  6 audi         a4           2.8  1999     6 manual(m5) f        18    26 p     compact
##  7 audi         a4           3.1  2008     6 auto(av)   f        18    27 p     compact
##  8 audi         a4 quattro   1.8  1999     4 manual(m5) 4        18    26 p     compact
##  9 audi         a4 quattro   1.8  1999     4 auto(l5)   4        16    25 p     compact
## 10 audi         a4 quattro   2    2008     4 manual(m6) 4        20    28 p     compact
## # ℹ 224 more rows

Among the variables in mpg are:

displ, a car’s engine size, in litres.

hwy, a car’s fuel efficiency on the highway, in miles per gallon (mpg). A car with a low fuel efficiency consumes more fuel than a car with a high fuel efficiency when they travel the same distance.

Scatterplot

Scatterplot, points are labeled with colors according to the class variable

Scatterplots

Line Chart

Bar chart

Another Bar Chart

Boxplot

Histogram

## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

1.3.4 Statistical Inference

Many problems in different domains can be formulated into hypothesis testing problems.

  • Are university graduates more likely to vote for Candidate A?

  • Is a treatment effective in reducing weights?

  • Is a drug effective in reducing mortality rate?

We want to answer these questions that take into account of the intrinsic variability. Formally, we can perform hypothesis testing and compute the confidence intervals.

These are what you learned in STAT 269. It is ok if you haven’t taken the STAT 269. The topics will be briefly reviewed.

We will focus on the applications using R.

1.3.5 Machine Learning

We will illustrate some machine learning methods using real datasets. For example,

  • Diagnoising breast cancer with the k-NN algorithm

  • Employ Naive Bayes to build an SMS junk message filter (text data)

A wordcloud of text data

  • Use neural network to predict the compressive strength of concrete

1.3.6 Some Numerical Methods

  • Monte Carlo simulation (estimate probabilities, expectations, integrals)

  • numerical optimizaiton methods (e.g. maximizing a multi-parameter likelihood function using optim)

1.3.7 Lastly

It is important to communicate your results to other after performing the data analysis. Therefore, you will do a project with presentation and report.

1.4 Let’s Get Started

The best way to learn R is to get started immediately and try the code by yourselves. We will not discuss every topic in detail at the beginning, which is not interesting and unnecessary. We shall revisit the topics when we need additional knowledge.

Simple arithmetic expression

# can be used a simple calculator
3+5
## [1] 8

4*2
## [1] 8

10/2
## [1] 5

Comment a code: use the hash mark #

# this is a comment, R will not run the code behine #

Function for ‘combining’

c(4, 2, 3) # "c" is to "combine" the numbers
## [1] 4 2 3

Assignment (<- is the assignment operator like = in many other programming languages)

y <- c(4, 2, 3) # create a vector called y with elements 4, 2, 3

c(1, 3, 5) -> v # c(1,3,5) is assigned to v

Output

y
## [1] 4 2 3

v
## [1] 1 3 5

R is case-sensitive. When you type Y, you will see an error message: object ‘Y’ not found

Y
## Error in eval(expr, envir, enclos): object 'Y' not found

1.5 R Data Structures

Reading: ML with R Ch2

Most frequently used data structures in R: vectors, factors, lists, arrays, matrices, data frames

1.5.1 Vectors

Vector

  • fundamental R data structure
  • stores an ordered set of values called elements
  • elements must be of the same type
  • Type: integer, double, character, logical

Integer, double, logical, character vectors

x <- 1:2 # integer vector, we use a:b to form the sequence of integers from a to b

typeof(x) # type of the vector
## [1] "integer"

x <- c(1.1, 1.2) # double vector

typeof(x)
## [1] "double"

length(x) # length of the vector x
## [1] 2

x < 2 # logical (TRUE/FALSE)
## [1] TRUE TRUE

p <- c(TRUE, FALSE)

subject_name <- c("John", "Jane", "Steve") # character vector

Combine two vectors

y <- c(2, 4, 6)

c(x, y) # note that we created x above
## [1] 1.1 1.2 2.0 4.0 6.0

c(y, subject_name) # 2, 4, 6 become characters "2", "4", "6"
## [1] "2"     "4"     "6"     "John"  "Jane"  "Steve"

Assessing elements in the vectors

y <- c(2, 4, 6)
y[2] # second element
## [1] 4
y[3] # third element
## [1] 6

1.5.2 Factors

A factor is a special type of vector that is solely used for representing categorical (male, female/group 1, group 2, group 3) or ordinal (cold, warm, hot/ low, medium, high) variables.

Reasons for using factor

  • the category labels are stored only once. E.g., rather than storing MALE, MALE, MALE, FEMALE, the computer may store 1,1,1,2(save memory)

  • many machine learning algorithms treat categorical/ordinal and numeric features differently and may require the input as a factor

Create a factor

gender <- factor(c("MALE", "MALE", "FEMALE", "MALE")) 

gender
## [1] MALE   MALE   FEMALE MALE  
## Levels: FEMALE MALE

# compared with 
c("MALE", "MALE", "FEMALE", "MALE")
## [1] "MALE"   "MALE"   "FEMALE" "MALE"

1.5.3 Matrix

Matrix

  • a collection of numbers in a rectangular form

  • A matrix with dimension n by m means the matrix has n rows and m columns.

Create Matrix: To create a \(3\times 4\) matrix with elements 1:12 filled in column-wise

A <- matrix(1:12, nrow = 3, ncol = 4) # note that we use = instead of <- 
A
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12

Dimension, number of rows, number of columns of a matrix

# again R is case-sensitive, a and A are different
dim(A) # to find the dimension of A
## [1] 3 4

nrow(A) # to find the number of row in A
## [1] 3

ncol(A) # to find the number of column in A
## [1] 4

By default, the matrix is filled column-wise. You can change to row-wise by adding byrow = TRUE

B <- matrix(1:12, nrow = 3, ncol = 4, byrow = TRUE) 
B
##      [,1] [,2] [,3] [,4]
## [1,]    1    2    3    4
## [2,]    5    6    7    8
## [3,]    9   10   11   12

Select rows, columns, submatrix, element

A[1, 2] # select the element in the 1st row and 2nd column
## [1] 4

A[2, ] # select 2nd row
## [1]  2  5  8 11

A[, 3] # select 3rd column
## [1] 7 8 9

A[1:2, 3:4] # select a submatrix
##      [,1] [,2]
## [1,]    7   10
## [2,]    8   11

Try:

  • A[c(1, 2), c(1, 3, 4)]
  • A[-1, ]
  • A[, -2]

Combine Two Matrices

cbind(A, B) # combine column-wise
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,]    1    4    7   10    1    2    3    4
## [2,]    2    5    8   11    5    6    7    8
## [3,]    3    6    9   12    9   10   11   12

rbind(A, B) # combine row-wise
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12
## [4,]    1    2    3    4
## [5,]    5    6    7    8
## [6,]    9   10   11   12

Try:

  • rbind(B, A)

Transpose

x <- c(1, 2, 3)

t(x) # transpose
##      [,1] [,2] [,3]
## [1,]    1    2    3

Q <- matrix(1:4, 2, 2)

Q
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

t(Q) # transpose
##      [,1] [,2]
## [1,]    1    2
## [2,]    3    4

Matrix Addition

A <- matrix(1:6, nrow = 2, ncol = 3)
B <- matrix(2:7, nrow = 2, ncol = 3)

A + B
##      [,1] [,2] [,3]
## [1,]    3    7   11
## [2,]    5    9   13

A - B
##      [,1] [,2] [,3]
## [1,]   -1   -1   -1
## [2,]   -1   -1   -1

A + 2
##      [,1] [,2] [,3]
## [1,]    3    5    7
## [2,]    4    6    8

c <- c(1, 2)

A + c 
##      [,1] [,2] [,3]
## [1,]    2    4    6
## [2,]    4    6    8

Elementwise Product

A <- matrix(1:6, nrow = 2, ncol = 3)
B <- matrix(1:2, nrow = 2, ncol = 3)

A * B
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    4    8   12

c <- 2

A * c
##      [,1] [,2] [,3]
## [1,]    2    6   10
## [2,]    4    8   12

c <- c(10, 100)
A * c
##      [,1] [,2] [,3]
## [1,]   10   30   50
## [2,]  200  400  600

c <- c(10, 100, 1000)
A * c  # do you notice the pattern?
##      [,1] [,2] [,3]
## [1,]   10 3000  500
## [2,]  200   40 6000

Matrix Multiplication

A <- matrix(1:12, nrow = 3, ncol = 4) # 3x4 matrix

t(A) # 4x3 matrix
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
## [3,]    7    8    9
## [4,]   10   11   12

t(A) %*% A #3x3 matrix, %*% = matrix multiplication
##      [,1] [,2] [,3] [,4]
## [1,]   14   32   50   68
## [2,]   32   77  122  167
## [3,]   50  122  194  266
## [4,]   68  167  266  365

B <- matrix(1:9, nrow = 3, ncol = 3)

B %*% A
##      [,1] [,2] [,3] [,4]
## [1,]   30   66  102  138
## [2,]   36   81  126  171
## [3,]   42   96  150  204

A %*% B # error, non-conformable arguments
## Error in A %*% B: non-conformable arguments

Diagonal Matrix

diag(1:4) # diagonal matrix with diagonal elements being 1:4
##      [,1] [,2] [,3] [,4]
## [1,]    1    0    0    0
## [2,]    0    2    0    0
## [3,]    0    0    3    0
## [4,]    0    0    0    4

A <- matrix(1:9, 3, 3)

A
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9

diag(A) # find the diagonal of A
## [1] 1 5 9

How to create an identity matrix in R?

Inverse

The inverse of a \(n \times n\) matrix \(A\), denoted by \(A^{-1}\), is a \(n \times n\) matrix such that \(AA^{-1} = A^{-1} A = I_n\), where \(I_n\) is the \(n\times n\) identity matrix.

To find the inverse of \(A\) in R: solve

A <- matrix(c(1, 0, 0, 3), 2, 2)
solve(A)
##      [,1]      [,2]
## [1,]    1 0.0000000
## [2,]    0 0.3333333

Some Statistical Applications

I will mention a few connections of matrices with statistics.

  • A dataset is naturally a matrix. Suppose that you have \(n\) people. You collected their health information: blood pressure, height, weight, age, whether they smoke (1 if yes, 0 if no), whether they drink (1/0), etc.

  • Linear regression: we observe \((x,y)\), where \(x\) is a vector of covariates and \(y\) is your response. For example, \(y\) is the blood pressure, \(x\) is the collection of other health information. The linear regression model assumes that \[y = \beta_0 + x^T \beta_1 + \varepsilon,\] where \(\varepsilon\) is the error term. Our goal is to estimate \(\beta:=(\beta_0, \beta_1)\). Let \(X\) be the design matrix. That is \(X\) is a \(n \times p\) matrix where \(n\) is the number of observation, \(p-1\) is the number of covariates. The least squares solution for \(\beta\) is \[\hat{\beta} = (X^T X)^{-1}X^TY.\] We will revisit linear regression later (I know some of you may not know linear regression).

  • Correlation matrix, Covariance matrix. Let \(X\) be a random vector (column vector). The covariance matrix is defined as \[\Sigma := E[(X-E(X))(X-E(X))^T].\]

1.5.4 Lists

  • store an ordered set of elements like a vector

  • can store different R data types (unlike a vector)

# let's create some vectors (of different types)

subject_name <- c("John", "Jane", "Steve") 

# at this point, you should notice that meaningful names should be used for the variables

temperature <- c(98.1, 98.6, 101.4)

flu_status <- c(FALSE, FALSE, TRUE) 

# notice how we use _ to separate two words 
# this is one of the styles in coding, you should be consistent with your style

data <- list(fullname = subject_name, temperature = temperature, flu_status = flu_status) 

# you may wonder what is the meaning of temperature = temperature
# in "fullname = subject_name"
# on the left of = is the name of the 1st element of your list
# on the right of = is the name of the variable that you want to
# assign the value to the 1st element

data 
## $fullname
## [1] "John"  "Jane"  "Steve"
## 
## $temperature
## [1]  98.1  98.6 101.4
## 
## $flu_status
## [1] FALSE FALSE  TRUE

To assess the element of a list:

data$flu_status
## [1] FALSE FALSE  TRUE

data[c("temperature", "flu_status")]
## $temperature
## [1]  98.1  98.6 101.4
## 
## $flu_status
## [1] FALSE FALSE  TRUE

data[2:3] # if you don't have the names
## $temperature
## [1]  98.1  98.6 101.4
## 
## $flu_status
## [1] FALSE FALSE  TRUE

1.5.5 Data frames

Data frame can be understood as a list of vectors, each having exactly the same number of values, arranged in a structure like a spreadsheet or database

gender <- c("MALE", "FEMALE", "MALE")
blood <- c("O", "AB", "A")

pt_data <- data.frame(subject_name, temperature, flu_status, gender, blood)

pt_data
##   subject_name temperature flu_status gender blood
## 1         John        98.1      FALSE   MALE     O
## 2         Jane        98.6      FALSE FEMALE    AB
## 3        Steve       101.4       TRUE   MALE     A
colnames(pt_data)
## [1] "subject_name" "temperature"  "flu_status"   "gender"       "blood"

pt_data$subject_name
## [1] "John"  "Jane"  "Steve"

pt_data[c("temperature", "flu_status")]
##   temperature flu_status
## 1        98.1      FALSE
## 2        98.6      FALSE
## 3       101.4       TRUE

pt_data[1, ] # like a matrix
##   subject_name temperature flu_status gender blood
## 1         John        98.1      FALSE   MALE     O

pt_data[, 2:3] 
##   temperature flu_status
## 1        98.1      FALSE
## 2        98.6      FALSE
## 3       101.4       TRUE

Create a new column

pt_data$temp_c <- (pt_data$temperature - 32) * 5 / 9

pt_data
##   subject_name temperature flu_status gender blood   temp_c
## 1         John        98.1      FALSE   MALE     O 36.72222
## 2         Jane        98.6      FALSE FEMALE    AB 37.00000
## 3        Steve       101.4       TRUE   MALE     A 38.55556

pt_data$fever <- (pt_data$temp_c > 37.6)

pt_data
##   subject_name temperature flu_status gender blood   temp_c fever
## 1         John        98.1      FALSE   MALE     O 36.72222 FALSE
## 2         Jane        98.6      FALSE FEMALE    AB 37.00000 FALSE
## 3        Steve       101.4       TRUE   MALE     A 38.55556  TRUE

1.6 Operators

Priority Operator Meaning
1 $ component selection
2 [ [[ subscripts, elements
3 ^ (caret) exponentiation
4 - unary minus
5 : sequence operator
6 %% %/% %*% modulus, integer divide, matrix multiply
7 * / multiply, divide
8 + - add, subtract
9 < > <= >= == != comparison
10 ! not
11 & | && || logical and, logical or
12 <- -> = assignments
# $ for list, data frame, etc
subject_name <- c("John", "Jane", "Steve") 
temperature <- c(98.1, 98.6, 101.4)
flu_status <- c(FALSE, FALSE, TRUE) 

data <- list(fullname = subject_name, temperature = temperature, flu_status = flu_status) 

data$fullname
## [1] "John"  "Jane"  "Steve"

pt_data <- data.frame(subject_name, temperature, flu_status)

pt_data$temperature
## [1]  98.1  98.6 101.4
# [ ], [[]]
x <- c(1, 5, 7)
x[2]
## [1] 5

data[[1]]
## [1] "John"  "Jane"  "Steve"
# x^r = x to the power of r
x <- 2
x^4 # 16
## [1] 16
# modulus
7 %% 2 # 7 divided by 2 equals 3 but it remains 1, modulus = reminder
## [1] 1

10 %% 3
## [1] 1

20 %% 2
## [1] 0

# integer division
7 %/% 2
## [1] 3

20 %/% 3
## [1] 6

Comparison

# <, >, <=, >=, ==, !=
x <- 2
x > 3
## [1] FALSE

x < 4
## [1] TRUE

x <- c(1, 5, 7)
x < 3 # compare each element with 3
## [1]  TRUE FALSE FALSE

x >= 5
## [1] FALSE  TRUE  TRUE

x == 5 # if x is equal to 5, not x = 5
## [1] FALSE  TRUE FALSE

x != 5 # if x is not equal to 5
## [1]  TRUE FALSE  TRUE
x <- TRUE

!x # not x 
## [1] FALSE

x <- 2

x <= 2
## [1] TRUE

!(x <= 2)
## [1] FALSE

& and && indicate logical AND. The shorter form performs elementwise comparisons. The longer form examine only the first element of each vector.

x <- c(TRUE, FALSE, TRUE)
y <- c(FALSE, FALSE, TRUE)

x & y
## [1] FALSE FALSE  TRUE

x && y
## Warning in x && y: 'length(x) = 3 > 1' in coercion to 'logical(1)'

## Warning in x && y: 'length(x) = 3 > 1' in coercion to 'logical(1)'
## [1] FALSE

z <- c(TRUE)

x && z
## Warning in x && z: 'length(x) = 3 > 1' in coercion to 'logical(1)'
## [1] TRUE

| and || indicate logical OR. The shorter form performs elementwise comparisons. The longer form examine only the first element of each vector.

x <- c(TRUE, FALSE, TRUE)
y <- c(FALSE, FALSE, TRUE)

x | y
## [1]  TRUE FALSE  TRUE

x || y
## Warning in x || y: 'length(x) = 3 > 1' in coercion to 'logical(1)'
## [1] TRUE

z <- c(TRUE)

x || z
## Warning in x || z: 'length(x) = 3 > 1' in coercion to 'logical(1)'
## [1] TRUE

Assignment

# these assignments are the same, it is recommended to use <-
a <- 2
a
## [1] 2
2 -> b
b
## [1] 2
c = 2
c
## [1] 2

Do !(x > 1) & (x < 4) and !((x > 1) & (x < 4)) give different results?

Example Let v be a vector of integers. Write a one-line R code to compute the product of all the even integers in v.

To illustrate how to solve the question step by step:

v <- -10:10 # begin writing your code by setting some integers
v %% 2 # find the remainder, if the remainder is 0, it is an even number
##  [1] 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
v %% 2 == 0  # check which elements is 0
##  [1]  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE
v[v %% 2 == 0] # select the elements which are "TRUE"
##  [1] -10  -8  -6  -4  -2   0   2   4   6   8  10
prod(v[v %% 2 == 0]) # find the product
## [1] 0

# the result is 0. v may not be a good example to check if the code is correct
# let's change to some other vector
v <- 2:8
prod(v[v %% 2 == 0])
## [1] 384

2 * 4 * 6 * 8 #check 
## [1] 384

# now, the final answer is prod(v[v %% 2 == 0])

1.6.1 Vectorized Operators

An important property of many of the operators is that they are “vectorized”. This means that the operation will be performed elementwise.

x <- c(1, 2, 3)
y <- c(5, 6, 7)

x + y
## [1]  6  8 10

x * y
## [1]  5 12 21

2 * x # you do not need to use c(2,2,2)*x
## [1] 2 4 6

y / 2 # you do not need to use y/c(2,2,2)
## [1] 2.5 3.0 3.5

A <- matrix(1:9, nrow = 3, ncol = 3)
B <- matrix(1:9, nrow = 3, ncol = 3)

A + B
##      [,1] [,2] [,3]
## [1,]    2    8   14
## [2,]    4   10   16
## [3,]    6   12   18

A * B # this is not matrix multiplication, but elementiwse multiplication
##      [,1] [,2] [,3]
## [1,]    1   16   49
## [2,]    4   25   64
## [3,]    9   36   81

x <- c(1, 3, 5)
y <- c(2, 2, 9)
x < y
## [1]  TRUE FALSE  TRUE

Another example:

X <- c(1, 1, 5)
Y <- c(5, 5, 1)
X > Y # FALSE FALSE TRUE
## [1] FALSE FALSE  TRUE
sum(X > Y) # TRUE = 1, FALSE = 0
## [1] 1
mean(X > Y) # 1/3
## [1] 0.3333333

The code mean(X > Y) is computing \[\begin{equation*} \frac{1}{3}( I(X_1 > Y_1) + I(X_2 > Y_2) + I(X_3 > Y_3)), \end{equation*}\] where \(I(\cdot)\) is the indicator function, that is, \(I(X_1 > Y_1) = 1\) if \(X_1 > Y_1\) and \(I(X_1 > Y_1) = 0\) if \(X_1 \leq Y_1\).

Example: if A is matrix and x is a vector, you can find the sums of the elements in A and x by sum(A) and sum(x) respectively.

x <- 1:10
sum(x)
## [1] 55

A <- matrix(1:10, 5, 2)
sum(A)
## [1] 55

Example: Find \(\sum^{5}_{x=1}\sum^{4}_{y=1} \frac{x}{x+y}\) without any loops.

(x <- matrix(1:5, nrow = 4, ncol = 5, byrow = TRUE)) # define and display at the same time using (...)
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    2    3    4    5
## [2,]    1    2    3    4    5
## [3,]    1    2    3    4    5
## [4,]    1    2    3    4    5
(y <- matrix(1:4, nrow = 4, ncol = 5, byrow = FALSE))
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    1    1    1    1
## [2,]    2    2    2    2    2
## [3,]    3    3    3    3    3
## [4,]    4    4    4    4    4
x / (x + y) # vectorized operation
##           [,1]      [,2]      [,3]      [,4]      [,5]
## [1,] 0.5000000 0.6666667 0.7500000 0.8000000 0.8333333
## [2,] 0.3333333 0.5000000 0.6000000 0.6666667 0.7142857
## [3,] 0.2500000 0.4000000 0.5000000 0.5714286 0.6250000
## [4,] 0.2000000 0.3333333 0.4285714 0.5000000 0.5555556
sum(x / (x + y))
## [1] 10.72817

1.7 Built-in Functions

Common mathematical functions sqrt, abs, sin, cos, log, exp.

To get help on the usage of a function. Use ?. For example, if you want to know more about log. Type ?log in the console. You will then see that by default, log computes the natrual logarithms.

Other useful functions

Name Operations
ceiling smallest integer greater than or equal to element
floor largest integer less than or equal to element
trunc ignore the decimal part
round round up for positive and round down for negative
sort sort the vector in ascending or descending order
sum, prod sum and produce of a vector
cumsum, cumprod cumulative sum and product
min, max return the smallest and largest values
range return a vector of length 2 containing the min and max
mean return the sample mean of a vector
var return the sample variance of a vector
sd return the sample standard deviation of a vector
seq generate a sequence of number
rep replicate elements in a vector

Note:

If you have data \(x_1,\ldots,x_n\), the sample variance is defined as \[ S^2_n := \frac{1}{n-1} \sum^n_{i=1}(x_i-\overline{x}_n)^2. \] Note that we divide the sum by \(n-1\) but not \(n\). The sample standard deviation is the square root of the sample variance.

x <- 1:5

y <- sqrt(x)

y
## [1] 1.000000 1.414214 1.732051 2.000000 2.236068

ceiling(y)
## [1] 1 2 2 2 3

sum(x)
## [1] 15

prod(x)
## [1] 120

cumsum(x)
## [1]  1  3  6 10 15

cumprod(x)
## [1]   1   2   6  24 120

min(x)
## [1] 1

max(x)
## [1] 5

range(x)
## [1] 1 5

mean(x)
## [1] 3

var(x)
## [1] 2.5

rep(0, 10) # create a vector of length 10 with all elements being 0
##  [1] 0 0 0 0 0 0 0 0 0 0

rep(1, 10) # create a vector of length 10 with all elements being 1
##  [1] 1 1 1 1 1 1 1 1 1 1

1.7.1 sort()

x <- c(1, 5, 3, 10)

sort(x) # default = ascending order
## [1]  1  3  5 10

sort(x, decreasing = TRUE) # descending order
## [1] 10  5  3  1

1.7.2 seq()

This is an example of function with more than one argument.

# seq(from, to)
seq(1:5)
## [1] 1 2 3 4 5

seq(from = 1, to = 5)
## [1] 1 2 3 4 5

# seq(from, to, by)
seq(1, 5, 2)
## [1] 1 3 5

seq(from = 1, to =5, by = 2)
## [1] 1 3 5

# seq(from, to, length)
seq(0, 10, length = 21)
##  [1]  0.0  0.5  1.0  1.5  2.0  2.5  3.0  3.5  4.0  4.5  5.0  5.5  6.0  6.5  7.0  7.5  8.0  8.5  9.0  9.5 10.0

seq(from = 0, to = 10, length = 21)
##  [1]  0.0  0.5  1.0  1.5  2.0  2.5  3.0  3.5  4.0  4.5  5.0  5.5  6.0  6.5  7.0  7.5  8.0  8.5  9.0  9.5 10.0

1.7.3 rep()

# rep(data, times), try ?rep
rep(0, 10)
##  [1] 0 0 0 0 0 0 0 0 0 0

rep(c(1, 2, 3), 3) 
## [1] 1 2 3 1 2 3 1 2 3

1.7.4 pmax, pmin

x <- c(1, 3, 5)
y <- c(2, 4, 4)

max(x, y) # maximum of x and y
## [1] 5

min(x, y) # minimum of x and y
## [1] 1

pmax(x, y) # elementwise comparison
## [1] 2 4 5

pmin(x, y) # elementwise comparison
## [1] 1 3 4

1.8 Some Useful RStudio Shortcuts

See also https://support.rstudio.com/hc/en-us/articles/200711853-Keyboard-Shortcuts

  • Ctrl + 1: Move focus to the Source Editor (when you are in the Console)
  • Ctrl + 2: Move focus to the Console (when you are in the source window)
  • \(\uparrow\) (the up arrow key on the keyboard): go to the previous command (in the console)
  • \(\downarrow\) (the down arrow key on the keyboard): go to the next command (in the console)
  • Esc: Delete the current command/ Interrupt currently executing command
  • Ctrl + Tab: go to the next tab

1.9 Exercises

To test your understanding, try to evaluate the following code by hand and then check with the output from R.

x <- (10:1)[c(-1, -4)]
x <- x^2
x[5] # what do you expect to see?
a<-c(1:3, 6, 3 + 3, "sta", 3 > 5)
a #?
typeof(a) #?
length(a) #?
x <- rep(1:6, rep(1:3, 2))
> x %% 2 == 0 #?
> x[x %% 2 == 0] #?
> round(-3.7) #?
> trunc(-3.7) #?
> floor(-3.7) #?
> ceiling(-3.7) #?
> round(3.8) #?
> trunc(3.8) #?
> floor(3.8) #?
> ceiling(3.8) #?
> x<-c(4, 3, 8, 7, 5, 6, 2, 1)
> sort(x) #?
> order(x) #?
> sum(x) + prod(x) #?
> cumsum(x) + cumprod(x) #?
> max(x) + min(x) #?

1.10 Comments to Exercises

These are comments but not answers but you can get the answers immediately by running the code.

x <- (10:1)[c(-1, -4)]
# 10:1 will give you a vector with elements 10, 9, 8,...,1
10:1
##  [1] 10  9  8  7  6  5  4  3  2  1
# [c(-1, -4)] will remove the 1st and 4th elements in the vector
# therefore 10 and 7 will be removed from 10:1 given
x
## [1] 9 8 6 5 4 3 2 1

x <- x^2
# ^2 will square each of the elements in the vector 

x[5] # 36
## [1] 16
a<-c(1:3, 6, 3 + 3, "sta", 3 > 5)
a # when you combine numeric, characters and logical values, the results become character
## [1] "1"     "2"     "3"     "6"     "6"     "sta"   "FALSE"
typeof(a) # character
## [1] "character"
length(a) # 
## [1] 7
x <- rep(1:6, rep(1:3, 2))
# we first evaluate rep(1:3, 2), which is 
# 1 2 3 1 2 3
# rep then replicates the elements by the corresponding number of times
# 1 is repeated 1 time
# 2 is repeated 2 times
# 3 is repeated 3 times
# 4 is repeated 1 time
# 5 is repeated 2 times
# 6 is repeated 3 times

x%%2 == 0 # check if the modulus is 0 when x is divided by 2
##  [1] FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE
# this is equivalent to ask if the elements are even

x[x%%2 == 0] # find out all the even elements
## [1] 2 2 4 6 6 6
# if you need to use these functions
# you may try with a positive number and a negative number
# to see if the results are what you want

round(-3.7)  # try round(-3.5), round(-3.4)  
## [1] -4
trunc(-3.7) 
## [1] -3
floor(-3.7) 
## [1] -4
ceiling(-3.7) 
## [1] -3

round(3.8) 
## [1] 4
trunc(3.8) 
## [1] 3
floor(3.8) 
## [1] 3
ceiling(3.8)
## [1] 4
x<-c(4, 3, 8, 7, 5, 6, 2, 1)
sort(x) # asecending order
## [1] 1 2 3 4 5 6 7 8
order(x) # from the result, can you guess what it does?
## [1] 8 7 2 1 5 6 4 3
# order(x)  returns a permutation which rearranges its first argument into ascending or descending order
# that is, x[order(x)] = sort(x)

sum(x) + prod(x) 
## [1] 40356
cumsum(x) + cumprod(x) 
## [1]     8    19   111   694  3387 20193 40355 40356
max(x) + min(x) 
## [1] 9