Chapter 2 Probability and Simulation
Optional Reading: R Cookbook Ch8
2.1 Probability Distributions
Using the normal distribution as an example:
Function | Purpose |
---|---|
dnorm |
Normal density |
pnorm |
Normal CDF |
qnorm |
Normal quantile function |
rnorm |
Normal random variables |
Examples
Density of \(N(2, 3^2)\) at \(5\).
\(P(X \leq 3)\), where \(X \sim N(2, 3^2)\)
pnorm(3, mean = 2, sd = 3)
## [1] 0.6305587
# "mean =" and "sd =" are optional
pnorm(3, 2, 3)
## [1] 0.6305587
Generate 10 random variables, each follows \(N(3, 4^2)\).
rnorm(10, 3, 4)
## [1] 6.134238 3.881177 10.142106 2.986068 5.721259 1.544929 6.376358 -5.683805 5.387371 9.742924
95th percenttile of \(N(0, 1)\). Find \(q\) such that \(P(Z \leq q) = 0.95\)
Plotting the normal density
2.1.1 Common Distributions
Discrete distribution | R name | Parameters |
---|---|---|
Binomial | binom |
n = number of trials; p = probability of success for one trial |
Geometric | geom |
p = probability of success for one trial |
Negative binomial (NegBinomial) | nbinom |
size = number of successful trials; either prob = probability of successful trial or mu = mean |
Poisson | pois |
lambda = mean |
Continuous distribution | R name | Parameters |
---|---|---|
Beta | beta |
shape1 ; shape2 |
Cauchy | cauchy |
location ; scale |
Chi-squared (Chisquare) | chisq |
df = degrees of freedom |
Exponential | exp |
rate |
F | f |
df1 and df2 = degrees of freedom |
Gamma | gamma |
rate ; either rate or scale |
Log-normal (Lognormal) | lnorm |
meanlog = mean on logarithmic scale; sdlog = standard deviation on logarithmic scale |
Logistic | logis |
location ; scale |
Normal | norm |
mean ; sd = standard deviation |
Student’s t (TDist) | t |
df = degrees of freedom |
Uniform | unif |
min = lower limit; max = upper limit |
To get help on the distributions:
?dnorm
?dbeta
?dcauchy
# the following distributions need to use different code
?TDist
?Chisquare
?Lognormal
Examples (Using Binomial as an Example)
dbinom(2, 10, 0.6) # p_X(2), p_X is the pmf of X, X ~ Bin(n=10, p=0.6)
## [1] 0.01061683
pbinom(2, 10, 0.6) # F_X(2), F_X is the CDF of X, X ~ Bin(n=10, p=0.6)
## [1] 0.01229455
qbinom(0.5, 10, 0.6) # 50th percentile of X
## [1] 6
rbinom(4, 10, 0.6) # generate 4 random variables from Bin(n=10, p=0.6)
## [1] 7 6 4 6
x <- 0:10
plot(x, dbinom(x, 10, 0.6), type = "h") # "h" for histogram like vertical lines
2.1.2 Exercises
- The average number of trucks arriving on any one day at a truck depot in a certain city is known to be 12. Assuming the number of trucks arriving on any one day has a Poisson distribution, what is the probability that on a given day fewer than 9 (strictly less than 9) trucks will arrive at this depot?
- Let \(Z \sim N(0, 1)\). Find \(c\) such that
- \(P(Z \leq c) = 0.1151\)
- \(P(1\leq Z \leq c) = 0.1525\)
- \(P(-c \leq Z \leq c) = 0.8164\).
# P(0 <= Z <= c) = 0.8164/2
# P(Z <= c) = 0.8164/2 + 0.5
c <- qnorm(0.8164 / 2 + 0.5)
# test our answer
pnorm(c)- pnorm(-c)
## [1] 0.8164
- Plot the density of a chi-squared distribution with degrees of freedom \(4\), from \(x=0\) to \(x=10\).
- Find the 95th percentile of this distribution.
# note that a chi-squared r.v. is nonnegative
x <- seq(0, 10, by = 0.1)
plot(x, dchisq(x, df = 4), type = "l")
- Simulate \(10\) Bernoulli random variables with parameter \(0.6\).
- Plot the Poisson pmf with parameter \(2\) from \(x = 0\) to \(x = 10\).
- Draw a plot to illustrate that the 97.5th percentile of the t distribution will be getting closer to that of the standard normal distribution when the degrees of freedom increases.
2.2 Simulation
We have already seen how to use functions like
runif
,rnorm
,rbinom
to generate random variables.R actually generates pseudo-random number sequence (deterministic sequence of numbers that approximates the properties of random numbers)
The pseduo-random number sequence will be the same if it is initialized by the same seed (can be used to reproduce the same simulation results or used to debug).
# every time you run the first two lines, you get the same result
set.seed(1)
runif(5)
## [1] 0.2655087 0.3721239 0.5728534 0.9082078 0.2016819
# every time you run the following code, you get a different result
runif(5)
## [1] 0.89838968 0.94467527 0.66079779 0.62911404 0.06178627
Sampling from discrete distributions
Usage of sample
:
sample(x, size, replace = FALSE, prob = NULL)
See also ?sample
.
sample(10) # random permutation of integers from 1 to 10
## [1] 3 1 5 8 2 6 10 9 4 7
sample(10, replace = T) # sample with replacement
## [1] 5 9 9 5 5 2 10 9 1 4
sample(c(1, 3, 5), 5, replace = T)
## [1] 5 3 3 3 3
# simulate 20 random variables from a discrete distribution
sample(c(-1,0,1), size = 20, prob = c(0.25, 0.5, 0.25), replace = T)
## [1] -1 1 1 -1 0 0 1 1 0 -1 0 0 0 0 0 1 1 0 -1 0
Example: Suppose we have a fair coin and we play a game. We flip the coin. We win $1 if the result is head and lose $1 if the result is tail. You play the game 100 times. You are interested in the cumulative profit.
set.seed(1) # R actually generates pseudo random numbers
# setting the seed ensure that each time you will get the same result
# for illustration, code debugging, reproducibility
profit <- sample(c(-1, 1), size = 100, replace = T)
plot(cumsum(profit), type = "l")
Example: You have two dice \(A\) and \(B\). For die \(A\), there are \(6\) sides with numbers \(1,2,3,4,5,6\) and the corresponding probability of getting these values are \(0.1,0.1,0.1,0.1,0.1,0.5\). For die \(B\), there are \(4\) sides with numbers \(1,2,3,7\) and the corresponding probability of getting these values are \(0.3,0.2,0.3,0.2\). You roll the two dice independently. Estimate \(P(X > Y)\) using simulation, where \(X\) is the result from die \(A\) and \(Y\) is the result from die \(B\).
n <- 10000 # number of simulations
X <- sample(1:6, size = n, replace = TRUE, prob = c(0.1, 0.1, 0.1, 0.1, 0.1, 0.5))
Y <- sample(c(1, 2, 3, 7), size = n, replace = TRUE, prob = c(0.3, 0.2, 0.3, 0.2))
mean(X > Y)
## [1] 0.6408
Why the sample mean approximates the required probability? Recall the strong law of large numbers (SLLN). Let \(X_1,\ldots,X_n\) be independent and identically distributed random variables with mean \(\mu:=E(X)\). Let \(\overline{X}_n:= \frac{1}{n} \sum^n_{i=1}X_i\). Then
\[\overline{X}_n \stackrel{a.s.}{\rightarrow} \mu.\]
Note: a.s. means almost surely. The above convergence means \(P(\lim_{n\rightarrow \infty} \overline{X}_n = \mu) = 1\). Note that (the expectation of an indicator random variable is the probability that the corresponding event will happen)
\[ P(X>Y) = E(I(X>Y)).\]
To apply SLLN, we just need to recognize the underlying random variable is \(I(X>Y)\). Then, with probability \(1\), the sample mean
\[ \frac{1}{n} \sum^n_{i=1} I(X_i > Y_i) \rightarrow P(X>Y).\]
The quantity on the LHS is what we compute in mean(X>Y)
(note that we are using vectorized comparison).
Another Simulation Example
A simple model on the stock return assumes that (i)
\[r_{t+1} := \log \frac{P_{t+1}}{P_t} \sim N(\mu,\sigma^2), \]
where \(r_{t+1}\) is the log-return at Day \(t+1\), \(P_t\) is the stock price at the end of Day \(t\); (ii) \(r_1,r_2,\ldots\) are iid. Simple algebra shows that \[P_{t+1} = P_t e^{r_{t+1}}.\] Applying the above equation repeatedly, we have
\[P_{t+1} = P_{t-1}e^{r_{t+1}} e^{r_t} = \ldots = P_0 e^{r_{t+1}} e^{r_t} \cdots e^{r_1} = P_0 e^{ \sum^{t+1}_{i=1} r_i}.\]
Suppose that the current price of a certain stock \(P_0\) is \(100\), \(\mu = 0.0002\) and \(\sigma = 0.015\). Using simulation, estimate the probability that the price is below $95 at the close of at least one of the next 30 trading days.
2.3 Additional Exercises:
Ex 1
Let \(X \sim N(mean = 2, sd = 1)\), \(Y \sim Exp(rate = 2)\), \(Z \sim Unif(0, 4)\) (continuous uniform distribution on [0,4]). Suppose \(X, Y, Z\) are independent. Estimate \(P(\max(X,Y)>Z)\).
Recall the difference between pmax
and max
:
# always try with simple examples when you test the usage of functions
x <- c(1, 2, 3)
y <- c(0, 5, 10)
max(x, y)
## [1] 10
pmax(x, y)
## [1] 1 5 10
Answer:
n <- 100000
x <- rnorm(n, 2, 1)
y <- rexp(n, 2)
z <- runif(n, 0, 4)
mean(pmax(x, y) > z)
## [1] 0.51278
Ex 2
Let \(X \sim N(mean = 2, sd = 1)\), \(Y \sim Exp(rate = 2)\), \(Z \sim Unif(0, 4)\) (continuous uniform distribution on [0,4]). Suppose that \(X, Y, Z\) are independent. Estimate \(P(\min(X,Y)>Z)\).
n <- 100000
x <- rnorm(n, 2, 1)
y <- rexp(n, 2)
z <- runif(n, 0, 4)
mean(pmin(x, y) > z) # what is the difference between pmin and min?
## [1] 0.11386
Ex 3
Let \(X \sim N(mean = 2, sd = 1)\), \(Y \sim Exp(rate = 2)\), \(Z \sim Unif(0, 4)\) (continuous uniform distribution on [0,4]). Suppose that \(X, Y, Z\) are independent. Estimate \(P(X^2 Y>Z)\).
n <- 100000
x <- rnorm(n, 2, 1)
y <- rexp(n, 2)
z <- runif(n, 0, 4)
mean(x ^ 2 * y > z)
## [1] 0.40801
Ex 4
Person \(A\) generates a random variable \(X \sim N(2, 1)\) and Person \(B\) generates a random variable \(Z \sim Unif(0, 4)\). If \(X < Z\), Person \(A\) will discard \(X\) and generate another random variable \(Y \sim Exp(0.5)\). Find the probability that the number generated by \(A\) is greater than that by \(B\).
n <- 100000
greater <- rep(0, n)
for (i in 1:n) {
X <- rnorm(1, 2, 1)
Z <- runif(1, 0, 4)
if (X< Z) {
Y <- rexp(1, rate = 0.5)
greater[i] <- Y > Z
} else {
greater[i] <- 1 # 1 means A's no > B's no
}
}
mean(greater)
## [1] 0.63775
Remark: you may find that the following code gives you almost the same answer. Why?
n <- 100000
x <- rnorm(n, 2, 1)
y <- rexp(n, 0.5)
z <- runif(n, 0, 4)
mean(pmax(x, y) > z)
## [1] 0.63977
We will see additional simulation examples after we talk about some programming in R
There are many important topics that we will not discuss
- algorithms for simulating random variables
- inverse transform
- acceptance rejection
- Markov Chain Monte Carlo
- methods to reduce variance in simulation
- Control variates
- Antithetic variates
- Importance sampling