Chapter 2 Probability and Simulation
Optional Reading: R Cookbook Ch8
2.1 Probability Distributions
Using the normal distribution as an example:
Function | Purpose |
---|---|
dnorm |
Normal density |
pnorm |
Normal CDF |
qnorm |
Normal quantile function |
rnorm |
Normal random variables |
Examples
Density of N(2,32) at 5.
P(X≤3), where X∼N(2,32)
pnorm(3, mean = 2, sd = 3)
## [1] 0.6305587
# "mean =" and "sd =" are optional
pnorm(3, 2, 3)
## [1] 0.6305587
Generate 10 random variables, each follows N(3,42).
rnorm(10, 3, 4)
## [1] 3.0697823 5.7833862 -0.3445385 1.7289238
## [5] 5.2347737 -2.6807430 2.6884267 4.2905353
## [9] 3.2931281 2.9978411
95th percenttile of N(0,1). Find q such that P(Z≤q)=0.95
Plotting the normal density
2.1.1 Common Distributions
Discrete distribution | R name | Parameters |
---|---|---|
Binomial | binom |
n = number of trials; p = probability of success for one trial |
Geometric | geom |
p = probability of success for one trial |
Negative binomial (NegBinomial) | nbinom |
size = number of successful trials; either prob = probability of successful trial or mu = mean |
Poisson | pois |
lambda = mean |
Continuous distribution | R name | Parameters |
---|---|---|
Beta | beta |
shape1 ; shape2 |
Cauchy | cauchy |
location ; scale |
Chi-squared (Chisquare) | chisq |
df = degrees of freedom |
Exponential | exp |
rate |
F | f |
df1 and df2 = degrees of freedom |
Gamma | gamma |
rate ; either rate or scale |
Log-normal (Lognormal) | lnorm |
meanlog = mean on logarithmic scale; sdlog = standard deviation on logarithmic scale |
Logistic | logis |
location ; scale |
Normal | norm |
mean ; sd = standard deviation |
Student’s t (TDist) | t |
df = degrees of freedom |
Uniform | unif |
min = lower limit; max = upper limit |
To get help on the distributions:
?dnorm
?dbeta
?dcauchy
# the following distributions need to use different code
?TDist
?Chisquare
?Lognormal
Examples (Using Binomial as an Example)
dbinom(2, 10, 0.6) # p_X(2), p_X is the pmf of X, X ~ Bin(n=10, p=0.6)
## [1] 0.01061683
pbinom(2, 10, 0.6) # F_X(2), F_X is the CDF of X, X ~ Bin(n=10, p=0.6)
## [1] 0.01229455
qbinom(0.5, 10, 0.6) # 50th percentile of X
## [1] 6
rbinom(4, 10, 0.6) # generate 4 random variables from Bin(n=10, p=0.6)
## [1] 8 4 9 5
x <- 0:10
plot(x, dbinom(x, 10, 0.6), type = "h") # "h" for histogram like vertical lines
2.1.2 Exercises
- The average number of trucks arriving on any one day at a truck depot in a certain city is known to be 12. Assuming the number of trucks arriving on any one day has a Poisson distribution, what is the probability that on a given day fewer than 9 (strictly less than 9) trucks will arrive at this depot?
- Let Z∼N(0,1). Find c such that
- P(Z≤c)=0.1151
- P(1≤Z≤c)=0.1525
- P(−c≤Z≤c)=0.8164.
# P(0 <= Z <= c) = 0.8164/2
# P(Z <= c) = 0.8164/2 + 0.5
c <- qnorm(0.8164 / 2 + 0.5)
# test our answer
pnorm(c)- pnorm(-c)
## [1] 0.8164
- Plot the density of a chi-squared distribution with degrees of freedom 4, from x=0 to x=10.
- Find the 95th percentile of this distribution.
# note that a chi-squared r.v. is nonnegative
x <- seq(0, 10, by = 0.1)
plot(x, dchisq(x, df = 4), type = "l")
- Simulate 10 Bernoulli random variables with parameter 0.6.
- Plot the Poisson pmf with parameter 2 from x=0 to x=10.
- Draw a plot to illustrate that the 97.5th percentile of the t distribution will be getting closer to that of the standard normal distribution when the degrees of freedom increases.
x <- 10:200
plot(x, qt(0.975, df = x), type = "l", ylim = c(1.9,2.3))
# add a horizontal line with value at qnorm(0.975)
# lty = 2 for dashed line, check ?par
abline(h = qnorm(0.975), lty = 2)
2.2 Simulation
We have already seen how to use functions like
runif
,rnorm
,rbinom
to generate random variables.R actually generates pseudo-random number sequence (deterministic sequence of numbers that approximates the properties of random numbers)
The pseduo-random number sequence will be the same if it is initialized by the same seed (can be used to reproduce the same simulation results or used to debug).
# every time you run the first two lines, you get the same result
set.seed(1)
runif(5)
## [1] 0.2655087 0.3721239 0.5728534 0.9082078 0.2016819
# every time you run the following code, you get a different result
runif(5)
## [1] 0.89838968 0.94467527 0.66079779 0.62911404
## [5] 0.06178627
Sampling from discrete distributions
Usage of sample
:
sample(x, size, replace = FALSE, prob = NULL)
See also ?sample
.
sample(10) # random permutation of integers from 1 to 10
## [1] 3 1 5 8 2 6 10 9 4 7
sample(10, replace = T) # sample with replacement
## [1] 5 9 9 5 5 2 10 9 1 4
sample(c(1, 3, 5), 5, replace = T)
## [1] 5 3 3 3 3
# simulate 20 random variables from a discrete distribution
sample(c(-1,0,1), size = 20, prob = c(0.25, 0.5, 0.25), replace = T)
## [1] -1 1 1 -1 0 0 1 1 0 -1 0 0 0 0 0 1
## [17] 1 0 -1 0
Example: Suppose we have a fair coin and we play a game. We flip the coin. We win $1 if the result is head and lose $1 if the result is tail. You play the game 100 times. You are interested in the cumulative profit.
set.seed(1) # R actually generates pseudo random numbers
# setting the seed ensure that each time you will get the same result
# for illustration, code debugging, reproducibility
profit <- sample(c(-1, 1), size = 100, replace = T)
plot(cumsum(profit), type = "l")
Example: You have two dice A and B. For die A, there are 6 sides with numbers 1,2,3,4,5,6 and the corresponding probability of getting these values are 0.1,0.1,0.1,0.1,0.1,0.5. For die B, there are 4 sides with numbers 1,2,3,7 and the corresponding probability of getting these values are 0.3,0.2,0.3,0.2. You roll the two dice independently. Estimate P(X>Y) using simulation, where X is the result from die A and Y is the result from die B.
n <- 10000 # number of simulations
X <- sample(1:6, size = n, replace = TRUE, prob = c(0.1, 0.1, 0.1, 0.1, 0.1, 0.5))
Y <- sample(c(1, 2, 3, 7), size = n, replace = TRUE, prob = c(0.3, 0.2, 0.3, 0.2))
mean(X > Y)
## [1] 0.6408
Why the sample mean approximates the required probability? Recall the strong law of large numbers (SLLN). Let X1,…,Xn be independent and identically distributed random variables with mean μ:=E(X). Let ¯Xn:=1n∑ni=1Xi. Then
¯Xna.s.→μ.
Note: a.s. means almost surely. The above convergence means P(lim. Note that (the expectation of an indicator random variable is the probability that the corresponding event will happen)
P(X>Y) = E(I(X>Y)).
To apply SLLN, we just need to recognize the underlying random variable is I(X>Y). Then, with probability 1, the sample mean
\frac{1}{n} \sum^n_{i=1} I(X_i > Y_i) \rightarrow P(X>Y).
The quantity on the LHS is what we compute in mean(X>Y)
(note that we are using vectorized comparison).
2.3 Additional Exercises
Ex 1
Let X \sim N(mean = 2, sd = 1), Y \sim Exp(rate = 2), Z \sim Unif(0, 4) (continuous uniform distribution on [0,4]). Suppose X, Y, Z are independent. Estimate P(\max(X,Y)>Z).
Recall the difference between pmax
and max
:
# always try with simple examples when you test the usage of functions
x <- c(1, 2, 3)
y <- c(0, 5, 10)
max(x, y)
## [1] 10
pmax(x, y)
## [1] 1 5 10
Answer:
n <- 100000
x <- rnorm(n, 2, 1)
y <- rexp(n, 2)
z <- runif(n, 0, 4)
mean(pmax(x, y) > z)
## [1] 0.51129
Ex 2
Let X \sim N(mean = 2, sd = 1), Y \sim Exp(rate = 2), Z \sim Unif(0, 4) (continuous uniform distribution on [0,4]). Suppose that X, Y, Z are independent. Estimate P(\min(X,Y)>Z).
n <- 100000
x <- rnorm(n, 2, 1)
y <- rexp(n, 2)
z <- runif(n, 0, 4)
mean(pmin(x, y) > z) # what is the difference between pmin and min?
## [1] 0.11283
Ex 3
Let X \sim N(mean = 2, sd = 1), Y \sim Exp(rate = 2), Z \sim Unif(0, 4) (continuous uniform distribution on [0,4]). Suppose that X, Y, Z are independent. Estimate P(X^2 Y>Z).
n <- 100000
x <- rnorm(n, 2, 1)
y <- rexp(n, 2)
z <- runif(n, 0, 4)
mean(x ^ 2 * y > z)
## [1] 0.4084
Ex 4
Person A generates a random variable X \sim N(2, 1) and Person B generates a random variable Z \sim Unif(0, 4). If X < Z, Person A will discard X and generate another random variable Y \sim Exp(0.5). Find the probability that the number generated by A is greater than that by B.
n <- 100000
x <- rnorm(n, 2, 1)
y <- rexp(n, 0.5)
z <- runif(n, 0, 4)
mean(pmax(x, y) > z)
## [1] 0.63853
We will see additional simulation examples after we talk about some programming in R
There are many important topics that we will not discuss
- algorithms for simulating random variables
- inverse transform
- acceptance rejection
- Markov Chain Monte Carlo
- methods to reduce variance in simulation
- Control variates
- Antithetic variates
- Importance sampling