Chapter 21 Using C++ in R with Rcpp

Sometimes R code is not fast enough no matter how hard you try
In that case, you can consider writing that slow part of your program in C++
With the package Rcpp, you can connect C++ with R easily
A typical bottleneck that C++ can address is loops that cannot be (easily) vectorized because subsequent iterations depend on previous ones (e.g., Gibbs sampling when one perform Bayesian inference)

Basic references: https://adv-r.hadley.nz/rcpp.html (this notes is almost completely following this link)

More examples: https://gallery.rcpp.org/

Useful Reference: https://teuder.github.io/rcpp4everyone_en/

Installation:

On Windows, install Rtools (this is not a R package).
On Mac, install Xcode from the app store.

Then, install Rcpp using install.packages("Rcpp") in R.

21.1 Basic Examples

library(Rcpp)

Consider this R function that add \(3\) numbers:

add_R <- function(x, y, z) {
  x + y + z
}
add_R(1, 2, 3)
## [1] 6

Let’s write the C++ equivalent:

# need to use ' '  or " " inside cppFunction()
cppFunction(
'int add_C(int x, int y, int z) {
  int sum = x + y + z;
  return sum;
}'
)

When you run this code, Rcpp will compile the C++ code and construct an R function that connects to the compiled C++ function.

# add works like a regular R function
add_C
## function (x, y, z) 
## .Call(<pointer: 0x00007ff9bb9e1610>, x, y, z)

add_C(1, 2, 3)
## [1] 6

Important differences for writing functions between R and C++:

	R	C++
Declare output type	no	yes
Declare input type	no	yes
Require explicit return statement	no	yes
Statements end with;	no	yes

The common data types are:

Type	Vector	Scalar	Matrix
numeric	`NumericVector`	`double`	`NumericMatrix`
integer	`IntegerVector`	`int`	`IntegerMatrix`
character	`CharacterVector`	`String`	`CharacterMatrix`
logical	`LogicalVector`	`bool`	`LogicalMatrix`

21.2 Scalar Input, Scalar Output

sign_R <- function(x) {
  if (x > 0) {
    1
  } else if (x == 0) {
    0
  } else {
    -1
  }
}

cppFunction(
'int sign_C(int x) {
  if (x > 0) {
    return 1;
  } else if (x == 0) {
    return 0;
  } else {
    return -1;
  }
}'
)

if syntax is identical

21.3 Vector Input, Scalar Ouptut

sum_R <- function(x) {
  total <- 0
  n <- length(x)
  for (i in 1:n) {
    total <- total + x[i]
  }
  total
}

cppFunction(
'double sum_C(NumericVector x) {
  int n = x.size();
  double total = 0;
  for (int i = 0; i < n; ++i) {
    total = total + x[i];
  }
  return total;
}'
)

Need to declare the type of each variable (n, total, i)
for loop syntax is different.
In C++, need to use =, not <-
Can also use total += x[i]
To find the length of a vector in C++, use .size()
IMPORTANT: in C++, vector indices start at \(0\). If your vector has \(n\) elements, the last position is at \(n-1\).

Comparing performance using mark() from the package bench

library(bench)
x <- runif(1000)
mark(
  sum(x),
  sum_R(x),
  sum_C(x)
)
## # A tibble: 3 × 6
##   expression      min   median `itr/sec` mem_alloc
##   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>
## 1 sum(x)        1.3µs    1.4µs   516137.        0B
## 2 sum_R(x)     25.4µs   27.6µs    35059.   23.21KB
## 3 sum_C(x)      1.7µs    1.9µs   420129.    2.49KB
## # ℹ 1 more variable: `gc/sec` <dbl>

You can see sum_R is much slower than the built-in sum (which is highly optimized).

sum_C is much faster than sum_R

21.4 Vector Input, Vector Output

Consider a function that computes \[\begin{equation*} \sqrt{(y_i - x)^2}, \quad i=1,\ldots,n. \end{equation*}\]

# vectorized R function
pdist_R <- function(x, y) {
  sqrt((y - x) ^ 2)
}

# R function with for loop
pdist_R_for <- function(x, y) {
  n <- length(y)
  out <- rep(0, n)
  for (i in 1:n) {
    out[i] = sqrt((y[i] - x)^2)
  }
  out
}

cppFunction(
'NumericVector pdist_C(double x, NumericVector y) {
  int n = y.size();
  NumericVector out(n);

  for(int i = 0; i < n; ++i) {
    out[i] = sqrt(pow(y[i] - x, 2.0));
  }
  return out;
}'
)

To create a numeric vector called out with length \(n\), use NumericVector out(n)
C++ uses pow(), not ^, for exponentiation

Compare the performance:

y <- runif(1e6)
mark(
  pdist_R(0.5, y),
  pdist_R_for(0.5, y),
  pdist_C(0.5, y)
)
## # A tibble: 3 × 6
##   expression           min  median `itr/sec` mem_alloc
##   <bch:expr>       <bch:t> <bch:t>     <dbl> <bch:byt>
## 1 pdist_R(0.5, y)   6.41ms  6.68ms     128.     7.63MB
## 2 pdist_R_for(0.5… 71.86ms  74.1ms      11.8    7.67MB
## 3 pdist_C(0.5, y)    3.5ms  3.81ms     253.     7.63MB
## # ℹ 1 more variable: `gc/sec` <dbl>

The C++ version is faster than the vectorized R function in this case.

In general, I recommend if you can use a vectorized R function, you do not have to write the C++ version (at least not the first part to be considered when you are optimizing your code).

21.5 Returning a list

Return a list with two elements with names sum and diff:

cppFunction('
List sum_diff_C(double x, double y) {
  return List::create(Named("sum") = x + y, Named("diff") = x - y);
}
')

sum_diff_C(2, 3)
## $sum
## [1] 5
## 
## $diff
## [1] -1

21.6 Using `sourceCpp`

We have illustrated how to write simple C++ functions with cppFunction(). For real problems, we have to break down our program in various pieces and we should use stand-alone C++ files.

For example, create a file called my_Rcpp.cpp that contains the following code:

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
double mean_C(NumericVector x) {
  int n = x.size();
  double total = 0;

  for(int i = 0; i < n; ++i) {
    total = total + x[i];
  }
  return total / n;
}

In R, use sourceCpp() to compile your my_Rcpp.cpp file.

sourceCpp("my_Rcpp.cpp")

Compare the performance:

x <- runif(1e5)
mark(
  mean(x),
  mean_C(x)
)
## # A tibble: 2 × 6
##   expression      min   median `itr/sec` mem_alloc
##   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>
## 1 mean(x)     237.6µs  238.9µs     3977.        0B
## 2 mean_C(x)    78.1µs   78.4µs    12132.    2.49KB
## # ℹ 1 more variable: `gc/sec` <dbl>

21.7 Example: Gibbs Sampler

R code:

gibbs_R <- function(N, thin) {
  mat <- matrix(nrow = N, ncol = 2)
  x <- 0
  y <- 0

  for (i in 1:N) {
    for (j in 1:thin) {
      x <- rgamma(1, 3, y * y + 4)
      y <- rnorm(1, 1 / (x + 1), 1 / sqrt(2 * (x + 1)))
    }
    mat[i, ] <- c(x, y)
  }
  mat
}

C++:

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
NumericMatrix gibbs_C(int N, int thin) {
  NumericMatrix mat(N, 2);
  double x = 0;
  double y = 0;

  for(int i = 0; i < N; i++) {
    for(int j = 0; j < thin; j++) {
      x = rgamma(1, 3, 1 / (y * y + 4))[0];
      y = rnorm(1, 1 / (x + 1), 1 / sqrt(2 * (x + 1)))[0];
    }
    mat(i, 0) = x;
    mat(i, 1) = y;
  }
  return(mat);
}

Use ( instead of [ to index into the matrix
[0] is to convert the vector output into scalar

C++ version is about 20 times faster:

mark(
  gibbs_R(100, 10),
  gibbs_C(100, 10),
  check = FALSE
)
## # A tibble: 2 × 6
##   expression          min   median `itr/sec` mem_alloc
##   <bch:expr>     <bch:tm> <bch:tm>     <dbl> <bch:byt>
## 1 gibbs_R(100, …   2.81ms   3.78ms      244.    4.94MB
## 2 gibbs_C(100, …    224µs 269.75µs     3612.     4.1KB
## # ℹ 1 more variable: `gc/sec` <dbl>

21.8 More Details

Comment in C++ is //.

Vector

Access an individual element of a vector using [] or ().

R	Rcpp
`x <- rep(0, 5)`	`NumericVector x(5);`
`x <- rep(1, 5)`	`NumericVector x(5, 1);`
`x <- c(1, 2, 3)`	`NumericVector x = {1, 2, 3};`
`x[2]`	`x[1]` or `x(1)`

Elementwise arithmetic operations are possible for NumericVector

For example:

cppFunction('
NumericVector vector_sum(NumericVector x, NumericVector y) {
  NumericVector sum = x + y;
  return sum;
}
')

x <- 1:5
y <- 2:6
vector_sum(x, y)
## [1]  3  5  7  9 11

For comparison operators, we need to use logical vectors with LogicalVector.