Random sampling from a fixed set is used in many areas of statistical computing. The performance of this operation can be critical, especially when the sampled set is large. The fast RNGs provided in this package make very fast sampling possible when combined with suitably fast algorithms.

By combining fast RNGs with a fast methods for creating integers in a range one gets good performance for sampling with replacement:

```
library(dqrng)
m <- 1e6
n <- 1e4
bm <- bench::mark(sample.int(m, n, replace = TRUE),
sample.int(1e4*m, n, replace = TRUE),
dqsample.int(m, n, replace = TRUE),
dqsample.int(1e4*m, n, replace = TRUE),
check = FALSE)
```

expression | min | median | itr/sec | mem_alloc |
---|---|---|---|---|

sample.int(m, n, replace = TRUE) | 433.3µs | 475µs | 1999.342 | 49.1KB |

sample.int(10000 * m, n, replace = TRUE) | 846µs | 900.4µs | 1072.688 | 80.7KB |

dqsample.int(m, n, replace = TRUE) | 47.6µs | 50µs | 17170.274 | 58.2KB |

dqsample.int(10000 * m, n, replace = TRUE) | 53.3µs | 57.5µs | 14311.590 | 82.9KB |

Note that sampling from `10^10`

integers triggers “long-vector
support” in R.

When sampling *without* replacement one has to consider an
appropriate algorithm for making sure that no entry is repeated. When
more than 50% of the population are sampled, dqrng shuffles an
appropriate part of the full list and returns that. The algorithm used
in R is similar but dqrng has the edge with respect to performance:

```
library(dqrng)
m <- 1e6
n <- 6e5
bm <- bench::mark(sample.int(m, n),
dqsample.int(m, n),
check = FALSE, min_iterations = 50)
```

expression | min | median | itr/sec | mem_alloc |
---|---|---|---|---|

sample.int(m, n) | 55.96ms | 63.7ms | 15.45168 | 6.11MB |

dqsample.int(m, n) | 8.16ms | 10.3ms | 90.04322 | 6.1MB |

For lower sampling ratios a set based rejection sampling algorithm is used by dqrng. In principle, R can make use of a similar algorithm based on a hashset. However, it is only used for larger input vectors even though it is faster than the default method. The algorithm in dqrng, which is based on a bitset, is even faster, though:

```
library(dqrng)
m <- 1e6
n <- 1e4
bm <- bench::mark(sample.int(m, n),
sample.int(m, n, useHash = TRUE),
dqsample.int(m, n),
check = FALSE)
```

expression | min | median | itr/sec | mem_alloc |
---|---|---|---|---|

sample.int(m, n) | 1.19ms | 1.94ms | 433.0223 | 3.85MB |

sample.int(m, n, useHash = TRUE) | 590.9µs | 657.81µs | 1427.8970 | 169.65KB |

dqsample.int(m, n) | 99.66µs | 102.62µs | 8678.4395 | 39.11KB |

As one decreases the sampling rate even more, dqrng switches to a hashset based rejection sampling. Both hashset based methods have similar performance and are much faster than R’s default method.

```
library(dqrng)
m <- 1e6
n <- 1e2
bm <- bench::mark(sample.int(m, n),
sample.int(m, n, useHash = TRUE),
dqsample.int(m, n),
check = FALSE)
```

expression | min | median | itr/sec | mem_alloc |
---|---|---|---|---|

sample.int(m, n) | 478.87µs | 692.85µs | 799.9916 | 3.82MB |

sample.int(m, n, useHash = TRUE) | 14.64µs | 16.57µs | 56302.7572 | 3.98KB |

dqsample.int(m, n) | 4.76µs | 5.42µs | 158618.5998 | 448B |

For larger sampling ranges R uses the hashset by default, though
`dqsample.int`

is still faster:

```
library(dqrng)
m <- 1e10
n <- 1e5
bm <- bench::mark(sample.int(m, n),
dqsample.int(m, n),
check = FALSE)
```

expression | min | median | itr/sec | mem_alloc |
---|---|---|---|---|

sample.int(m, n) | 12.45ms | 16.43ms | 58.97732 | 1.76MB |

dqsample.int(m, n) | 1.98ms | 3.04ms | 302.74952 | 781.3KB |

The following methods are used for sampling without replacement. The algorithms are presented in R-like pseudo code, even though the real implementation is in C++. For sampling rates above 50%, a partial Fisher-Yates shuffle is used:

```
no_replace_shuffle <- function(m, n) {
tmp <- seq_len(m)
for (i in seq_len(n))
swap(tmp[i], tmp[i + random_int(m-i)])
tmp[1:n]
}
```

where `random_int(m-i)`

returns a random integer in
`[0, m-i]`

. Since the full population is kept in memory, this
method is only suitable for high selection rates. One could expect that
reservoir
sampling should work well for lower selection rates. However, in my
tests set based algorithms were faster:

```
no_replace_set <- function(m, n) {
result <- vector(mode = "...", length = n) # integer or numeric
elems <- new(set, m, n) # set object for storing n objects out of m possible values
for (i in seq_len(n))
while (TRUE) {
v = random_int(m)
if (elems.insert(v)) {
result[i] = v
break
}
}
result
}
```

Here `elems.insert(v)`

returns `TRUE`

if the
insert was successful, i.e. `v`

was not in `elems`

before, and `FALSE`

otherwise. There are different strategies
for implementing such a set. For intermediate sampling rates (currently
between 0.1% and 50%) dqrng uses a bitset, i.e. a vector of
`m`

bits each representing one of the possible values. For
lower sampling rates the memory usage of this algorithm is to expensive,
which is why a hashset^{1} is used, since there the used memory scales
with `n`

and not with `m`

. One could expect that
Robert Floyd’s
sampling algorithm would be superior, but this was not the case in
my tests, probably because it requires a final shuffling of the result
to get a random *permutation* instead of a random
*combination*.

For the specialists: Open addressing with a power-of-two size between 1.5 and 3 times

`n`

, identity hash function for the stored integers and quadratic probing.↩︎