Nassim Nicholas Taleb recently tweeted out an interesting problem. I didn't know how to solve it mathematically, so I wanted to build a simulation in R.

Here's the problem:

PROBABILITY QUIZ: How randomness should almost never look random.

— NassimNicholasTaleb (@nntaleb) November 2, 2017

(spurious cancer cluster in Fooled by Randomness)@CutTheKnotMath pic.twitter.com/1Svrn0IB69

I like to visualize things, so I decided to simulate throwing 8 darts on a board first, then simulate the dart throwing model many times using the Monte Carlo method.

# Simulating dart throws

This uses the `raster`

and `tidyverse`

packages^{[1]}. The `raster`

package simulate throwing darts, while the `tidyverse`

gives us flexible ways to filter the data we're generating.

We need to do four main things:

- Create a 4x4 grid that will function as our dart board
- Generate 8 random points on the grid that function as our dart throws
- Count the number of darts in each section of our dart board
- Visualize the results

## Constructing a dart board

To build the dart board, we can use the raster package:

```
# create a raster with 4x4 grids
board <- raster(xmn=0, ymn=0, xmx=1, ymx=1, res=0.25)
```

## Generating dart throws

We need to generate an *extent* from our dart board raster. This forms a bounding box around our dart board raster, so we can calculate coordinates within the extent that map to the raster.

To generate a spatial sample, we pass in our dart board extent^{[2]}, the number of times we want to throw darts, and our method for throwing the darts, which is `random`

.

```
# throw our darts on the dart board
throws <- spsample(as(extent(board), 'SpatialPolygons'), 8, 'random')
```

## Counting darts

The raster package has a function called `rasterize`

that will tablulate how many times each grid was hit by a dart.^{[3]}

```
# calculate which grid each of our darts hit
board <- rasterize(throws, board, fun='count', background = 0)
```

## Visualizing the dart board

Let's take a look at our dart board now:

```
# plot the board
plot(board)
# plot the points
points(throws, pch=20)
```

In this exampe, none of our grids were hit by 3 or more darts.

# Simulate our dart board model many times

Now that we can visualize hitting one dart board with 8 darts, we want to determine the likelihood of one of our grids getting hit by 3 or more darts, per Taleb's challenge.

One simulation, shown above, gave us a little information about this probability. Using the Monte Carlo simulation method, we can run many simulations rapidly, then compare the results to get a much better idea of this probability. While we'll never *prove* the correct answer with this method, it will give us a practical answer very quickly.^{[4]}

## Turn our earlier work into a function

To create a Monte Carlo simulation, we're going to turn our earlier R snippets into a function that returns `TRUE`

or `FALSE`

, depending on if the dart board contains any grids with 3 or more darts:

```
throwDarts <- function(raster, n_throws) {
# throw our darts on the dart board
throws <- spsample(as(extent(board), 'SpatialPolygons'), n_throws, 'random')
# calculate which grid each of our darts hit
board <- rasterize(throws, board, fun='count', background = 0)
# save board to a table
board_table <- tbl_df(data.frame(coordinates(board), count=board[]))
# return TRUE if there are 3 or more darts in any grid, else return FALSE
ifelse(nrow(board_table %>% filter(count >= 3)) > 0, return(TRUE), return(FALSE))
}
```

The primary changes in this function are:

- We're saving the results of the board after counting the darts in each grid to a table, instead of plotting them in R.
- We're evaluating whether or not there are 3 or more darts in any grid within the function and returning
`TRUE`

or`FALSE`

depending on the results. - The parameter
`raster`

needs to be passed in (we'll use our`board`

raster from earlier), as well as the number of throws,`n_throws`

.

We can run this function with `throwDarts(board, 8)`

.

## Building the Monte Carlo simulator

Now that we have the `throwDarts`

function, all we need to do is run it many times and calculate the frequency of cases where it returns `TRUE`

divided by the number of simulations. This will return an estimated average for how frequently 3 or more darts will hit the same grid during a simulation.

Since we already did the heavy lifting, the Monte Carlo simulator is easy to build:

```
dartSims <- function(raster, n_throws, n_sims) {
# save an empty vector to store sim results
boardEval <- c()
# for loop to run throwDarts sims
for (i in seq(n_sims)) {
boardEval <- c(boardEval,
throwDarts(raster,
n_throws))
}
# calculates the frequency of boards with 3 or more darts in a single grid
hitRate <- sum(boardEval)/length(boardEval)
# returns this value when the function is evaluated
return(hitRate)
}
```

The bulk of the work is done in the `for loop`

, which runs through our `throwDarts`

function as many times as we tell it to, storing the results in the `boardEval`

vector.

Since `TRUE`

values equal 1 and `FALSE`

values equal 0 in R, we can write a quick one-liner to add up all the boards that have `TRUE`

values, then divide that by the number of boards we simulated.

This gives us a frequency for how often boards have a grid with 3 or more darts!

## Final calculation

Running `dartSims(board, 8, 100)`

will simulate throwing 8 darts at our dart board 100 times, returning the frequency of grids with 3 or more darts.

However... this value isn't likely the best answer! Running more simulations increases the likelihood of attaining a closer approximation of the true probability.

This is where judgement needs to come into play, as increasing the number of simulations increases the time and computing power required. There are diminishing returns past a certain point, too, so starting smaller and increasing until things start to stabilize is a good approach.

`dartSims(board, 8, 100000)`

gives a good approximation given these tradeoffs with a value of `0.16912`

^{[5]}. Taleb tweeted out that the solution is `0.169057`

, so this is a fantastic estimation for our purposes.

# Recap

We covered a lot of ground, first building a visualization of throwing darts at a virtual dart board before abstracting the dart throwing into a function that we could use in a Monte Carlo simulation. While we didn't answer Taleb's question definitively, we did calculate a practical answer quickly.

The Monte Carlo method is portable to similar problems, too. We sneakily implemented a technique that will work for geospatial problems, so keep an eye out for future write-ups along those lines (no pun intended).

You can get the complete R script here.

One of the best things about R is it's ecosystem. There's a package for almost any statistical or visualization task you can think of. ↩︎

You may notice the

`SpatialPolygon`

parameter, too. This is telling the sampling function that the dart board contains distinct grids. ↩︎We're using

`fun='count'`

because we want the counts of darts hitting each grid.`background=0`

tells rasterize to set grid values to 0 for the lonely grids that don't get hit by a dart. ↩︎As a quick philosophical aside, a practical, mostly correct answer now is good enough for me in most situations. ↩︎

Running

`set.seed(2001)`

just before executing`dartSims(board, 8, 10000)`

allows you to reproduce this result. R's random functions aren't really random, rather it uses a pseudorandom number generator behind the scenes.`set.seed`

allows us to tap into this and reproduce results by resetting to a starting point for a random number generation function. ↩︎