Why does Monte Carlo work with pseudo random numbers?

The Monte Carlo (MC) method is a useful procedure to approximate integrals. It is based on the Strong Law of Large Numbers (SLLN), which I state in the following Monte Carlo suggestive way

Theorem: Let \(\{X_n\}_{n\in \mathbb{N}}\) be an independent and identically distributed sequence of random variables in some probability space \((\Omega,\mathcal{F},\mu\)). Let \(g\) be a map such that \(g(X_1)\in L^1(\mu)\). Then \[\frac{1}{n}\sum_{i=1}^n g(X_i) \rightarrow \mathbb{E}[g(X_1)],\qquad \mu-a.s\]

Assume that we want to approximate the integral \(\int_0^1 x^2 dx\) using Monte Carlo. We first decide the length of the random sequence, let us say 10,000. Then we simulate these many independent uniform (0,1) random variables and we take the mean of their squares. This can be done in R with only two lines of code:

x <- runif(1E4)
mean(x^2)

0.335881943442855

we can see that the results is close to the real value 1/3. Now, the random numbers given by R (and programming languages in general) are not really random. They are a deterministic sequence that depend on an initial parameter called the seed. Thus the question: If the simulated numbers are not really random (whatever this means), why does the Monte Carlo method converge to the desired value?

It turns out that the convergence is due to results from dynamical systems and not to the SLLN. In what follows I outline how this works. I start with some definitions

Definition: Let \((\Omega,\mathcal{F},\mu\)) be a probability space. A measurable transformation \(\tau:\Omega \rightarrow \Omega\) is said to preserve \(\mu\) if \(\mu(\tau^{-1}(A))=\mu(A)\) for all \(A\in\mathcal{F}\).

Definition: Let \((\Omega,\mathcal{F},\mu)\) be a probability space and let \(\tau:\Omega \rightarrow \Omega\) preserve \(\mu\). The quadruple \((\Omega,\mathcal{F},\mu,\tau)\) is called a dynamical system.

Definition: A measure preserving transformation \(\tau: (\Omega,\mathcal{F},\mu)\rightarrow (\Omega,\mathcal{F},\mu)\) is called ergodic if for any \(A\in\mathcal{F}\) with \(\tau^{-1}(A)=A\), either \(\mu(A)=0\) or \(\mu(A^c)=0\).

In dynamical systems the properties of sequences defined with successive applications of some measure preserving map \(\tau\) are studied. i.e. sequences like \[x_0,\tau(x_0),(\tau\circ\tau)(x_0),\ldots,\tau^n(x_0),\ldots\] where \(\tau^n\) means the composition of \(\tau\) with itself \(n\) times and \(x_0\) is some initial value.

The following theorem is a corollary of Birkhoff’s ergodic theorem. It is the main theorem of this discussion.

Theorem: Let \((\Omega,\mathcal{F},\mu,\tau)\) be a dynamical system with \(\tau\) ergodic, and let \(g\) be a map in \(L^1(\mu)\). Then for \(\mu\)-almost every \(x_0\) \[\frac{1}{n}\sum_{i=1}^ng(\tau^i(x_0))\rightarrow \int g d\mu\]

So, pseudo random number generators in the computer use an ergodic dynamical system to simulate randomness. They focus on simulating i.i.d sequences of uniform(0,1) random variables. \(L^1\) transformations of these sequences are used to obtain simulations from other distributions. The previous theorem shows why this works. I think one example will make it clear in general:

Suppose we want to approximate the second moment of an exponential distribution with density \(f(x)=e^{-x}\) using Monte Carlo. The computer has already a uniform(0,1) generator. To each generated value \(u\) we apply the inverse cdf of the exponential, so we get \(-\log(1-u)\) (here \(\log\) means natural logarithm). Then we square these numbers and average their values. In other words we compute \[\frac{1}{n}\sum_{i=1}^n(\log(1-u_i))^2,\]

where \(\{u_i\}_{i=1}^n\) is our generated sequence. By the ergodic theorem, this will converge to \(\int_0^1(\log(1-u))^2 du\). Doing the change of variable \(x=-log(1-u)\), we get that \[\int_0^1(\log(1-u))^2du = \int_0^{\infty}x^2e^{-x}dx\] which is the integral we wanted to approximate.

In another post I will give a proof of Birkhoffs ergodic theorem using Markov Chains.