Problems

2002 Paper 2 Q12

D: 1600.0 B: 1500.6

probability approximating binomial to normal binomial distribution normal approximation continuity correction nested binomial cumulative distribution function critical evaluation

On \(K\) consecutive days each of \(L\) identical coins is thrown \(M\) times. For each coin, the probability of throwing a head in any one throw is \(p\) (where \(0 < p < 1\)). Show that the probability that on exactly \(k\) of these days more than \(l\) of the coins will each produce fewer than \(m\) heads can be approximated by \[ {K \choose k}q^k(1-q)^{K-k}, \] where \[ q=\Phi\left( \frac{2h-2l-1}{2\sqrt{h} }\right), \ \ \ \ \ \ h=L\Phi\left( \frac{2m-1-2Mp}{2\sqrt{ Mp(1-p)}}\right) \] and \(\Phi(\cdot)\) is the cumulative distribution function of a standard normal variate. Would you expect this approximation to be accurate in the case \(K=7\), \(k=2\), \(L=500\), \(l=4\), \(M=100\), \(m=48\) and \(p=0.6\;\)?

Solution: Let \(H_i\) be the random variable of how many heads the \(i\)th coin throws on a given day. Then \(H_i \sim B(M,p)\), and the probability that a given coin produces fewer than \(m\) heads is \(p_h = \P(H_i < m)\) Let \(C\) be the random variable the number of coins producing fewer than \(m\) heads, then \(C \sim B(L, p_h)\). The probability that more than \(l\) of the coins produce fewer than \(m\) heads is therefore \(\P(C > l)\). Finally, the probability that on exactly \(k\) days more than \(l\) of the coins will produce fewer than \(m\) heads is: \[ \binom{K}{k} \cdot \P(C > l)^k \cdot (1-\P(C > l))^{K-k} \] Let's start by assuming that all our Binomials can be approximated by a normal distribution. \(B(M,p) \approx N(Mp, Mp(1-p))\) and so: \begin{align*} p_h &= \P(H_i < m) \\ &\approx \P( \sqrt{Mp(1-p)}Z+Mp < m-\frac12) \\ &= \P \l Z < \frac{2m-2Mp-1}{2\sqrt{Mp(1-p)}} \r \\ &= \Phi\l\frac{2m-2Mp-1}{2\sqrt{Mp(1-p)}} \r \end{align*} \(B(L, p_h) \approx B \l L, \P \l Z < \frac{2m-2Mp-1}{2\sqrt{Mp(1-p)}} \r\r = B(L, \frac{h}{L}) \approx N(h, \frac{h(L-h)}{L})\) Therefore \begin{align*} \P(C > l) &= 1-\P(C \leq l) \\ &\approx 1- \P \l \sqrt{\frac{h(L-h)}{L}} Z + h \leq l+\frac12 \r \\ &= 1 - \P \l Z \leq \frac{2l-2h+1}{2\sqrt{\frac{h(L-h)}{L}}}\r \\ &= 1- \Phi\l \frac{2l-2h+1}{2\sqrt{\frac{h(L-h)}{L}}} \r \\ &= \Phi\l \frac{2h-2l-1}{2\sqrt{\frac{h(L-h)}{L}}} \r \end{align*} If we can approximate \(\sqrt{1-\frac{h}{L}}\) by \(1\) then we obtain the approximation in the question. Alternatively, \(B(L, \frac{h}{L}) \approx Po(h)\) and \(Po(h) \approx N(h,h)\) so we obtain: \begin{align*} \P(C > l) &= 1-\P(C \leq l) \\ &\approx 1 - \P(\sqrt{h} Z +h < l + \frac12) \\ &= 1 - \P \l Z < \frac{2l-2h+1}{2\sqrt{h}} \r \\ &= \Phi \l \frac{2h - 2l -1}{2\sqrt{h}}\r \end{align*} as required. [I think this is what the examiners expected]. Considering the case \(K=7\), \(k=2\), \(L=500\), \(l=4\), \(M=100\), \(m=48\) and \(p=0.6\), we have the first normal approximation depends on \(Mp\) and \(M(1-p)\) being large. They are \(60\) and \(40\) respectively, so this is likely a good approximation. The first approximation finds that \begin{align*} h &= 500 \cdot \Phi \l \frac{2 \cdot 48 - 2 \cdot 60 - 1}{2\sqrt{24}} \r \\ &= 500 \cdot \Phi \l \frac{2 \cdot 48 - 2 \cdot 60 - 1}{2\sqrt{24}} \r \\ &= 500 \cdot \Phi \l \frac{-25}{2 \sqrt{24}} \r \\ &\approx 500 \cdot \Phi (-2.5) \\ &= 500 \cdot 0.0062 \\ &\approx 3.1 \end{align*} The second binomial approximation will be good if \(500 \cdot \frac{3.1}{500} = 3.1\) is large, but this is quite small. Therefore, we shouldn't expect this to be a good approximation. However, since \(m = 48\) is far from the mean (in a normalised sense), we might expect the percentage error to be large. [Alternatively, using what I expect the desired approach] The approximation of \(B(L, \frac{h}{L}) \approx Po(h)\) is acceptable since \(n>50\) and \(h < 5\). The approximation of \(Po(h) \sim N(h,h)\) is not acceptable since \(h\) is small (in particular \(h < 15\)) Finally, we can compute all these values exactly using a modern calculator. \begin{array}{l|cc} & \text{correct} & \text{approx} \\ \hline p_h & 0.005760\ldots & 0.005362\ldots \\ \P(C > l) & 0.164522\ldots & 0.133319\ldots \\ \text{ans} & 0.231389\ldots & 0.182516\ldots \end{array} We can also see how the errors propagate, by doing the calculations assuming the previous steps are correct, and also including the Poisson step. \begin{array}{lccc} & \text{correct} & \text{approx} & \text{using approx } p_h \\ \hline p_h & 0.005760\ldots & 0.005362\ldots & - \\ \P(C > l)\quad [Po(h)] & 0.164522\ldots & 0.165044\ldots & 0.134293\ldots \\ \P(C > l)\quad [N(h,h)] & 0.164522\ldots & 0.169953\ldots & 0.133319\ldots \\ \P(C > l)\quad [N(h,h(1-\frac{h}{L})] & 0.164522\ldots & 0.169255\ldots & 0.132677\ldots \\ \text{ans} & 0.231389\ldots & 0.231389\ldots \end{array} By doing this, we discover that the largest errors are actually coming not from approximating the second approximation but from the small absolute (but large relative error) in the first approximation. This is, in fact, a coincidence; we can observe it by investigating the specific values being used. The first approximation looks as follows:

You might not be able to tell, but there's actually two plots on this chart. However, let's zoom in on the area we are worried about:

We can see there are small differences, which could be large in percentage terms. (As we found when we computed them directly).

First, we can immediately see that if we just look at the distribution of \(B(L, p_h)\) and \(B(L, p_{h_\text{approx}})\) we get quite different results, even before we do any approximations.

If we plot the probability distribution of \(B(L, p_h)\) vs \(N(Lp_h, Lp_h(1-p_h))\) we find that it is not a great approximation.

However, the CDF happens to be a very good approximation *just* for the value we care about. Very lucky, but not possible for someone sitting STEP to know at the time!

View

Problems

Filters

2002 Paper 2 Q12