The statistics of problems and disasters
Sometimes, after a disaster, people say "Why didn't they build it better?", meaning that a levée should have been higher, a dam should have been stronger, and so on.
The easy answer is to build everything stronger and bigger than we think necessary. The hard part is answering the question "By how much?". The structure will cost more, and take longer to build. Will we ever know by how much we have over-built it? No. Only if there is a problem will we have an answer, just as the only sure diagnosis is a post-mortem.
What can we do to make sure that we build the best structure that will do a job adequately and safely? We need facts. Before we can build a levée or barrier against floods, we need to know how high the water will rise. The problem is that although we may have records of where the water has been, we have to use these to decide what may happen in the future. How to do this is not obvious. We could simply decide to take the highest recorded value and build to that. We could add a bit extra for safety, but how much?
Properties of distributions
When we look at such records we see a problem; the levels of the water are not all equally likely: they are spread out according to some distribution law that we don't know in advance, unless we have a theoretical calculation that is known to work. This is true for a great many physical variables. What we can do is to take all the existing records and draw a histogram of the values. The result might look like the simulation below.
The recorded water level runs from zero at the left to some maximum value at the right. Although it is interesting to know what the distribution is like, especially if we have a theory that we wish to test, the vital area is around the maximum values, because those are the ones that cause the trouble. It is not often that we can see a physical record of a distribution, but an aerial view of a busy airport reveals a black streak at each end of the runway, the density of which gives a good idea of the distribution of landing spots. In that example, the vital thing is that the landing spot must not be too far along for the plane to stop within the runway, and not so far back that it lands before reaching the runway. In the case of water, the banks of rivers, and the beaches and cliffs of the sea, often show the maximum water levels, but they don't give any idea of the distribution of levels.
The problem with our graph is that it doesn't reveal much about the top end. A solution to this is to use a logarithmic vertical scale. In such a scale, equal vertical distances don't mean 1, 2, 3, etc - they mean 1, 10, 100, 1000, etc (or 2, 4, 8, 16, . . ., or 3, 9, 27, 81 . . .). The next two simulations show examples of logarithmic histograms, using the same distribution function as before.
The thumbnails look similar, but the actual histograms show that one set has a much higher maximum value than the other, though both include the same number of values. This is very worrying for our hopes of prediction. We also see greater effects of randomness at the right hand end where the data are sparse, than at the left where they are plentiful. The logarithmic graph gives a very good rendering of statistical fluctuations.
The exponential distribution
The histograms look rather straight at the left side. This is because the simulation used an exponential distribution, of the form dn/dv = Aexp(-v/b), where dn/dv is the density of events in the space of the variable v, A is a constant, and b is the mean value in v-space. The exponential function exp(x), also written ex, is the inverse of the logarthmic function log(x). Transforming graphs to produce straight lines is very useful, because any change of behaviour is easier to see than with other shapes. It is also easier to fit straight lines to data than to fit more complicated functions.
To investigate further, a great many numerically equal sets of data were made, and for each batch of sets, the mean and maximum were calculated for each set in the batch. So if there were 1000 batches of 10000 values, there would be 1000 mean values and 1000 maximum values. For each batch, all the means and maxima were recorded in histograms - blue for the means and red for the maxima. Three examples are given below. For each set, one example of the exponential distribution is shown in grey. Note that in the following seven diagrams, the grey exponential histogram is on a log scale, while the blue and red histograms are linear.
Here are more examples, in which the blue histograms have been scaled down to fit them in. Ideally, the red and the blue histograms should have the same area, as they do in the graphs above.
From these histograms we can deduce the following results for the mean and maxima.
Means - The mean (blue) becomes more reliable with more data, as we would expect. Its distribution becomes narrower and more symmetrical, eventually approximating to a normal (or Gaussian) distribution, a very important important statistical function.
Maxima -The maxima are very variable for a given number of data in a set, and as we increase the number of data in a set, the maximum does not settle down to a constant value, and the range of maximum values in a batch of data sets does not decrease very quickly. For many distributions, the maximum (and minimum and range where applicable) are not good representations of the distribution. This is very bad news in a great many situations. When the water comes over the bank, or the plane undershoots the runway, it is no consolation to know that the mean value (however well measured) was well within requirements.
The problem we see here is that because the maximum value increases with the number of data, no existing records can include the maximum value that could ever occur in the future. Nor can the existing records exactly predict the maximum value that can occur during the next N years. All we can do is build to some estimate, such as the highest flood expected within 25 years, 50 years, 100 years, etc. Even then, the read histograms show that there is a distinct probability of a higher result than the average expectation, so that the "50 year flood" could occur next year, next month, next week, or tomorrow (though with decreasing probability).
Furthermore, although the histograms above do show a trend that we can measure, with real situations we cannot adopt this procedure. Flood records cannot be manufactured: we have to use the existing records, which may not be numerous enough to show what kind of distribution they follow. The distribution may even vary with time because of changes in climate, water usage or land usage.
The normal (or Gaussian) distribution
The normal distribution is of the form dn/dv = Aexp(-(x - b)2/c2), where A is a constant, b is the mean, and c is the standard deviation. The constant A is often chosen to normalise the graph to an area of unity to produce a standard function which can be scaled to fit any number of data.
The diagrams below show some results for a normal distribution, made in the same way as those for the exponential distribution. The first two histograms, linear and logarithmic, include 100000 values each. The last two blue histograms have been scaled down.
The properties of distributions are of great importance in many industries. Suppose, for example, that the correct performance of a component depends on some variable v not exceeding a specified value v(max), or perhaps it must not fall below a value
v(min). The simplest procedure might be to specify the component very conservatively, but in a very competitive field this could risk loss of sales to a rival, especially in a field which is developing rapidly.
Therefore it is desirable to be able to specify, with great confidence, the probability that the required limit will be violated. The problem is that for very tight limits, the number of tests must be very large. To be confident that less than one value in 100 million fails a test means a large number of tests.
One way out is to assume that the distribution of the values is known. It is then possible to measure the mean and standard deviation, which is relatively easy, and to calculate from them the expected probabilities, using the known properties of the distribution. For the normal distribution, a criterion of six standard deviations is sometimes used, corresponding to a success rate of around 99.9999998 %, or a failure rate of about 0.0000001973 %, which seems very small. But the number of bits on a hard disc, for example, is very high, like the number of bits per day transmitted by millions of communications channels. Only a minute error rate is tolerable.
The next pictures show some distributions that all have the same mean (blue) and the same standard deviation (green). Their maxima and minima (red) differ significantly.
The next picture shows the first and last distributions together to show the difference.
The shift may not look much, but the effect on the numbers of values beyond a given limit can be great. In the diagram below, the numbers of data to the right of the black line are very different in the two histograms: in the upper one, the number is too small to show.
The next picture shows two normal distributions (red and blue) differing by a factor of 1.1 in width.
The purple curve was calculated as follows. For each horizontal value V, the total value of each curve (area under the curve) to the right of that position was calculated. These values are the integrals of the curves from v to infinity. The ratio of the two results was plotted as a purple point.
We see that for these two fairly similar distributions, the ratio rises rapidly away from the middle. The vertical lines represent a scale in standard deviations from the mean, while the horizontal lines are at intervals of ten. So at four standard deviations from the mean, the purple curve shows a factor of about 24. At six standard deviations the ratio would be very large. The next diagram, with a grid in steps of 100, shows that at six standard deviations the ratio is about 1080, a colossal increase from 24.
This property of distributions that include an exponential is responsible for some common physical phenomena. If a chemical reaction requires a given molecular energy to take place, only those molecules which have at least that much energy will participate. The distribution of molecular energies has an exponential component. An increase in temperature will increase the number of active molecules out of all proportion to the change in temperature. The properties of semiconductors may change strongly with temperature for a similar reason.
Returning to the subject of specifying components, the problem is that the numbers of values in the extreme tails of the distribution are sensitive to small changes in the form of the distribution. If, for example, we have wrongly assumed a normal distribution, the results may be wrong. Although "common sense" is often a dangerous guide, it does seem reasonable that although we can measure good values of mean and standard deviation with, say, 100 thousand values, extrapolating from 100 thousand to 100 million values requires caution.
Looking at the problem another way, we might define a value above which we want to find only 1 in N of the values, where N is something large such as 100 million. If we test only 100 million values, we might find one value above the limit, none above the limit, two above the limit, etc. To get a sensible result, we need to take a lot more data. If A is the number of data above the limit, the fractional root mean square uncertainty on A is given roughly by the reciprocal of the square root of A. If we want an inaccuracy of 1% (1/100), say, we need A = 10000 values. If these are one hundred millionth of the total data, that total must be 10000 X 100 million, which is a million million.
Accidents happen in threes
A major accident is sometimes followed quite soon by a second, and even a third, though the events are not necessarily at nearby places. What is going on?
Let us imagine a system in which some kind of event happens in a completely random way. In any short period of time ΔT, the probability of one event happening is the same. A long period of time T consists of a great many of these short periods, one after another. We can make ΔT as short as we like, and make more of them to fil up the period T.
What are we likely to see in a stream of random events. Before looking at some events, let us try to guess what the result might be. if we try to fit say N events into a period T, we can find an infinite number of ways of doing it, unless time is quantised into indivisible units, for which there is no evidence.
One possibility is to make the events equally spaced. Exact equality is only one out of an infinite number of possibilities, and therefore has zero probability. It won't happen. If we allow a slight variation from equality within some band of error, the probability becomes finite. If the spread of each event in time is δt, then you can calculate the probability of all N events being in such a band. Try it.
Jocelyn Bell, using a radio telescope, found regular pulses of radio waves from a certain area in the celestial sphere. Tests showed that the pulses were extremely regular. They emanated from a rotating neutron star, called a pulsar. The regularity was easily explained by the well known law of conservation of angular momentum.
Another extreme case would occur if all N events happened at the same time. This has zero probability, but the probability becomes finite if we only insist on the events occurring within a duration δt. Again, this low probability can be calculated. In such a case there would be N - 1 very short intervals and two very long intervals, within the period T.
We begin to see that although anything approaching regularity is unlikely, so is the occurrence of extreme bunching and long empty periods. The actually distribution in time is easy to calculate. Firstly, though, we will look at some simulated events, calculated simply by choosing their position in time randomly.
These ten sets of events were all calculated in exactly the same way. They do look "random", but they show an apparent tendency to "bunching", which is the source of the phrase "accidents come in threes". If people are asked to draw randomly lines, or invent random numbers, they will often make them more uniformly spaced than truly random would be, in an attempt to avoid too much repetition. Yet we know that tossing a coin or a die can easily result in two or more consecutive equal results. In fact the probability of the next result is in no way influenced by previous results, if the system is unbiassed - there is no "law of averages" as commonly understood.
The actual distribution of the random events looks like the next diagram.
It is the exponential distribution whose properties we looked at earlier. The blue line shows the mean, and the red line shows the maximum value of this set of data. In a set of random events, if we were to count the number of intervals between 0 and δt, δt and 2δt, 2δt and 3δt, etc, the most frequent would be those between 0 and δt, the shortest intervals of all.
The next two pictures show randomness in size as well as position.
Here is a picture of some starlings on wires.
Although they are not evenly spaced, their spacings are not completely random either, because the birds do not like to be closer together than a certain distance. The next picture shows a similar effect with gulls.
A similar effect should also be observed on busy roads, with the distance between vehicles being always greater than the safe minimum. It does not always happen like this. Sometimes a wave of compression passes back through the vehicles on a busy road, when a driver suddenly uses the brakes, and all the ones behind have to do the same, each one lagging slightly behind the one in front. The cumulative effect on a motorway can bring the traffic down to very slow speeds, long after the original driver has regained full speed.
More about the statistics of timing
Imagine that we have a signal which occurs at some time that we wish to measure. It could the turning on of a switch, or the collision of two metal objects, or a flaah of lightning. Almost any kind of signal can be turned into an electrical one using a transducer.
In order to measure the instant at which the signal occurs, we can use an electrical clock - a regular series of pulses. At each clock pulse we can use a logic circuit to inspect the signal wire to see whether it has happened, and we will see and output of zero until the signal has occurred, and one afterwards, as in the following diagram.
The electrical system can be thought of as a time-to-digital converter, or TDC, since we get a digital output. The grey lines represent the clock pulses, and the circles represent the logic output at each one. In order to ascertain the instant of switching, all we can do is connect two values as shown. There is no way of knowing what happened between the clock pulses. The zero crossing of the real signal and the interpolated measurement occur at different times. It is clear that the error in time cannot be more than half a clock period.
For a great many measurements, a histogram of timing errors would look like the following diagram.
All outcomes in a range of one clock period are equally likely, and the root-mean-square error in a large number of measurements is CP/√12, where CP is the clock period. The formula gives rms error = 0.289CP. In order to reduce the timing error we could reduce the clock period, that is, use a faster clock.
There is another way, using the same clock. We use an analogue-to-digital-converter, or ADC. The interpolation now looks like the next diagram.
The result ought to be better than with the TDC, because the samples are all on the signal curve instead of being logic levels. The histogram is shown below.
It doesn't look any better than the first histogram, and it isn't even level. Why is this? The answer is seen by looking at the samples. Because the signal rises so sharply, most of the samples are either fully high or fully low. They tell use about the amplitude, but not about the timing. The solution is to make the signal worse. Yes, worse. We will slow it down, using an analogue filter. In the diagram below, the error is much smaller than before, because the samples we are using are on the slope of the signal, which is time-dependent.
The histogram, shown in the next diagram, is much narrower than before. If the signal were less curved, or if we used a better interpolation than a straight line, the result would be even better.