Monday, October 15, 2007

Testing Your Stove Test - the Beauty of the Normal Distribution

We all have a regular need to show people how much we care about the accuracy and reliability of our claims for our products (I am assuming that the introduction of a new stove to a region is just like the introduction of any new consumer product to a "market" - we want lots of sales and satisfied customers) - as such we should have tests (for stoves and other products) in place that clearly show what the performance of our contributions are. Within industry, quality control (QC) is taken very seriously, and if everyone had to argue about the results of every quality test there would be a world of trouble. Instead everyone agrees on statistical methods that not only show that the performance of our stoves is what we say that it is, but we also can also use present data to predict the future performance of many more stoves in the field. We create tests to show how we test performance, but how do we test our tests - to indicate that people can trust our results?

The very first way to start evaluating the suitability of a procedure/test (time to boil, firepower, individual emissions, etc.) is to determine if the results of the test is "well behaved" - a somewhat un-technical expression that indicates that a set of data resulting from applying the test multiple times has the statistics of a "normal" (or Gaussian) distribution (see http://en.wikipedia.org/wiki/Normal_distribution). A large chunk of quality control (QC) is based on this mathematical distribution, which assumes that any measured variable (say the time to boil 2 liters of water with the WoodGas stove - just following the WBT procedure) has a real and precise value, but that small errors are introduced during the measurement process so that there is a spread in its actual measured values. We can discuss where these errors originate later, but the assumption is that they are random in nature, so the distribution has a single well defined peak. If not then there is a real question as to whether the test is reliable.

As an example, we can use Frank's (for once this is his real name) adobe brick crush strength measurements - 20 real measured data points taken with whatever his standard procedure is. In the first figure I show how we use Excel's "frequency" function to create a table of the probability distribution of his data values. Figure 2 shows that when the data is plotted indeed the distribution of the data appears well behaved (it has a nice, bell-shaped appearance), so that I think that he can assume that both his brick making and crush testing methods are sound and reliable (and therefore that he just has random sources of error). Tests that are "out of control" are ease to spot - the distribution may have no peaks or two peaks or worse. Frank can assume that his average and standard deviation define the distribution well, so he can use easy statistics to see if design changes are an improvement. Without this confidence in his consistency in manufacturing he could be wasting time as he explores changes, because he can't tell if they are effective!

Next I show that I have gone way overboard to actually demonstrate that his distribution quite closely (mathematically) matches a normal one, and now we can see what sort of shape he would get if he measured hundreds (or thousands) of bricks - for example he'll eventually get some points outside his present range, but not very many (these would fall in the tails of the distribution). I use two methods - one where I calculate the expected normal distribution based on just the data points that he has provided ("coarse") and the other one indicating what he might see if he performed thousands of tests ("fine" - assuming his resolution is 1 lb/in2); Frank's data and the two calculated distributions look very similar.

Now Frank is pretty confident that he can develop a specification that meets his needs for well engineered adobe brick wall structural stability, and he can make bricks by the thousands while testing individual bricks only periodically (always just as very few as is statistically necessary!), and his customers can clearly see that he has a quality operation. He would set up his brick performance specification so that walls made from his bricks fail only an acceptable percentage of the time - hopefully this doesn't mean that he has to reject most of his bricks; if this happens then either his manufacturing process is out of control, or alternatively his test method is not capable enough to do the job he is asking of it. Luckily we can do a separate evaluation of the test itself, just to see if it is up to the task of helping him maintain a particular set of specifications (this is called a Gage R&R).

But this does put a rather large onus on the SOP (standard operating procedure, e.g. for the WBT or the emissions tests) developer - it must be very clear (I suggest colored highlights and such to make filling in the blanks easier, and eventually there may be no need for all the calculated value cells to be shown, since the operator does not need these to do the job) and just about bombproof in every language. The SOP has to include every step, such as the actual measurement of the moisture content of the fuel. The present WBT is written so that it implies that every wood species, moisture content, wood size and shape, and operator skill level will result in the same adjusted (for species and MC) measured values. It may be best, as Dean suggests, to just use this test to compare preliminary designs, but any test should be "well behaved" and have this demonstrated early on. While we are learning, it is not unreasonable to use a standard fuel type to simplify things - for example the U.S. has a special agency (NIST) just for supplying industries with "standards" for their testing purposes. The fact that not every stove can use the same fuel type (nor is a world standard available) is certainly unfortunate, and hinders cross comparisons!

Besides being able to apply standard statistical techniques, the beauty of demonstrating that the results of a test are well behaved (and that a normal distribution is followed) is that we know can predict the behavior of future stoves - what stove developer, funding agency, or NGO wouldn't want this - and what climate change researcher wouldn't want to use such data in their analysis? We should all be interested in making sure our tests are reliable, and cover the costs of this simple start ourselves - build it into budgets upfront. Its all about just repeating trials enough, occasionally, so that you know that you are close enough to the mean for your goals. Don't be surprised if you find that you only can "know" the value of a measurement to plus/minus 30%, such that you can't tell the difference between a 5 liter time to boil of 21 minutes and one of 39 minutes (30 minute average +/-30%). And when we are all doing the same measurements as well as we can, I expect that occasional claims for incredible fuel savings (or amount of carbon avoided, or number of trees saved, or ecological/climate impact avoided) will not be as extreme as they are now - my work on the Darfur stove from Berkeley has taught me that we still have a way to go until we test and report appropriately.

The WBT keeps advancing but it is already plenty close for testing whether people can follow it reliably. First you develop the procedure (these days it always has photos when it is done, and we place big picture covered posters in all labs), then you see if the results are well behaved and the test is capable, then you test the process with an R&R once you have specifications... - it is the normal order of things, so this effort is right on time. And when you when you are done there is no more need to talk about test result uncertainties again.

All other studies of the impact on the test results of individual or combined variables unfortunately should to be briefly delayed until the test methods themselves can be proven capable; then (statistically) designed experiments can be quickly used to test the impact of a variable (say moisture content or fuel type/size/shape) or a combination of ones. The name of this formal-like process is "design of experiments" (or DOE), and it aims to determine the sensitivity of a test result to key variables - again, such as fuel moisture content, fuel species, and other factors that might be important in determining the outcome of the tests.

1 comment:

Unknown said...

Hi Charlie

I'm very happy to see some serious thought you are giving to quality control and the careful use of statistics - these issues are all too often overlooked and all too frequently claims are made which do not have a rational (statistical) basis. However I think that in both this article and the earlier one on "Measurements and Statistics for Stovers" you are missing the most crucial issue: how well the WBT predicts field performance. In that earlier article you mention, for example, that one could introduce a "longer boil time, to reduce startup/transient effects" to improve the GR&R results. This no-doubt would work to improve the repeatability of the test, but the WBT would become even less reliable as a predictor of field performance. There is a very clear interaction between stove type and numerous experimental variables, so a test procedure which defines these variables will favour particular types of stoves. If the procedure doesn't reasonably match how the stoves will ultimately be used, no matter how well behaved the results are, or how good the GR&R is, then there is a risk that performance improvements made in the lab on the basis of the WBT do not improve field performance. Figure 2 here - is a fairly good example of this.

My personal opinion is that we should stop chasing the holy grail of the *international* WBT, and instead tailor it to the product's target market. If we want to brag about how good our stove is, a suitable measure would be percentage improvement over a (local) baseline.

Finally, have you see the Design of Experiment work I did? - it can be downloaded here (see Chapter 3)