Validate, validate, validate …

I had a query recently from a researcher who devised a variant of the GPS to incorporate trunk data. He’d submitted it for publication but the reviewer asked for evidence that the scale had been validated and he wanted to know how to respond. It made me stop and think about the whole process of validation. It’s one of those areas in which the concepts evolved within psychometrics, where they are relevant, have been allowed to spill over into other areas, where they are not.


For the uninitiated the field appears complex. I remember a PhD student once who we asked to validate a scale coming back a week later completely confused – she did master it eventually but there was a steep learning curve. Read the relevant chapter in Portney and Watkins for example and you are conducted on a whistle-stop tour of face, content, criterion-related and construct validity in 20 pages. Altman and Bland (direct link to article) whip through these even more quickly and add in internal consistency for good measure.

I don’t have enough space in a blog article to go into why this is all necessary (Altman and Bland provide a succinct summary) but I do want to explore when it is necessary which I feel is very poorly understood. Stating it rather boldly, validation of a scale is required when we don’t know what we are measuring. Psychometrics evolved to support psychologists and behavioural scientists who wanted to quantify concepts such as happiness or anxiety. Neither happiness nor anxiety is defined in terms of numbers so the researcher has to go through a process of convincing her or his peers that the scale she or he has devised is a valid measure of what the rest of us understand by the terms. In our own, field health related quality of life or patient satisfaction or even general terms like gross motor function or mobility are similar qualitative terms. If we want to assign a numerical value to these then we need to go through the same process. As our understanding of the underlying issues becomes more sophisticated then so does the battery of different types of validity that we need to establish in order to convince others that our scale is represents what we say it represents.

By contrast, however, such a process of validation is not required if we do know what we are measuring. If we are measuring length, time, speed or joint angle, moment and power then there are very precise definitions of the terms we are seeking to measure and there is absolutely no need to go through this full validation process. The question we need to ask is whether the tests are accurate rather than whether they are valid. This requires a completely different set of techniques. The GPS is a derivative of joint angle measurements and I would argue that a consideration of accuracy is required rather than one of validity.

Of course there is a subsequent question which is whether any measurement is useful. Just because a variant of the GPS including trunk data is well defined and accurate doesn’t necessarily mean it is useful in any particular context. That, however, is yet another and different question.


Just a minute

During a meeting of the CMAS standards meeting last week there was some discussion about how repeatable our measurements need to be. I was struck by  a comment from Rosie Richards from the Royal National Orthopaedic Hospital at Stanmore that six degrees is the angle represented by one minute on a clock (apparently the idea originally came from her colleague Matt Thornton). Her point was that this doesn’t feel like a very big angle and that if we are are working to this sort of accuracy then we are doing pretty well. I’d agree with her and think if there is ever any discussion of just how accurate gait analysis is then using this as an illustration is really powerful.

Corn Exchange clock, Bristol. This clock actually has two second hands. The red one records GMT and the black one the local time in Bristol which is 190 km west of London and thus nine seconds behind! (C) Rick Crowley, Creative commons licence.

The evidence supports this. In our systematic review, Jenny McGinley and I suggested that measurement variability of more than 5 degrees was concerning and showed that most repeatability studies for most joint angles report variability of less than this. They are thus also, of course, within the one minute limit as well.

It’s also interesting to note that the variability within normal gait is generally less than 6 degrees. I’ve tabulated the standard deviations from our recent comparison of normative data below. Hip rotation at one centre pushes above the limit (but this is almost certainly a consequence of measurement error). The only other variable that exceeds this is foot progression (which I’ll return to below). This should be of interest to those who think that they should be able to use differences in gait pattern as a biometric to identify people. To do this successfully would require variability within the 1 minute limit to distinguish between people.  Personally, I think this is a big ask from the CCTV camera footage that the biometricians would like to base their analysis upon.

Average standard deviations across gait cycles for different gait variables

This doesn’t mean we should be  complacent, however. In the figure below I’ve compared Verne  in the average normal pose at the instant of foot contact (grey outline) and then increased his leading hip flexion by 6 degrees (and adjusted the trailing foot pitch to bring the foot into contact with the ground again while all the other joint angles remain the same). You can see that this has increased step length by over 10%. If there was an additional 6 degree increase in trailing hip extension as well then this would double. The additive effect of such variability may help explain why foot progression in the table above is a little higher than the other measures in that it can be considered as a combination of the transverse plane rotations at pelvis, hip, knee and ankle rather than a “single” joint angle.

Effect on step length of increasing leading hip flexion by six degrees

In summary the one minute limit seems an extremely useful way of describing how accurate our measurement systems are and we should take considerable confidence from this. On the other hand we shouldn’t be complacent as variability of this level in specific joint parameters can have quite substantial impacts on the biomechanics of walking.

Readers outside the UK may not fully appreciate the title to this blog which is a reference to one of the oldest comedy shows on BBC radio which has been broadcast regularly since 1967. It is one of the purest and most exuberant celebrations of the English language that I know. Episodes are not being broadcast at present but when they are they can be listened to internationally (I think) through the BBC i-player

Spot the difference

So how can we use the standard error of measurement? I spent a considerable part of a recent post criticising the ICC but it’s clear from correspondence with several people that the properties of the SEM are not well understood. The SEM is a type of standard deviation (Bland and Altman, 1996, actually refer to it as the within-subject standard deviation). It assumes that measurements of any variable will be distributed about the man value (in this post we’ll assume that the mean value is the true value which needn’t necessarily be true, but is the best we can do for most clinically useful measurements). Measurements a long way from the mean are far less likely than those close to the mean and the distribution of many measurements follows a specific shape called the normal distribution. It can be plotted as below and the golden rule is that the probability of finding a measurement between any two values is given by the area under the curve (hence the differently shaded blue areas in the figure).

standard deviation

(Click on picture to got to Wikipedia article on standard deviation)

If the distribution is normal then it is described by just two parameters. The mean  (which coincides with the mode and the median) and the standard deviation which measures the spread. 68% of measurements fall within ± one SEM of the mean. This means that 32% (1 in 3) fall outside. So if you only take one measurement then on nearly a third of occasions the true value of whatever you are trying to be measure will be further than one SEM from the actual measurement. On 16% (one in six) of occasions the true value will be higher than the true value by one SEM or more and on 16% of occasions it will be lower. This isn’t particularly reassuring so in classical measurement theory scientists tend to prefer to concentrate on the ±2 SEM range within which 95% of measurements fall (this still means that on only 1 in 20 occasions the true value will lie outside this range of one measurement).

This type of analysis get’s quite scary when applied to gait analysis measures. I’ll focus on a physical exam measure as an example because then we don’t need to worry about variation across the gait cycle. Fosang et al. (2003) calculated the SEM for the popliteal angle as 8°. This means that if a single measurement of 55° is made on any particular person then there is a 1 in 3 chance that the true measurement is greater than 63° or less than 47°. If we want 95% confidence then all we can say is that the true value lies somewhere between 39° and 81°. Data from Jozwiak et al. (1996)  suggest that the one  standard deviation range for the normal  population of boys is from 14° to 50° (you do need to make some assumptions to extract these values) and the two standard deviation range is from (-4° to 68°). Thus the 95% confidence limits on our measurement of 55° (39° to 81°) suggest the true value could lie anywhere between well within the 1SD range to a long way outside the 2SD range. As a clinical measurement this isn’t very informative.

Things look even gloomier when you want to compare two measurements! We very often want to know if there is any evidence of change. Has a patient improved, or have they deteriorated, either as the result of a disease process or as a consequence of some intervention? We take one measurement and some weeks later we make anther measurement to compare. There is measurement variability associated with both measurements so we are even less certain about the difference between the two measurements than we are about any individual measurements. Any decent clinical measurement text book will tell you that the variability in the difference between two measures will be 1.4 (√2) times the SEM for an individual measurement.

Going back to the popliteal angle measurement this means that in order to have 95% confidence that two measurements are different we need to have measured a difference of greater than 22° (22.4° actually, being 1.4*2*8°). This is huge and may make you just want to give up and get a job sweeping the roads which doesn’t require you to think about what you are doing. Don’t give up though – for all this sounds pretty grim it is better than using the old surgeon’s trick of eyeballing the measurement and recording “tight hamstrings”.

There are some other factors to consider as well. We may not be interested in detecting a difference but want confidence that what we are doing is not actually harming our patient. So take two measurements of popliteal angle and let’s assume the later measurement is lower (better) than the first. On 2.5% of occasions the true difference will be less than the 95% confidence limit (we will have over-estimated the change by more than the confidence limits) but the other 2.5% who are outside the confidence limits have had an even more positive change (we have under-estimated the change). We thus have 97.5% confidence in an improvement greater than the lower limit. There is a strong argument that we should be using a what is called a one-tailed distribution to correct for this in which case we only need 2.3 * SEM in order to have 95% confidence of an improvement. This still works out as 18°.

We can also question the need for 95% confidence. How often do doctors or allied health professionals ever have 95% confidence in what they are doing? Why should we demand so much more of our statistical measures than we do of other areas of our practice? In some cases we might want 95% confidence (if we are going to spend many thousands of pounds operating on a child with cerebral palsy and requiring them and their family to engage in an 18 month rehabilitation programme) but on others this might be overkill (if we want to assess the outcome of a routine programme of physical therapy). In many clinical situations having 90% confidence that a treatment has not been detrimental may be sufficient. If we drop to requiring 80% confidence then the measured difference need only be as low as 1.2 times the SEM. The table below allows you to work out the minimal difference you need to measure (in multiples of the SEM) to be confident of an improvement has occurred (one tailed).  I wouldn’t drop much below 80% because there is limited sense in drawing formal conclusions from data if you know you’re going to be wrong on 1 in every 5 occasions.

Minimum difference between two measurements
















All in terms of SEM

Before you start thinking that the picture is too rosy remember that not harming your patients is a pretty low standard of care. If we are delivering any care package we really want confidence that it is helping. To manage this statistically we need to define a minimal clinically important difference (MCID). This is the minimum value of change that you consider is useful to the patient as a result of the treatment. If you are simply trying to prevent deterioration then the value may be zero and the analysis above is appropriate. For most interventions, however, you want improvement and to have confidence of that improvement the difference in your measurements needs to exceed the MCID by the number of SEM stated in the right hand column of the table. In some ways this analysis is depressing. The hard truth is that there is significant measurement variability in the measurements that most of us rely on (gait analysis is very little better than physical examination). Most of the time we are deceiving ourselves if we think that we have hard evidence of anything from a single clinical measurement from an individual.

In many ways, though, I think that this is one of the strengths of clinical gait analysis though, particularly in the way it brings together so many different measurements including kinematics, kinetics, physical exam, video and EMG. Although we have limited confidence in any of the individual measurements the identification of patterns within such a wide range of measurements can give considerably more confidence in our overall clinical decision making.

The other thing I’d point out is that none of the discussion above would have been possible on the basis of a measure of reliability such as the ICC.Fosang et al. (2003) quote the ICC for popliteal angle as 0.72. I defy anyone to construct a rational interpretation of the impact of measurement error on clinical interpretation on the basis of that number!


Bland, J. M., & Altman, D. G. (1996). Measurement error and correlation coefficients. BMJ, 313(7048), 41-42.

Fosang, A. L., Galea, M. P., McCoy, A. T., Reddihough, D. S., & Story, I. (2003). Measures of muscle and joint performance in the lower limb of children with cerebral palsy. Dev Med Child Neurol, 45(10), 664-670.

Jozwiak, M., Pietrzak, S., & Tobjasz, F. (1997). The epidemiology and clinical manifestations of hamstring muscle and plantar foot flexor shortening. Dev Med Child Neurol, 39(7), 481-483.


Metrology or psychometrics


July’s over so time to move on from the Determinants of Gait. We’re starting detailed development of teaching material for our new masters degree programme in clinical gait analysis. I’m working on the measurement theory section at the moment and been reflecting how to approach this. I’ve got an engineering background and automatically assume that the language we should use to describe measurement is that of classical measurement theory which I’m going to refer to as metrology.

Modern metrology really started with the French Revolution when a political motivation emerged to standardise measurement systems across the country. Out of this emerged an international process for the standardisation of measurement which is now overseen by the Conference Generale des Poids et Mesures, still based in Paris. They publish the International Vocabulary for Metrology (IVM) which is really the international “Bible” for measurement theory.  The Vocabulary is designed to be universal including the statement that “metrology includes all theoretical and practical aspects of measurement, whatever the measurement uncertainty and field of application.”

One of the things that interests me about measurement in medicine in general and in rehabilitation in particular is that, in some respects, it is developing separately to this paradigm which is accepted almost universally in the physical and biological sciences and in engineering and chemistry. Measurement in medicine and rehabilitation is becoming increasing conceived within the framework of psychometrics. Why, if all the rest of the world is handling measurements one way does, psychology and now rehabilitation need to adopt a different approach?

Whilst its foundations can be traced back to Darwin (see Wikipedia) , psychometrics really came of age in the middle of the twentieth century and is thus a much more recent development than metrology. As the name implies it was developed by psychologists for their work studying concepts such as self-esteem or happiness or even pain which are less specifically defined than quantities in other branches of science. Over-simplifying a bit – metrology was developed to measure things that are specifically defined whereas psychometrics was developed to measure things that are not.

If the quantity you are measuring is specifically defined (e.g. someone’s height) then it is sensible to ask how accurate the measurement is (are you measuring what you claim to be measuring) and this is the fundamental challenge of metrology. If the quantity you are measuring is not specifically defined (e.g. how happy someone is) then the question of how accurate you are is rather meaningless. Psychometry thus focusses on the twin alternative questions of how reliable (repeatable) and valid measurements are.

Others may argue but I am convinced that there is a hierarchy here. If a quantity is well enough defined to determine how accurately it can be measured, then assessing repeatability and validity is second best. If you want to do the job properly you should use metrology to assess accuracy. Psychometric assessment of reliability and validity should be confined to quantities for which the superior option is not possible.

I think that the insidious onset of psychometry has made people lazy. I suspect that there would have been considerably more effort expended on improving measurements in biomechanics if the community had focussed on ensuring accuracy of measurements rather than accepting second best (and rather flattering) measures of repeatability derived from an essentially psychometric approach.

Any volunteers to man (or woman!) the barricades against the insurgence of psychometrics where it isn’t needed or wanted?