This post is really a follow-up to the rant I had about psychometrics about a month ago. Again its prompted by preparing some material on *Measurement Theory* for our Masters programme. It focusses on the use of reliability indices for assessing the variability associated with measurements. Needless to say reliability indices are a central feature of the psychometric approach.

The more I think about these the more worked up I get. How can something so useless be so widely implemented? The main problem I have is that the indices is that they are almost impossible to make any sense of. Fosang et al. (2003) reported an interrater intra-class correlation coefficient (ICC) for the popliteal angle of 0.78. What on earth does this mean? According to Portney and Watkins (Portney & Watkins, 2009) this rates as “good”. How good? If I measure a popliteal angle of 55° for a particular patient how do I use the information that the ICC is 0.78? Perhaps even more important, if another assessor, measures it to be 60° a few weeks later how do we interpret that?

What is even more frustrating is that there is far superior alternative, the standard error of measurement (SEM – don’t confuse with the standard error of the mean which sounds similar but is something entirely different). This expresses the variability in the same units as the original measure. It is essentially a form of the standard deviation so we know that 68% of repeat measures are likely to fall within ± one SEM of the true value. Fosang et al. also report that the SEM for the popliteal angle is 6.8°. Now if we measure a popliteal angle of 55° for a particular patient we have a clear idea of how accurate our measurement is. We can also see that the difference of 5° in the two measurements mentioned above is less than the SEM and there is thus quite a reasonable possibility that the difference is simply a consequence of measurement variability rather than of any deterioration in the patient’s condition. (Rather depressingly we need to have a difference of nearly 3 times the SEM to have 95% confidence that the difference in two such measurements is real).

Quite often the formula for the SEM is given as

SEM=SD√(1-ICC).

This suggests that the SEM is a derivative of the ICC which is quite misleading. The SEM is quite easy to calculate directly from the data and should really be seen as the primary measure of reproducibility with the ICC the derivative measure:

ICC = 1-(SEM/SD)^{2}

There are at least six different varieties of the ICC representing different models for exactly how reliability is defined. Although the differences in the models appear quite subtle the ICC calculated on the basis of the different models vary considerably (see pages 592-4 of Portney & Watkins, 2009 for a good illustration of this) . It is quite common to find publications which don’t even tell you which model has been used.

Simplifying a little, the ICC is defined as the ratio of the variability arising from true differences in the measured variable between individuals in the sample (variance = σ_{T}^{2}) and the total variability which is the sum of the true variability and measurement error (variance = σ_{T}^{2}+σ_{E}^{2}), thus

ICC=(σ_{T}^{2})/(σ_{T}^{2}+σ_{E}^{2})

Unfortunately this means that the ICC doesn’t just reflect the measurement error but also the characteristics of the sample chosen. If the sample you choose has a large range of true variability then you will get a higher ICC even if the measurement error is exactly the same. This means that, even if you can work out how to interpret the ICC clinically, you can only do so sensibly with an ICC calculated from a sample that is typical of your patient population. It is nonsensical, for example, to assess ICC from measurements on a group of healthy individuals (which is common in the literature because it is generally easier) and then apply the results for a particular patient group.

Luckily there is a safeguard here in that for most measures we are interested in the true variability in a group of healthy individuals is lower than that in the patients we are interested in so the ICC calculated form the healthy individuals is likely to be a conservative estimate of the ICC for the patient group.

Interpretation of the ICC is generally based on descriptors. Fleis (1986) suggested that an ICC in the range 0.4 – 0.75 was *good* and over 0.75 was *excellent.* Portney and Watkins (2009) are a little more conservative regarding values below 0.75 as *poor* to *moderate*, above 0.75 as *good*. In their latest edition, however, they do suggest that “for many clinical measurements, reliability should exceed 0.90 to ensure reasonable validity [sic]”.

It is possible to do a little maths to explore these ratings. Using the formula above we can calculate the ICC for different values of the SEM (σ_{E, }as expressed as a percentage of the standard deviation of the true variability within the sample σ_{T}).

You can see that even if the measurement error is the same size as the total variability in the sample studied then the ICC is still 0.5 so Fleis’ early suggestion that an ICC as low as 0.4 represents good reliability is a little suspicious. Using his scale reliability is still assessed as excellent starts at an ICC of 0.75 which corresponds to the measurement error still being over half (60%) the standard deviation of the true variability – doesn’t sound particularly *good* to me, let alone *excellent*! Even Portney and Watkins’ cut-off of 0.90 for clinical measurements still allows for the measurement error to be almost exactly a third of the true variability. All in all I’d suggest that either set of descriptors is extremely flattering.

So the ICC is ambiguously defined, difficult to interpret, compounds reproducibility with sample heterogeneity, and has a clearly superior alternative in the SEM. Why on earth is it so popular? I’d suggest the reason lies in the table above – if your reproducibility statistics aren’t very good then put them through the ICC calculator and you’ll feel a great deal better about them. Award yourself an ICC of just over 0.75 and feel that nice warm glow inside as you allow Fleis to label you as excellent!

PS

You might ask how we ever got in this situation and I suspect the answer may lie in the original paper of Shrout and Fleis (1979) and the example they use to discuss the use of the ICC;

“For example Bayes (1972) desired to relate ratings of interpersonal warmth to nonverbal communication variables …”

Does it surprise us that measures developed to quantify reliability of variables such as *interpersonal warmth* and *nonverbal communication* may not be directly applicable to clinical biomechanics? Perhaps interdisciplinary collaboration can be taken a little too far.

.

Fleis, J. (1986). *Design and Analysis of Clinical Experiments*. New York: John Wiley & Sons.

Fosang, A. L., Galea, M. P., McCoy, A. T., Reddihough, D. S., & Story, I. (2003). Measures of muscle and joint performance in the lower limb of children with cerebral palsy. *Dev Med Child Neurol, 45*(10), 664-670.

Portney, L. G., & Watkins, M. P. (2009). *Foundations of clinical research: applications to practice.* (3rd ed.). Upper Saddle River, NJ: Prentice-Hall.

Shrout, P. E., & Fleiss, J. L. (1979). Intra-class correlations: uses in assessing rater reliability. Psychology Bulletin, 86, 420-428.

At the end of the day Richard it is a single index for correspondence and agreement, previous “old school” measures such as correlation and t-tests fall short by a greater margin. I do agree however alternatives should be sought but you must be clear in your distinction of reliability in terms of ” absolute” such as SEM and “relative” such as the ICC, and also the requirement of normally distributed data for SEM. Good post though, a topic that needs further discussion in biomechanics.

Thanks to Quin Louw from Cape Town who’s brought my attention to a paper in the Journal of Clinical Epidemiology going into some of these issues in a little more detail. I’d recommend downloading a copy if you are interested in the topic:

de Vet, H. C., Terwee, C. B., Knol, D. L., & Bouter, L. M. (2006). When to use agreement versus reliability measures. J Clin Epidemiol, 59(10), 1033-1039.

As you might gather from what I’ve written in the main post I don’t think they’ve been critical enough of the ICC but it’s still an informative read.

Hi Richard, is there an alternative to the SEM for non-normally distributed data? would such a measure then reflect reliability or agreement?

Interesting question Lynn. The SEM represents the range within which 68% of measurements fall about the true value. The equivalent for data that isn’t normally distributed is the inter-quartile range which is by definition the range in which 50% (-25% to +25%) of measurements lie. (I often wonder why we don’t have a mid 68 percentile range and then descriptors of normally distributed data and non-normally distributed data would be reported in directly comparable units but that’s another issue!). I’ve never seen anyone calculate a within-person inter-quartile range but that’s essentially what is required.

The tidiest way to handle this would be to apply a transformation to the data to make it normally distributed. You can then perform the analysis and back transform to give yourself the inter-quartile ranges.

I’d see this as representing measurement error or variability or reproducibility. I’m not convinced that the distinction many people make between agreement and reliability is particularly useful.

Do note that some people see the ICC as preferable because the SEM requires normally distributed data. This is rubbish because the analysis that produces ICC values is dependent on the assumption of normally distributed values in an almost identical way to the calculation of the SEM. There is a also subtle issue that the measurement error may be normally distributed even if the true inter-subject variability isn’t. In this case a direct calculation of the within subject standard deviation may be good estimate of the SEM but I suspect that calculating using a method based on analysis of variance would be suspect. There are some quite subtle issues here though – probably more subtle than is justified by the quality of the data that most of us are able to collect as a basis for these analyses.

Thank you, for brilliant explanation.

Are there some range values for SEM or SEM% similar as for ICC (e.g. poor, moderate, good)?

Or may be some percentage of mean for SEM ?