Easy vs meaningful metrics
Due to financial and human resource constraints, inevitably proxy indicators are sometimes used in lieu of measuring what really matters.
Proxy indicators are absolutely useful and provide important information for decision-making and course corrections — but proxy indicators should never be conflated with the actual goal they are designed to measure.
A typical example is tracking quantity rather than quality. Number of schools built, number of graduates, number of natural resource managers, number of acres of natural preserve parks, number of people with access to water, etc. — these are absolutely helpful bits of information. Unfortunately, they don’t really tell us much about quality of education, quality of nature conservation, or quality of water that is accessible. Expanding the number of acres under park protection is better than not — unless these are merely “paper parks” that offer no meaningful ecosystem or wildlife protection.
Unfortunately, often data that are easy to collect are input and output metrics that tell us little about outcomes. The development community is familiar with this problem since each project or program must address this challenge of coming up with the right metrics and methods for data collection. Fortunately, there are many useful resources out there for those in the M&E/MEL/MERL field to reference when trying to identify good indicators to use.
The “tyranny of metrics”
Jerry Z. Muller’s book, The Tyranny of Metrics (2018), weighs in on these challenges. In fact, he presents a much-needed counterpoint to the growing obsession with using metrics to assess performance — whether in the nonprofit sector, business, or government. He argues persuasively that the problem is not “that metrics are intrinsically tyrannical, but rather that they are frequently used in ways that are dysfunctional and oppressive.”
Muller provides both a philosophical critique and a practical one based upon case studies from higher education, K-12 education, medicine, policing, the military, business and finance, and philanthropy and foreign aid.
In chapter 15, he outlines eleven “unintended but predictable negative consequences” of our metrics-obsessed organizational cultures. Each of these is important to bear in mind, but I want to focus here on the first one he mentions — “Goal displacement through diversion of effort to what gets measured” — because it resonates so strongly with my own critical perspective on monitoring and evaluation in the international development sector.
In essence, Muller argues that too often indicators end up supplanting goals in ways that push us further from achieving those goals. This, then, is the paradox of metrics: tracking metrics helps identify what is working and what isn’t so that, in theory, improvements in efficacy, efficiency, and ultimately, outcomes, can be made — but because the metrics are usually just proxies for what really matters, pursuit of them as ends in-and-of themselves may backfire. For one thing, when people and organizations are rewarded according to their scores on these proxy measures, “gaming the system” predictably emerges. People pursue good numbers at the cost of good outcomes.
Muller illustrates this kind of perverse incentive with some striking cases. For example, in New York State, after physician report cards became publicly available, physicians became more risk-averse. In other words, they improved their patient outcomes score by reducing the number of cases they took that were “high-risk”:
“In New York State, for example, the report cards for surgeons report on postoperative mortality rates for coronary bypass surgery, that is, what percentage of the patients operated upon remain alive thirty days after the procedure. After the metrics were instituted, the mortality rates did indeed decline — which seems like a positive development. But only those patients who were operated upon were included in the metric. The patients who the surgeons declined to operate on because they were more high-risk — and hence would bring down the surgeon’s score — were not included in the metrics.”
— Jerry Z. Muller, The Tyranny of Metrics (2018), p. 117
It turned out that many high-risk patients were being sent to an out-of-state clinic, thereby depressing the denominator.
If all we know is the “success rate” of a physician, then of course that information will be interpreted as “the better the success rate, the better the physician.” However, if the scores do not reflect the condition of the patient at the time of the surgery, then aside from weeding out some truly awful surgeons, the scores tell us surprisingly little about who the best surgeons are. Surely, we would be most impressed by a doctor who not only had excellent patient recovery rates but also took on the most challenging cases — especially if our own case was “high-risk.”
In this example, we see that a poorly designed indicator may be to blame.[i] A better approach to measuring performance might be to use separate indicators for 30-day postoperative mortality rates among low-risk and high-risk patients. Yet, we might still prefer to know the proportion of patients seen by a particular physician that falls into each risk category (i.e., how much expertise does the doctor have with high-risk cases?). Perhaps an even better indicator would be a weighted score that accounts for this, giving more weight to survival among high-risk patients.
Being vigilant against goal displacement
Of course, most of us in the development field already know that it is important to design good indicators. And yet the problem persists.
It seems that awareness alone is insufficient: we must be proactively vigilant to detect and avert goal displacement and other unintended negative consequences of our chosen metrics. It is not enough to design a results framework — or a data dashboard — and then follow it on faith alone. We must regularly revisit our indicators and critically reflect on whether they — as well as any performance incentive systems that use them — are producing negative outcomes. This can be built into an annual review process.
Ultimately, this kind of internal critique requires a willingness to adapt and change, as well as the use of some expert judgment (although what counts as expert judgment — not to mention, who is an expert — is another discussion).
Just because measures can lead to goal diversion doesn’t mean we shouldn’t use any — but it does mean that we have to assess the relative value of certain measures over others and maybe make some trade-offs (especially considering financial and human resource constraints that limit how much and what kinds of data can be collected).
Following from an excellent analysis by Lant Pritchett over at the Center for Global Development blog, it seems that we also need to appreciate that indicators reflect components of a system and that in order to understand how well the system is functioning — and what is needed to improve it — we must interpret the data within the context of other indicators and the system more generally. Regardless of how much information is available (more is not always better), some expert judgment is still required to interpret it correctly and wisely.
If we are not intentional about this process — whether as part of our regular team meetings or quarterly progress reporting — then it can be very easy to fall into the trap of losing sight of the forest for the trees.
Use imagination to avoid creating perverse incentives
One way to avoid selecting distracting or perverse indicators in the first place starts with using our imagination. After identifying a set of plausibly useful indicators, we can take the time to imagine the ways that this information could push or pull our team away from the bigger project or program goals. Does pursuit of better scores on this indicator actually distract us from the goal — or even work against another, perhaps more important, goal?
In other words, can you predict any unintended negative consequences of using these metrics? If they can be predicted, they can be avoided. To some extent, making accurate predictions requires some knowledge of human psychology and behavior. The burgeoning field of behavioral economics has especially become helpful in not only recognizing perverse incentives but predicting them.
In his essay, “When Economic Incentives Backfire”, Samuel Bowles provides an oft-cited example:
“When six day-care centers in Haifa, Israel, began fining parents for late pickups, the number of tardy parents doubled. The fine seems to have reduced their ethical obligation to avoid inconveniencing the teachers and led them to think of lateness as simply a commodity they could purchase.”
Once he puts it this way, it’s not really a surprising result. Similarly, if we think about things like tracking indicators of number of students enrolled in school without any additional indicators relating to the quality of education, it might be unsurprising that schooling does not necessarily equate to learning.
Obviously, this is more than an accounting or assessment issue. One heartbreaking story is told by Pritchett in his book, The Rebirth of Education: Schooling Ain’t Learning (2013). He recounts a village meeting in Uttar Pradesh, India, following a school assessment performed by an MIT-based research team. The assessment had revealed that their children apparently had learned very little at their school. One middle-aged father stood up and said directly to the school principal:
“You have betrayed us. I have worked like a brute my whole life because, without school, I had no skills other than those of a donkey. But you told us that if I sent my son to school, his life would be different from mine. For five years I have kept him from the fields and work and sent him to your school. Only now I find out that he is thirteen years old and doesn’t know anything. His life won’t be different. He will labor like a brute, just like me.”
— A father in Uttar Pradesh, quoted in The Rebirth of Education (2013), p. 2
Unfortunately, whatever performance metrics had been used at this school had failed to ensure positive learning outcomes. According to Pritchett, a major reason for this kind of failure is that too often schools aim to look as if they are excelling — never mind whether they actually are. Such isomorphic mimicry emerges in many institutional contexts — and will be the subject of another post.
Concluding remarks
For now, I will conclude by emphasizing: we must not lose sight of the human development goals we truly care about. This more than any well-chosen metric is what keeps our mission on track. We have to be willing to be vigilant and self-critical — willing to say we’re not making meaningful progress, even when our indicators say we’re “on target.” This requires good judgment. Let’s not abdicate the role of expert judgement to data analytics. As the famous quote goes, “All models are wrong, but some are useful.” Let us keep in mind how to interpret data usefully, bearing in mind their limitations — and never losing sight of their purpose.
To help us in this regard, Muller concludes his book with a checklist for “when and how to use metrics (see also this list [PDF] of “criteria for selection of high-performing indicators” developed by Goldie MacDonald with the Centers for Disease Control).
Endnotes
[i] Although, as Muller is careful to note, a similar pattern apparently emerged in Massachusetts even though the score cards were not made public. I think though it’s probably fair to assume that publicly available scores elevate the stakes of the game, so-to-speak.