Appendix E: Blinded Reviewer Comments (continued, 2, 3)
Health Care Efficiency Measures: Identification, Categorization, and Evaluation
Explanation of Interest in Efficiency Measures
General
Executive Summary
Chapter 1 - Introduction
Chapter 2 - Methods
Typology
Chapter 3 - Results
Chapter 4 - Assessing Measures
Chapter 5 - Discussion
Appendix
Editorial Comment
Which measures are ready for use?
Are there published measures not included?
Are there vendor developed measures not included?
Chapter 4 - Assessing Measures
Section | Comments | Response |
---|---|---|
Chapt. 4 - Assessing Measures | P49, paragraph 1 - First sentence is awkward. "We suggest …" reads easier. | This change was made. |
Chapt. 4 - Assessing Measures | P49, paragraph 1 - consider adding Appropriateness or Suitability to stated purpose as a criterion. The authors actually cite this as a key reason that stakeholders cited measures developed in the academic world as inadequate for answering their questions (see top of page 50) | Added actionability as a criterion. |
Chapt. 4 - Assessing Measures | Page 49, "Importance". You may want to ignore my comment here, but I disagree with your assertion that measures in peer-reviewed literature ".are more important to a scholarly audience." In reality, the vast majority of these papers are not important to anyone; they simply represent academics publishing papers to be publishing papers (that's something we often do in academe). I suppose your comment could be considered true, in that these articles are important to the authors, in that the new publications can be listed on the authors' annual reports to their departments. They don't really expect anyone to actually make use of the findings. In the last line of the page, you note that Newhouse questioned the utility of existing efficiency measures for policy, but isn't he really questioning the technique for deriving the measures (SFA) rather than the measures themselves? | We agree and have modified paragraph to reflect this comment. |
Chapt. 4 - Assessing Measures | P49, last paragraph - Consider moving the first sentence in the last paragraph to the end of the previous paragraph. | Done. |
Chapt. 4 - Assessing Measures | P50, paragraph 2 - The last sentence implies a value judgement that I'm not convinced is universally the case (i.e. that multi-input, multi-output measures are superior). | Judgment has been removed. |
Chapt. 4 - Assessing Measures | Page 50, end of second paragraph: as per my comment on the executive summary section, nuanced multi-input, multi-output measures are probably a good thing. In this context yes, they would be harder to convince policy makers than a single numerical judgment, but I bet providers would like them better (and they might convert more readily to quality improvement programs). | Have removed the suggestion that these are necessarily superior. |
Chapt. 4 - Assessing Measures | Page 50, beginning of third paragraph: vendors respond to market needs (see also next comment). | Added |
Chapt. 4 - Assessing Measures | Page 50, end of third paragraph: ".perform fairly well." begs the question, what is the definition of good performance of these measures? The market has issued calls for PFP and tiered networks, and the efficiency index is promoted as a solution. Some of the consultants actually help create the perceived need. For example, Mercer Human Resources uses the efficiency index to tier networks, calculates savings from removing physicians with high O/E ratios, and promotes tiering as a solution. I think the best one can say is that the measures respond to the markets' perceived needs. The performance of these measures is exactly part of the research program for which the report calls. | Modified text to reflect this comment. |
Chapt. 4 - Assessing Measures | Page 50, second to last paragraph (and actually you use this language in several other places). You state that ".reliability of most of these measures have been evaluated.by the vendors." Actually, what the vendors supply are measurement tools - person-level risk adjusters and/or episode groupers. Efficiency measures are developing with the aid of these tools, but evaluating the tools is not the same as evaluating the measures developed with them. | Made this change. |
Chapt. 4 - Assessing Measures | Page 50, bottom: yes, the lack of testing is surprising. The rapid rise in health care costs creates understandable pressure for fast and simple solutions (such as tiered networks). Another editorial comment. | No response necessary. |
Chapt. 4 - Assessing Measures | P51, paragraph 1 - Again, the statement about efficiency measures not capturing quality or outcomes. Is this a failing, or are side-by-side comparisons of cost-efficiency and quality an acceptable alternative. | Included the idea of side-by-side comparisons. |
Chapt. 4 - Assessing Measures | Page 51, second paragraph: Sorry, I couldn't follow this one at all! | We have clarified this paragraph. |
Chapt. 4 - Assessing Measures | Page 51, second paragraph from bottom: Actually, the commercial insurance data bases that I've seen do span multiple sites. We have that, the GIC insurers in Massachusetts have it, many (most?) Blues plans have data from multiple sites. What they may not have is significant market penetration. In most markets you need to pool multiple insurers to get a good sample size for an individual physician. Is that what you are thinking here? | This section has been revised and now focuses on the challenges of using aggregated administrative data. |
Chapt. 4 - Assessing Measures | P51, paragraph 4 - Sentence 2 is not true. Sentence 3 implies a possibility that already exists. Re sentence 2: Commercial health plans' administrative data span multiple sites of care. Self-insured purchasers have administrative data that span multiple sites of care. Re sentence 3: Several purchaser initiatives pool these commercial databases (e.g. MAGIC, Care-Focused Purchasing). All of the BQIP pilots have pooled administrative data, including both Commercial and Medicare data. Several states now mandate that all commercial payers submit complete claims data to the state (e.g. New Hampshire, Maine, Kansas) and several are considering such legislation (e.g. Massachusetts, Nevada). New Hampshire and Maine make their pooled administrative data available for research. Providers are identifiable in these datasets whereas payers are not. | |
Chapt. 4 - Assessing Measures | Page 51, bottom: Understanding services as overuse or underuse helps. | We did not introduce this construct into the report. |
Chapt. 4 - Assessing Measures | Page 52, second paragraph. Drop last two sentences, since they also appear in the following paragraph. | This change was made. |
Chapt. 4 - Assessing Measures | P52, paragraph 3 - Collaborative projects that pool administrative data can negotiate lower per physician costs for economic and quality profiles from proprietary vendors. | We did not include this observation. |
Chapt. 4 - Assessing Measures | P52, paragraph 6 - Flexible pricing—discuss possibility of calculating an average payment per service code across payers as mentioned on page 16. | Added this to research agenda and in this area. |
Chapt. 4 - Assessing Measures | Page 52, bottom: Again, this is a perspective issue, and another area where the improved typology helps (also helps frame the debate). For example, a participant in CDHP probably wants to see real prices (to understand the out of pocket costs), while a plan measuring relative provider efficiency would use standardized dollars to remove biases due simply to contractual differences. | Did not add this comment in this place; reference to CDHP elsewhere. |
Chapt. 4 - Assessing Measures | p.53, last paragraph This paragraph as worded is confusing. I think you are saying that as you move down the variables on the Y axis (various applications for efficiency measurement) they should meet more rigorous criteria on the x axis. Some might argue w/ this premise but should be clearer regardless. (Perhaps some shading on the chart) Also the applications need a brief description for the reader who may not intuitively understand how efficiency measurement would be relevant in this case. | We reworded this paragraph to make this more transparent. |
Chapt. 4 - Assessing Measures | p.54 Table 11 Should the other attributes derived from stakeholder groups be included in the table on the x axis? | We added the attribute “actionable” |
Chapt. 4 - Assessing Measures | Page 54, Table: It seems to me that the question in ranking the uses is, how much harm would occur if a mistake were made? Or more specifically, how much harm would occur to patients and secondarily to physicians if a mistake were made? In terms of increasing consequences of making a mistake, we would order the uses as internal review, quality improvement, PFP, public reporting, tiered copayment, and network selection. Not sure where research fits, probably around the quality improvement level. Also not sure where health plan selection by purchasers fits, probably around the public reporting level. Public reporting is the line where reputations get damaged. If you mistakenly hurt my reputation, you probably meet my lawyer. Network selection is the highest stakes because it disrupts doctor-patient relationships, potentially harming patients by forcing them to switch physicians. And from the physician and plan point of view, if you take away my livelihood in error, you definitely get to meet my lawyer! At least in PFP we could always give the money back (in fact I was in charge of adjusting the RIPA PFP payments in response to errors uncovered). | We have reordered the rows in this way. |
Chapt. 5 - Discussion
Section | Comments | Response |
---|---|---|
Chapt. 5 - Discussion | Page 55, last bullet on page. The statement is not true because you omit, among others, DCGs. | DxCGs are now included. |
Chapt. 5 - Discussion | p.55 I would recommend grouping the "conclusion" bullets by content area or under some type of heading. At the moment they are a bit scattered. Many need some additional detail, particularly stating potential policy implications (if w/i the scope of this paper not sure). It would seem that the research agenda could then map back to the findings by identifying gaps. | Conclusion has been written in prose rather than bullet form. |
Chapt. 5 - Discussion | p.56 Ditto for grouping under research agenda and perhaps some prioritization. This is a long list. | Research agenda has been grouped but not prioritized. |
Chapt. 5 - Discussion | P56, Future Research - See suggestions for additional research areas above under page 8 of Executive Summary.
| These additions were made |
Chapt. 5 - Discussion | Page 56, last bullet before Future Research: One idea that might relate to this bullet is that by definition, there is one way to fix an instance of underuse (i.e. supply the underused service); but there are an indefinite number of ways to spend extra money! And they can often be justified. Defining overuse means proving the negative (it is no benefit to do the extra MRI, it is no benefit to use the new drug off-label, etc.). That makes overuse inherently harder to define and drive out. | No change made. |
Chapt. 5 - Discussion | Page 56, Future research: as mentioned above, I would suggest bullets about driving out waste, and most important, about defining desired outcomes (and their connection to patient preferences) so that systems have targets for the quality improvement programs that will make them more efficient. | We have not included this comment. |
Appendix
Section | Comments | Response |
---|---|---|
Appendix | B-5, 1st heading - "SEARCH #1" should be moved up ahead of "DATABASES SEARCHED 2000 - 11/2005." | This change was made. |
Appendix | E-14, header - Should be labeled "Appendix E". I would love to have seen a Reason Code for why each study was excluded. | Reason code was provided. |
Editorial Comment
Section | Comments | Response |
---|---|---|
Editorial Comment | Editorial comment: We believe there is a strong argument that tiered networks are not socially equitable. If tiering worked, than those patients with richer or stronger insurers would be able to access the “best” physicians, while other patients (likely the underserved) would pay more to see the “worst” physicians. In addition, in markets with little excess physician capacity, only the first tiered network works. The “good” physicians' practices fill and then only the “less good” are available, no matter what the tiering says. | No response necessary. |
Which measures are ready for use?
Section | Comments | Response |
---|---|---|
Which measures are ready for use? | None of the peer-reviewed input/output measures seem to be very useful. However, the relative efficiency measures currently used by payors are very useful and begin to provide sustainable cost reduction possibilities by (1) improving provider selection -- more on total efficiency and less on unit price; (2) beginning to be used with tiered co-payments; and (3) health plan selection -- which health plans provide the best total value. It is likely that these relative performance measures, with enough data (a key component), should be practical for internal review and improvement of physician and physician-hospital systems, but I would expect a large degree of resistance to use of these. "Report cards" on individual practice are likely to be very controversial, as has been the case with morbidity reporting in several Eastern states. Pay for Performance (P4P) may well be linked at some future date to the some combination of quality and efficiency reporting. | No response necessary. |
Which measures are ready for use? | All of them are ready for internal review and improvement -- they all start to give a view of efficiency that is important, but they all need a fair amount of refinement before being used for other uses. I think that pmpm is certainly an easy metric of efficiency that is currently used by purchasers to select health plans, however, it has to be fully severity-adjusted to be meaningful in any way when comparing premiums. (Large employers simply have plans reprice their claims to compare one plan to another and therefore do not need to have the data severity adjusted since it is their own). When it comes to public reporting, P4P, tiering, network selection, I personally believe that the efficiency measures you've identified are only suitable to identify the outliers, and then again, only if there are large enough sample sizes from which to calculate the scores. A couple of years ago, BTE and Leapfrog issued a White Paper on measuring provider efficiency. The Paper outlined some of the necessary conditions for use of some of the more common efficiency measurement products. Those conditions still hold true. And the reason for my statement is that there are very few instances currently where Payers have enough data to meet the conditions we specified. | No response necessary. |
Which measures are ready for use? | Population-based measures are more developed and better suited to a capitated environment, while the episode-based measures are more widely used and better suited to a fee-for-service environment. With regard to performance improvement, population-based measures are difficult to use for improving efficiency as they're generally too high level. Episode groupers hold more promise for improving efficiency, when drill-down reporting capabilities are made available to physicians, but physicians won't engage with them unless there are economic incentives to do so. I don't think either type of measure is ready to be used for pay for performance at the physician level, and maybe not even at the group level. On the other hand, tiering may provide a sufficient incentive for physicians to engage in understanding episode-based measures and working with them to effect improvements in efficiency. | No response necessary. |
Which measures are ready for use? | Which measures are ready now and for what use: It is great to see you using the concept that the rigor of the measure has to match the use of the measure. I made exactly the same argument in the FMA report to the Massachusetts Medical Society on the GIC tiering system. I am very glad that it is self-evident (to you) – it is not self-evident to everyone! NCQA makes the same comments in the HEDIS efficiency performance measures guidelines they released for public comment in February. I'm going to go out on a limb here and wonder out loud if any of these are appropriate uses for efficiency measures. I wonder if the real use of an efficiency measure is as a community or societal indicator. If Rochester NY is less efficient than other communities at producing high quality good outcome episodes of care for a given condition, then we better start figuring out why and fixing the system. In a model of medical care, such as Wagner's chronic care model, PFP, public reporting, tiered networks, etc. would be means of activating physicians (I think they had a different idea when they discussed activating patients, not just getting them to change doctors based on scores). These tools, however, actually get in the way of quality improvement. I have a comment on the LASIK example to show how that happens. Toyota is efficient at producing moderately priced cars that are safe, start every time and very rarely need to go to the shop (my personal definition of high quality in a car). I expect you understand this better than I do, but this seems like a place to point out that (as I understand it) Toyota succeeded by driving out waste, reducing variation, enlisting their production workings in improving their systems, you know, all the Deming ideas that have become formalized with Six Sigma and lean processes. I don't think they tiered their workers into above and below average. I apologize again for editorializing but could not resist! The AHRQ report has a certain scope and this is perhaps beyond its borders. | No response necessary. |
Which measures are ready for use? | I am not sure I understand how to use table 11 to answer this question but I will give it a try. I am less clear on “what efficiency measures” are being evaluated for different purposes such as public reporting, payment etc. From the experience at IHA, we will be testing this year both episode based and population based measures for use in pay for performance. If data are complete and measures are valid and reliable, then IHA will plan to include the following efficiency measures in the P4P program: (Table “Measure Description” located at the end of comments) | No response necessary. |
Which measures are ready for use? | Assumption you make is that “academic models” are more appropriate, yet little take-up by vendors/purchasers/plans
| No response necessary. |
Which measures are ready for use? | The report suggests that efficiency measures be evaluated using the same framework for evaluating quality measures. That is, efficiency measures should be evaluated based on importance, scientific soundness, and feasibility. Without information about the importance, scientific soundness, and feasibility of each measure identified in the report, it is difficult to determine which measures are ready for use. For example, the last column of Appendix F of the report (“Data on reliability, sensitivity analysis, validity reported?”) indicates that none of the measures published in peer-reviewed literature appears to have been thoroughly tested in terms of reliability, sensitivity analysis, and validity. There seems to be some data available for 1 or 2 of these elements but no measure has data reported on all 3 of these elements. In addition, for those measures in which data on reliability, validity, and/or sensitivity has been reported, this information was not provided in the report. Therefore, it is difficult to evaluate the scientific soundness of the identified measures. Another example is that the report does not specify whether the published measures are in the public domain. As stated in the report, most of the vendor-developed measures are proprietary and may impose cost barriers during implementation. This type of information would help evaluate the feasibility of implementing the measures. | No response necessary. |
Are there published measures not included?
Section | Comments | Response |
---|---|---|
Are there published measures not included? | Not that I know of. | No response necessary. |
Are there published measures not included? | While there are other efficiency measures (ratios of dollar-costs to mean dollars, e.g.), these are from the grey literature, not peer-reviewed literature. Perhaps a longer discussion of these efficiency measures would be more useful. This would likely entail more site visits/conference calls with the major health plan and provider group users of these efficiency measures. In particular, I believe (but am not sure) that Kaiser Permanente may be using efficiency measures when rating its physicians in the large medical groups in N. CA, S. CA and the Pacific Northwest regions. | We judged that we captured the major nonpeer reviewed efficiency measures and while a search for other grey literature efficiency measures might provide additional information, we do not judge them as high of a priority. |
Are there published measures not included? | There are no other published efficiency measures that I am aware of beyond those identified in the report. | No response necessary. |
Are there published measures not included? | Other published efficiency measures: I am not aware of any. | No response necessary. |
Are there published measures not included? | In terms of other published efficiency measures, I did not get a clear sense of this list, but rather noticed some examples of efficiency measures included in your report. For the episode based measures, the cost of care measures can be broken down into its most granular components (i.e. cost of care for a specific healthcare service for a specific episode), not sure how these would factor into the proposed typology. | No response necessary. |
Are there vendor developed measures not included?
Section | Comments | Response |
---|---|---|
Are there vendor developed measures not included? | A population-based vendor tool that was not discussed is DxCG. It originated as DCGs in the published literature. DxCG offers both concurrent (historical) and predictive models. The former are useful for profiling primary care physicians or comparing groups/networks based on PMPM costs, adjusted for the disease burden in the populations they care for (as reflected by their DxCG index). The principal researchers involved in developing DxCG are Arlene Ash and Randy Ellis. | We added DxCG. |
Are there vendor developed measures not included? | Yes -- one missing vendor is the Cave Consulting Group's "Marketbasket" efficiency measures. At least two large health insurers are making use of this system. | We added Cave Consulting Group's efficiency measures. |
Are there vendor developed measures not included? | There are no other major vendor-developed efficiency measures that I am aware of beyond those identified in the report. | No response necessary. |
Are there vendor developed measures not included? | Other vendors you may want to consider adding are the following:
| CAVE and DxCG were added. |
Are there vendor developed measures not included? | You did capture the more important ones | No response necessary. |
Are there vendor developed measures not included? | Other major vendor-developed measures: HBOC-McKesson has a product they call Pattern Profiler. It matches physician procedure utilization and intensity against what given diagnoses would be expected to require. For example, a 99215 level office visit would not be appropriate for a diagnosis of pharyngitis. A visit for hypertension could be coded at 99214 instead of 99213, but only so many times a year. They also evaluate radiology and other physician procedures. They have developed norms from a large clinical knowledge data base that they have been working on for decades. A flaw in the system is that it does not evaluate other inputs such as pharmacy. | We did not include HBOC-McKesson's product because of time and resource limitations. |