Do fault prediction models that guide testing and other efforts to improve software reliability lead to finding different or additional faults in the next release, to an improved process for finding the same faults that would occur were the models not used, or do they have no impact at all? In this challenge paper, we describe the difficulties involved in estimating effects of this sort of intervention and discuss ways to empirically answer that question and ways of assessing any changes, if present. We present several experimental design options and discuss the pros and cons of each.