Mayer, Thomas, “Data mining: a reconsideration,” Journal of Economic Methodology 7:2 (2000), 183-194.
- Data Mining
- In the good sense, “data mining” means fitting multiple econometric specifications (in the simple case, multiple OLS regressions) to the data. This is both reasonable and scientific.
- In the bad sense, many economists implicitly equate data mining with running many regressions and then only reporting the one(s) that “work.”
- It is important to report any results that are contrary to the hypotheses, even if they seem very unlikely.
- Unbiased data mining means neglecting to report results only for the following reasons:
- The results fail statistical diagnostic tests
- Their statistical test results are inferior to those of the reported results
- They support the reported results
- They are obviously wrong (such as a significantly negative coefficient that collective experience says should be positive)
- The only case where biased data mining (purposefully omitting all contrary results) is when the author is trying to show that a hypothesis might be correct. In this case, the author’s intent should be clearly stated.
- Contrary evidence can usually be found to all hypotheses and theories, so sometimes all we can do is show that we might be right.
- Even unbiased data mining may be unacceptable if the researcher chooses diagnostic tests and/or significance cutoff levels with which his readers may not agree.
- One possibility is for researchers to simply report more specifications, and for readers and referees to require them.