Data mining: a reconsideration

Mayer, Thomas, “Data mining:  a reconsideration,” Journal of Economic Methodology 7:2 (2000), 183-194.

  • Data Mining
    • In the good sense, “data mining” means fitting multiple econometric specifications (in the simple case, multiple OLS regressions) to the data.  This is both reasonable and scientific.
    • In the bad sense, many economists implicitly equate data mining with running many regressions and then only reporting the one(s) that “work.”
    • It is important to report any results that are contrary to the hypotheses, even if they seem very unlikely.
  • Unbiased data mining means neglecting to report results only for the following reasons:
    • The results fail statistical diagnostic tests
    • Their statistical test results are inferior to those of the reported results
    • They support the reported results
    • They are obviously wrong (such as a significantly negative coefficient that collective experience says should be positive)
  • The only case where biased data mining (purposefully omitting all contrary results) is when the author is trying to show that a hypothesis might be correct.  In this case, the author’s intent should be clearly stated.
    • Contrary evidence can usually be found to all hypotheses and theories, so sometimes all we can do is show that we might be right.
  • Even unbiased data mining may be unacceptable if the researcher chooses diagnostic tests and/or significance cutoff levels with which his readers may not agree.
  • One possibility is for researchers to simply report more specifications, and for readers and referees to require them.