Dangers of data mining: The case of calendar effects in stock returns

Sullivan, Ryan, Allan Timmerman, and Halbert White, “Dangers of data mining:  The case of calendar effects in stock returns,” Journal of Econometrics 105 (2001), 249-286.

Using the same set of data to both formulate and test a hypothesis introduces data-mining biases.  Calendar effects in stock returns are an outstanding instance of data-driven findings.  Evaluated correctly, however, these calendar effects are not statistically significant.

Researchers have documented day of the week effects, week of the month effects, month of the year effects, and effects for turn of the month, turn of the year, and holidays, none of which was predicted ex ante by theory.  By pure statistical chance, when enough theories are tested on the same set of U.S. publicly-traded common stock returns, some of them are bound to outperform a benchmark, no matter which criteria are used to compare performance.

This paper uses 100 years of data to examine a “full universe” of 9453 calender-based investment rules, and a “reduced universe” of 244 rules.  Investment strategies are tested jointly with many other similar strategies.  Report nominal p-values, and White’s reality check p-value for each null hypothesis of no effect.  White’s p-value adjusts for the data-mining bias.

Conclusions:  Nominal p-values are highly significant for many strategies, but White’s reality check p-values are not significant for any calendar-based strategy.

The Loss Function Has been Mislaid

McCloskey, Donald N., “The Loss Function Has Been Mislaid,” The American Economic Review Vol 75, No 2 (1985), 201-205.

  • Statistical significance is not the same as economic significance (most authors, most professors, and their audiences confuse this point.
  • Significance tell us how (un)likely it is that an observed coefficient is close to its actual value.
  • It is the scientist who must decide whether that estimated value is large enough to have intuitive/economic meaning.

Data mining: a reconsideration

Mayer, Thomas, “Data mining:  a reconsideration,” Journal of Economic Methodology 7:2 (2000), 183-194.

  • Data Mining
    • In the good sense, “data mining” means fitting multiple econometric specifications (in the simple case, multiple OLS regressions) to the data.  This is both reasonable and scientific.
    • In the bad sense, many economists implicitly equate data mining with running many regressions and then only reporting the one(s) that “work.”
    • It is important to report any results that are contrary to the hypotheses, even if they seem very unlikely.
  • Unbiased data mining means neglecting to report results only for the following reasons:
    • The results fail statistical diagnostic tests
    • Their statistical test results are inferior to those of the reported results
    • They support the reported results
    • They are obviously wrong (such as a significantly negative coefficient that collective experience says should be positive)
  • The only case where biased data mining (purposefully omitting all contrary results) is when the author is trying to show that a hypothesis might be correct.  In this case, the author’s intent should be clearly stated.
    • Contrary evidence can usually be found to all hypotheses and theories, so sometimes all we can do is show that we might be right.
  • Even unbiased data mining may be unacceptable if the researcher chooses diagnostic tests and/or significance cutoff levels with which his readers may not agree.
  • One possibility is for researchers to simply report more specifications, and for readers and referees to require them.

Let’s Take the Con Out of Econometrics

Leamer, Edward E., “Let’s Take the Con Out of Econometrics,” The American Economic Review, Vol 73, No 1 (1983),  31-43.

  • statistical inference is not a precise laboratory-style science (parable of farmers, birds, and shade)
    • econometricians can interpret data, but cannot usually perform controlled experiments
    • even with randomly selected samples, the bias of the estimators can be thought small, but it cannot be safely assumed to be zero
    • the uncertainty surrounding sample selection falls as the sample size increases
    • the uncertainty surround model misspecification does not fall with increased sample size, and cannot be inferred from the data
      • One way to decrease this uncertainty might be to collect data from two separate [non]experiments whose biases are independently distributed.  This will result in a bias that is the average of the individual experiments, and only half the misspecification uncertainty.
  • Only a model with infinite variables and infinite data is beyond all scrutiny
    • For any data set, there is an infinite number of polynomial equations that can fit the data points equally well.
    • For any experiment or nonexperiment, an infinite number of variables could plausibly effect the observed outcome (generating substantial degrees of freedom problems).
    • For a model with unlimited parameters, a finite data set can suggest infinite parameter values, each fitting the data to a different degree and appearing more or less believable.
  • Prior assumptions are the key
    • All inferences rely on assumptions formed before looking at the data.
    • It is best to use assumptions that are generally accepted, that are convenient, and that generate the same results as all other assumptions in their (broad) class of assumptions.
  • The Horizon Problem
    • Starting with a model and then adjusting the horizon until the model fits is a problem.
    • Starting with a data and then inferring a model is a problem because it is impossible to tell whether the data validates the data-inspired model.
    • Start with a model, determine beforehand the horizon that will be sufficient to validate it, and limit (but do not rule out) adding variable ex post.
  • Conclusions
    • Accept that all inferences rely on assumptions about which variables to include, how to collect the data, etc.
    • Make assumptions beforehand, and then show how the sensitivity of results to those assumptions is very low.