The Replication Crisis That Wasn’t

Guest Post

Published onMarch 24, 2021

Updated on March 24, 2021

The estimable Larry Swedroe and our friends at Alpha Architect have saved me a lot of work by wonderfully summarizing my colleagues’ 1 new paper and the literature it addresses. Thus, I’ll keep this mercifully brief.

Q4 2020 hedge fund letters, conferences and more

Data mining is the idea that backtested results are generally overstated, as researchers try many different things and don’t report failures but, rather, have a large bias to report successes. If that is the case, then traditional inference (“hey it’s a two t-statistic!”) is not valid. 2 It has always been a concern, and researchers have always taken it seriously. I can tell you that we have been obsessed with this issue since before the New York Rangers won their first Stanley Cup in 54 years (and now, sadly, a new 27 year drought is well underway), before anyone knew who Jerry Seinfeld was (unless they watched Johnny), and almost as important, since there was a nasty wall in Berlin. 3 Whether one is just trying to understand markets, or actually implementing these strategies for clients and/or themselves, the concern that highly touted results may be exaggerated is an obvious worry!

We have always, as almost anyone who’s ever had to sit through a presentation from us can confirm, addressed this in two ways: out-of-sample evidence and theory/story. A factor found to have a positive realized return in one place over a limited time period, 4 no matter how good the return, is always suspect. How will it hold up in other geographies, time periods, or even asset classes? As the great economist Henny Youngman often said, take Fama and French’s original work on the value factor. 5 They did this mainly using one value factor (price-to-book) in one country (the USA) in one asset class (within equities) over a period that now seems way too short (1963 through the late 1980s). 6 I know I’m old, but we now have more out-of-sample data than they had in-sample data. 7 Since then, researchers have tested it in other countries, using other valuation measures, for selecting equity exposure across countries, and in multiple other asset classes. Researchers have even extended the original results back to 1929 or even further. And, of course, there’s that little matter of the factors holding up since the late 1980s too. 8

Taken as a whole, the results have been an outstanding confirmation of the initial results for the value factor (and I will be discussing other factors soon). I say this as one very conscious of (nay, obsessed with) value’s recent difficulties (and as one with the temerity to claim these, until a recent partial resurgence, excruciating results don't really change our long-term estimate of the value factor’s expected return). And, while you need to account for valuation changes and its negative correlation with momentum to see the strength of the simple price-to-book value factor in the USA since 1990 (BTW, adjusting for valuation changes and examining value in combination with momentum are legit!), the totality of the out-of-sample evidence is way stronger than just that. 9

I’ve only discussed value but I could write a very similar paragraph for momentum (itself having some pretty cool super long-term backtests). Other stalwarts of the quant factor pantheon, like quality and low-risk investing, are similar, if not as exhaustive, success stories. 10

In addition to out-of-sample evidence, we also require 11 some understanding of why the factor should work to begin with. These don’t have to be provable single stories. For instance, both a believer in efficient markets and one who thinks markets are very warped by behavioral finance can have good (but separate) reasons to believe in the value factor. But there has to be some sense to it. Famous examples used to illustrate the perils of data mining include timing the market using butter production in Bangladesh or the winner of the Super Bowl. These are edifying and make us laugh. But it’s harder to judge when the researcher is dealing with less obviously ridiculous 12 measures. If your new factor is made up of measures from Compustat, CRSP, and FRED, it will likely never be as obviously ex ante ridiculous as these cautionary tales. All I can say here is that we try very hard to have high standards!

Data mining isn’t the only criticism faced by the factor literature. A more basic problem is that a backtest might never have been right to begin with. A “replication” crisis is most specifically about this – not being able to even replicate the original work. 13 At AQR, we usually try to replicate academic research to see how it holds up, and, similarly, when an internal researcher finds a new result, we have someone else independently replicate the work to ensure that it is valid.

Next, a backtest could’ve been pristine (i.e., correctly implemented and unexaggerated by repeated iterations), had a great and in fact true story behind it, but cease to work going forward as more and more people learn about the pattern and invest in the factor; perhaps because of this self-same pristine research. Indeed, enough investment based on the factor can arbitrage away its efficacy, although a strategy can actually still work going forward even when “everyone” knows about it, but only up to some point (enough capital can drive away any edge). 14 It’s very fair to worry that a strategy might be arbitraged away once revealed and over-invested in. 15 It’s a whole other thing to just assume it must be the case.

So, none of these are brand new issues. But there is a relatively new and growing literature on it. There have been numerous papers over the last few years (again, see Larry’s recent summary) examining the “factor zoo” (a term coined by the great John Cochrane). “Zoo” isn’t generally used as a compliment and is not so used here. While specific papers of course vary, the general consensus has been that factors have been disappointing since their “discovery” (either measured by out-of-sample returns as a fraction of back-tested returns, or by the fraction of factors that hold up out-of-sample, or by the really terrible ones that didn’t even replicate in-sample). Apparently the Zoo was indeed too big, data mining too prevalent, and we’ve all learned a well-deserved lesson. Time to move on.

Not so fast! These papers, most of them well-done, have often stirred up financial media reaction that is, well, less well-done. Stories like “investors only get 50-60% of a backtest going forward because it’s been r-b-trajed!!!”, with a tone that factor investors are ivory tower idiots who can’t cope when confronted with how the real world works beyond their backtests, are not uncommon. 16 ^, 17 As I will soon discuss, we think these same 50-60% type results are a cause for celebration, and we have thought so for near thirty years.

Since data mining has been such a concern of ours for decades, we take this burgeoning literature quite seriously. Some of our responses have been to note that we don’t believe in 100s of factors. We believe in a handful of factor types and not even in every factor beloved by the literature. If you measure value (again, using it as the example) 100 different ways and average them, you are not testing 100 independent factors as is, unfortunately, sometimes asserted. If you average them you are not cherry picking the best way but creating, in our view, a more robust factor (and thus, this should be viewed as more positive evidence for value not being data-mined). 18 Thus we’ve always thought the factor zoo is smaller than some others do, and we note that we only believe in the parts of it that overcome our many hurdles. We have never felt the need or desire to defend every paper ever written on factor investing. 19

Finally, at its most simple, we have never, not for a second, believed that one should expect to get results going forward that are as good as a backtest. We may fight hard, and on net successfully, to control data mining, but that doesn’t mean we can eliminate it. Some always creeps in to even the best intentioned backtests. For rough (very rough) justice, we’ve always started from an assumption that you’d get half the backtested results going forward (of course, specific situations might call for different estimates 20 ). Thus, oddly, when papers come out saying that if you invested in all the published factors right after they were known you’d only get, like, half the backtest going forward, others see a problem while we generally pump our fists in a kind of obnoxious “yes!!!”-like motion.

So, what’s left to do? I’ve handled the “it’s all data mining or arbitraged away” critics just fine without my sagacious colleagues, no? Well, you might have noticed that my counter-arguments to critics of factor investing are piecemeal. They aren’t a formal test showing that we’re likely right – they are just a series of guideposts that give us great confidence. Well, along come Jensen, Kelly, and Pedersen 21 to test, and test brilliantly, what we have argued, largely anecdotally, for many years. 22 They form a framework (warning, their paper is a lot more “mathy” than mine are these days, oh to be a young geek again) that is quite painstaking in broad factor inclusion and replication, that accounts for the correlations among factors (again, there ain’t really 100s of factors), and, most importantly imho, doesn’t start from the, frankly, silly notion that “it’s a failure if we don’t get 100% of a backtest out-of-sample.” 23 This Bayesian framework starts with the simple intuitive prior that the factors have zero expected return. The in-sample results influence us to raise our estimate from this start of zero, but not one-for-one. If you started out thinking something was likely zero with some confidence, then you observed a bunch of data that said it was positive, you would be foolish to think “oh, now it’s as good as what I just observed.” Your prior counts too. The idea is simple and intuitive. It’s a needed and brilliant formalization of something we, and we believe many other factor investors, have always done. Simply put we do not assume you should do as well as a backtest, and this new paper formalizes and tests this notion.

Their results are rather startlingly (even to me) positive for the field in general. I’ll leave you in the hands of Larry for, again, a very well done summary and just quote from the main paper’s abstract here:

“The majority of asset pricing factors: (1) can be replicated, (2) can be clustered into 13 themes, the majority of which are significant parts of the tangency portfolio, (3) work out of sample in a new large data set covering 93 countries, and (4) have evidence that is strengthened (not weakened) by the large number of observed factors.”

In plainer words, done properly (using a consistent methodology, accounting for the simultaneous testing of many factors, comparing to a fair yardstick and not 100% of a backtest, studying the broadest set of factors in the most places, using global data, and even making their code and data available online) the study of ex post “after the research paper came out and told everybody the good news” factor results is extremely supportive of factor investing. 24

I think this is one of the most important papers in our field for a long time. I am, of course, incredibly biased, both by my own interests and the esteem in which I hold my colleagues (ex-Theis of course; I don’t trust that guy as far as I could throw a Danish sumo wrestler). 25 So I do encourage you to read the actual paper and decide for yourself if I’m right. But I am. 26

Oh, and I’m not sure this blog qualifies as brief, so I may have lied in the beginning when I promised you that. But, it would’ve been a lot less brief without Larry’s excellent summary, so a big thank you to him!

Article by Cliff Asness, AQR

Guest Post

If you are interested in contributing to Hedge Fund Alpha on a regular or one time basis read this post

The Replication Crisis That Wasn’t

Get Your Business News