HFA Icon

Design Choices, Machine Learning, And The Cross-section Of Stock Returns

HFA Padded
HFA Staff
Published on
Updated on
Cumulative performance of machine learning portfolios

Design Choices, Machine Learning, And The Cross-section Of Stock Returns by SSRN

Minghui Chen

Technische Universität München (TUM) - Department of Financial Management and Capital Markets

Matthias X. Hanauer

Technische Universität München (TUM); Robeco Asset Management

Tobias Kalsbach

Technische Universität München (TUM)

November 23, 2024

Abstract

We fit over one thousand machine learning models for predicting stock returns, systematically varying design choices across algorithm, target variable, feature selection, and training methodology. Our findings demonstrate that the non-standard error in portfolio returns arising from these design choices exceeds the standard error by 59%. Furthermore, we observe a substantial variation in model performance, with monthly mean top-minus-bottom returns ranging from 0.13% to 1.98%. These findings underscore the critical impact of design choices on machine learning predictions, and we offer recommendations for model design. Finally, we identify the conditions under which non-linear models outperform linear models.

Read more hedge fund letters

1 Introduction

Machine learning (ML) models for predicting stock returns have gained substantial popularity in both academic studies and industry practice in recent years. For example, Freyberger et al. (2020), Gu et al. (2020), and Chen et al. (2024) predict stock returns for the United States, while Rasekhschaffe and Jones (2019) and Tobek and Hronec (2021) focus on developed markets, and Hanauer and Kalsbach (2023) examine emerging markets. Overall, these studies show that models allowing for non-linearities and interactions—such as neural networks and tree-based models—yield superior out-of-sample (OOS) returns compared to linear models.

However, as a still developing field, existing studies vary considerably across several key design choices. For instance, when predicting future stock returns, Gu et al. (2020) and Freyberger et al. (2020) employ the excess return over the risk-free rate as the target variable, whereas Tobek and Hronec (2021) and Hanauer and Kalsbach (2023) use the abnormal return relative to the market. While these studies use a continuous target variable, Rasekhschaffe and Jones (2019) advocate for predicting categories, such as distinguishing outperformers from underperformers. Additionally, Gu et al. (2020), Tobek and Hronec (2021), and Hanauer and Kalsbach (2023) implement an expanding window to train their models, while Freyberger et al. (2020) and Rasekhschaffe and Jones (2019) use rolling windows. This variety in design choices underscores the lack of common research standards in machine learning for stock return predictions, making it challenging to compare performance across studies.

In this study, we analyze the variation in seven key design choices in machine learning studies and document the resulting differences. These design choices include (1) algorithm, (2) target variable, (3) target transformation, (4) post-publication treatment, (5) feature selection, (6) training window, and (7) training sample. To assess the importance of these choices, we examine all possible combinations, resulting in a total of 1,056 machine learning models. Each model is trained on a common set of features for the U.S. stock market, and we evaluate their out-of-sample performance using top-minus-bottom decile portfolios. Thereby, we document the economic relevance of design choices and provide recommendations for model design.

The main findings of our study can be summarized as follows: First, we document substantial variation in top-minus-bottom decile returns across different machine learning models. For example, monthly mean returns range from 0.13% to 1.98%, with corresponding annualized Sharpe ratios ranging from 0.08 to 1.82.

Second, we find that the variation in returns due to these design choices, i.e., the non-standard error, is approximately 1.59 times higher than the standard error from the statistical bootstrapping process. This magnitude is comparable to or even exceeds findings in related studies on non-standard errors, such as ratios of 1.60 in Menkveld et al. (2023), 1.06 in Soebhag et al. (2024), 1.10 inWalter et al. (2024), and 1.55 in Fieberg et al. (2024). Thus, design choices in financial machine learning are of substantial importance. Among these choices, we find that post-publication treatment, training window, target transformation, algorithm, and target variable are the most influential. Consequently, researchers need to be particularly cautious regarding the assumptions underlying their choices. By contrast, feature selection and training sample have little impact.

Third, we provide recommendations for design choices based on prediction goals and economic effects, and we identify conditions under which non-linear models outperform linear models. For example, the recommended target variable depends on the prediction goals. If the aim is to forecast higher relative raw returns, as is common in cross-sectional stock return studies, the abnormal return relative to the market is more suitable than the excess return over the risk-free rate. Conversely, if the goal is to achieve high market-risk-adjusted returns, CAPM beta-adjusted returns are preferable, as the feature importance shows that this target effectively captures the low-risk effect. Finally, non-linear ML models significantly outperform linear OLS models only when using abnormal returns relative to the market as the target variable, employing continuous target returns, or adopting expanding training windows.

Our study makes three key contributions to the existing literature. First, we contribute to the body of research that employs machine learning models for stock return prediction (Rasekhschaffe and Jones, 2019; Freyberger et al., 2020; Gu et al., 2020; Tobek and Hronec, 2021; Azevedo et al., 2023; Cakici et al., 2023; Hanauer and Kalsbach, 2023; Howard, 2024). These studies typically apply specific design choices when tuning the machine learning models. However, these choices often vary across studies. In contrast, we assess the importance of seven key design choices by systematically analyzing all possible combinations. Consequently, we validate the replicability of the machine learning predictions and provide deeper insights into the performance variations resulting from these different choices. Our study is also related to Bali et al. (2024), who analyze the stock-level variation of 100 machine learning predictions as a predictive signal for the cross-section of stock returns. Unlike our study, they focus on the variation itself as a predictive signal. Furthermore, the 100 different return predictions are derived solely from random forest models that randomly select 76 features from a total set of 153 features, without varying other design choices.

Second, we contribute to studies that provide guidelines for finance research. For instance, Ince and Porter (2006) offer guidelines for handling international stock market data, Harvey et al. (2016) propose a higher hurdle for testing the significance of potential factors, and Hou et al. (2020) recommend methods for mitigating the impact of small stocks in portfolio sorts. By offering guidance on design choices for machine learningbased stock return predictions, we help reduce uncertainties in model design and enhance the interpretability of prediction results.

Finally, we contribute to the literature on non-standard errors in finance. Menkveld et al. (2023) introduce the concept of non-standard errors and apply it to six hypotheses using EuroStoxx 50 index futures data. Soebhag et al. (2024) and Walter et al. (2024) apply this concept to stock portfolio sorts and factor constructions, while Fieberg et al. (2024) focus on cryptocurrency portfolio sorts. These studies find that non-standard errors are often more prominent than standard errors. We confirm that design choices play a crucial role in machine learning-based return predictions, with non-standard errors of similar or even higher magnitude.

In an independent and contemporaneously written working paper, Lalwani et al. (2024) also investigate the role of methodological choice on the performance of machine learning strategies. However, in contrast to our study, they primarily focus on the effect of different sample filters, such as size, price, age, and industries, and differences in evaluation choices, such as the number of quantiles or portfolio weighting. More importantly, they do not investigate the effects of design choices related to target, target transformation, postpublication treatment, and feature preselection. The reported ratio of 1.99 of the nonstandard error with the standard error for gross value-weighted returns is slightly higher than in our study.

Our study has important implications for machine learning research in finance. A deeper understanding of the critical design choices is essential for optimizing machine learning models, thereby enhancing their reliability and effectiveness in predicting stock returns. By addressing variations in research settings, our work helps researchers demonstrate the robustness of their findings and reduce non-standard errors in future studies. This, in turn, allows for more accurate and nuanced interpretations of results.

The remainder of this study is structured as follows: In Section 2, we describe our data, data sources, the seven identified research design choices, and portfolio performance measurements. In Section 3, we examine the impact of these seven choices and compare standard errors with non-standard errors. In Section 4, we provide a guide to these design choices. Section 5 summarizes our findings.

2 Data and methodology

2.1 Data

Our analysis is based on U.S. common shares (codes 10 and 11) that are listed on NYSE, NYSE MKT (formerly AMEX), or NASDAQ. We retrieve monthly market data from the Center for Research in Security Prices (CRSP) and merge it with return predictors from the Open Source Asset Pricing (OSAP, March 2022 version) library of Chen and Zimmermann (2022). We select the 163 clear and 44 likely predictors as the lagged features in the machine learning algorithms. Our sample period ranges from January 1957 to December 2021.

To mitigate the impact of small and illiquid stocks, we exclude microcaps, which are defined as stocks with a market capitalization below the 20th percentile of NYSE market capitalization (cf., Hou et al., 2020). After filtering, we have 1,632,495 monthly observations and a monthly average of 2,093 stocks.

Finally, we follow Gu et al. (2020), Avramov et al. (2023), Leippold et al. (2022), Hanauer and Kalsbach (2023), and Howard (2024) in preprocessing the features. More specifically, we cross-sectionally rank the stocks each month by each characteristic into the [-1, 1] interval. In case of missing feature values, we set the value to 0.

2.2 Research design choices

When predicting stock returns using machine learning algorithms, researchers and practitioners face a number of important methodological choices. We identify such variations in design choices in several published machine-learning studies, all of which predict the cross-section of stock returns. More specifically, these studies include Gu et al. (2020), Freyberger et al. (2020), Avramov et al. (2023), and Howard (2024) for U.S. market, Rasekhschaffe and Jones (2019) and Tobek and Hronec (2021) for global developed markets, Hanauer and Kalsbach (2023) for emerging markets, and Leippold et al. (2022) for the Chinese market. In total, we identify variations in seven common research design choices across these studies, and we categorize them into four main types regarding the algorithm, target, feature, and training process. Table 1 summarizes the specific design choices of these studies.

Design choices in published machine learning studies

Note that we focus on design choices regarding the setup of the model training and do not consider differences with respect to the sample period, the set of features, or the portfolio construction to evaluate the predictions. Figure 1 illustrates the seven research design choices. In the remainder of this section, we explain each choice in more detail.

Research design choices

See the full paper here.

HFA Padded

The post above is drafted by the collaboration of the Hedge Fund Alpha Team.