Value (book-to-market), out of sample, 2016 to 2026

The claim

Fama and French’s 1992 paper is arguably the single most cited empirical asset pricing paper ever written. It documents that size and book-to-market equity jointly capture the cross-sectional variation in average stock returns that had previously been attributed to a grab bag of other variables. The abstract, first two sentences verbatim:

“Two easily measured variables, size and book-to-market equity, combine to capture the cross-sectional variation in average stock returns associated with market beta, size, leverage, book-to-market equity, and earnings-price ratios. Moreover, when the tests allow for variation in beta that is unrelated to size, the relation between market beta and average return is flat, even when beta is the only explanatory variable.”

(Fama and French 1992). The follow-up Fama and French 1993 paper in the Journal of Financial Economics formalized the B/M insight into the HML factor via the now-canonical 2×3 size × B/M construction: NYSE median as the size breakpoint, 30/40/30 B/M breakpoints for growth/neutral/value, six value-weighted portfolios, with the HML return defined as 0.5 * (SV + BV) - 0.5 * (SG + BG). In the original 1963-1991 sample, HML returned approximately 0.4% per month.

This verdict tests whether the effect survives an independent post-2015 sample using the same 2×3 construction.

What we tested

This is the fourth Nullberg verdict and the first that does not trivially fail. The first three found a sign-inverted MAX, a decayed-to-null momentum, and an underpowered directional survivor in profitability. Value is a different kind of factor again: its claim is about the market’s pricing of accounting book equity, tested against a time-varying denominator that moves with price.

Sample and data

2016-01-05 to 2026-04-09 daily closes, 122 usable formation months in the primary spec
5,568 US equities in the price cache, merged with 8,479 rows of FMP company profile data, further merged with raw quarterly book equity (totalStockholdersEquity from balance_sheets.parquet, 485K rows) and raw quarterly shares outstanding (weightedAverageShsOut from income_statements.parquet, 493K rows)
Book equity and shares outstanding joined point-in-time via pandas.merge_asof on filingDate, so no look-ahead

Methodology

For each formation month t, attach the most recent filed book equity and shares outstanding to every symbol (via as-of merge on filingDate ≤ end of month t)
Compute a time-varying market cap as shares_outstanding * end_of_month_close. Unlike the previous three verdicts which used the FMP snapshot market cap, this one tracks price moves between filings properly, which is the correct way to form a B/M score when the denominator is market value of equity.
Compute B/M = totalStockholdersEquity / time_varying_market_cap
Exclude financial firms (sector in {Financial Services, Real Estate}) to match the Fama-French SIC 6 convention. Financial firms have leverage structures that distort B/M in ways the original paper deliberately excluded.
Exclude stocks with non-positive book equity (standard value-research practice)
Form portfolios: see below for each specification
Hold for one month, rebalance monthly
Report i.i.d. t-statistic as the pre-registered standard, plus a Newey-West 12-lag HAC t-statistic as an additional rigor check because monthly long-short factor returns are often mildly autocorrelated
Split the sample at its midpoint and report the primary specification on each half separately as a stability check

Four specifications reported in full:

Primary: Fama-French 1993 HML 2×3 construction, financials excluded. NYSE median market cap is the size breakpoint, 30/40/30 NYSE B/M percentiles are the value breakpoints, six value-weighted portfolios, HML = 0.5*(SV+BV) - 0.5*(SG+BG). The canonical factor construction.
Sensitivity A: simple D10-D1 on B/M, value-weighted, NYSE breakpoints, financials excluded. A direct analog of the other Nullberg verdicts’ primary spec (ten-decile sort, top decile minus bottom decile). This spec produces a wider spread than HML 2×3 by construction because it is 10%-10% rather than 30%-30%.
Sensitivity B: HML 2×3 including financials. Same construction as primary but with financial firms and REITs left in. Shows the effect of the exclusion.
Sensitivity C: simple D10-D1, equal-weighted, filtered, winsorized, financials excluded. Equal-weighted analog to Sens A. Tests whether the value effect exists on smaller stocks.

Pre-registered verdict thresholds, calibrated for the Fama-French HML canonical of ~0.4% per month:

Replicated: mean(HML) > 0.003 AND t > 2
Degraded: 0 < mean(HML) ≤ 0.003 AND t > 2
Failed: mean ≤ 0 OR t ≤ 2
Inconclusive: a data quality issue prevents a clean call

The 0.003 floor is approximately 75% of the canonical 0.4% per month. Paper-specific calibration is consistent with the earlier Novy-Marx verdict which used a 0.002 floor for its smaller canonical effect.

The numbers

Primary specification: HML 2×3, value-weighted, no financials

Metric	Value
Sample months	122
Median stocks / month	2,169
Mean monthly HML	+1.698%
Std dev monthly	6.65%
i.i.d. t-statistic	+2.82
Newey-West 12-lag t-statistic	+1.94
Annualized Sharpe	+0.88
% months positive	54.9%
Worst month	-13.16%
Best month	+28.48%

Pre-registered call on the full-sample i.i.d. rubric: mean > 0.003 AND t > 2, so the verdict is REPLICATED. This is the first REPLICATED verdict in the Nullberg archive.

Two important caveats follow immediately below.

Caveat 1: the Newey-West HAC adjustment is just below the threshold

Monthly factor long-short returns typically show mild positive autocorrelation, which the i.i.d. t-statistic ignores. Applying a Newey-West HAC correction with a 12-lag Bartlett kernel (the standard for monthly data) inflates the standard error, lowering the t-statistic from +2.82 to +1.94. This is below the same |2| threshold the rubric uses. Under a strict Newey-West interpretation, the primary would fail.

We report both statistics. The rubric was pre-registered with an i.i.d. t-stat to maintain cross-verdict consistency (MAX, momentum, and profitability all used i.i.d. tests), so the rubric call is REPLICATED. But the reader should know that a more conservative standard error only marginally clears the bar.

Caveat 2: the replication is driven by the second half of the sample

The sample splits at its midpoint as follows:

Half	Months	Mean monthly	i.i.d. t
First half: 2016-01 to 2021-01	61	+0.585%	+0.73
Second half: 2021-02 to 2026-02	61	+2.810%	+3.20

Only the second half of the sample supports the replication. The first half is statistically null. If Nullberg had run this replication in January 2021, the verdict would have been FAILED.

The annual mean HML returns tell the same story:

Year	Mean monthly HML
2016	+1.56%
2017	+2.84%
2018	-0.07%
2019	-0.88%
2020	-0.74%
2021	+1.77%
2022	+9.24%
2023	+2.31%
2024	-0.30%
2025	+1.13%
2026 (partial)	+2.38%

2022 alone averages +9.24% per month, which is more than 5x the full-sample mean. That year was the well-documented “value revenge” when rising rates and the tech selloff crushed growth stocks and rewarded profitable low-multiple value stocks. Three of the top five monthly HML returns in the entire 122-month sample (+28.5% in October 2022, +21.0% in August 2022, +20.9% in May 2022) are from 2022.

A trimmed analysis (dropping the top 5 and bottom 5 monthly returns) still produces a mean of +1.323% with t = +2.99, so the effect is not driven by a literal handful of freak months. But it is a regime-driven effect: value was flat-to-negative in the late-2010s bull market, then experienced an enormous rebound from mid-2021 onward as the macro regime changed.

Sensitivity A: simple D10-D1, VW NYSE breakpoints, no financials

Metric	Value
Mean monthly	+3.173%
i.i.d. t-stat	+2.41
Newey-West t-stat	+2.05

The simple D10-D1 spread is larger in magnitude than HML 2×3 because it uses a 10%-10% extreme rather than a 30%-30% split. It replicates under both i.i.d. and Newey-West adjustments, clearing |t| > 2 on both measures. This is a more conservative confirmation of the primary call: the most extreme value-vs-growth spread survives the HAC correction that the tighter 2×3 construction does not.

Sensitivity B: HML 2×3, including financials and real estate

Metric	Value
Mean monthly	+1.602%
i.i.d. t-stat	+2.78
Newey-West t-stat	+1.92

Including financials and real estate produces a nearly identical picture. The Fama-French SIC 6 exclusion matters less than the convention would suggest in our sample, which is itself informative.

Sensitivity C: simple D10-D1, equal-weighted, filtered, no financials

Metric	Value
Median stocks / month	1,689
Mean monthly	+0.157%
i.i.d. t-stat	+0.35
Newey-West t-stat	+0.31

Under equal-weighting with the standard price/liquidity filter, the value effect disappears. This is the same pattern we observed in the profitability verdict: the effect is concentrated in value-weighted large-cap portfolios and is absent from equal-weighted mid-cap universes. Consistent with Asness, Frazzini, Israel, Moskowitz, and Pedersen’s post-2010 work characterizing classic quality and value factors as large-cap phenomena.

What this means

Strict reading of the pre-registered rubric: REPLICATED. The full-sample HML 2×3 clears both the mean floor and the t threshold under i.i.d. assumptions, and the rubric does not condition on sub-sample stability or HAC adjustments.

Honest reading of what the data actually says:

The effect survived the 2016-2026 out-of-sample window. That is not nothing. Three of the four specifications we ran clear the significance bar under i.i.d. assumptions, and one of them (Sens A) also clears under the stricter Newey-West standard.
The effect is concentrated in the post-2021 regime. The first half of our sample is indistinguishable from zero. A researcher running the same script at the end of January 2021 would have returned a FAILED verdict. What changed is the macro environment: rising rates, a growth selloff, and the subsequent mean-reversion in valuation dispersion.
The effect is a large-cap phenomenon. Value-weighted large-cap portfolios show it. Equal-weighted mid-cap portfolios do not.
The Newey-West adjustment is a real concern for HML 2×3 specifically. Monthly HML returns have enough positive autocorrelation that the HAC-adjusted t is 31% lower than the i.i.d. t. The simple D10-D1 spread is less affected, possibly because its wider cross-section averages out more of the autocorrelation.

A fair summary is that value works now because of the 2022 rebound, not because of a consistently priced risk premium. Whether a reader wants to call that “replicated” depends on whether they believe the pre-registered rubric should condition on sub-sample stability. Our rubric does not, so we call it REPLICATED. The page is written so that any reader who disagrees can see exactly the decomposition that would change their mind.

Comparative picture across four verdicts

Paper	Factor class	Primary mean	i.i.d. t	NW t	Verdict
MAX, Bali Cakici Whitelaw 2011	Lottery	-1.801%	-2.57	n/a	Failed, inverted
Momentum, Jegadeesh Titman 1993	Trend	-0.134%	-0.16	n/a	Failed, decayed
Profitability, Novy-Marx 2013	Quality	+0.503%	+0.93	n/a	Failed, underpowered survivor
Value, Fama-French 1992	Value	+1.698%	+2.82	+1.94	Replicated (regime-driven)

One sign inversion, one decay to null, one underpowered directional survivor, and one regime-driven marginal replication. Four factor classes, four distinct verdict outcomes, and the rubric has now landed on both sides of the pre-registered line. That is the archive’s first demonstration that Nullberg is not a machine for stamping FAILED on every paper.

Reproducibility

The replication is a single Python file reading the 10-year daily OHLCV cache plus raw FMP quarterly book equity (from balance sheets) and raw FMP quarterly shares outstanding (from income statements). The fundamentals join is a point-in-time pandas.merge_asof on filingDate. Market cap is time-varying (shares_outstanding * end_of_month_close). Financial firms are excluded to match the Fama-French convention. Newey-West 12-lag HAC standard errors are computed alongside i.i.d. standard errors. The script runs in about 26 seconds.

Script: scripts/verdicts/fama_french_1992_value.py
Results JSON: scripts/verdicts/fama_french_1992_value.results.json
Monthly long-short CSVs: ..._primary.csv, ..._sensA.csv, ..._sensB.csv, ..._sensC.csv

What we will track from here

This verdict enters the archive as replicated (regime-driven). It is reviewed when any of the following happens:

An additional year of forward-sample data materially changes the second-half mean. If 2027 is weak, the regime-dependent story strengthens.
A rolling-window decomposition shows the effect reverting to null as we move the evaluation window.
A stricter Newey-West specification with more lags (24 or more, the Andrews 1991 rule-of-thumb for monthly data) pushes the primary t-statistic further below 2.
A larger cross-sectional universe or a time-varying VW with historical shares outstanding changes the magnitude materially.

If item 1 or 2 happens, the verdict flips to DEGRADED or FAILED with a dated changelog.

Bibliography

Fama, Eugene F., and Kenneth R. French. “The Cross-Section of Expected Stock Returns.” Journal of Finance 47(2), 1992, pp. 427-465. Paper
Fama, Eugene F., and Kenneth R. French. “Common risk factors in the returns on stocks and bonds.” Journal of Financial Economics 33(1), 1993, pp. 3-56. Paper
Bali, Turan G., Nusret Cakici, and Robert F. Whitelaw. “Maxing Out: Stocks as Lotteries and the Cross-Section of Expected Returns.” Journal of Financial Economics 99(2), 2011, pp. 427-446. Nullberg verdict: failed, inverted.
Jegadeesh, Narasimhan, and Sheridan Titman. “Returns to Buying Winners and Selling Losers: Implications for Stock Market Efficiency.” Journal of Finance 48(1), 1993, pp. 65-91. Nullberg verdict: failed, decayed.
Novy-Marx, Robert. “The other side of value: The gross profitability premium.” Journal of Financial Economics 108(1), 2013, pp. 1-28. Nullberg verdict: failed, underpowered.