Three Measures of Third-Base Greatness: Z-Scores, WAR, and wRC+

Introduction

At this point in the third-base study, we have three different ways of measuring greatness.

The first is our own z-score framework. It asks how far a player separated from other third basemen in the same season and across a career.

The second is WAR. It asks how much total value a player produced, including offense, defense, baserunning, position, replacement level, and playing time.

The third is wRC+. It asks how strong a hitter was after adjusting for league and park context, with 100 set as league average.

Each measure is useful.

Each measure answers a different question.

That is why this comparison matters.

A player can dominate by z-score because he separates from his third-base peers. A player can dominate by WAR because he accumulates value across many seasons. A player can dominate by wRC+ because his offensive rate quality is extraordinary.

The goal of this chapter is not to declare that one metric is correct and the others are wrong.

The goal is to compare the stories they tell.

For Study 1, I focused on regular third basemen with at least five qualified third-base seasons and matched values for career WAR, career wRC+, and our career z-score measures. That produced a working sample of 239 third basemen.

The central question is:

Which third basemen remain elite when judged by z-scores, WAR, and wRC+ together?

The answer begins with Mike Schmidt.

Across the three-metric composite, Schmidt is the clear anchor of the study. He ranks first in combined z-score, first in WAR, and second in wRC+. He is the player who survives every test.

But the rest of the list is more interesting than a simple ranking.

Eddie Mathews, Chipper Jones, Wade Boggs, George Brett, Home Run Baker, Alex Rodriguez, Ron Santo, Scott Rolen, and Jose Ramirez also emerge as strong cross-metric performers.

At the same time, several players reveal the tension between the metrics.

Brooks Robinson ranks extremely high by combined z-score and WAR, but much lower by wRC+. Dick Allen ranks first by wRC+, but much lower by combined z-score and WAR. Adrian Beltre ranks third by WAR but much lower by wRC+.

Those differences are not problems.

They are the point of the chapter.

The Three Metrics

This study compares three broad dimensions: Combined career z-score, career WAR, and career wRC+.

The combined z-score is internal to this project. WAR and wRC+ are external validation measures.

The three metrics are not interchangeable.

They measure different things.

The Combined Z-Score

The combined z-score is based on two career components: Model C offensive career score  and Traditional defensive career score.

Each career score is standardized across the third-base regular sample.

The standardized offensive score is:

z_{\mathrm{Offense},i} = \frac{ \mathrm{Offense}_{i} - \overline{\mathrm{Offense}} }{ s_{\mathrm{Offense}} }

The standardized defensive score is:

z_{\mathrm{Defense},i} = \frac{ \mathrm{Defense}_{i} - \overline{\mathrm{Defense}} }{ s_{\mathrm{Defense}} }

The combined z-score is:

\mathrm{Combined\ Z}_{i} = z_{\mathrm{Offense},i} + z_{\mathrm{Defense},i}

This score rewards players who separate from other third basemen in both offensive and traditional defensive dimensions.

It is not the same thing as WAR.

It does not directly assign run values. It does not use replacement level. It does not use park factors in the way WAR or wRC+ does. It is a peer-separation measure.

That is its strength.

It asks:

How far did this third baseman stand from the position?

WAR

WAR is a broader value metric.

In this chapter, WAR is used as a career value measure for the player’s qualified third-base seasons in our merged dataset. WAR includes offense, defense, baserunning, positional value, replacement value, and playing time.

For the purpose of this chapter, we can think of WAR abstractly as:

\begin{aligned} \mathrm{WAR}_i &= \mathrm{Offense}_i + \mathrm{Defense}_i + \mathrm{Baserunning}_i \\ &\quad+ \mathrm{Position}_i + \mathrm{Replacement}_i \end{aligned}

That is not intended as a full WAR formula. It is a conceptual summary.

WAR asks a different question from the z-score model.

It asks:

How much total value did this player produce?

That is why Adrian Beltre, Wade Boggs, Brooks Robinson, Scott Rolen, Graig Nettles, and Buddy Bell can look stronger by WAR than they do by wRC+ alone.

WAR values more than hitting.

wRC+

wRC+ is an offensive rate measure.

It is scaled so that 100 is league average:

wRC^+ = 100

A hitter with a 120 wRC+ is roughly 20 percent better than league average offensively:

wRC^+ = 120

A hitter with an 80 wRC+ is roughly 20 percent below league average offensively:

wRC^+ = 80

For this study, wRC+ answers a narrower question:

How good was the hitter?

It does not measure third-base defense. It does not measure total career value. It does not reward playing third base well. It isolates offensive rate quality.

That is why Dick Allen can rank first in wRC+ without ranking first in the other systems.

Why a Composite Score Is Useful

Because the three measures use different scales, we cannot simply add raw z-score, WAR, and wRC+.

Instead, I converted each metric into a percentile rank.

For each player:

P_{\mathrm{Combined\ Z},i} = \mathrm{PercentileRank}(\mathrm{Combined\ Z}_i) P_{\mathrm{WAR},i} = \mathrm{PercentileRank}(\mathrm{WAR}_i) P_{wRC^+,i} = \mathrm{PercentileRank}(wRC^+_i)

Then I calculated a three-metric composite percentile:

\mathrm{Composite}_{i} = \frac{ P_{\mathrm{Combined\ Z},i} + P_{\mathrm{WAR},i} + P_{wRC^+,i} }{3}
Higher values indicate players who rank well across all three systems.

This composite is not meant to replace the individual metrics. It is a summary tool.

It rewards broad agreement.

A player who ranks high in all three metrics will rise. A player who is exceptional in one metric but weaker in the others will still be visible, but not necessarily at the top of the composite.

That is why this study is useful.

It separates all-around consensus from metric-specific greatness.

Figure 1: The Top 25 Composite Performers

Place Figure 1 here.

Figure 1. Top third basemen across combined z-score, WAR, and wRC+.

The top 25 composite chart gives the broadest view of the results.

The top ten are:1. Mike Schmidt 2. Eddie Mathews 3. Chipper Jones 4. Wade Boggs 5. George Brett 6. Home Run Baker 7. Alex Rodriguez 8. Ron Santo 9. Scott Rolen 10. Jose Ramirez.

It is not simply an offensive list. Brooks Robinson does not reach the top ten because wRC+ pulls him down, but the list still includes two-way and value-based players such as Wade Boggs, Ron Santo, and Scott Rolen.

It is not simply a WAR list either. Adrian Beltre ranks third by WAR, but thirteenth by the composite because his wRC+ rank is lower than his WAR and z-score ranks.

It is not simply a wRC+ list. Dick Allen ranks first by wRC+, but he does not land near the top of the composite because his combined z-score and WAR ranks are lower.

The top composite list rewards players who remain strong across the different definitions of greatness.

That is why Schmidt is first.

He is not merely great by one method. He is great by all three.

The Top Players by Each Metric

The top players change depending on the question.

By combined z-score, the top five are: 1. Mike Schmidt 2. Brooks Robinson 3. Nolan Arenado 4. Scott Rolen 5. Wade Boggs.

This list rewards two-dimensional separation. Robinson and Arenado rise because traditional defense is included.

By WAR, the top five are: 1. Mike Schmidt 2. Eddie Mathews 3. Adrian Beltre 4. Wade Boggs 5. Brooks Robinson.

This list rewards total value and career accumulation.

By wRC+, the top five are: 1. Dick Allen 2. Mike Schmidt 3. Eddie Mathews 4. Harmon Killebrew 5. John McGraw.

This list rewards offensive rate quality.

These are three different lists because they are answering three different questions.

The question is not which list is correct.

The question is what each list reveals.

Figure 2: Rank Movement Across the Three Systems

Place Figure 2 here.

Figure 2. How the top composite third basemen rank by combined z-score, WAR, and wRC+.

The rank-comparison figure shows how players move across the three measures.

Mike Schmidt barely moves. That is the signature of a consensus number one. His profile is not dependent on one definition of value.

Eddie Mathews is similarly strong. He ranks high in WAR and wRC+, and still remains strong in the combined z-score system.

Chipper Jones is also stable. His defensive score is not strong, but his offensive value is so high that he remains near the top.

The movement becomes more interesting with players like Adrian Beltre, Nolan Arenado, and Scott Rolen.

Beltre ranks extremely high by WAR but much lower by wRC+. That makes sense. His case is not purely about offensive rate. It is about durability, defense, and total value.

Arenado ranks very high by combined z-score but much lower by wRC+. Again, that makes sense. His profile is two-dimensional and defense-forward.

Rolen is a balanced case. He ranks very high by combined z-score and WAR but lower by wRC+. That reflects his two-way value.

This figure shows that the metrics are not redundant.

They overlap, but they do not tell the same story.

Figure 3: Combined Z-Score Versus WAR

Place Figure 3 here.

Figure 3. Combined career z-score versus career WAR among third-base regulars.

The combined z-score and WAR relationship is strong.

The fitted line is:

\mathrm{WAR} = 15.98 + 6.69(\mathrm{Combined\ Z})

The model fit is:

R^2 = 0.782

That means the combined z-score explains a large share of the variation in career WAR among third-base regulars.

This is important.

It tells us that our z-score framework is not just an internal ranking system. It aligns strongly with a major external value metric.

But the scatterplot also shows meaningful differences.

Mike Schmidt sits at the upper-right extreme. His combined z-score and WAR both identify him as historically exceptional.

Brooks Robinson sits high in combined z-score and WAR, but his shape is different. His combined z-score is powered by traditional defense rather than offensive dominance.

Adrian Beltre sits higher in WAR than his combined z-score alone would predict. That suggests his total value, longevity, and broader WAR components are stronger than the simplified z-score model fully captures.

Nolan Arenado sits high in combined z-score but lower in WAR relative to the line. That may reflect career length, active-career status, or differences between traditional defensive separation and WAR’s defensive valuation.

The relationship is strong, but the residuals still matter.

They show where the systems disagree.

Figure 4: Offensive Z-Score Rate Versus wRC+

Place Figure 4 here.

Figure 4. Average offensive z-score per qualified third-base season versus career wRC+.

The relationship between average offensive z-score and wRC+ is also strong.

The fitted line is:

wRC^+ = 100.89 + 5.41(\mathrm{Average\ Offensive\ Z})

The fit is:

R^2 = 0.740

This confirms the earlier wRC+ validation result.

Average offensive z-score is a strong predictor of wRC+ because both are measuring offensive quality, though in different ways.

The equation says that each additional point of average offensive z-score corresponds to about 5.41 additional points of career wRC+:

\beta_1 = 5.41

This is why the offensive names rise in this figure.

Dick Allen, Mike Schmidt, Eddie Mathews, Chipper Jones, Alex Rodriguez, George Brett, Home Run Baker, Wade Boggs, Al Rosen, and David Wright all appear as strong offensive profiles.

Brooks Robinson, by contrast, is much closer to the middle of the wRC+ distribution. That is not a criticism. It simply reflects that Robinson’s greatness is not primarily a wRC+ case.

That is exactly why this comparison matters.

Figure 5: Cross-Metric Rank Disagreements

Place Figure 5 here.

Figure 5. Largest cross-metric rank disagreements among notable third basemen.

The disagreement chart is one of the most useful figures in the study.

It identifies players whose rankings differ sharply across combined z-score, WAR, and wRC+.

The rank spread is:

\begin{aligned} \mathrm{RankSpread}_{i} &= \max\left( r_{\mathrm{CombinedZ},i}, r_{\mathrm{WAR},i}, r_{\mathrm{wRC}^{+},i} \right) \\ &\quad- \min\left( r_{\mathrm{CombinedZ},i}, r_{\mathrm{WAR},i}, r_{\mathrm{wRC}^{+},i} \right) \end{aligned}

A large spread means the player looks very different depending on the metric.

Some of the most interesting disagreement cases are: Brooks Robinson, Dick Allen, Adrian Beltre, Nolan Arenado, Willie Kamm, Gary Gaetti, Harmon Killebrew, Edwin Encarnacion, Jim Ray Hart, and Deacon White.

These players are not mistakes in the data.

They are interpretive opportunities.

Brooks Robinson is a defensive and WAR giant, but not a wRC+ giant.

Dick Allen is an offensive-rate giant, but not a top combined z-score or WAR third-base regular in this framework.

Adrian Beltre is a WAR giant, but wRC+ does not fully capture his case.

Willie Kamm is extremely strong by the combined z-score framework because of traditional defense, but he is not similarly high by wRC+.

Edwin Encarnacion is much stronger by wRC+ than by third-base z-score or WAR within the third-base framework, partly because his career offensive identity extends beyond a long regular third-base profile.

The disagreement chart shows why a single number is not enough.

The Schmidt Result

Mike Schmidt is the central result of Study 1.

He ranks: Combined z-score rank: 1 WAR rank: 1 wRC+ rank: 2 Composite rank: 1

This is almost the perfect cross-metric profile.

Schmidt is not merely the best by our internal model. He is also the best by WAR and nearly the best by wRC+.

That matters because it means his result is robust.

He is not a product of one method.

He is the player who remains elite when the question changes.

If the question is peer separation, Schmidt wins.

If the question is total value, Schmidt wins.

If the question is offensive rate quality, Schmidt is still almost at the top.

That is the strongest possible case.

Eddie Mathews, Chipper Jones, and the Offensive Greatness Group

Eddie Mathews ranks second by the composite.

He ranks: Combined z-score rank: 6 WAR rank: 2 wRC+ rank: 3 Composite rank: 2.

That is a very strong cross-metric profile. Mathews does not have Schmidt’s complete separation, but he remains elite everywhere.

Chipper Jones ranks third by the composite: Combined z-score rank: 8 WAR rank: 6 wRC+ rank: 6 Composite rank: 3.

Chipper’s case is offense-forward. His traditional defensive component is not strong, but his offensive quality is so high that he remains elite across the systems.

George Brett and Home Run Baker also belong in this broad offensive greatness group. They are strong by wRC+, strong by WAR, and strong enough by combined z-score to remain near the top.

This group shows that offensive greatness can carry a third-base profile a long way.

Wade Boggs and the On-Base Profile

Wade Boggs ranks fourth by the composite: Combined z-score rank: 5 WAR rank: 4 wRC+ rank: 14 Composite rank: 4.

Boggs is a fascinating case because he is not a home-run power archetype. His greatness is built around contact, on-base skill, batting average, plate discipline, and sustained offensive quality.

The fact that he ranks so highly in the composite confirms that the model is not simply rewarding slugging power.

Boggs was a different kind of offensive star, and the metrics recognize it.

Rolen, Beltre, Arenado, and Two-Way Value

Scott Rolen, Adrian Beltre, and Nolan Arenado show why WAR and combined z-score are necessary companions to wRC+.

Rolen ranks: Combined z-score rank: 4 WAR rank: 10 wRC+ rank: 34 Composite rank: 9.

Beltre ranks: Combined z-score rank: 9 WAR rank: 3 wRC+ rank: 58 Composite rank: 13.

Arenado ranks: Combined z-score rank: 3 WAR rank: 15 wRC+ rank: 57 Composite rank: 15.

These are not weak wRC+ players. But their all-time third-base cases are not primarily wRC+ cases.

They are two-way cases.

Rolen is balanced. Beltre is a total-value and longevity case. Arenado is a defense-forward combined z-score case.

If this study used only wRC+, these players would be underrated.

If it used only WAR, their offensive shape would be less visible.

If it used only z-score, the relationship to broader value would be less clear.

The three-metric comparison gives the fuller picture.

Brooks Robinson and the Limits of wRC+

Brooks Robinson is the clearest example of a player whose greatness is not offensive-rate greatness.

He ranks: Combined z-score rank: 2 WAR rank: 5 wRC+ rank: 114.

That is a huge split.

It makes perfect sense.

Robinson’s historical case is not based on being one of the greatest offensive third basemen. It is based on defense, durability, and total value.

The combined z-score model sees him because traditional defense is included. WAR sees him because total value includes defense. wRC+ does not see him in the same way because wRC+ is an offensive metric.

That is not a flaw in wRC+. It is a reminder that wRC+ answers a narrower question.

Dick Allen and the Limits of Third-Base Accumulation

Dick Allen is the opposite case.

He ranks: wRC+ rank: 1 Combined z-score rank: 87 WAR rank: 51.

Allen’s offensive rate quality is extraordinary. But within this third-base regular framework, he does not accumulate the same kind of third-base-specific z-score or WAR profile as Schmidt, Mathews, Chipper, Boggs, or Brett.

This shows the difference between a great hitter who played third base and a great third baseman across the entire profile. That is an important distinction.

Allen is not diminished by this result. The study simply clarifies what kind of greatness he represents.

He is a wRC+ giant. He is not the top all-around third-base regular by the three-metric composite.

Why This Study Is Interesting

The value of Study 1 is that it prevents the project from becoming metric-dependent.

If Schmidt ranked first only by our z-score model, the conclusion would be interesting but narrower.

But Schmidt also ranks first by WAR and second by wRC+. That makes the conclusion much stronger.

At the same time, the disagreements prevent the chapter from becoming too simple.

Brooks Robinson, Dick Allen, Adrian Beltre, Nolan Arenado, Rolen, Boggs, and others show that greatness has different forms.

The study therefore supports two conclusions at once: 1. Mike Schmidt is the clearest cross-metric third-base anchor. 2. Different metrics reveal different kinds of third-base greatness.

Both points are important.

Limitations

This chapter uses regular third basemen with at least five qualified third-base seasons and matched values across the three systems. That makes the comparison cleaner, but it also means the study is focused on third-base regulars, not every player who ever appeared at third base.

The combined z-score uses this project’s offensive Model C and traditional defensive model. It does not include modern defensive metrics, park factors, or full run-value modeling.

WAR includes many components that the z-score model does not.

wRC+ is a rate statistic and should not be treated as an accumulated career value measure. That is why the study includes average offensive z-score per qualified season when comparing to wRC+.

The composite percentile score is a summary tool. It is not a new definitive metric. It is best used to identify players who remain strong across multiple systems.

Conclusion

Study 1 compares three ways of measuring third-base greatness: Combined z-score, WAR, and wRC+

The main result is clear.

Mike Schmidt is the strongest cross-metric third baseman in the study.

He ranks first by combined z-score, first by WAR, second by wRC+, and first by the three-metric composite.

Eddie Mathews, Chipper Jones, Wade Boggs, George Brett, Home Run Baker, Alex Rodriguez, Ron Santo, Scott Rolen, and Jose Ramirez also emerge as strong cross-metric performers.

But the disagreements are just as important.

Brooks Robinson shows that wRC+ cannot capture defensive greatness.

Dick Allen shows that offensive rate greatness is not the same as all-around third-base accumulation.

Adrian Beltre, Nolan Arenado, and Scott Rolen show the importance of two-way value.

The larger conclusion is this:

Third-base greatness is not one-dimensional.

Z-scores show peer separation.
WAR shows total value.
wRC+ shows offensive quality.

The best third basemen are the ones who remain visible when the lens changes.

By that standard, Mike Schmidt stands at the center of the argument.

 

The Shape of a Pitching Staff: A Dendrogram of MLB Team Pitching, 2001- 2025

The earlier chapter asked a ranking question: which MLB organizations built the best pitching staffs from 2001 to 2025?

This post asks a slightly different question.

Which teams pitched alike?

That is not the same thing. Two teams can both be good without being similar. One team might dominate through strikeouts and fielding-independent indicators. Another might prevent runs through contact control, ground balls, park fit, or bullpen management. A third might have a mixed profile, with decent run prevention but weaker underlying indicators. Ranking tells us who was best. Clustering tells us who had the same shape.

That is why a dendrogram is useful. It does not begin with a leaderboard. It begins with resemblance.

The question becomes: if we describe each franchise by its long-term pitching profile, which franchises naturally group together?

The method

Each franchise was described using season-normalized pitching variables from 2001 through 2025. That step is important because pitching environments changed dramatically during this period. A 4.00 ERA in 2001 does not mean exactly the same thing as a 4.00 ERA in 2025.

So each team-season was first compared to its own season.

z_{i,y,m} = \frac{ X_{i,y,m} - \overline{X}_{y,m} }{ s_{y,m} }

Here, (X_{i,y,m}) is team (i)’s value for metric (m) in season (y), (\bar{X}_{y,m}) is the league average for that metric in that season, and (s_{ym}) is the season standard deviation.

For lower-is-better metrics, such as ERA-, FIP-, xFIP-, SIERA, BB%, HR/9, HR/FB, and Hard%, I reversed the sign so that higher values always mean a more favorable pitching profile.

The clustering used these long-term franchise traits:

Category Variables
Value WAR
Dominance K-BB%
Command BB% prevention
Run prevention ERA-
Fielding-independent skill FIP-, xFIP-, SIERA
Home-run control HR/9, HR/FB
Contact profile GB%, Hard% prevention
Starter usage Quality-start rate

After calculating each franchise’s average profile, I standardized the franchise-level variables and used Ward hierarchical clustering. The distance between teams is based on how far apart their standardized pitching profiles are.

d(i,j)</p> <p>\sqrt{ \sum_{m=1}^{p} \left( z_{i,m} - z_{j,m} \right)^2 }

The dendrogram then links the most similar teams first and gradually joins them into larger groups.

Figure 1. MLB team pitching identity dendrogram, 2001-2025

The first thing to notice is that the dendrogram is not only a quality ranking. It does separate many of the best pitching organizations, but it also captures style.

The Dodgers, Yankees, Astros, Guardians, Phillies, Cubs, and Red Sox form a major cluster. That makes sense. These organizations score well across the strongest modern pitching indicators: WAR, K-BB%, FIP-, xFIP-, and SIERA. This is the elite skill-and-value cluster.

But the Braves, Giants, and Cardinals form a different group. They are not grouped with the Dodgers and Yankees, even though they include strong pitching organizations. Their similarities lie more in run prevention, home-run suppression, ground-ball tendency, and contact management. In other words, their profile is not merely “good pitching.” It is a particular kind of good pitching.

The Rays, Brewers, Padres, Blue Jays, and Diamondbacks form another interesting group. This cluster is more modern and peripheral-driven. These teams tend to show some strength in strikeout-minus-walk skill and xFIP/SIERA-style indicators, but they are not as dominant in long-term value or run prevention as the elite group.

At the other end, the Rockies, Royals, Orioles, Reds, and Rangers group together as long-term struggling pitching profiles. That does not mean each franchise was bad every year. It means that across the full 2001-2025 window, their average profile shares several weaknesses: lower WAR, weaker K-BB%, weaker fielding-independent indicators, and poorer home-run prevention.

The six main clusters

The dendrogram produced six useful interpretive groups:

Cluster Teams Interpretation
Elite skill and value staffs BOS, CHC, CLE, HOU, LAD, NYY, PHI Strongest overall skill profile, especially

WAR, K-BB%, FIP-, xFIP-, and SIERA

Contact-control run preventers ATL, SFG, STL Strong run prevention, home-run control,

ground-ball tendency, and starter length

Modern peripheral builders ARI, MIL, SDP, TBR, TOR Better in K-BB%, xFIP, and SIERA than

in long-term WAR or ERA dominance

Mixed middle profiles ATH, CHW, DET, LAA, MIN, NYM, SEA, WSN No single shared identity as strong as the

other groups, generally middle-range profiles

Low-dominance HR suppressors MIA, PIT Weak dominance indicators but relatively

better home-run suppression

Long-term struggling profiles BAL, CIN, COL, KCR, TEX Broadly weak long-term pitching profile across value, dominance, and fielding-independent metrics

Why the clusters formed

Figure 2 explains the dendrogram. It shows the average profile of each cluster.

The elite skill-and-value group is strong almost everywhere that modern pitching analysis would expect. The group is especially strong in K-BB%, FIP-, xFIP-, SIERA, and WAR. This is the clearest “modern excellence” cluster.

The contact-control group is different. Atlanta, San Francisco, and St. Louis are not grouped primarily because of overwhelming strikeout dominance. They cluster because of run prevention, home-run prevention, ground-ball tendency, and starter length. This is a more traditional-looking run-prevention cluster.

The modern peripheral group is subtle. Arizona, Milwaukee, San Diego, Tampa Bay, and Toronto do not dominate the long-term value category. Still, they show enough similarity in strikeout-minus-walk skill and advanced indicators to cluster together. This feels like a group of organizations that, at different times, leaned into modern pitching design without producing the same full-period dominance as the Dodgers or Yankees.

The struggling group is also clear. Baltimore, Cincinnati, Colorado, Kansas City, and Texas sit below average in most of the key categories. Colorado’s presence is not surprising because pitching in Denver creates a unique environmental problem. But the dendrogram is not only about park effects. The broader cluster reflects weaker long-term skill indicators too.

Why the Dodgers separate

The Dodgers are inside the elite skill-and-value cluster, but they remain visually distinctive. They join the group later than several other teams inside that cluster. That is important.

It suggests that the Dodgers are similar to other elite pitching organizations but are still somewhat their own thing. Their long-term profile is so strong across so many categories that they are not simply interchangeable with the Yankees, Astros, Guardians, Phillies, Cubs, or Red Sox.

That matches the earlier chapter. The Dodgers were the top pitching organization by average normalized score from 2001 to 2025. The dendrogram confirms the same story from a different angle. They are not only highly ranked. They have a recognizable organizational profile.

Houston and Cleveland as development stories

Houston and Cleveland are also revealing.

The Astros cluster with the elite organizations, but their full-period story contains more transformation than the Dodgers’ story. Houston’s early-2010s pitching collapse and late-2010s pitching rise are both part of the same 25-year average. Even with those bad seasons included, Houston still clusters with the strongest pitching organizations. That says something about how powerful the later Houston pitching model became.

Cleveland’s placement also makes sense. The Guardians are not always discussed like the Dodgers or Yankees because they operate with a different market profile, but the data places them in the same broad pitching family. Their identity is built around development, strike-zone control, and repeatable pitching skill.

Tampa Bay, Milwaukee, and the modern middle

The Rays and Brewers are especially interesting because they appear in the modern peripheral group rather than the elite skill-and-value group.

That is not an insult. It may actually be the more interesting result.

Tampa Bay and Milwaukee often represent modern pitching creativity: bullpen flexibility, role adaptation, pitcher development, and tactical staff construction. But over the full 25-year period, the dendrogram places them closer to San Diego, Toronto, and Arizona than to the Dodgers or Yankees.

That suggests a distinction between modernity and dominance. A team can be tactically modern without producing the same long-term value profile as the very top organizations.

What the dendrogram adds

The dendrogram adds something that a leaderboard cannot.

A leaderboard says:

The Dodgers were first.
The Yankees were second.
The Astros, Guardians, Cubs, Red Sox, Phillies, and Braves followed.

The dendrogram says something different:

The Dodgers, Yankees, Astros, Guardians, Cubs, Red Sox, and Phillies belong to a shared skill-and-value family.

The Braves, Giants, and Cardinals form a separate contact-control and run-prevention family.

The Rays, Brewers, Padres, Blue Jays, and Diamondbacks form a modern peripheral family.

The Rockies, Royals, Orioles, Reds, and Rangers share a long-term struggling profile.

That is the value of clustering. It changes the question from “who was better?” to “who was built alike?”

Conclusion

The central lesson is that pitching identity has shape.

Some organizations built strong staffs across nearly every modern indicator. Some built staffs that prevented runs through contact management and home-run control. Some had modern peripheral strengths without the same long-term value profile. Some struggled across nearly every measure.

The dendrogram does not replace the chapter’s rankings. It deepens them.

From 2001 to 2025, the best pitching organizations were not merely collecting arms. They were building systems. The Dodgers built the most consistent system. The Yankees, Astros, Guardians, Phillies, Cubs, and Red Sox clustered near them because they also produced strong long-term skill profiles. The Braves, Giants, and Cardinals showed a different path, one built more around run prevention and contact control.

The larger point is simple: team pitching is not just performance. It is identity.

And the dendrogram shows us how identity takes shape.

 

 

Pitching in the Age of the Strikeout: MLB Team Pitching, 2001-2025

Pitching is one of the most difficult parts of baseball to measure because it sits at the intersection of many things. A pitcher controls the ball, but not everything that happens after contact. A defense turns balls in play into outs, or fails to. A park changes the meaning of a fly ball. A league environment changes the meaning of a 4.00 ERA. A bullpen changes the way we understand a starter. A front office changes the way we understand a pitching staff.

That is why a long-term team pitching study has to be deliberate and careful. If we simply rank every team from 2001 to 2025 by ERA, we are mixing together very different run environments. A 3.70 team ERA in one season does not mean exactly the same thing as a 3.70 team ERA in another. The offensive environment changes. The baseball changes. Strikeout rates change. Bullpen usage changes. Even the definition of a normal starting pitcher changes.

So the goal of this chapter is not merely to ask which team had the lowest ERA. The better question is this: which organizations consistently produced strong pitching staffs relative to their own era?

That is a critical distinction. A team does not pitch in the abstract. It pitches in a particular season, against a particular league, with a particular baseball, inside a particular tactical environment. The 2001 Diamondbacks, the 2011 Phillies, the 2017 Guardians, the 2018 Astros, and the 2024 Braves all belong to the same broad story, but they do not belong to the same pitching world.

The data in this study covers MLB team pitching from 2001 through 2025. The core variables include ERA, ERA-, FIP, FIP-, xFIP, xFIP-, SIERA, WAR, K%, BB%, K-BB%, HR/9, HR/FB, complete games, quality starts, and several contact-profile measures. For 2001, xFIP, SIERA, and detailed contact data are incomplete, so the 2001 season is included in the main study but handled carefully where those variables are missing.

The central finding is straightforward: from 2001 to 2025, team pitching moved decisively toward strikeout-based run prevention. The best organizations were not simply the ones that prevented runs in a given season. They were the ones who repeatedly built staffs with strong strikeout-minus-walk rates, strong fielding-independent indicators, and enough depth to remain competitive across changing offensive environments.

The Dodgers stand out most clearly. Over the full 25-year period, they were the strongest pitching organization by average normalized pitching score. The Yankees, Astros, Guardians, Cubs, Red Sox, Phillies, and Braves also appear near the top. But the Dodgers are the outlier, not because of one spectacular season, but because of repeated organizational excellence.

Method: comparing teams within seasons

The first methodological problem is that pitching statistics are unstable over time. A league-average pitching staff in 2001 did not look like a league-average pitching staff in 2025. Strikeouts increased. Complete games declined. Velocity rose. Bullpen usage expanded. Home-run rates surged and retreated. If we compare raw numbers across all years, we risk confusing historical context with team quality.

To solve this, each team-season was compared only to the other teams from the same season. In other words, the 2011 Phillies were compared to the league as a whole in 2011. The 2018 Astros were compared with the league in 2018. The 2024 Braves were compared to the league as a whole. This allows us to ask which staffs were exceptional relative to the environment in which they actually pitched.

The basic within-season z-score is:

z_y(X_{i,y}) = \frac{ X_{i,y} - \overline{X}_y }{ s_y(X) }

Here, ( X_{i,y}) is a statistic for team (i) in season (y),  (\overline{X}_y) is the league average for that statistic in that season, and (s_y(X)) is the standard deviation across teams in that season.

For statistics where higher is better, such as WAR and K-BB%, the z-score is used directly. For statistics where lower is better, such as ERA-, FIP-, xFIP-, and SIERA, the sign is reversed. This keeps the interpretation consistent. A higher score always means better pitching.

q_{i,y,m} = \begin{cases} z_y(m_{i,y}), & \text{if higher values are better} \\ -z_y(m_{i,y}), & \text{if lower values are better} \end{cases}

The overall pitching score is then the average of the available component scores:

\text{Pitching Score}_{i,y} = \frac{1}{|M_{i,y}|} \sum_{m \in M_{i,y}} q_{i,y,m}

The metric set is:

M = \left\{ \text{WAR}, \text{K-BB\%}, \text{ERA-}, \text{FIP-}, \text{xFIP-}, \text{SIERA} \right\}

This score is not meant to be the only possible definition of pitching quality. It is a deliberately balanced measure. It includes actual run prevention through ERA-, fielding-independent performance through FIP- and xFIP-, skill-based dominance through K-BB%, and overall value through WAR.

K-BB% is especially important because it captures the two plate appearance outcomes most directly controlled by the pitcher: strikeouts and walks.

\text{K-BB\%} = \text{K\%} - \text{BB\%}

FIP also deserves special attention because it attempts to isolate the events most directly connected to the pitcher: home runs, walks, hit batters, and strikeouts.

\text{FIP} = \frac{ 13 \cdot \text{HR} + 3 \cdot (\text{BB} + \text{HBP}) - 2 \cdot \text{K} }{ \text{IP} } + c_{\text{FIP}}

The constant ( c_{\text{FIP}} ) places FIP on an ERA-like scale. Because that run environment changes by season, FIP- and other indexed measures help compare teams more fairly.

The league changes: strikeouts become the center of pitching

The first figure shows the most important league-wide transformation: the rise of strikeouts.

Figure 1. K%, BB%, and K-BB% trends, 2001-2025

In 2001, the league strikeout rate was about 17.3%. By 2025, it was about 22.2%. That is not a small tactical adjustment. It is a structural change in how pitching works. The modern pitching staff is built around missing bats in a way that the early 2000s staff was not.

Walk rate did not change nearly as dramatically. In 2001, the league walk rate was about 8.4%. In 2025, it was about 8.4% again. There were fluctuations in between, but the broad pattern is clear: strikeouts rose much more than walks did.

That means K-BB% increased substantially. In 2001, league K-BB% was about 8.9%. In 2025, it was about 13.8%. That is the heart of the modern pitching revolution. The best staffs are not just striking out more hitters. They are increasing the gap between strikeouts and walks.

This is why K-BB% belongs near the center of the composite score. It is simple, but powerful. It strips pitching down to a basic contest: can the staff create strikeouts without giving back too many free baserunners?

The answer, increasingly, is yes. But not all teams answered equally well.

Run prevention, FIP, and the changing meaning of ERA

The second figure compares ERA, FIP, xFIP, and SIERA over time. Note that some of the data overlaps on the same line.

Figure 2. ERA, FIP, xFIP, and SIERA trends, 2001-2025

One of the striking features of the long-term data is that ERA does not move in one simple direction. The league ERA was about 4.42 in 2001. It fell to about 3.74 in 2014, rose to about 4.51 in 2019, and then settled around 4.16 in 2025.

That pattern matters because it reminds us that pitching quality cannot be evaluated solely by raw ERA. A team can have a lower ERA because it is genuinely better, but it can also have a lower ERA because the entire league is scoring less. Likewise, a higher ERA in a high-offense environment may not be as bad as it looks.

The 2014 season is a useful example. League run prevention was strong. ERA, FIP, xFIP, and SIERA all sat at relatively low levels. A good pitching staff in 2014 had to be judged against that lower-scoring context. The opposite problem appears in 2019, when the home-run environment pushed run scoring upward. A team that survived 2019 with strong FIP-based indicators deserves credit, given the more difficult environment.

That is why indexed statistics such as ERA- and FIP- are valuable. They tell us how a staff performed relative to league average, where lower is better.

\text{ERA-} = 100 \cdot \frac{ \text{Team ERA} }{ \text{League ERA} } \quad \text{adjusted for context}

A team with an ERA- of 90 was roughly 10% better than league average by that measure. A team with an ERA- of 110 was roughly 10% worse. The same logic applies to FIP-, except the foundation is fielding-independent pitching rather than actual runs allowed.

This distinction becomes crucial when comparing staffs across 25 seasons. The 2011 Phillies and the 2018 Astros both appear as historically great staffs, but they do not look great in exactly the same way. The Phillies represent a more traditional elite rotation model. The Astros represent the modern strikeout, command, and run-prevention model.

The home-run environment

The third figure shows the home-run environment.

Figure 3. HR/9 and HR/FB trends, 2001-2025

Home runs are one of the most important pressure points in modern pitching analysis because they connect individual pitcher skill, batted-ball profile, park context, and league environment. In 2001, league HR/9 was about 1.14. It dipped below 1.00 in several seasons, including 2010 and 2014, then spiked dramatically in 2019, reaching about 1.41 HR/9.

The 2019 season stands out immediately. It was not merely a season with more scoring. It was a season in which the relationship between contact and damage changed. HR/FB also rose sharply, reaching about 15.3% in 2019. That created a very different environment for pitchers.

This is one reason xFIP can be useful. FIP uses actual home runs allowed. xFIP estimates performance by normalizing home-run rate relative to fly balls. Neither statistic is perfect. FIP gives pitchers the actual cost of the home runs they allowed. xFIP asks whether that home-run rate was likely to persist.

For a team-level study, the difference between FIP and xFIP can be revealing. A team with a much better FIP than xFIP may have suppressed home runs unusually well, perhaps through park effects, pitcher skill, batted-ball management, or some combination of these. A team with a much worse FIP than xFIP may have been punished by an elevated home-run rate.

The home-run environment also helps explain why season-normalization is necessary. A team pitching in 2019 faced a different kind of run-prevention problem than a team pitching in 2014. The raw numbers alone cannot tell us whether a staff was good. They have to be interpreted against the league context.

The disappearance of the complete game

The fourth figure captures one of the clearest tactical changes in baseball.

Figure 4. Complete games and quality starts, 2001-2025

In 2001, MLB teams combined for 199 complete games. In 2002, that number was 214. By 2025, it had fallen to 29.

This is not a gradual stylistic preference. It is a transformation in pitcher usage. The complete game went from a normal, if still special, part of pitching to a rare event. The starting pitcher’s job changed. The bullpen’s job changed. The manager’s job changed. The entire architecture of run prevention changed.

Quality starts also declined. In 2001, teams combined for 2,342 quality starts. In 2025, that number was 1,676. Unlike complete games, quality starts did not nearly vanish; they simply became less central to how team pitching is organized.

This matters because a traditional pitching staff was often understood through the front of the rotation. The ace mattered. The number two starter mattered. The innings-eater mattered. In the modern game, those categories still matter, but they are less complete descriptions of team pitching quality. A staff can be excellent because it has dominant starters, but it can also be excellent because it has a deep bullpen, matchup flexibility, velocity, strikeout depth, and player-development infrastructure.

This is one reason team-level pitching analysis is valuable. It captures the staff as an organization, not merely as a list of starting pitchers.

The organizational scoreboard

The franchise-level results show which organizations repeatedly built strong pitching staffs across the full period.

Figure 5. Franchise average pitching score, 2001-2025

The top organizations by average normalized pitching score were:

Rank Franchise Avg Score Avg Rank Top-5 Seasons
1 LAD 1.043 5.80 16
2 NYY 0.785 8.12 10
3 HOU 0.470 11.20 8
4 CLE 0.447 12.16 9
5 CHC 0.363 12.80 6
6 BOS 0.361 12.44 6
7 PHI 0.356 11.88 6
8 ATL 0.344 12.60 7

The Dodgers are the clear leader. Their average rank was 5.80 across 25 seasons, and they finished in the top five 16 times. That is a remarkable level of consistency.

The Yankees also stand out. They were not as dominant as the Dodgers by average score, but they were consistently strong. Their average rank was 8.12, and they had 10 top-five seasons.

Houston’s position is interesting because the Astros’ 25-year period includes both very bad years and elite years. Their full-period average ranks third, but that average hides a sharp organizational transformation. The 2013 Astros appear among the worst pitching seasons in the dataset. The 2018 and 2019 Astros appear among the strongest. That makes Houston one of the most dramatic before-and-after stories in the study.

Cleveland also deserves attention. The Guardians were not merely good in one season. They produced nine top-five seasons across the full period, including the remarkable 2017 staff and the shortened-season 2020 staff. Cleveland’s results point toward a consistent ability to develop or acquire pitching skill, especially strikeout and command skill.

The Phillies are different. Their full-period average is strong, but their story is anchored by the 2011 staff, the top single-season result in the study. The Phillies’ score is not just about consistency. It is about peak excellence.

The best team pitching seasons

The best individual team pitching seasons in the study were:

Figure 6. Best team pitching seasons, 2001-2025

Rank Season Team Score WAR ERA- FIP- K-BB%
1 2011 PHI 2.472 29.45 78.58 82.88 14.75%
2 2017 CLE 2.464 30.35 72.49 75.44 20.59%
3 2018 HOU 2.442 28.63 75.89 78.23 21.17%
4 2013 DET 2.089 26.26 89.31 82.13 15.79%
5 2024 ATL 2.059 23.62 84.13 86.61 18.47%

The top three are especially revealing because they show three different versions of elite pitching.

The 2011 Phillies represent the great traditional staff. Their rotation was the center of the story. Their ERA- was 78.58, meaning they were far better than league average at preventing runs. Their FIP- was also excellent at 82.88. They were not merely outperforming their peripherals. They were genuinely strong across the major indicators.

The 2017 Guardians look like a bridge between traditional excellence and modern dominance. Their ERA- was 72.49, the best among the top five listed here, and their FIP- was 75.44. Their K-BB% was 20.59%, which is extraordinary. This is a staff that combined run prevention, fielding-independent strength, and strikeout-minus-walk dominance.

The 2018 Astros are the modern model. Their K-BB% was 21.17%, the highest among these top five. Their ERA- and FIP- were both outstanding. They did not merely prevent runs. They controlled the plate appearance.

That phrase may be the key to the whole chapter: the modern elite staff controls the plate appearance. It wins by turning fewer balls into uncertain events. More strikeouts. Fewer walks. Better home-run control. Better matchup deployment. Better depth.

The 2013 Tigers are also fascinating. Their FIP- was much stronger than their ERA-, which suggests a staff whose fielding-independent indicators were better than its actual run prevention. That kind of gap is analytically useful because it may point toward defense, sequencing, bullpen leakage, park effects, or simple variation.

The 2024 Braves round out the top five, showing that the modern model remains alive. Strong WAR, strong ERA-, strong FIP-, and excellent K-BB% place them among the best team pitching seasons of the last 25 years.

The heat map view: organizational memory

The heat map shows how pitching strength is distributed over time.

Figure 7. Team pitching score heat map, 2001-2025

A heat map is useful because it shows continuity. A table gives us leaders. A heat map gives us memory.

The Dodgers’ consistency becomes visible immediately. They do not merely spike and disappear. They remain strong across many different league environments. This suggests that their pitching success is not just the product of one rotation or one era. It is organizational.

The Yankees also show long-term strength, although with a different shape. They remain regularly above average, but the Dodgers’ top-end consistency is stronger.

Houston’s pattern is more dramatic. The Astros transitioned from poor pitching during the rebuilding years to elite pitching in the late 2010s and beyond. This makes them one of the best examples of organizational reinvention in the dataset.

Cleveland’s pattern is also compelling. The Guardians do not always (understatement) have the resources of the largest-market teams, but the pitching results are consistently strong enough to suggest a real developmental identity. Cleveland’s peak seasons are not accidents.

Tampa Bay deserves separate attention as well. The Rays do not rank at the very top over the full period, but their modern pitching identity is clear. They are one of the organizations most associated with bullpen creativity, opener usage, and flexible staff construction. A team-level study captures some of that, although a starter-reliever split would make the story even sharper.

While the heat map shows the full league, a smaller set of franchise trajectories makes the organizational story easier to see. Figure 8 follows several teams that help define the period: the Dodgers, Yankees, Astros, Guardians, Rays, Phillies, and Braves. The Dodgers show sustained excellence. Houston shows dramatic organizational reinvention. Cleveland and Tampa Bay show the value of pitching development and tactical adaptation. Philadelphia shows the difference between peak rotation dominance and long-term consistency.

Figure 8. Selected Franchises, 2001 – 2025

Three eras of team pitching

Breaking the study into periods helps clarify the historical movement.

From 2001 to 2009, the leading organizations were:

Period Rank Franchise Avg Score Avg Rank Top-5 Seasons
2001-2009 1 CHC 0.980 6.89 4
2001-2009 2 LAD 0.840 7.33 5
2001-2009 3 ARI 0.805 8.78 5
2001-2009 4 BOS 0.791 8.22 4
2001-2009 5 NYY 0.733 9.00 4

The early period is more rotation-centered. The Cubs, Dodgers, Diamondbacks, Red Sox, and Yankees all had strong stretches. This era still belongs partly to the older model of staff construction. Starting pitching carries more of the symbolic weight. Complete games are declining, but they have not yet collapsed to modern levels.

From 2010 to 2019, the leaders were:

Period Rank Franchise Avg Score Avg Rank Top-5 Seasons
2010-2019 1 LAD 1.202 4.40 7
2010-2019 2 NYY 0.925 6.70 4
2010-2019 3 WSN 0.733 9.20 3
2010-2019 4 CLE 0.709 10.20 6
2010-2019 5 TBR 0.687 8.60 2

This is the period when the modern pitching environment becomes much clearer. Strikeouts rise. Velocity rises. Bullpen roles become more specialized. Cleveland, Tampa Bay, Washington, and Los Angeles all become central parts of the story.

From 2020 to 2025, the leaders were:

Period Rank Franchise Avg Score Avg Rank Top-5 Seasons
2020-2025 1 LAD 1.082 5.83 4
2020-2025 2 PHI 0.941 5.67 3
2020-2025 3 MIL 0.808 8.00 2
2020-2025 4 TBR 0.808 6.83 2
2020-2025 5 ATL 0.658 11.33 2

The modern period is especially interesting because it includes the shortened 2020 season, the post-2020 workload reset, and the continuing dominance of strikeout-based staff construction. The Dodgers remain first. The Phillies rise. The Brewers and Rays become central examples of modern pitching development and staff management. The Braves also emerge strongly, especially with the 2024 season.

The worst seasons and the cost of weak pitching infrastructure

The worst team pitching seasons are just as revealing as the best ones.

At the bottom of the dataset are seasons such as the 2025 Rockies, 2006 Royals, 2023 Athletics, 2024 Rockies, and 2013 Astros. These seasons combine weak WAR, poor run prevention, poor FIP-based indicators, and low K-BB%.

The 2025 Rockies had a pitching score of -2.757, with a 125.19 ERA-, 119.75 FIP-, and only an 8.47% K-BB%. The 2006 Royals were similarly poor, with a 124.43 ERA-, 118.20 FIP-, and a 4.15% K-BB%. The 2023 Athletics had a 132.94 ERA-, 122.15 FIP-, and 9.49% K-BB%.

These are not merely bad ERAs. They are broad staff failures. When a team is poor in both run prevention and fielding-independent indicators, the problem is deeper than sequencing or defense. It suggests that the staff is not controlling the strike zone, not limiting damaging contact enough, and not producing enough value.

The 2013 Astros are especially important because they later became one of the strongest pitching organizations in the study. That contrast gives us a natural case study in organizational transformation. Bad pitching staffs do not have to remain bad forever. But the transformation requires more than one good pitcher. It requires a system.

What the study suggests

This first pass suggests several conclusions.

First, team pitching from 2001 to 2025 became increasingly strikeout-centered. The rise in K% and K-BB% is the central statistical movement of the period. It changed what good pitching looks like.

Second, raw ERA is not enough for a long-term study. ERA remains important because runs allowed are real. But ERA must be placed next to FIP, xFIP, SIERA, K-BB%, and indexed measures such as ERA- and FIP-. Otherwise, we risk mistaking league environment for team quality.

Third, complete games and traditional starter workload declined dramatically. This changes how we should think about team pitching. A great staff is no longer just a great rotation. It is a complete run-prevention system.

Fourth, the Dodgers are the strongest pitching organization of the 2001-2025 period. Their dominance is not just peak dominance. It is consistency. They averaged a top-six pitching rank across 25 seasons and finished in the top five 16 times.

Fifth, several organizations deserve deeper case studies. The Astros show organizational reinvention. The Guardians show player-development strength. The Rays show tactical creativity. The Phillies show the power of peak rotation excellence. The Braves show modern staff strength. The Yankees show long-term high-level stability.

Conclusion: pitching as organizational identity

The most important lesson from this study is that pitching is no longer best understood as a collection of individual arms. At the team level, pitching has become an organizational identity.

The best teams do not merely find pitchers. They shape pitching environments. They develop velocity. They manage workloads. They build bullpens. They optimize matchups. They control the strike zone. They use data to turn raw stuff into repeatable advantage.

That is why the Dodgers’ long-term record matters. It is not just that they had good pitchers. Many teams have good pitchers for a year or two. The Dodgers repeatedly built strong pitching staffs across different run environments, tactical eras, and roster cycles.

The same broader lesson applies to Houston, Cleveland, Tampa Bay, Milwaukee, Atlanta, Philadelphia, and New York. The details differ, but the underlying pattern is the same. Modern pitching excellence is systemic.

From 2001 to 2025, baseball shifted toward a game in which the best staffs increasingly controlled plate appearances. Strikeouts rose. Walks became more costly. Home runs reshaped risk. Complete games disappeared. Bullpens expanded. The old image of pitching as one starter carrying a game into the ninth inning gave way to something more distributed, more specialized, and more organizational.

The great pitching staffs of this period are therefore not just statistical outliers. They are historical markers. They show how the game changed, and how the smartest organizations changed with it.

 

The Shape of Defense: What MLB Fielding Metrics Tell Us So Far This Season

The Shape of Defense: What MLB Fielding Metrics Tell Us So Far This Season

Defense is the hardest part of baseball to measure cleanly. I thought it might be interesting to study all MLB teams as of the end of June 2026.

Hitting leaves a visible trail. A batter walks, strikes out, singles, doubles, homers, or makes an out. Pitching is more complicated, but it still has a fairly direct statistical language. Strikeouts, walks, home runs, velocity, chase rate, and contact quality all point in recognizable directions. Defense is different. Good defense often appears as absence. The ball that does not fall. The extra base that is not taken. The throw that does not need to be dramatic because the fielder got to the ball early enough.

That makes defensive analysis both frustrating and interesting. One number rarely tells the whole story. Fielding percentage tells us whether a team usually completes the plays it reaches, but it says little about how many plays it reaches in the first place. Errors measure visible mistakes, but not invisible range. Modern metrics try to correct for that. Defensive Runs Saved, Outs Above Average, Fielding Run Value, FanGraphs Def, framing value, arm value, and range value each capture a different part of the defensive picture.

For this study, I used three FanGraphs team defensive leaderboards. The goal was not simply to rank teams. The better question is how the different defensive systems agree, where they disagree, and what kind of defense each team is actually playing.

The main conclusion is clear: so far this season, the Cubs are the strongest defensive team in baseball by a wide margin. But the deeper conclusion is more interesting. OAA, FRV, and FanGraphs Def are telling very similar stories. DRS is related to those measures, but it is not identical. Traditional fielding percentage has some relationship to defensive value, but not nearly enough to stand on its own.

Defense, in other words, is not one thing.

Building a Composite Defensive Score

To compare teams across multiple defensive systems, I created a composite z-score using four broad measures:

  1. Defensive Runs Saved, or DRS
  2. Outs Above Average, or OAA
  3. Fielding Run Value, or FRV
  4. FanGraphs Def

Each metric was standardized across the 30 MLB teams. The z-score for team (i) on metric (m) is:

z_{i,m} = \frac{x_{i,m} - \mu_m}{\sigma_m}

where (x_{i,m}) is team (i)’s value on metric (m),

(\mu_m)is the league average for that metric,

and (\sigma_m) is the standard deviation across teams.

The composite defensive score is then:

D_i = \frac{ z_{i,\mathrm{DRS}} + z_{i,\mathrm{OAA}} + z_{i,\mathrm{FRV}} + z_{i,\mathrm{Def}} }{4}

This score does not claim that all defensive metrics are perfect or equally philosophical. It is simply a way to ask a practical question: which teams look good across several major defensive metrics simultaneously?

The top of the list is not subtle.

Rank Team Composite Z DRS OAA FRV Def
1 CHC 2.41 57 38 34 34.06
2 LAD 1.73 61 23 20 21.77
3 BOS 1.32 42 18 17 19.20
4 ARI 1.30 25 23 23 19.08
5 SDP 1.08 17 16 22 19.61
6 STL 0.89 18 16 20 10.75

The Cubs are not merely first. They are first by a lot. Their composite z-score is 2.41, meaning they are far above league average across the combined defensive measures. The Dodgers are also excellent, but they are closer to the next group than they are to Chicago.

At the other end of the chart, the weakest defensive teams are also fairly clear.

Team Composite Z DRS OAA FRV Def
MIN -1.50 -32 -19 -19 -17.35
SEA -1.36 -2 -27 -22 -18.83
LAA -1.24 -3 -18 -23 -19.03
PHI -1.08 -29 -18 -9 -7.46
ATH -1.08 -4 -19 -16 -15.51

Minnesota rates last by the composite score. Seattle and the Angels are also deep in negative territory, though they arrive there in slightly different ways. Minnesota is hurt badly by DRS and modern range-based measures. Seattle is particularly poor by OAA and FRV standards. The Angels are near the bottom in FRV and FanGraphs Def.

The first lesson is that the defensive standings have a clear shape. Chicago is alone at the top. Los Angeles leads the next tier. Boston, Arizona, San Diego, and St. Louis form a strong second group. At the bottom, Minnesota, Seattle, Los Angeles Angels, Philadelphia, and the Athletics are the weakest group.

But rankings are only the beginning.

OAA and FRV Mostly Agree

The tightest relationship in the study is between Outs Above Average and Fielding Run Value.

The regression equation is:

\mathrm{FRV}_i = 0.83 \cdot \mathrm{OAA}_i + 0.77

with:

R^2 = 0.819

That is a strong relationship. This means that about 82% of the variation in team FRV is explained by team OAA in this dataset.

That makes intuitive sense. OAA and FRV are closely related modern defensive concepts. Both are trying to move beyond errors and fielding percentage. Both are interested in actual plays made relative to expected plays. Both reward teams that turn difficult batted balls into outs.

The Cubs sit in the upper-right corner of the chart. They are not just good by one metric. They are extreme by both. Chicago has 38 OAA and 34 FRV. Arizona, Los Angeles, San Diego, Boston, and St. Louis also occupy the positive area of the chart. At the bottom, Seattle, the Angels, Minnesota, and the Athletics cluster in negative territory.

This is useful because it gives confidence. If two modern metrics with related but not identical constructions point in the same direction, the result is more persuasive. The Cubs are not a leaderboard accident. Their defensive advantage appears in multiple systems.

DRS Tells a Related but Different Story

DRS also matters, but it does not align with FRV as tightly as OAA does.

The regression equation is:

\mathrm{FRV}_i = 0.43 \cdot \mathrm{DRS}_i - 3.98

with:

R^2 = 0.418

That is still a meaningful relationship, but it is much weaker than the OAA-FRV relationship. DRS and FRV are clearly not measuring the same thing in the same way.

This is where the study becomes more interesting. The Dodgers have the highest DRS total in the dataset, with 61, but they trail the Cubs in FRV, OAA, and Def. The Cubs have slightly lower DRS than Los Angeles, but they dominate in OAA and FRV. Tampa Bay is another example of disagreement. The Rays have a positive DRS total, but their FRV is negative. Philadelphia is negative in both, but much worse in DRS than FRV.

The correlation table reinforces the point. The correlations among the major modern measures are:

Pair Correlation
OAA and FRV 0.90
FRV and Def 0.97
OAA and Def 0.94
DRS and OAA 0.65
DRS and FRV 0.65
DRS and Def 0.66

The formula for correlation is:

r_{XY} = \frac{ \sum_i (x_i - \bar{x})(y_i - \bar{y}) }{ \sqrt{\sum_i (x_i - \bar{x})^2} \sqrt{\sum_i (y_i - \bar{y})^2} }

This tells us that DRS belongs in the conversation, but it should not be treated as interchangeable with OAA or FRV. When DRS disagrees with the Statcast-style measures, that disagreement is not a nuisance. It is evidence that defensive measurement still depends on the assumptions built into each system.

 

The Cubs Are Winning with Range

The component data explains why Chicago is separated from the rest of the league.

 

The Cubs have:

Component Value
Range 31
Arm 7
Framing -3
Blocking 0
Throwing -2
FRV 34

That is the key to the whole study. Chicago is not leading because of framing. It is not leading because of blocking. It is leading because of range.

Range is the largest component in the dataset. The Cubs have a Range value of 31. The next highest teams are Arizona at 19, the Dodgers at 18, Boston at 16, and San Diego at 13. That gap is enormous. It suggests that Chicago is turning a large number of balls in play into outs that an average defense might not convert.

A simple component model is:

\mathrm{FRV}_i = \mathrm{Range}_i + \mathrm{Arm}_i + \mathrm{Framing}_i + \mathrm{Blocking}_i + \mathrm{Throwing}_i + \epsilon_i

The additional term at the end of the equation: (\epsilon_i)is included because published component totals may not always sum perfectly to the displayed total due to rounding, classification, or metric construction. But the practical interpretation is still clear. In this dataset, Range is the dominant separating variable.

Arizona is similar to Chicago in shape, though not in magnitude. The Diamondbacks have 19 Range and 23 FRV. The Dodgers combine 18 Range with 6 Arm. Boston combines 16 Range, 3 Arm, and 2 Blocking. San Diego has a more balanced profile, with 13 Range, 4 Arm, and positive throwing value.

Toronto is the most interesting contrast. The Blue Jays have a strong FRV total of 12, but they do not get it from range. Their Range value is -2. Their Framing value is 13. Toronto is a reminder that not all good defenses are built the same way. Some teams prevent runs by reaching more balls. Others gain value through catcher receiving. A single defensive ranking can hide those differences.

Traditional Fielding Stats Still Miss the Heart of Defense

Fielding percentage remains familiar, but it is not enough.

The regression between fielding percentage and FanGraphs Def is:

\mathrm{Def}_i = 1958.5 \cdot \mathrm{FP}_i - 1930.6

with:

R^2 = 0.227

That means that fielding percentage explains only about 23% of the variation in FanGraphs Def in this dataset.

This should not be surprising. Fielding percentage is calculated as:

\mathrm{FP} = \frac{\mathrm{PO} + \mathrm{A}} {\mathrm{PO} + \mathrm{A} + \mathrm{E}}

Putouts and assists are important, as are errors. But this formula has a blind spot. It only evaluates plays that become official chances. It does not ask how many difficult balls were reached. It does not ask whether a fielder’s first step turned a hit into an out. It does not measure the value of positioning. It does not capture the difference between a clean single and a ball that a better defense would have converted into an out.

This is why a team can have a decent fielding percentage and still grade poorly by modern defensive metrics. Avoiding errors is not the same thing as preventing hits. Completing routine plays is not the same thing as creating outs.

The traditional fielding formula is useful, but incomplete. It measures reliability on contacted chances. Modern defensive metrics try to measure territory, difficulty, and run value.

What the Rankings Mean

The strongest defensive teams so far fall into three groups.

The first group is Chicago alone. The Cubs are the best team in the study because they combine elite OAA, elite FRV, elite FanGraphs Def, and excellent DRS. Their defense is driven by range, and the size of that range advantage is the most important finding in the data.

The second group includes the Dodgers, Red Sox, Diamondbacks, Padres, and Cardinals. These teams are all meaningfully above average. The Dodgers have the strongest DRS figure and excellent marks across the board. Arizona and San Diego are especially strong in FRV. Boston is balanced and rates highly in all major measures. St. Louis is not quite as strong by FanGraphs Def, but its OAA and FRV remain impressive.

The third group is more complicated. Atlanta, Cleveland, Toronto, Texas, Kansas City, and the Yankees all have positive composite scores, but each has a different profile. Toronto is especially noteworthy because its positive value comes primarily from framing rather than range. Texas has positive OAA but neutral FRV, which keeps its composite score closer to the middle.

At the bottom, Minnesota is the weakest team overall. Seattle and the Angels are also poor by modern defensive value. Philadelphia is unusual because DRS dislikes it more than FRV does, while the Athletics are consistently weak across several measures.

Why This is Important

Defense affects the interpretation of everything else.

A pitcher on a great defensive team is not working in the same environment as a pitcher on a poor defensive team. A staff backed by Chicago’s range advantage may see more balls converted into outs. A staff backed by a weak range defense may see more balls fall in, even when contact quality is similar. That matters when evaluating ERA, run prevention, pitcher luck, and even team overperformance.

The same is true for team analysis. A club with strong pitching numbers may be getting help from its defense. A club with disappointing run prevention may have a fielding problem hidden underneath the pitching line. Defense is not just a separate category. It is part of the context in which pitching statistics are produced.

This is especially important because modern baseball produces so many batted-ball measurements. We can now ask not just whether a ball was caught, but whether it should have been caught. We can compare actual outs to expected outs. We can separate the routine from the exceptional. That changes the language of defense.

Errors used to dominate the conversation because they were visible. Range is harder to see, but it is often more important.

Conclusion

So far this season, the defensive story is clear at the top. The Cubs have been the best defensive team in baseball by the combined evidence of DRS, OAA, FRV, and FanGraphs Def. Their advantage is not cosmetic. It is large, broad, and especially driven by range.

The Dodgers are also excellent, and they lead in DRS. Boston, Arizona, San Diego, and St. Louis form a strong second tier. At the bottom, Minnesota, Seattle, the Angels, Philadelphia, and the Athletics have been the weakest defensive teams by the composite measure.

But the broader lesson is methodological. Defense should not be reduced to one statistic. OAA, FRV, and FanGraphs Def are closely aligned. DRS is related but more independent. Fielding percentage captures a small part of the picture, but it misses much of what makes modern defense valuable.

The old defensive question was: did the fielder make an error?

The better question is: how many outs did the defense create that an average defense would not have created?

That is where the Cubs separate themselves. Not merely by being clean. By getting to the baseball.

 

Zeno’s Paradox: The Infinite Hidden Inside a Single Step

At first glance, Zeno’s paradox seems ridiculous.

Of course, Achilles catches the tortoise. Of course, an arrow moves through the air. Of course, I can walk across a room. Well, duh!

We know these things before anyone begins arguing. Motion is one of the most ordinary facts of experience. Every thrown ball, every running child, every falling leaf, every car moving down a road seems to refute Zeno before he even begins.

And yet the paradox remains.

That is what makes Zeno interesting. His argument does not stand because it leads us to believe that motion is impossible. It survives because it reveals something strange about the way we explain motion. Zeno takes an everyday event and slows it down until the ordinary becomes puzzling. He asks us to look not at the fact that something moves, but at what must be true for motion to be intelligible.

Before I can cross a room, I must first cross half the room. Before I can cross the remaining distance, I must cross half of that. Then half again. Then half again. The distances become smaller and smaller, but the number of required divisions seems to grow without end.

The paradox begins with a simple observation: A finite distance can be divided into infinitely many parts.

That is the unsettling idea at the heart of Zeno’s paradox. The problem is not that the room is too large. The problem is that even a small room appears to contain an infinite structure.

The question becomes: how can a person complete an infinite number of tasks in a finite amount of time?

The Dichotomy Paradox

One of Zeno’s most famous arguments is often called The Dichotomy Paradox. The word “dichotomy” means a division into two parts. In this paradox, every journey must be divided in half.

Suppose I want to walk from one side of a room to the other. To reach the far wall, I first need to reach the halfway point. Once I reach the halfway point, I still need to reach the halfway point of the remaining distance. Then I need to reach the next halfway point. And so on.

The sequence looks like this:

\frac{1}{2},\ \frac{1}{4},\ \frac{1}{8},\ \frac{1}{16},\ \frac{1}{32},\ldots

Each distance is smaller than the one before it. But there is no final term. No matter how many halfway points I cross, another halfway point remains.

That is the apparent trap. If every motion requires completing infinitely many sub-motions, then motion seems impossible. Before I can finish the journey, I must finish an infinite sequence of smaller journeys.

Yet I do finish the journey.

That tension is the paradox.

Figure 1. Divided Finite Distance.

Mathematically, the total distance can be written as an infinite series:

\frac{1}{2}+\frac{1}{4}+\frac{1}{8}+\frac{1}{16}+\cdots

At first, this looks like an endless accumulation. But modern mathematics gives us a clear answer:

\frac{1}{2}+\frac{1}{4}+\frac{1}{8}+\frac{1}{16}+\cdots = 1

More formally:

\sum_{n=1}^{\infty}\left(\frac{1}{2}\right)^n = 1

The infinite series has a finite sum.

This is the key mathematical insight. An infinite number of terms does not necessarily mean an infinite total. The terms can shrink quickly enough that their sum approaches a finite limit.

That is why the walker reaches the wall. The distances get smaller, and the times required to cross them also get smaller. The infinite sequence does not require infinite time.

Still, this answer should not make us dismiss Zeno too quickly. The modern solution is powerful, but it also shows why the paradox mattered in the first place. Zeno forced later thinkers to clarify the relationship between infinity, space, time, and motion.

He did not merely ask a trick question. He discovered a pressure point.

Achilles and the Tortoise

The most famous version of Zeno’s argument is Achilles and the tortoise.

Imagine Achilles, the great runner, racing against a tortoise. Since Achilles is much faster, the tortoise receives a head start. Once the race begins, Achilles quickly reaches the place where the tortoise started. But by that time, the tortoise has moved a little farther ahead.

Achilles then reaches that new position. But again, the tortoise has moved forward.

Achilles reaches the next position. The tortoise has moved again.

This continues indefinitely.

The distances shrink. The tortoise’s lead becomes smaller and smaller. But in Zeno’s framing, Achilles must first reach every previous position occupied by the tortoise. Since there are infinitely many such positions, it seems Achilles can never catch up.

Again, common sense rebels.

Of course Achilles catches the tortoise.

But Zeno is not really betting on the tortoise. He is asking whether motion can be explained if every interval contains infinitely many smaller intervals.

Figure 2. Race Diagram.

Let the tortoise begin with a head start of distance (d). Let Achilles run at velocity (vA), and let the tortoise move at velocity (vT). If Achilles is faster, then:

v_A > v_T

The time it takes Achilles to catch the tortoise is:

t_{\text{catch}} = \frac{d}{v_A - v_T}

This equation gives a finite answer. Achilles catches the tortoise when the initial head start has been eliminated by the difference between their speeds.

For example, suppose the tortoise starts 10 meters ahead. Achilles runs at 10 meters per second. The tortoise moves at 1 meter per second. Then:

t_{\text{catch}} = \frac{10}{10 - 1} t_{\text{catch}} = \frac{10}{9}

So Achilles catches the tortoise in about 1.11 seconds.

t_{\text{catch}} \approx 1.11\ \text{seconds}

The paradox dissolves mathematically. But it does not disappear philosophically. Zeno’s description of the race is not false in the ordinary sense. Achilles really does pass through the tortoise’s earlier positions. There really are infinitely many possible subdivisions of the race. What Zeno gets wrong is the assumption that infinitely many subdivisions require infinitely much time.

The modern answer depends on the idea of convergence.

The partial sums of a shrinking series approach a limit. For example:

S_n = \sum_{k=1}^{n}\left(\frac{1}{2}\right)^k

As (n) increases, (S_n) gets closer and closer to 1.

\lim_{n\to\infty} S_n = 1

This is the heart of the mathematical solution. The sequence has infinitely many steps, but the total distance is finite. The total time is finite too, assuming the motion is continuous, and the speed remains well-behaved.

Figure 3. Infinite Steps

The Arrow Paradox

Zeno’s Arrow paradox attacks motion from another direction.

Imagine an arrow flying through the air. At any single instant, the arrow occupies a particular position. At that instant, it is exactly where it is. It is not yet at the next position, nor is it at the previous one.

So, Zeno asks, where is the motion?

If time is made of instants, and if the arrow is motionless at each instant, then how can motion arise from a collection of motionless moments?

This paradox is different from the Dichotomy and Achilles arguments. It is not mainly about an infinite sequence of distances. It is about time itself. If time is composed of indivisible instants, then motion becomes difficult to locate. At a single frozen instant, nothing appears to move.

A photograph captures this problem nicely. A photograph of a moving car does not show motion itself. It shows a car at a position. Motion appears only when we understand the position as part of a sequence.

Modern physics and calculus answer this by treating velocity not as a visible change inside a single instant, but as an instantaneous rate of change.

Average velocity is easy to understand:

v_{\text{avg}} = \frac{\Delta x}{\Delta t}

This says that average velocity equals change in position divided by change in time.

Instantaneous velocity is more subtle. It is defined as the limit of average velocity as the time interval becomes arbitrarily small:

v(t) = \lim_{\Delta t\to 0}\frac{x(t+\Delta t)-x(t)}{\Delta t}

The arrow does not need to move “inside” a frozen instant. Its motion is represented by the way its position changes over time. Velocity belongs to the structure of the function, not to a single isolated snapshot.

That is a powerful mathematical response. But again, Zeno has forced us to become more precise. He makes us distinguish between position and motion, between an instant and an interval, between a snapshot and a process.

The arrow paradox is not silly. It is a warning about confusing the parts of a description with the whole of reality.

Infinity as the Real Subject

The reason Zeno’s paradoxes endure is that they are not really about turtles, arrows, or people crossing rooms. They are about infinity.

There are at least two kinds of infinity at work here.

First, there is the infinity of division. A line segment can be divided in half, then half again, and so on. There is no obvious stopping point. This suggests that space may be infinitely divisible.

Second, there is the infinity of sequence. Once we begin listing the required steps, the list seems endless. First half the distance. Then half the remainder. Then half again.

Zeno’s genius was to combine these two ideas and turn them against motion.

If every finite act contains infinitely many parts, then how can any finite act be completed?

The modern answer is that infinitely many parts can form a finite whole. That answer now seems familiar because infinite series are part of standard mathematics. But the idea is far from obvious. It is one of the great achievements of mathematical thought.

A simple geometric series shows the point:

a + ar + ar^2 + ar^3 + \cdots = \frac{a}{1-r}

provided that:

|r| < 1

In the Dichotomy paradox, the first term is:

a = \frac{1}{2}

and the common ratio is:

r = \frac{1}{2}

So:

\frac{a}{1-r} = \frac{\frac{1}{2}}{1-\frac{1}{2}} \frac{\frac{1}{2}}{\frac{1}{2}} = 1

The infinite sum equals the finite distance.

This is why Zeno’s argument fails mathematically. But it fails in a revealing way. It shows that common sense alone is not enough. We needed a theory of limits to explain what everyday experience already knew.

The Difference Between Solving and Dismissing

It is tempting to say that calculus solved Zeno’s paradox and leave it there.

In one sense, that is true. The mathematics of limits gives a clean answer to the problem of infinite subdivision. Achilles catches the tortoise. The walker crosses the room. The arrow moves.

But there is a difference between solving a paradox and dismissing it.

A bad paradox depends on a cheap trick. Once the trick is exposed, nothing remains.

Zeno’s paradox is different. Even after the mathematical answer is given, the original problem remains intellectually productive. It continues to ask useful questions.

What is continuity?

What is an instant?

Is space made of points, or are points abstractions we impose on space?

Is time a flowing reality, or a coordinate in a mathematical model?

Does mathematics describe the world directly, or does it provide a structure that predicts the world?

These are not dead questions. They return in different forms in philosophy, physics, and mathematics. Zeno’s paradox survives because it sits near the boundary between lived experience and formal explanation.

We live in motion. But to explain motion, we must translate it into distance, time, velocity, sequence, and limit. Each translation clarifies something. Each translation also changes the problem.

The Paradox as a Lesson in Explanation

There is a deeper lesson here.

Zeno shows that an explanation can fail even when the reality being explained is obvious.

Motion happens. No serious person doubts that. But saying “motion happens” is not the same as explaining how motion is possible within a particular theory of space and time.

That distinction matters far beyond ancient philosophy.

In science, statistics, and history, we often begin with facts that seem obvious. A species changes. A river cuts a valley. A baseball player declines with age. A market rises or falls. A civilization expands. A population migrates.

But explanation requires structure. We need a model. We need assumptions. We need a way to connect observations to causes.

Zeno’s paradox reminds us that the structure of explanation can become unstable. Sometimes the model makes the obvious seem impossible. When that happens, the answer is not to reject experience immediately. It is to examine the assumptions inside the model.

That may be the real value of the paradox.

Zeno slows us down. He makes us ask what we mean by motion, distance, time, and completion. He takes a simple act and reveals the hidden machinery of thought inside it.

A single step across a room becomes a philosophical event.

Why the Paradox is Still Discussed

Zeno was wrong if his goal was to prove that motion is impossible.

But he was right that motion is stranger than it appears.

The paradox matters because it teaches humility. We should be careful when we assume that ordinary experience is simple. The simplest events often contain the deepest assumptions.

Walking across a room feels immediate. But when analyzed mathematically, it opens into infinity.

A runner passing a tortoise feels obvious. But when divided into successive positions, it becomes a puzzle about convergence.

An arrow flying through the air feels undeniable. But when frozen into instants, it becomes a question about time.

In each case, Zeno forces us to notice that reality and explanation are not identical. Reality happens. Explanation tries to account for how it happens. The gap between the two is where paradox lives.

The modern mathematical answer is beautiful:

\sum_{n=1}^{\infty}\left(\frac{1}{2}\right)^n = 1

An infinite process can have a finite limit.

But the philosophical lesson is just as important:

The world may move easily, but our concepts do not always move with it.

Conclusion: The Infinite in the Ordinary

Zeno’s paradox begins with common sense and ends with infinity.

That is why it remains powerful. It does not take us away from ordinary life. It takes ordinary life more seriously than we usually do.

A walk across the room becomes a question about infinite division. A race becomes a question about convergence. An arrow becomes a question about time, instants, and change.

The paradox is not really asking whether motion exists. It is asking whether our account of motion is coherent.

That is a much better question.

Achilles catches the tortoise. The arrow reaches the target. I cross the room.

But after Zeno, none of these things seems quite as simple as they did before.

The world still moves.

The mystery is that we can explain it at all.

 

Season-Level Validation: Do Third-Base Offensive Z-Scores Predict wRC+?

Season-Level Validation: Do Third-Base Offensive Z-Scores Predict wRC+?

Introduction

The first wRC+ validation study used a career-level FanGraphs export.

That study was useful. It showed that, among regular third basemen, average Model C offensive score per qualified season strongly predicted career wRC+. It also showed that traditional defense did not predict wRC+, which was exactly what we wanted from a negative-control test.

But the career-level study had one limitation.

wRC+ is fundamentally a season-level offensive rate statistic. Our offensive z-score system is also built season by season. So the cleanest validation test is not career score against career wRC+.

The cleanest test is:

Does a third baseman’s season-level offensive z-score predict his season-level wRC+?

This chapter answers that question.

The answer is yes.

Using the season-level FanGraphs export, the Model C offensive season score explains about 69 percent of the variation in season wRC+ among qualified third-base seasons.

R^2 = 0.692

The fitted model is:

wRC^+ = 101.47 + 5.86(\text{Model C Offensive Season Score})

That is a strong result.

Just as important, the traditional defensive score does not predict wRC+:

R^2 = 0.002

This is exactly the pattern the project needed.

Offensive z-scores predict offense.

Traditional defensive z-scores do not.

That means the Model C offensive score is not merely identifying generally good players. It is measuring offensive quality.

Data Used in the Season-Level Study

The FanGraphs season-level export included:

9,152 player-season rows through 2025
Season
Name
Team
PA
wRC+
PlayerId
MLBAMID

The broader third-base season dataset included:

3,188 qualified third-base seasons
Season range: 1880–2025

The merge was very strong:

Matched seasons: 3,163
Unmatched seasons: 25
Match rate: 99.2%

The remaining unmatched seasons were mostly older Negro Leagues or historical ID cases. The modern and post-integration major-league seasons matched very well.

This makes the season-level validation much cleaner than the first career-level wRC+ test.

Why Season-Level Validation Matters

The career-level wRC+ test asked whether accumulated third-base offensive separation was related to career offensive quality.

The season-level test is more direct.

It asks:

In a given season, does the offensive z-score model identify the same kind of offensive performance that wRC+ identifies?

This is a better test because both measures are season-specific.

The z-score model compares a third baseman to other third basemen in the same season. wRC+ compares a hitter’s offensive production to the league and park context of that season.

They are not the same statistic.

But they should be related.

If Model C is measuring offensive quality, high Model C scores should correspond to high wRC+ values.

That is what the data show.

The Model C Offensive Score

The Model C offensive score uses seven components:

OBP
ISO
BB/PA
SO/PA, inverted
Net SB/PA
R/PA
RBI/PA

Each component is converted into a same-position, same-season z-score.

The basic z-score formula is:

z = \frac{x - \mu}{\sigma}

Where:

x = \text{the player's value} \mu = \text{the same-position, same-season peer-group mean} \sigma = \text{the same-position, same-season peer-group standard deviation}

This is the central idea of the study.

Raw numbers ask how large a number is. Z-scores ask how far a player separated from his peer group.

Offensive Component Equations

On-base percentage is:

OBP = \frac{H + BB + HBP}{AB + BB + HBP + SF}

Slugging percentage is:

SLG = \frac{TB}{AB}

Isolated power is:

ISO = SLG - AVG

Walk rate is:

BB/PA = \frac{BB}{PA}

Strikeout rate is:

SO/PA = \frac{SO}{PA}

Net stolen bases are:

NetSB = SB - CS

Net stolen-base rate is:

NetSB/PA = \frac{SB - CS}{PA}

Run rate is:

R/PA = \frac{R}{PA}

RBI rate is:

RBI/PA = \frac{RBI}{PA}

The strikeout component is inverted because lower strikeout rates are better:

z_{\text{Low SO/PA}} = -\left( \frac{ (SO/PA)_i - \overline{(SO/PA)}_{\text{peer}} }{ s_{SO/PA,\text{peer}} } \right)

The full Model C offensive season score is:

\begin{aligned} \text{Season Score} &= z_{\text{OBP}} + z_{\text{ISO}} + z_{\text{BB/PA}} + z_{\text{Low SO/PA}} \\ &\quad + z_{\text{NetSB/PA}} + z_{\text{R/PA}} + z_{\text{RBI/PA}} \end{aligned}

This score measures offensive separation from same-season third-base peers.

Regression Framework

The main validation model is:

wRC^+_s = \alpha + \beta_1(\text{Model C Offensive Season Score}_s) + \varepsilon_s

Where:

wRC^+_s = \text{FanGraphs wRC+ for season } s \alpha = \text{intercept} \beta_1 = \text{slope for the offensive z-score} \varepsilon_s = \text{residual error}

The coefficient of determination is:

R^2 = 1 - \frac{ \sum_s \left( wRC^+_s - \widehat{wRC^+}_s \right)^2 }{ \sum_s \left( wRC^+_s - \overline{wRC^+} \right)^2 }

A higher value of R^2 means the model explains more of the variation in wRC+.

Main Season-Level Result

The fitted offense-only model is:

wRC^+ = 101.47 + 5.86(\text{Model C Offensive Season Score})

The result is:

R^2 = 0.692

This means the Model C offensive season score explains about 69.2 percent of the variation in season-level wRC+ among matched qualified third-base seasons.

That is a strong validation result.

The slope is also meaningful:

\beta_1 = 5.86

Each additional point of Model C offensive season score corresponds to about 5.86 additional points of wRC+.

For example, a player with an offensive score of 0 projects as:

wRC^+ = 101.47 + 5.86(0) wRC^+ = 101.47

A player with an offensive score of 5 projects as:

wRC^+ = 101.47 + 5.86(5) wRC^+ = 130.77

A player with an offensive score of 10 projects as:

wRC^+ = 101.47 + 5.86(10) wRC^+ = 160.11

This is exactly the pattern expected if Model C is capturing offensive dominance.

Figure 1: Model Comparison

Figure 1. How well season-level third-base metrics predict wRC+.

The first figure compares several models.

The offensive z-score model performs well:

R^2_{\text{Offensive z-score}} = 0.692

The traditional defensive score performs almost not at all:

R^2_{\text{Traditional Defense}} = 0.002

Adding traditional defense to offense does not meaningfully improve the result:

R^2_{\text{Offense + Defense}} = 0.692

Adding plate appearances produces only a tiny improvement:

R^2_{\text{Offense + PA}} = 0.695

The WAR_off benchmark is higher:

R^2_{\mathrm{WAR}_{\mathrm{off}}} = 0.846

That is expected. WAR_off is already a sophisticated offensive value measure. It is included only as a benchmark, not as a competing z-score model.

The important comparison is offense versus defense.

The offensive z-score score predicts wRC+ strongly. The defensive score does not.

Figure 2: Offensive Z-Score Versus wRC+

Figure 2. Season wRC+ versus Model C offensive season score among third basemen.

This figure shows the main relationship directly.

The x-axis is:

\text{Model C Offensive Season Score}

The y-axis is:

wRC^+

The fitted line is:

wRC^+ = 101.47 + 5.86x R^2 = 0.692

The pattern is clear.

High offensive z-score seasons generally produce high wRC+ seasons. Miguel Cabrera’s 2013 season, Chipper Jones’s 1999 season, Mike Schmidt’s 1980 and 1981 seasons, George Brett’s 1985 season, and Alex Rodriguez’s 2007 season all sit in the upper-right region.

That is exactly where they should be.

The plot also shows interesting residual cases. Some seasons have high wRC+ relative to their Model C score. Others have lower wRC+ than the z-score model predicts.

Those differences are not necessarily errors. They show that Model C and wRC+ measure offense from different angles.

Figure 3: Actual Versus Predicted wRC+

Figure 3. Actual versus predicted season wRC+ using the offensive z-score model.

The prediction equation is:

\widehat{wRC^+}_s = 101.47 + 5.86(\text{Model C Offensive Season Score}_s)

The residual is:

\text{Residual}_s = wRC^+_s - \widehat{wRC^+}_s

Players near the diagonal are well predicted. Players above the diagonal have higher wRC+ than the z-score model predicts. Players below the diagonal have lower wRC+ than the z-score model predicts.

The figure shows that most seasons fall around the diagonal, which is why the model produces a strong R^2.

It also shows the value of residual analysis. The most interesting seasons are often the ones that do not land exactly where the model expects.

Figure 4: The Defensive Negative Control

Figure 4. Traditional defensive score does not predict season wRC+.

The negative-control model is:

wRC^+ = \alpha + \beta_1(\text{Traditional Defensive Season Score}) + \varepsilon

The fitted result is:

wRC^+ = 102.46 + 0.51(\text{Traditional Defensive Season Score}) R^2 = 0.002

This is one of the most important results in the chapter.

The traditional defensive score explains almost none of the variation in wRC+.

That is exactly what should happen.

wRC+ is an offensive metric. A traditional defensive score should not meaningfully predict it. The fact that it does not strengthens the validation.

It shows that the Model C offensive score is measuring offense specifically, not simply general player quality.

Figure 5: Residuals

Figure 5. Largest season-level wRC+ residuals from the offensive z-score model.

The residual equation is:

\text{Residual}_s = wRC^+_s - \widehat{wRC^+}_s

Positive residuals mean the season had a higher wRC+ than predicted by the z-score model.

Negative residuals mean the season had a lower wRC+ than predicted.

The largest positive residuals include:

Matt Williams 1995
Jim Finigan 1954
Jack Gleason 1884
Sean Berry 1995
Ron Cey 1981
George Scott 1970
Mike Schmidt 1981
Bill Joyce 1894

The largest negative residuals include:

Art Devlin 1905
Chone Figgins 2011
Jerry Royster 1977
Pie Traynor 1922
Chuck Harmon 1954
Bubba Phillips 1960
Charlie Hayes 1999
Maikel Garcia 2024

These residuals are worth studying because they show where the z-score model and wRC+ disagree most.

Interpreting Positive Residuals

A positive residual means wRC+ sees more offensive value than the z-score model predicts.

There are several possible reasons.

First, wRC+ is built from run values and is park- and league-adjusted. Model C is built from peer separation in selected categories. The two systems overlap strongly, but they are not identical.

Second, Model C includes runs and RBI rates. Those are useful for describing offensive dominance, but they can also be influenced by lineup context. wRC+ is more directly centered on offensive production independent of team context.

Third, partial seasons can create interesting differences. Matt Williams 1995, for example, had a very high wRC+ in fewer plate appearances than a full season. The z-score model includes playing-time weighting, so a shorter season can be pulled downward relative to a rate statistic.

That does not mean either measure is wrong.

It means they are answering slightly different questions.

Model C asks:

How much offensive separation did this third baseman produce in this season?

wRC+ asks:

How strong was this hitter's offensive production after league and park adjustment?

Those are related questions, not identical questions.

Interpreting Negative Residuals

A negative residual means the z-score model predicted a higher wRC+ than the player actually had.

This can happen when a player scores well in the Model C components but not as well in wRC+.

For example, a player may separate from third-base peers in runs, RBI, baserunning, or contact profile without producing the same level of park- and league-adjusted offensive value.

Art Devlin 1905 is the largest negative residual in this run. Pie Traynor 1922, Ossie Vitt 1915, and several other early-era or context-sensitive seasons also appear in the negative tail.

This is not surprising.

The farther back the data go, the more differences we expect between a transparent peer-z-score model and a modern run-value metric such as wRC+.

The residuals are not a failure of the model. They are a useful diagnostic tool.

Why This Season-Level Result Matters

This season-level validation is probably the cleanest offensive test in the project.

The WAR validation showed that the combined offense-defense model predicts total value.

The career wRC+ validation showed that average offensive z-score predicts career offensive quality.

But this season-level wRC+ validation is even more direct.

It compares:

\text{Season Offensive Z-Score}

to:

\text{Season } wRC^+

The result is strong:

R^2 = 0.692

That means Model C captures a substantial share of the same offensive signal captured by wRC+.

The defensive negative control confirms the interpretation:

R^2_{\text{Defense Only}} = 0.002

That is almost zero.

Offensive z-scores predict offense. Traditional defensive z-scores do not.

That is exactly the validation pattern we wanted.

How This Fits With the Earlier Validation Studies

The validation sequence now has three layers.

First, the WAR study showed that offense and traditional defense together predict total value:

R^2_{\text{Career WAR, Offense + Defense}} = 0.814

Second, the career-level wRC+ study showed that average offensive z-score predicts career offensive quality:

R^2_{\text{Career wRC+}} = 0.740

Third, this chapter shows that season-level offensive z-score predicts season-level wRC+:

R^2_{\text{Season wRC+}} = 0.692

Together, these results give the project a strong methodological foundation.

The z-score model is not WAR.

It is not wRC+.

It is a simpler and more transparent peer-separation model.

But it clearly captures real value-related information.

Limitations

This chapter should still be read carefully.

The FanGraphs season-level file matched almost all qualified third-base seasons, but not every season. The unmatched cases were mostly older Negro Leagues or historical ID records.

The Model C offensive score is not park-adjusted in the same way as wRC+. It is same-position and same-season adjusted through z-scores, but that is not identical to league and park adjustment.

Model C also includes runs and RBI rates, which are not purely individual batter skill measures. They can reflect lineup and team context.

Finally, wRC+ is itself a model. It is extremely useful, but it is not a perfect measure of all offensive contribution. It does not treat baserunning the same way Model C does, and it does not ask the same positional-peer question.

So the correct conclusion is not:

Model C is the same as wRC+.

The correct conclusion is:

Model C strongly predicts wRC+, while preserving a different interpretive question.

That is exactly what we want from a validation study.

Conclusion

The season-level wRC+ validation gives the clearest offensive support for the third-base z-score project.

The main model is:

wRC^+ = 101.47 + 5.86(\text{Model C Offensive Season Score})

The result is:

R^2 = 0.692

That means the offensive z-score model explains about 69 percent of the variation in FanGraphs season-level wRC+ among matched qualified third-base seasons.

The traditional defensive score explains almost none:

R^2 = 0.002

That negative-control result is crucial.

The offensive model predicts offense.

The defensive model does not.

The broader implication is clear.

The z-score system is not just an internal ranking device. It aligns strongly with established external value metrics.

WAR validates the two-dimensional model.

wRC+ validates the offensive model.

And the season-level wRC+ study confirms that Model C captures a real offensive signal year by year.

 

Do Third-Base Offensive Z-Scores Predict wRC+?

Introduction

The WAR validation chapter tested the full two-dimensional model.

It asked whether our third-base z-score framework could predict total player value. The answer was yes. Offensive z-scores predicted WAR. Traditional defensive z-scores added substantial explanatory power. The combined model performed especially well at the career level.

But WAR is broad.

WAR includes offense, defense, baserunning, positional adjustment, replacement level, and playing time. That makes it useful, but it also makes it complex. If the question is whether our offensive z-score model really measures offensive quality, WAR is not the cleanest validation target.

For that, we need an offense-only benchmark.

That is where wRC+ becomes useful.

FanGraphs wRC+ is designed to measure offensive production relative to league and park context, with 100 as league average. A 120 wRC+ means a hitter was about 20 percent better than league average. An 80 wRC+ means about 20 percent below league average.

So the validation question becomes simple:

Does our Model C offensive z-score predict FanGraphs wRC+?

The answer is yes.

Among third-base regulars with at least five qualified third-base seasons, the average Model C offensive score per qualified season explains a large share of career wRC+ variation:

wRC^+ = 100.89 + 5.41(\text{Model C Offensive Score per Qualified Season}) R^2 = 0.740

That is a strong relationship.

Just as important, the traditional defensive score does not meaningfully predict wRC+:

R^2 = 0.022

That negative-control result matters. It tells us that the offensive z-score model is not simply measuring general player quality. It is measuring offense.

Why wRC+ Is the Right Validation Target

The earlier WAR validation was a broad test.

It asked:

Do our offense-defense scores predict total value?

This chapter asks something narrower:

Does our offensive z-score predict an established offensive metric?

That is a cleaner test of Model C.

The offensive z-score model was built from same-position, same-season peer comparisons. It was not designed to reproduce wRC+. It does not directly use the same run-value formula. It does not include park adjustments in the same way. It includes runs and RBI, which wRC+ does not treat as independent batter skills in the same way. It includes baserunning through net stolen bases, while wRC+ is focused on hitting.

Even so, the relationship is strong.

That is useful validation.

It means Model C is not just producing interesting internal rankings. It is also aligned with an external offensive measure.

Data Used in the Study

The FanGraphs file used for this chapter was a career batting leaderboard export. Because the file was career-level rather than season-level, this first wRC+ validation is a career-level study.

The merge was very successful.

The third-base career dataset included: 897 third-base players

The FanGraphs wRC+ merge matched: 786 of 897 players

Among regular third basemen, defined as players with at least five qualified third-base seasons, the merge matched: 239 of 240 players

That gives us a strong sample for the validation test.

The main analysis focuses on the regulars because wRC+ is a rate statistic, and very short careers can create noisy results. A five-qualified-season cutoff helps identify players with enough third-base playing time to make the comparison meaningful.

The Model C Offensive Score

The offensive score used in this validation is the same Model C score used throughout the third-base study.

Model C uses seven offensive components:

OBP
ISO
BB/PA
SO/PA, inverted
Net SB/PA
R/PA
RBI/PA

The basic z-score formula is:

z = \frac{x - \mu}{\sigma}

Where:

x = \text{the player's value} \mu = \text{the same-position, same-season peer-group mean} \sigma = \text{the same-position, same-season peer-group standard deviation}

This equation asks a simple question:

How far above or below the third-base peer group was this player?

That is the core of the whole study.

Offensive Component Equations

On-base percentage is:

OBP = \frac{H + BB + HBP}{AB + BB + HBP + SF}

Slugging percentage is:

SLG = \frac{TB}{AB}

Isolated power is:

ISO = SLG - AVG

Walk rate is:

BB/PA = \frac{BB}{PA}

Strikeout rate is:

SO/PA = \frac{SO}{PA}

Net stolen bases are:

NetSB = SB - CS

Net stolen-base rate is:

NetSB/PA = \frac{SB - CS}{PA}

Run rate is:

R/PA = \frac{R}{PA}

RBI rate is:

RBI/PA = \frac{RBI}{PA}

The strikeout component is inverted because fewer strikeouts are better:

z_{\text{Low SO/PA}} = -\left( \frac{ (SO/PA)_i - \overline{(SO/PA)}_{\text{peer}} }{ s_{SO/PA,\text{peer}} } \right)

The full Model C offensive season score is:

\begin{aligned} \text{Season Score} &= z_{\text{OBP}} + z_{\text{ISO}} + z_{\text{BB/PA}} + z_{\text{Low SO/PA}} \\ &\quad + z_{\text{NetSB/PA}} + z_{\text{R/PA}} + z_{\text{RBI/PA}} \end{aligned}

This produces one offensive score for each qualified third-base season.

Playing-Time Weighting

The broader study uses a playing-time weight so that a partial season does not count the same as a full season.

The weight is:

w = \min\left(1, \frac{PA}{600}\right)

The weighted season score is:

\text{Weighted Offensive Season Score} = \text{Model C Offensive Season Score} \times w

The career offensive score is:

\text{Career Offensive Score} = \sum_{s=1}^{n} \text{Weighted Offensive Season Score}_s

This career score is cumulative. It rewards repeated separation from third-base peers.

But wRC+ is not cumulative. It is a rate-style offensive measure. That creates an important methodological issue.

Why We Use Average Offensive Score per Qualified Season

Because wRC+ is rate-based, the best predictor is not simply total career offensive score.

A player with many seasons can accumulate a large career score even if his average season was not historically great. Another player with fewer seasons can have a higher offensive level but a lower accumulated score.

So for this validation, the primary predictor is:

\text{Average Offensive Score} = \frac{ \text{Career Offensive Score} }{ \text{Qualified Third-Base Seasons} }

Or:

\text{Average Offensive Score} = \frac{ \sum_{s=1}^{n} \text{Weighted Offensive Season Score}_s }{ n }

Where:

n = \text{number of qualified third-base seasons}

This gives us an offensive quality measure rather than a pure accumulation measure.

That distinction matters.

The cumulative career offensive score still predicts wRC+, but not as well as the average score.

For third-base regulars:

Average offensive score per qualified season:
R² = 0.740

Cumulative career offensive score:
R² = 0.661

The average score is a better validation measure because it matches the rate-like nature of wRC+.

Regression Framework

The basic validation model is:

wRC^+_i = \alpha + \beta_1(\text{Average Offensive Score}_i) + \varepsilon_i

Where:

wRC^+_i = \text{FanGraphs career wRC+ for player } i \alpha = \text{intercept} \beta_1 = \text{effect of one additional average offensive z-score point} \varepsilon_i = \text{residual error}

The fitted model for third-base regulars is:

wRC^+ = 100.89 + 5.41(\text{Average Offensive Score}) R^2 = 0.740

This means that each additional point of average Model C offensive score is associated with about 5.41 points of career wRC+.

A player with an average offensive score of 0 projects near league average:

wRC^+ = 100.89 + 5.41(0) wRC^+ = 100.89

A player with an average offensive score of 3 projects as:

wRC^+ = 100.89 + 5.41(3) wRC^+ = 117.12

A player with an average offensive score of 6 projects as:

wRC^+ = 100.89 + 5.41(6) wRC^+ = 133.35

This is exactly the kind of relationship we hoped to see.

Figure 1: Model Comparison

Figure 1. How well third-base z-scores predict FanGraphs wRC+.

The first figure compares the validation models.

The most important result is:

R^2 = 0.740

for the average offensive score model among regular third basemen.

The cumulative offensive score also performs well:

R^2 = 0.661

But the average score is better because wRC+ is a rate metric.

The traditional defensive score performs very poorly as a wRC+ predictor:

R^2 = 0.022

That is not a problem. It is exactly what we want.

Defense should not predict wRC+ very well. If it did, that would suggest either a hidden confounding problem or a model that was mixing offensive and defensive signals.

The offense-plus-defense model is nearly identical to the offense-only model:

R^2 = 0.741

That small difference tells us that traditional defense adds almost nothing to the prediction of wRC+. Again, this strengthens the interpretation.

The offensive model predicts offense. The defensive model does not.

Figure 2: Average Offensive Z-Score Versus wRC+

Figure 2. Career wRC+ versus average offensive z-score among third-base regulars.

The second figure shows the main relationship directly.

The x-axis is:

\text{Model C Offensive Score per Qualified Third-Base Season}

The y-axis is:

wRC^+

The fitted line is:

wRC^+ = 100.89 + 5.41x R^2 = 0.740

The upward trend is clear.

Players with high average offensive z-scores tend to have high career wRC+ values. Mike Schmidt, Chipper Jones, Eddie Mathews, George Brett, Wade Boggs, Dick Allen, and Al Rosen all sit in the upper-right region. Players with lower offensive z-score averages tend to have lower wRC+ values.

This is a strong validation of Model C.

The z-score model is not simply rewarding raw counting totals. It is recovering a meaningful offensive signal that corresponds closely to an established offensive metric.

Figure 3: Actual Versus Predicted wRC+

Figure 3. Actual versus predicted career wRC+ using the offense-only model.

The actual-versus-predicted plot shows how well the model estimates wRC+.

The prediction equation is:

\widehat{wRC^+} = 100.89 + 5.41(\text{Average Offensive Score})

The residual is:

\text{Residual}_i = wRC^+_i - \widehat{wRC^+}_i

Players near the diagonal are well predicted. Players above the diagonal have higher wRC+ than the model predicts. Players below the diagonal have lower wRC+ than the model predicts.

This figure shows that the model captures the broad structure very well, but it also shows useful outliers.

That is important.

The purpose of validation is not only to confirm that the model works. It is also to identify where it differs from an established metric.

Figure 4: The Defensive Negative Control

Figure 4. The traditional defensive score does not meaningfully predict wRC+.

The negative-control model is:

wRC^+ = \alpha + \beta_1(\text{Traditional Defensive Score per Qualified Season}) + \varepsilon

The fitted equation is:

wRC^+ = 105.34 - 1.49(\text{Traditional Defensive Score per Qualified Season}) R^2 = 0.022

This means traditional defense explains only about 2.2 percent of the variation in career wRC+ among regular third basemen.

That is a very small relationship.

This is one of the most important findings in the chapter. It shows that the validation is specific. Offensive z-scores predict offensive value. Traditional defensive z-scores do not.

The negative-control test strengthens the model.

It tells us that Model C is not simply identifying famous players or good players in general. It is identifying an offensive quality.

Figure 5: Residuals

 

Figure 5. Largest wRC+ residuals from the offensive z-score model.

The residual equation is:

\text{Residual}_i = wRC^+_i - \widehat{wRC^+}_i

Positive residuals mean the player’s FanGraphs wRC+ is higher than the z-score model predicts.

Negative residuals mean the player’s FanGraphs wRC+ is lower than the z-score model predicts.

The largest positive residuals include:

Edwin Encarnacion
David Freese
Dick Allen
Cal Ripken Jr.
Deacon White
Joe Torre
Larry Parrish
Wade Boggs

These players had higher wRC+ values than the offensive z-score model predicted.

The largest negative residuals include:

Ossie Vitt
Art Devlin
Jim Gilliam
Jose Ramirez
Billy Werber
Bob Jones
Chone Figgins
Hans Lobert

These players had lower wRC+ values than the model predicted.

The residuals are not merely mistakes. They show where the two systems differ.

Interpreting the Positive Residuals

Positive residuals are especially interesting because they identify players whose wRC+ is better than our average offensive z-score model expects.

For example, Edwin Encarnacion has a large positive residual. His career wRC+ is much stronger than his average third-base z-score profile suggests. This may reflect the fact that much of his offensive identity was formed outside a long traditional third-base career. Since our model focuses on qualified third-base seasons, while FanGraphs career wRC+ reflects his broader batting career, the comparison can produce differences.

David Freese also appears as a positive residual. His wRC+ is higher than expected from the third-base z-score model.

Dick Allen is another important case. He had enormous offensive quality, and his wRC+ remains higher than the model predicts, even though the model already rates him strongly.

Wade Boggs is also above prediction. That may reflect the way wRC+ values his on-base skill and batting quality more directly than a model that also includes runs, RBI, power, and baserunning components.

Interpreting the Negative Residuals

Negative residuals tell the opposite story.

Ossie Vitt is much lower in wRC+ than the offensive z-score model predicts. Art Devlin, Jim Gilliam, Jose Ramirez, Billy Werber, Bob Jones, Chone Figgins, and Hans Lobert also fall below prediction.

These cases require careful interpretation.

Some players may be rewarded in our Model C framework because they separated from their third-base peers in components that do not translate as strongly into wRC+. Runs, RBI, stolen-base value, and contact profile can influence the z-score model differently than wRC+.

Jose Ramirez is especially interesting. The model predicts a higher wRC+ than his current FanGraphs career mark. That may reflect his strong same-position separation across multiple components, including power, walks, baserunning, runs, and RBI. It may also reflect the fact that his career is still active.

A negative residual does not mean the z-score model is wrong. It means the z-score model and wRC+ are measuring offense from different angles.

That difference is analytically useful.

What the wRC+ Validation Shows

The wRC+ validation supports the offensive model in three ways.

First, the relationship is strong:

R^2 = 0.740

Second, the slope is meaningful:

\beta_1 = 5.41

That means each additional average offensive z-score point corresponds to about 5.41 points of wRC+.

Third, the negative-control test works:

R^2_{\text{Defense Only}} = 0.022

Traditional defense does not predict wRC+.

That is exactly what should happen if the model is behaving properly.

Why This Complements the WAR Validation

The WAR validation and wRC+ validation answer different questions.

The WAR validation asked:

Do offense and traditional defense together predict total value?

The answer was yes.

The career-level offense-plus-defense model for regular third basemen had:

R^2 = 0.814

The wRC+ validation asks:

Does the offensive z-score model predict offensive quality?

The answer is also yes.

The average offensive score model has:

R^2 = 0.740

Together, these two validation studies are stronger than either one alone.

WAR validates the broader two-dimensional structure.

wRC+ validates the offensive dimension specifically.

The negative control confirms that the defensive dimension is not pretending to be offense.

This gives the project a stronger methodological foundation.

What the Study Does Not Prove

This chapter should not be overread.

It does not prove that Model C is better than wRC+. It does not prove that wRC+ is perfect. It does not prove that every residual is meaningful. It does not prove that the z-score model captures park effects, full run values, league quality, or all contextual differences.

The FanGraphs file used here is at the career level. That means this chapter does not yet test season-by-season wRC+ against season-by-season z-scores.

A season-level wRC+ study would be even cleaner because it would compare:

\text{Season Offensive Z-Score}

directly against:

\text{Season } wRC^+

That should be the next step if we obtain a season-level FanGraphs export.

For now, this chapter provides strong career-level validation.

Conclusion

The wRC+ validation study answers a direct question:

Do third-base offensive z-scores predict an established offensive metric?

Yes.

Among third-base regulars, the average Model C offensive score per qualified season strongly predicts FanGraphs career wRC+:

wRC^+ = 100.89 + 5.41(\text{Average Offensive Score}) R^2 = 0.740

The cumulative offensive score also predicts wRC+, but less strongly:

R^2 = 0.661

Traditional defense does not meaningfully predict wRC+:

R^2 = 0.022

That is exactly the pattern we wanted.

The offensive model predicts offense.

The defensive model does not.

The combined validation framework now has both breadth and specificity.

The WAR study showed that offense plus defense predicts total value.

The wRC+ study shows that the offensive z-score model predicts offensive quality.

That is a major validation result for the third-base project.

 

Squam Lake (Flash Fiction)

Kellen was dead, and that was a good thing. She felt safe, as safe as a young woman prancing around the middle of Reverse Vampire territory could. She thought she knew what was what (after all, she was a woman of the world, right?). Lucky for her, I’ve got her back.

Behold all who hear me; I am a modern-day Van Helsing. And, yes, I am talking about THAT Van Helsing.

Author’s Note: Not that I need to brag, but I am a direct descendant of the great Van Helsing. Yeah, howdy, little old me, the man nearly everyone calls Hillbilly Jedediah, carries the DNA of the greatest monster hunter that ever lived. What does your DNA look like once it is untangled and exposed?

My tale won’t take long to tell. I am working on a memoir, but I need to live several hundred more years before any publisher worth their salt will give me a sit-down. So, here it is (such as it is).

It was a day like any other at Squam Lake, androids were dreaming of electric sheep, and the U.S. dollar was in a deadly tug of war with the Japanese Yen. All seemed to be right with the world. Of course, I didn’t sleep; how could I when all h-e-double-hockey-sticks was breaking loose everywhere I looked? I can’t save everyone; that’s impossible; I have to pick and choose. On this day, for reasons beyond my capacity to understand, I decided to give her my attention. Usually, I would say that if someone is foolish enough to go to Reverse Vampire Central (during an RV convention, no less), they deserve whatever they get.

How did I find him out? It’s just one of those things, some real inexplicable nonsense. It was the kind of lapse that can be made 1000 times and never get you into trouble. Maybe it is just lousy RV karma. Maybe he “just ain’t living right,” as every evangelical will tell you is the reason for everything bad that happens to any poor son of a biscuit that happens to zig when they should have zagged. Yeah, it finally happened; I was able to expose him, to show him for what he truly is. I exposed him, I directed a bright light on his deepest colors.

It was a simple e-mail…short, nothing more than a few words. I intercepted it the way I usually do; a simple keylogger sent the message directly to me. “They are tricksy rabbits.” That is all he had to write. What happened next will make your toes curl.

After I received the message, I called her in two seconds. “Get the heck out of there, dagnabbit; he is the one I have been looking for. Evan is the Reverse Vampire! I am sure of it; run as fast as you can.”

She made it two steps before her left hamstring was ripped from her leg. I didn’t want to think about what I knew he would do with the fresh, human meat. One thing is sure: he didn’t like it at room temperature.

I could immediately sense it; I felt her pain. What else could I do? I gathered up my resolve, opened a portal, and headed east. You know, I didn’t have to save her; it wasn’t my job. Looking back, I guess I kind of felt sorry for her. Who knows, maybe I even liked her. I have since given it lots of thought, and I still don’t know why I risked my life that day.

The incantation complete, the portal opened up only a few feet from Evan.

“Put her down, Now!”

Evan looked back at me; he was half-crazed, licking the blood off the detached muscle. I could tell he was silently cursing in his feeble little mind, a half-sized brain with only enough room inside for murder and carnage.

So, I did it; I used The Device. It does take a heck of a toll on me, but, like I said, I guess maybe I like her. As it stands, she is fine (I sent her back to a time just before the trip to Squam Lake), Evan is a fetus (best I could do), and I really need a beer. On second thought, my cousin, Naomi Crump, makes the vilest moonshine I have ever experienced, and I could use a week-long bender.

 

The Potato Paradox Is Not Really a Paradox

The potato paradox is one of those little mathematical oddities that feels impossible the first time you hear it.

Suppose you have 100 pounds of potatoes. The potatoes are 99 percent water. After sitting out for a while, they dry slightly and reach 98 percent water content.

How much do they weigh now?

The instinctive answer is something close to 99 pounds. After all, the water percentage only dropped by one point. How much difference could that make?

The correct answer is 50 pounds.

That is the shock of the potato paradox. A change from 99 percent water to 98 percent water halves the total weight.

At first glance, this feels absurd. But there is no contradiction. The trick is not in the arithmetic. The trick is in the denominator.

The key idea is that the amount of non-water material does not change. The potatoes lose water, but they do not lose dry potato matter.

Let the initial total weight be:

Let the initial water fraction be:

The dry matter is the part that is not water:

Substituting the values:

So the original 100 pounds of potatoes contains 99 pounds of water and 1 pound of dry matter.

That 1 pound is the anchor of the whole problem.

After drying, the potatoes are 98 percent water. That means they are 2 percent dry matter. But the dry matter is still 1 pound. So we need to find the new total weight W1 such that 1 pound is 2 percent of the total.

The equation is:

where:

So:

The potatoes now weigh 50 pounds.

That means the water weight has fallen from 99 pounds to 49 pounds:

99-49=50

So the potatoes lost 50 pounds of water.

The paradoxical feeling comes from confusing a percentage point change with a small physical change. Going from 99 percent water to 98 percent water sounds tiny because the percentage dropped by only one point. But the dry matter share doubled.

Originally, the dry matter was 1 percent of the total:

After drying, the dry matter is 2 percent of the total:

The dry matter did not increase. The denominator decreased.

That is the entire puzzle.

The general formula clarifies the structure. If the initial weight is W0, the initial water fraction is p0, and the final water fraction is p1, then the dry matter is:

The final weight is:

Substituting the expression for (D):

So the general potato paradox equation is:

For the classic potato problem:

This is why the puzzle is so effective. The numbers look nearly identical:

99% & 98%

But the meaningful comparison is not between 99 and 98. It is between the dry percentages:

1% & 2%

That is a doubling.

The closer a quantity is to 100 percent water, the more sensitive the total weight becomes to small changes in the water percentage. This can be seen by writing the total weight as a function of the water fraction:

Here D is fixed. The only thing changing is p, the water fraction. As p approaches 1, the denominator becomes very small. A small change in the denominator can produce a large change in the total.

The sensitivity is visible in the derivative:

As p approaches 1, the denominator  becomes extremely small. That makes the total weight very sensitive to changes in p.

This is not just some kind of bizarre potato trick. It is a lesson about ratios, percentages, and hidden bases. Percentages are always percentages of something. When that “something” changes, intuition can fail.

The same kind of error appears in many places. A business may say its costs fell from 99 percent of revenue to 98 percent of revenue, which sounds modest. But if profit rises from 1 percent to 2 percent, profit has doubled. A baseball player’s out rate, a hospital’s survival rate, an investment’s expense ratio, or a website’s conversion rate can all create similar illusions. Near the extremes, small percentage-point changes can hide large relative changes.

So is the potato paradox really a paradox? Not in the strict sense.

The potato paradox is most properly classified as a veridical paradox: a result that appears impossible at first but is actually true. Its force comes from a denominator effect. The dry matter remains fixed while the total weight changes, so a one-percentage-point drop in water content produces a surprisingly large drop in total weight.

A true paradox usually involves a contradiction, or at least a deep tension between two apparently valid ideas. The potato paradox does not contain a contradiction. It contains a surprise. Once the dry matter is kept fixed, the result follows directly.

The puzzle feels paradoxical because our intuition focuses on the water percentage. The math focuses on the dry matter percentage. Those are complements, but psychologically they behave very differently.

The statement “the potatoes go from 99 percent water to 98 percent water” sounds like almost nothing changed.

The statement “the potatoes go from 1 percent dry matter to 2 percent dry matter” sounds much more dramatic.

Both statements describe the same situation. One hides the effect. The other reveals it.

That is why the potato paradox is useful. It reminds us that percentages are not self-explanatory. We have to ask what the denominator is, what remains fixed, and what is actually changing.

The potatoes did not violate logic. They exposed a weakness in ordinary intuition.

The paradox is not in the potatoes; it lies in how we perceive percentages.

 

 

Below the Line: The Lowest-Scoring Qualified Offensive Third Basemen

Introduction

The earlier chapters looked at greatness.

They asked which third basemen separated most strongly from their positional peers. That led naturally to players such as Mike Schmidt, Chipper Jones, Eddie Mathews, George Brett, Wade Boggs, Jose Ramirez, and others. Those players live in the upper tail of the distribution. They are the positive outliers.

The previous chapter reversed the question and studied the center. It asked which third basemen were most average, which players sat closest to the offensive norm of the position.

This chapter moves to the other side.

It asks: Which qualified third basemen were farthest below the offensive standard of their own positional peers?

That question is not the same as asking who the worst third basemen were. This study is offense-only. It does not include defense, throwing, range, durability beyond qualification, leadership, baserunning beyond the offensive variables included in Model C, postseason value, or WAR. A player could score poorly here and still have had defensive value. He could have stayed in the lineup because of glove work, team need, positional scarcity, reputation, or organizational context.

For that reason, the most accurate wording is: lowest-scoring qualified offensive third basemen.

That wording is important. All these men were professional athletes. Many of us would love to be on this list.

The results are still interesting. In the combined Model A and Model C framework, the lowest-scoring multi-season offensive third baseman is Ken Reitz. He is followed by Aurelio Rodriguez, Charley Smith, Ke’Bryan Hayes, Lee Tannehill, Pedro Feliz, Bob Aspromonte, Bubba Phillips, Placido Polanco, and Frank O’Rourke.

The single-season list is different. The lowest Model C third-base season belongs to Jimmy Austin in 1912, followed by Chris Truby in 2002, Matt Dominguez in 2014, Chris Johnson in 2014, Eddie Mulligan in 1921, and Billy Purtell in 1910.

Together, these lists ask a deeper baseball question: How can a player qualify repeatedly while remaining far below the offensive center of his position?

The answer is likely found in the parts of the game this model does not measure.

The Framework

The same basic scoring system used in the dominance chapters is used here.

Each player-season is compared only to other qualified third basemen from the same season. This means a third baseman from 1912 is not directly compared to one from 2014. Each player is judged against the offensive expectations of his own season and position.

The basic z-score equation is:

Where:

 

A z-score of zero means the player was exactly average in that category. A positive z-score means he was above average. A negative z-score means he was below average.

The Model A season score is:

Model A emphasizes on-base skill, slugging, home-run rate, walks, runs, and RBI.

The Model C season score is:

Model C uses a broader offensive framework. It includes isolated power, walks, contact, net stolen-base value, run scoring, and RBI production.

Isolated power is:

Net stolen-base rate is:

The strikeout component is inverted because fewer strikeouts are better:

The raw score is then weighted by playing time:

For the lowest-scoring study, the logic is simple.

In the dominance chapters, higher scores were better.
In this chapter, lower scores identify weaker offensive separation.

A strongly negative season score indicates the player was far below the third-base peer group across all categories.

Measuring Multi-Season Weakness

Single seasons can be strange. A player can have one unusually poor season due to injury, age, bad luck, or a temporary collapse.

A multi-season regular is different.

For that reason, this chapter also calculates average season score for players with at least five qualified third-base seasons.

The career average is:

where (n) is the number of qualified seasons.

The combined Model A and Model C average score is:

This combined score identifies players who were low-scoring under both definitions of offense.

That is important because a player might look poor under one model but less poor under another. The combined list is stricter. It asks whether players who remained far below average were defined differently under the Model A power/run-production framework or the broader Model C framework.

The Lowest-Scoring Multi-Season Third Basemen

The combined Model A and Model C results identify the lowest-scoring multi-season third basemen.

Rank Player Years Qualified Seasons Combined Avg. Score
1 Ken Reitz 1973–1980 8 -5.28
2 Aurelio Rodriguez 1969–1980 12 -4.69
3 Charley Smith 1961–1967 5 -4.49
4 Ke’Bryan Hayes 2021–2025 5 -4.01
5 Lee Tannehill 1904–1909 5 -3.90
6 Pedro Feliz 2004–2010 7 -3.81
7 Bob Aspromonte 1962–1971 8 -3.69
8 Bubba Phillips 1957–1963 6 -3.59
9 Placido Polanco 2001–2013 6 -3.29
10 Frank O’Rourke 1926–1930 5 -3.23
11 Chris Johnson 2010–2014 5 -3.11
12 Maikel Franco 2015–2022 7 -2.97
13 Ed Sprague 1993–1999 7 -2.96
14 Ray Knight 1979–1987 7 -2.92
15 Enos Cabell 1976–1982 6 -2.87

Ken Reitz is the most prominent result. His combined average score of -5.28 is far below the third-base peer baseline. He ranked first in the Model A low-score list and second in the Model C low-score list. That means his offensive weakness was not a product of one particular model. It appeared under both definitions.

Aurelio Rodriguez is second. His result is especially notable because he had twelve qualified third-base seasons in the study. That is a long run. A player who qualifies that often is doing something valuable enough to stay in the lineup. In this case, the value almost certainly lies outside this offensive model.

Charley Smith ranks third overall and first in Model C alone. That makes him one of the clearest examples of a player whose broad offensive profile sat far below the third-base baseline.

Ke’Bryan Hayes ranks fourth in the combined list through 2025. That is a striking modern result. It should be interpreted carefully because he is still an active player, and his defensive reputation is not part of the model. In fact, Hayes is a useful reminder of why this chapter must remain offense-only. A low offensive score does not equal low total player value.

Pedro Feliz, Bob Aspromonte, Bubba Phillips, Placido Polanco, and others reinforce the same point. Several of these players had reputations or roles that extended beyond offensive production. The model captures only their offensive separation from third-base peers.

The Lowest-Scoring Model C Regulars

Model C alone gives a slightly different list.

The top ten lowest-scoring Model C third-base regulars are:

Rank Player Years Qualified Seasons Avg. Model C Score
1 Charley Smith 1961–1967 5 -5.41
2 Ken Reitz 1973–1980 8 -4.95
3 Aurelio Rodriguez 1969–1980 12 -4.52
4 Lee Tannehill 1904–1909 5 -3.93
5 Pedro Feliz 2004–2010 7 -3.63
6 Jim Presley 1985–1990 6 -3.46
7 Chris Johnson 2010–2014 5 -3.40
8 Ed Sprague 1993–1999 7 -3.16
9 Bob Aspromonte 1962–1971 8 -3.14
10 Ke’Bryan Hayes 2021–2025 5 -3.03

Model C pushes Charley Smith to the top. It also moves Jim Presley into the top ten, while some players who looked especially poor under Model A fall slightly.

This is significant because Model C includes low strikeout rate and net stolen-base value. A player who was weak under Model A might recover somewhat in Model C if he made contact, ran well for the position, or contributed in ways not captured by slugging and home-run rate. Conversely, a player can look worse in Model C if he lacks those broader offensive contributions.

The Model C list therefore does not simply duplicate Model A. It identifies players whose offensive weakness remained visible even when the model became broader.

The Lowest-Scoring Individual Seasons

Single-season results tell a different story.

The lowest-scoring Model C third-base seasons are:

Rank Player-Season Model C Score
1 Jimmy Austin, 1912 -8.69
2 Chris Truby, 2002 -8.47
3 Matt Dominguez, 2014 -8.12
4 Chris Johnson, 2014 -7.83
5 Eddie Mulligan, 1921 -7.76
6 Billy Purtell, 1910 -7.65
7 Jose Hernandez, 2003 -7.43
8 Ray Knight, 1987 -7.19
9 Terry Pendleton, 1996 -7.16
10 Todd Cruz, 1983 -7.16
11 Travis Jackson, 1936 -7.11
12 Brooks Robinson, 1958 -7.10
13 Brandon Drury, 2019 -7.07
14 Charley Smith, 1965 -7.04
15 Pete Suder, 1941 -7.01

Jimmy Austin’s 1912 season is the lowest Model C third-base season in the dataset. Chris Truby’s 2002 season is close behind. Matt Dominguez and Chris Johnson both appear in 2014, suggesting that the modern third-base peer group that year set a difficult offensive baseline for weaker performers.

The single-season list includes some surprising names. Brooks Robinson appears for 1958. Terry Pendleton appears for 1996. Ray Knight appears for 1987. Travis Jackson appears for 1936. These are reminders that a poor offensive season does not define a player’s entire career. A great defender, a former star, an aging veteran, or a player with a different value profile can still appear on a low offensive season list.

That is why the chapter separates seasons from regulars.

A bad season is a moment.
A low multi-season score is a pattern.

Model A Versus Model C: Agreement and Disagreement

The next question is whether the two models agree about the lowest-scoring third-base regulars.

The relationship between Model A average score and Model C average score is positive but not especially strong: R2 =0.218

This is an important result. The two models do not agree perfectly on offensive weakness.

Some players are poor under both definitions. Ken Reitz, Aurelio Rodriguez, Lee Tannehill, Pedro Feliz, Bob Aspromonte, and Bubba Phillips fall into this group. Their low scores are relatively stable.

Other players are more model-sensitive. Placido Polanco, for example, ranks fourth on the Model A low-score list but thirty-fourth on the Model C low-score list. That means Model C saw more offensive value in his broader profile than Model A did. Enos Cabell shows a similar pattern, ranking seventh in Model A but forty-sixth in Model C.

Ke’Bryan Hayes is another interesting case. He ranks second in Model A and tenth in Model C. Model C does not erase the offensive weakness, but it makes it less extreme.

Charley Smith moves in the opposite direction. He ranks tenth in Model A but first in Model C. That suggests his broader offensive profile was even weaker than his Model A profile.

The low () is therefore not a problem. It is informative. It shows that offensive weakness, like offensive greatness, depends partly on how offense is defined.

Why Did These Players Qualify?

This is the baseball question beneath the numbers.

If these players scored so poorly on offense, why did they qualify for multiple seasons?

The answer is almost certainly that teams were not evaluating them by this offensive model alone.

Several explanations are possible.

First, some players had defensive value. Third base requires reaction time, arm strength, and infield skill. A weak hitter could remain in the lineup if he saved runs with the glove. Ken Reitz, Aurelio Rodriguez, Pedro Feliz, Ke’Bryan Hayes, and Brooks Robinson all remind us that third base has never been purely an offensive position.

Second, offensive expectations change by era. A third baseman who looks weak in one period may have been more tolerable because the league or position valued defense more heavily. The same-season peer adjustment controls for the offensive environment, but it does not control for managerial tolerance or roster construction.

Third, some players may have held jobs because of scarcity. Teams need someone to play third base every day. A club may accept weak offense if the alternatives are worse, injured, inexperienced, or defensively unplayable.

Fourth, reputation matters. Veterans sometimes continue to receive playing time after their offense declines. Single-season low scores often capture this. A player can be valuable earlier in his career and still produce a very poor qualified season later.

Fifth, team context matters. A weak-hitting third baseman on a strong offensive team may be easier to carry than the same player on a weak offensive team.

This is what makes the low-score study valuable. It does not merely identify poor offensive performances. It points toward the hidden parts of player value and team decision-making.

The Ethics of the Label

A chapter like this needs careful language.

It would be easy to call these players “the worst third basemen.” That would be inaccurate.

The model measures only offense. It does not measure defense. It does not measure total value. It does not measure WAR. It does not measure the reasons a manager kept writing a player’s name into the lineup.

A better phrase is:

lowest-scoring qualified offensive third basemen

or:

the weakest offensive third-base regulars in this peer-adjusted framework

That phrasing keeps the result honest.

Ken Reitz may rank first here, but the statement is not “Ken Reitz was the worst third baseman.” The statement is:

Among players with at least five qualified third-base seasons, Ken Reitz had the lowest combined average offensive score in the Model A and Model C framework.

That is precise.

Precision matters, especially when the result is negative.

Comparison With Averageness

This chapter also helps clarify the difference between average and weak.

The previous chapter identified Casey Blake as the most average combined third-base regular. Blake was close to the center of the third-base offensive distribution. His profile was neither strongly positive nor strongly negative.

Ken Reitz is different. He was not centered. He was far below the offensive center. His negative average score means he repeatedly trailed his third-base peers across the model categories.

The distinction can be summarized this way:

Casey Blake = closest to the center

Ken Reitz = farthest below the center among multi-season regulars

Mike Schmidt = farthest above the center

That gives the third-base study a complete structure.

Dominance.
Averageness.
Weakness.

All three are relative to the same positional baseline.

What This Adds to the Larger Study

The low-score chapter adds an important dimension to the project.

The dominance chapters showed the upper tail. The average chapter showed the center. This chapter shows the lower tail.

Together, they make the distribution visible.

A position is not defined only by its stars. It is also defined by the players who stayed in the lineup despite weak offense. Those players reveal the position’s tolerance limits. They show where defense, reputation, scarcity, and roster construction may have mattered enough to overcome poor offensive production.

At third base, the lower tail includes both obscure names and recognizable ones. It includes long-career regulars, defensive specialists, aging veterans, and players with uneven offensive records. That variety makes the list more interesting than a simple ranking of failure.

The numbers identify the pattern. Baseball history explains why the pattern existed.

Conclusion

The lowest-scoring third-base study completes the first full positional distribution.

The main results are:

Lowest combined multi-season offensive regular: Ken Reitz

Lowest Model C multi-season offensive regular: Charley Smith

Lowest Model C individual third-base season: Jimmy Austin, 1912

Most notable modern low-score regular: Ke’Bryan Hayes

Most important caution: defense and WAR are not included

The results should be interpreted carefully. This is not a list of the worst third basemen in total value. It is a list of the lowest-scoring qualified offensive third basemen within this peer-adjusted framework.

That distinction makes the chapter stronger.

The most interesting question is not merely who scored lowest. It is why they played. A player who repeatedly qualifies despite weak offense must have offered something else, or must have occupied a context in which the team accepted the offensive cost.

That is where the baseball story begins.

The numbers show the lower tail.
The roster decisions explain why the lower tail existed.

Third base now has three points of reference:

Mike Schmidt: the upper tail

Casey Blake: the center

Ken Reitz: the lower tail

Together, they describe the full offensive shape of the position.