A Few Thoughts on MLB Batting Averages and Scoring

The folks with a serious interest in baseball have been meticulously recording the numbers the game generates since the 19th century, giving us one of the longest continuous statistical datasets in professional sports. Using MLB league totals from 1871 through 2025, I have traced the story of offense through a single, elegant metric: runs per game per team (R/G).

The chart below (based on raw data graciously provided by baseball-reference.com) visualizes the average runs scored per game per team by decade, beginning in the 1920s—an era often considered the dawn of modern baseball. I view 1920 as the beginning of the modern era, mainly due to the standardization of the balls used in the games. Before this date, the balls were haphazardly procured; there were no standards imposed, and none were implied. One game might finish with a score of 43 – 36, and the next might be 2 -1. This was a result of the baseball ( and yes, I mean singular ball) used in the game.


The figure tells an interesting story:

  • 1930s: Offensive explosion. The live-ball era fully matured, and league scoring topped 5 runs per game.

  • 1960s: The “Pitcher’s Decade.” Offense collapsed, bottoming out at 3.7 R/G in 1968—the “Year of the Pitcher.”

  • 1990s: The power surge. League scoring rebounded to nearly 5 runs per game, driven by expansion, smaller parks, and the home-run boom. Surely, there are no other explanations, right? Cough, cough, hack, hack…

  • 2020s: The analytics paradox (but not really). Despite smarter lineups and stronger hitters, offense has fallen again, down to 4.4 R/G in recent seasons. More on this later…

BATTING AVERAGES

 

While run scoring has fluctuated wildly, the league batting average has remained remarkably stable. From 1920 onward, the overall mean is .262, almost identical to the all-time mark of .260 since 1871.

The highest batting averages came during the explosive decades of the 1920s and 1930s, while today’s hitters hover around .245, the lowest sustained level since the Dead Ball Era (1900-1920).

ANALYTICS

The offensive (and defensive) landscape of MLB can’t be understood without the analytics revolution, which ushered in a seismic shift in how teams interpret performance. It is, without doubt, the most transformative movement in the history of the game.

Baseball’s analytics revolution unfolded in three waves. The first began in the late 1970s, when writer Bill James published his Baseball Abstracts and coined the term “sabermetrics,” introducing a generation of fans and front offices to the idea that baseball could be studied scientifically. The second wave arrived around 2000, when the Oakland Athletics—immortalized in Moneyball—used data-driven roster construction to compete on a small budget. Their success sparked a league-wide shift toward on-base percentage, run efficiency, and market inefficiency analysis. The third and most mind-bending stage came in 2015 with the introduction of Statcast, a tracking technology that measures exit velocity, launch angle, spin rate, and player movement in real time. Together, these eras changed baseball from a sport of intuition to one of precision, where every swing, pitch, and sprint is quantified and optimized.

The following chart overlays those analytical milestones onto league scoring trends. Note how the average runs per game increased steadily until mathematics started to play a central role in baseball strategy.


  • 🟠 2000 – Moneyball / Analytics Era: Teams begin valuing on-base skills and cost efficiency.

  • 🔴 2015 – Statcast Era: Tracking technology transforms player evaluation and biomechanics.

Interestingly, runs per game spiked during the early pre-Moneyball years (late 1990s) but declined sharply once every team adopted similar analytical models. The advantage disappeared as the playing field leveled and pitchers harnessed data to exploit hitters’ weaknesses. League-wide defense also vastly improved; the players had a much better idea of where to position themselves batter by batter and pitch by pitch.

THE APPARENT DATA PARADOX

Baseball-flavored analytics were initially designed to optimize offense, yet their full integration has arguably optimized defense and pitching instead. By 2025, batting averages and runs per game are both at their lowest sustained levels in decades—even as individual player performance is measured with unprecedented precision.

The result is a kind of equilibrium: fewer balls in play, more strikeouts and home runs, and an ongoing debate about whether efficiency has made the game better or simply duller.

 And yes, there is a strong correlation between what has happened in baseball and what the 3-point shot has brought to the NBA. Just as basketball front offices realized that a 3-point shot is worth 50% more than a regular 2-point shot, baseball players were strongly advised that a home run is worth a lot more than a single or walk.

Take a moment to look over the following table. I am struck by the downward trend in batting average. It sure seems like the table is calling out for a similar study using on-base and slugging percentages. I will address this issue in a future post.
Metric 1920–2000 2010s 2020s
Avg. Batting Avg. (BA) .264 .254 .245
Avg. Runs per Game (R/G) ~4.5 4.38 4.45

The 2010s and 2020s mark the first back-to-back decades of declining batting average since the 1960s. Despite this, run scoring remains relatively stable. Interesting, isn’t it? Even though there is only one batter and nine defenders, the offense-minded have concluded that home runs, even with the resultant declines in batting average and on-base percentage, are much more desirable than any other alternatives. This is a big reason why batting averages have gone down, defense and pitching have improved, and average runs per game have stayed consistent.

CONCLUSION

The numbers reveal something profound: baseball’s statistical evolution mirrors its cultural one, suggesting a fundamental constancy in its design. Each new wave of data, whether Bill James’ notebooks or Statcast’s terabytes of data, has changed how players are valued and how teams win. Yet through all of it, the sport’s core equilibrium remains intact. The league batting average, while steadily going down, still results in scoring of about 4½ runs per game—just as it did a hundred years ago. In the end, baseball adapts, but it rarely strays too far from its mathematical mean. I find that very intriguing.

The next post builds on the themes touched on in this short essay.  I want to know where all the .300 hitters have gone, and I have decided to write about it. The next post will build on the work of Stephen Jay Gould, one of the most influential and essential evolutionary biologists of the last century.  Perhaps most importantly, he was a big baseball fan who used his considerable talents to write about the sport he loved.

 

Analyzing Max Exit Velocity (2020)

Analyzing Max Exit Velocity (2020)

In baseball analytics, exit velocity—specifically, the maximum exit velocity—is a critical metric. It measures the speed at which a ball leaves the bat, providing insights into a player’s power and potential impact. I am looking at max exit velocity data from the 2020 season. This visualization offers a clear and detailed view of how max exit velocities are distributed among players and a smoothed density estimate to reveal underlying trends. My first observation is amazement at how hard these balls are being hit. It is truly astonishing.

Forget batting average; this metric is more diagnostic than many others that are typically (especially historically) referenced. If you are putting a team together, you want players who hit the ball hard. And yes, the harder the better. This line of reasoning is all about a player’s ceiling; it has nothing to do with the dribbling groundballs that find a spot between defenders. Such “seeing eye” base hits are of little predictive value.

In 2020, exit velocity data’s importance escalated as teams began using it for more refined scouting and player development decisions. This season saw an exceptionally high interest in advanced metrics, partly because of the pandemic-shortened season. This led teams and analysts to seek more data-driven insights into player performance.

I used a histogram with an overlayed density curve to visualize max exit velocity data. Here’s what each part of this plot conveys:

  • Histogram: The histogram separates the exit velocity data into intervals (bins) and shows how many players achieved max exit velocities within each range. Each bar represents a specific range of velocities and provides a quick overview of where most data points (player exit velocities) lie.
  • Density Curve: The smoothed density curve overlaid on the histogram estimates the data’s distribution, offering insights into how the data might spread beyond discrete bins. This curve helps us visualize peaks and concentration points without the rigidity of bin divisions.

Key Insights from the 2020 Max Exit Velocity Data

  1. Concentration Around the Mean: The density curve reveals a central concentration of exit velocities in the range of approximately 105-111 mph. This concentration suggests that most players in the 2020 season achieved max exit velocities within this range, indicating a consistent performance level among players regarding hitting power.
  2. Distribution Shape: The distribution is symmetric, slightly skewed towards higher velocities. This symmetry is typical in sports metrics, where most players fall near the average performance level while a few outliers achieve exceptional numbers.
  3. High-End Outliers: The density curve and histogram both suggest that a few players in 2020 achieved exceptionally high max exit velocities, reaching up to 118 mph. These outliers represent some of the league’s top power hitters, whose performances exceed the average exit velocities and pose a significant offensive threat to opposing teams. And in case you were wondering, Pete Alonzo of the New York Mets hit a ball at 118.4 mph to lead the league. If facing such a batter, I would point to first base and take my chances with the next guy. If first were occupied, I certainly wouldn’t put anything over the plate. I wouldn’t even see the line drive coming back at me.

Why This Visualization Matters

A histogram with a density curve provides a quantitative view of max exit velocity data. This visualization helps scouts, coaches, and analysts quickly assess the distribution of max exit velocities across players. The density curve also offers a smooth, continuous view of the data, making it easier to observe trends and concentrations without the constraints of bin width.

Closing Thoughts

This histogram with a density overlay captures a snapshot of the league’s hitting power, revealing the typical max exit velocities and highlighting exceptional outliers.

This exemplifies how data analytics can deepen our understanding of baseball. By looking beyond averages and focusing on distribution, we gain a richer perspective on the league’s players. Whether you’re a data enthusiast or a baseball fan, this analysis offers a powerful glimpse into the metrics driving modern baseball.

 

Exploring Arm Strength in MLB (2020-2024): A Positional Comparison

Introduction

When I think about baseball, arm strength is one of the first things that comes to mind—especially when comparing players across different positions. Whether it’s a third baseman making a quick throw across the diamond (Brooks Robinson, anyone?) or an outfielder firing a rocket from the warning track (Roberto Clemente was awesome), a strong and accurate arm can make all the difference. Recently, I dove into some data from Major League Baseball covering the years 2020 to 2024 to better understand how arm strength varies by position, and I’d like to share what I found.

Comparing Average Arm Strength Across Positions

I started by looking at the average arm strength for each position. Unsurprisingly, outfielders—particularly those in right field—have the strongest arms, while positions like first base require less power behind the throw.

This bar chart shows the average arm strength for each position (excluding catcher) in miles per hour. Outfielders (RF, CF, LF) clearly lead the way, with center fielders and right fielders consistently throwing the hardest. It makes sense: outfielders must make long throws back into the infield, often in critical situations where arm strength is key.

Are you surprised? I might have thought that shortstops would have edged out left fielders and maybe even center fielders. That said, it is close.

As always, box plots allow us to get a more granular view of the raw data. Here is what I found.

Notice the outliers among first basemen. Lots of them get very little on their throws. That is unsurprising; many players are positioned there for their offense, with defense being an afterthought.

As readers of this blog know, I have a special relationship with violin plots. Here is the same data in that form.

Once again, the poor arms of a select group of first basemen are highlighted. I consider that fact to be a big takeaway from this plot.

Infield vs. Outfield: A Clear Difference

Next, I wanted to break things down further and compare infielders’ arm strength versus outfielders. Unsurprisingly, outfielders, who cover more ground and make longer throws, generally have stronger arms.

The box plot below shows the distribution of arm strength between infield and outfield players. Outfielders not only have higher average arm strength, but the range of arm strength is more comprehensive, too. Some outfielders, particularly those in right field, can really get after it when a runner is rounding second.

I would like to tell you something interesting about this plot. Over 35 years ago, I was taught a trick (more properly, a heuristic) at Harvard University. If there is a space between the bodies of the box plots, then the data set is worthy of further exploration. If you look closely, you can see a thin space between the boxes, so I decided to investigate further to see if the differences in arm strength are statistically significant. We will get to that in a bit.

Looking for Patterns: Correlations Between Positions

Before we get to the hard-core statistics, I  wanted to explore whether there is a relationship between arm strength at different positions. For instance, do shortstops tend to have arm strength similar to that of second basemen or third basemen? To find out, I ran a correlation analysis.

This heatmap shows how arm strength at one position correlates with another. There are some interesting patterns here—positions like second base (2B) and shortstop (SS) show a strong correlation, likely because they both require quick, strong throws in the infield. The outfield positions also show high correlations with each other, which makes sense given the similar demands placed on their arms.

Here are the Statistics

The results of the one-way ANOVA test (a comparison of variance amongst means) indicate the following:

  • F-statistic: 261.67

Since the p-value is extremely small (well below the typical significance and totally arbitrary threshold of 0.05), we can reject the null hypothesis. This suggests statistically significant differences in arm strength across the different positions. In other words, the differences in arm strength are authentic and valid.

I have never done this before in my blog, but I decided to take an even deeper dive into this data set. I view this blog as more or less an introduction to what I find interesting. I don’t want to get into the weeds; many blogs and websites do that. Today, though, is different. Early this morning, I ran my 4 miles despite not wanting to get out of bed. My hip, which needs to be replaced, barked the entire time. I guess I am in a mood… Here is what I did next.

group1 group2 meandiff p-adj lower upper reject
arm_1b arm_2b 4.0267 0 2.74 5.30 TRUE
arm_1b arm_3b 8.4252 0 7.1 9.75 TRUE
arm_1b arm_cf 12.6281 0 11.3 13.92 TRUE
arm_1b arm_lf 11.1761 0 9.93 12.42 TRUE
arm_1b arm_rf 13.3679 0 12.1 14.63 TRUE
arm_1b arm_ss 8.977 0 7.6 10.32 TRUE
arm_2b arm_3b 4.3985 0 3.2 5.57 TRUE
arm_2b arm_cf 8.6014 0 7.46 9.738 TRUE
arm_2b arm_lf 7.1494 0 6.067 8.23 TRUE
arm_2b arm_rf 9.3412 0 8.2 10.45 TRUE
arm_2b arm_ss 4.9503 0 3.75 6.14 TRUE
arm_3b arm_cf 4.2029 0 3.02 5.38 TRUE
arm_3b arm_lf 2.7509 0 1.61 3.88 TRUE
arm_3b arm_rf 4.9427 0 3.78 6.10 TRUE
arm_3b arm_ss 0.5518 0.84 -0.69 1.79 FALSE
arm_cf arm_lf -1.4519 0.02 -2.54 -0.35 TRUE
arm_cf arm_rf 0.7398 0.45 -0.38 1.86 FALSE
arm_cf arm_ss -3.6511 0 -4.86 -2.43 TRUE
arm_lf arm_rf 2.1918 0 1.12 3.26 TRUE
arm_lf arm_ss -2.1991 0 -3.36 -1.037 TRUE
arm_rf arm_ss -4.3909 0 -5.57 -3.202 TRUE

These are the results from Tukey’s HSD (Honestly Significant Difference) test results that provide pairwise comparisons between arm strengths for different positions. Yeah, I know your eyes are glazing over, but bear with me. Here’s how to interpret the key columns:

  1. Group1 and Group2: These columns represent the two positions being compared. For example, “arm_1b” vs. “arm_2b” compares the arm strength of first basemen with second basemen.
  2. Meandiff: This column shows the difference in the average arm strength between the two groups. A positive number means the arm strength of the first group (Group1) is higher than the second group (Group2).
    • For example, the mean difference between first basemen (arm_1b) and second basemen (arm_2b) is 4.03 mph, meaning first basemen tend to have lower arm strength compared to second basemen.
  3. p-adj: This is the adjusted p-value, which tests the statistical significance of the difference. If this value is below 0.05, it indicates that the difference is statistically significant.
    • For most comparisons, the p-values are extremely low (0.0), indicating strong evidence that arm strength significantly differs between these positions.
  4. Lower and Upper: These are the confidence intervals for the mean difference. It provides a range within which the actual mean difference will likely fall, with a 95% confidence level.
    • For example, the confidence interval for the difference between arm_1b and arm_2b is between 2.75 and 5.31 mph, suggesting that the actual difference lies within this range.
  5. Reject: This column tells whether the difference between the two groups is statistically significant. If it says “True,” the test rejects the null hypothesis, meaning the difference between the two positions is significant.
    • In this case, “True” appears in many rows, indicating that the arm strengths differ significantly between most pairs of positions.

Key Insights

  • Significant differences: Almost all pairwise comparisons show statistically significant differences. For example:
    • Outfielders (CF, RF, LF) generally have higher arm strength compared to infielders (1B, 2B, 3B, SS).
    • Third basemen (arm_3b) also tend to have higher arm strength than first basemen (arm_1b), as shown by an 8.43 mph difference.
  • Largest differences: The biggest differences are between infield positions like first base and outfield positions like right field (arm_rf), where the arm strength difference can be over 13 mph.

Even though my hip is killing me, I feel very good about the results of this study.

Wrapping Up

So, what did I learn from all this? First, outfielders—especially those in right and center field—are in a league of their own regarding arm strength. Conversely, infielders don’t need the same power, but positions like third base and shortstop still require strong arms for those quick, long throws.

Running the ANOVA and Tukey’s test confirmed that these differences in arm strength are not random results due to the vagaries of sampling. Understanding these variations can be crucial for teams looking to optimize their defensive lineups or scout new talent.

Examining the data and seeing how arm strength varies across MLB positions was fascinating. I hope you enjoyed it. I am going to grab a beer and contemplate the disappointment of my team, the Cleveland Guardians, disastrously ending another year. Meh, what else is new?

Even More Catcher Info: 2023 Blocking Data

Catcher defense, especially the ability to block pitches, can often go unnoticed but significantly impact the game. Preventing wild pitches and passed balls can save crucial runs and give pitchers confidence to throw in the dirt when necessary. In 2023, several catchers distinguished themselves as exceptional blockers. Let’s take a look at some of the data.

This analysis uses metrics like “blocks above average,” passed balls/wild pitches (PBWP), and more to examine the best catchers at blocking pitches during the season. Below, I break down the data to highlight the elite performers.

1. Top 10 Catchers by Blocks Above Average

“Blocks above average” is a critical statistic that tells us how much better (or worse) a catcher is compared to the league average at blocking pitches. Here’s a look at the top 10 catchers based on this metric:

As shown, Sean Murphy from the Atlanta Braves leads the way with 16 blocks above average, followed closely by Alejandro Kirk and Nick Fortes. These catchers were above average in keeping pitches in front of them, saving runs for their teams.

2. Actual vs. Expected PBWP

Next, take a look at the actual vs. expected number of passed balls and wild pitches (PBWP). The scatter plot below visualizes this comparison:

Catchers whose actual PBWP is lower than expected (below the red line) performed better than average. Catchers like Sean Murphy and J.T. Realmuto are among those outperforming expectations, while others are closer to the expected values. Note that the majority of catchers were about average.

3. Blocks Above Average Per Game

Another critical metric is the rate catchers accumulate blocks above average per game. This accounts for differences in playing time and offers a normalized view of performance. Here’s a look at the top 10 catchers:

The usual suspects are once again prominent. Notice that Yainer Diaz ranked number one in the league in this critical category.

4. Comprehensive Heatmap

To better understand each catcher’s performance, I’ve compiled several blocking metrics into a heatmap. This chart includes statistics such as catcher blocking runs, blocks above average, actual vs. expected PBWP, and blocks above average per game:

The heatmap above gives a comprehensive view of the top 10 catchers. The varying shades show how these catchers compare across multiple metrics, with Sean Murphy, Alejandro Kirk, and Nick Fortes again emerging as the top performers. This heatmap allows us to see the nuances in their blocking ability, with some excelling at reducing passed balls. In contrast, others are better at blocking above average on a per-game basis.

Conclusion

Nuance and subtlety are the operative words here. Asking who was the best defensive catcher in 2023 has as complex and interesting answer. What should we value in a catcher’s defense? Which metric is more important to winning than the others? Can you settle for a below-average pop time if your catcher is brilliant at framing pitches? Lots of great questions that require thoughtful answers. Stay tuned; I will continue posting my analyses. And yes, I do intend to publish some (hopefully) thoughtful conclusions.

 

Pop Time: A Critical Metric for Catchers

In baseball, a catcher’s Pop Time can be the difference between catching a base-stealer and letting them slide in safely. Pop Time measures how quickly a catcher transfers the ball from their mitt to second base, factoring in the catcher’s footwork, exchange, and arm strength. This metric provides a more comprehensive assessment of a catcher’s defensive capabilities than arm strength alone, making it crucial in evaluating how effectively a catcher can control the running game.

This post explores the distribution of pop times among various MLB catchers, with visualizations such as a histogram, Kernel Density Estimate (KDE) plot, violin plot, and box plot. We’ll also examine some key summary statistics and update the analysis with the best pop times recorded during the 2023 season.


What is Pop Time?

Pop Time is the time it takes for a catcher to throw the ball to second base during a steal or pickoff attempt. It measures the time elapsed from when the pitch hits the catcher’s mitt to when the throw reaches the center of the base. MLB’s average pop time for a throw to second base is 2.01 seconds, but elite catchers are significantly faster.

Pop Time considers three main factors:

  • Footwork: The catcher’s ability to quickly get into a throwing position.
  • Exchange: How fast the catcher transfers the ball from the glove to the throwing hand.
  • Arm Strength: The velocity and speed of the throw.

Catchers with exceptional Pop Times obviously offer a much higher probability of recording an out.


Best Pop Times from 2023

Below are the best average Pop Times to second base on stolen-base attempts (minimum 15 SB attempts) from the 2023 MLB season:

  • J.T. Realmuto: 1.90 seconds
  • Yan Gomes: 1.93 seconds
  • Jorge Alfaro: 1.94 seconds
  • Austin Hedges: 1.94 seconds
  • Manny Piña: 1.94 seconds
  • Gary Sánchez: 1.94 seconds

These elite catchers consistently post Pop Times well below the league average, making them highly effective at throwing out would-be base stealers. J.T. Realmuto, whose reputation proceeds him, leads the pack with an impressive 1.90 seconds.


Pop Time Distribution: A Closer Look

To better understand how Pop Times vary among catchers, I visualized the distribution using a histogram:

The histogram shows that most catchers’ Pop Times cluster around 1.95–2.0 seconds, with very few recording times below 1.90 seconds. The majority of catchers are near the league average of 2.01 seconds, but the elite catchers separate themselves by consistently being faster than this threshold.


Kernel Density Estimate (KDE) Plot

A Kernel Density Estimate (KDE) plot smooths out the distribution to provide a clearer picture of the underlying trends:

The KDE plot highlights the peak of Pop Times around 1.95 seconds, confirming that most catchers perform near this time. The data skews slightly to the right, indicating that a few catchers have slower pop times exceeding 2.0 seconds, but most fall below this threshold.


Violin Plot: Visualizing Distribution and Density

I also created a violin plot, which combines the features of a KDE and a box plot to visualize both the distribution and the density of pop times:

The violin plot shows that most catchers fall within a narrow range of 1.90 to 2.00 seconds. The distribution is dense around 1.95 seconds, with fewer catchers having significantly faster or slower times. This plot also highlights that catchers like J.T. Realmuto are outliers, excelling well beyond the typical range.


Box Plot: Highlighting Key Statistics

The box plot below offers a simple yet informative view of the data, focusing on the central tendency and spread of Pop Times:

Key points from the box plot:

  • Median Pop Time: 1.97 seconds
  • Interquartile Range (IQR): Most pop times fall between 1.93 and 1.99 seconds.
  • Outliers: A few catchers have slower times above 2.0 seconds, but these are rare.

Summary Statistics

The summary statistics for Pop Times further illustrate how closely clustered most catchers are around the league average:

  • Mean Pop Time: 1.96 seconds
  • Standard Deviation: 0.051 seconds (indicating low variability)
  • Minimum Pop Time: 1.83 seconds
  • Maximum Pop Time: 2.09 seconds
  • 25th Percentile: 1.93 seconds
  • 50th Percentile (Median): 1.97 seconds
  • 75th Percentile: 1.99 seconds

These statistics show that most catchers perform within a narrow band, with the elite catchers falling below 1.90 seconds.


Conclusion

Pop Time is a critical metric for evaluating a catcher’s ability to control the running game. While arm strength is important, Pop Time provides a fuller picture by incorporating footwork and exchange speed. This type of analysis also lets us ignore the pitcher and focus exclusively on the catcher’s skills.

Our analysis of Pop Times using visual tools like histograms, KDE plots, violin plots, and box plots shows that most catchers fall within a narrow range of 1.95 to 2.0 seconds, with a few standout performers excelling beyond this. The data from the 2023 season illustrates how slight differences in Pop Time can significantly impact a catcher’s effectiveness at throwing out base stealers.

For catchers, a fast Pop Time can be the difference between a successful defensive play and allowing the opposing team to gain momentum on the bases. I hope you are enjoying this deep dive into the nuances of catching; I certainly am. It is fascinating, isn’t it?

Whiff Percentages in Baseball: A Little EDA Goes A Long Way

In baseball analytics, understanding a player’s whiff percentage—the rate at which they miss the ball when swinging—can offer key insights into their performance. A higher whiff percentage suggests a tendency to miss pitches, while a lower percentage indicates better contact with the ball.

In this post, I explore whiff percentages from both leagues across several years using three different visualization techniques: box plots, violin plots, and a line plot of medians. Each method offers a unique perspective on the data, and together, they help paint a comprehensive picture of trends in whiff percentages from 2015 to 2023. All players with approximately 200 plate appearances in that given year are included in the study.


1. Box Plot: Visualizing the Distribution by Year

A box plot is a simple yet powerful tool to summarize the distribution of whiff percentages each year. It shows the median (the line within each box), the interquartile range (the box itself), and any outliers (the dots outside the whiskers).

This box plot gives us several insights:

  • Consistency: In certain years, the boxes are tightly grouped, indicating less variation in whiff percentages (e.g., 2015).
  • Outliers: Some years have extreme values, shown as dots, which highlight players who either significantly outperformed or underperformed compared to the rest.
  • Year-to-Year Comparison: The height of the boxes gives a sense of how spread out the whiff percentages were for each year, helping to identify years with more variability in player performance.

Why use a box plot? Box plots are ideal when you want to compare distributions without being distracted by individual data points. It provides a clean, uncluttered view of how the overall performance fluctuated from year to year, and highlights outliers effectively.


2. Violin Plot: Adding Depth to Distribution Analysis

A violin plot enhances the box plot by providing additional information about the shape of the distribution. It combines aspects of a box plot with a kernel density estimate, which helps visualize the probability distribution of the data. I will mention once again that I invented these plots, much to the chagrin of my peers, many decades ago. See my “A Crush, A Data Viz, and a Book Long Postponed” post for that tragic tale.

This violin plot offers some extra depth:

  • Distribution Shape: You can see how the whiff percentages are spread out within each year. Some years have narrow violins, suggesting that most players had similar whiff percentages, while others are more spread out, indicating more variability.
  • Density: The wider sections of the violin show where most data points are concentrated, allowing us to see not just the range but also the density of players’ performances in each year.

Why use a violin plot? Violin plots are particularly useful when you want a more nuanced understanding of the data distribution. While box plots are excellent for a high-level summary, violin plots allow us to see the underlying density, which can reveal patterns not visible in box plots alone.


3. Line Plot of Medians: Tracking Trends Over Time

Finally, to understand the overall trend in whiff percentages, I created a line plot of the median whiff percentage for each year. The median is a robust measure of central tendency, making it ideal for highlighting general shifts without being overly influenced by outliers.

This plot shows us:

  • Overall Trend: The line plot helps reveal whether the median whiff percentage is increasing, decreasing, or remaining stable over time. If the line rises, it suggests that players are missing more swings as the years progress, while a falling line indicates better contact rates.
  • Key Years: Significant upward or downward trends in specific years are easily spotted. These could prompt further investigation into why such changes occurred, whether due to rule changes, player performance shifts, or other factors.

Why use a line plot? A line plot of medians is the best way to capture the long-term trend. It smooths out individual variations and provides a clear picture of how the “middle” of the data is changing over time.


Conclusion: Insights from Multiple Perspectives

By using these three visualizations—box plots, violin plots, and line plots—we gain a multi-dimensional understanding of whiff percentages in baseball:

  • The box plot provides a clean, high-level comparison of distributions across years, highlighting outliers and general performance variability.
  • The violin plot offers a deeper look at how player performances are distributed within each year, revealing the shape and density of the data.
  • The line plot of medians shows the overall trend, capturing how the middle of the distribution shifts over time.

Each plot tells a part of the story, and when combined, they provide a comprehensive view of player performance over the years. Whether you’re a data enthusiast, baseball analyst, or interested bystander, these tools can help unlock valuable insights into the game. And yes, I find the trend reversal after the 2020 season curious. The great thing about Exploratory Data Analysis (EDA) is that it can strongly suggest what questions must be asked in subsequent stages of analysis. That is certainly what happened here.

 

Frame This: MLB Catchers (2023)

I took a deeper dive into MLB Catchers for the year 2023. I found lots of interesting stuff. Let’s get to it.

In this post, I decided to focus on catcher framing. Some catchers are better than others in fooling umpires that a ball is a strike. That is what catcher framing is all about. This may surprise some of you, but all this data is now readily available. Every pitch is tracked with impressive accuracy, with terabytes of data generated for each game played.

I created this figure to illustrate the standardized zones used for pitches thrown to home plate. The following is taken from the perspective of the catcher and home plate umpire.

Take Zone 11, for example. The reams of data tell us the percentage of pitches in that area that are taken and called strikes. In 2023, 19.2% of all pitches thrown into that zone were called strikes. Austin Hedges, then a catcher for the world-champion Texas Rangers, managed to get 27.6% of those pitches called strikes by the sweaty man crouching behind him. Get the idea? Hedges’ strike rate for that zone led all of MLB.

Hedges’ work in Zone 13 was even more impressive. The league average for pitches thrown up and away to right-handed batters was 23.6%. Hedges managed to get strike calls on 42.2% of those pitches. Extraordinary.

I ran a Cluster Analysis of all the framing data across all the zones to recognize the top ten catchers in MLB in 2023. Hedges and Patrick Bailey of the San Francisco Giants stand apart based on their superior performance.

And, yes, what is a top ten list without a bottom ten list? There might be a name or two on there that will surprise you.

In a previous post, I had identified J.T. Realmuto as having an outstanding defensive season in 2023. Regarding pitch framing, he ranked a ridiculous 63rd. I admit, I found that unexpected.

Now, we can move on to something very cool. I have known what heatmaps are for a long time, but I have never needed to create one. It simply never came up. Guess what is next; go ahead.

I want to point out one aspect of this map: Hedges was well below the league average regarding framing pitches in Zone 14. I must admit, that is curious. I do not know why he would be so bad in that area and excel in all the other zones. I have no explanation for that anomalous chunk of data.

And, yes, I also generated a heatmap for the bottom ten catchers in 2023.

Another strange fact is that Martin Maldonado was very good at getting strike calls in Zone 11 but well below average in all the others. Does that have something to do with the pitchers on the Houston Astros in 2023? That line of reasoning might lead to a possible explanation.

I thought that was the end of this post, but I decided to test the new AI release that ChatGPT just dropped. I asked it for recommendations on how it would display this data. It offered up something very cool. Here are Radar Plots of the top 5 and bottom 5 catchers for pitch framing for the 2023 season.

Note that Hedges in Zone 13 and Miguel Amaya in Zone 17 stand out.

These plots are beautiful, but I haven’t decided on their utility. Are they diagnostic enough to merit their use? We will look more into that question in future posts.

At least for now, the takeaway is that determining the best defensive catcher in 2023 is much more subtle and nuanced than one might have imagined. Stay tuned; there is more to come.

 

Baseball Has a Strange Math Issue

Baseball Has a Strange Math Issue

My last post was about the defensive capabilities of MLB catchers in 2023. I mentioned that there was more to come. As I was researching the follow-up post, I came across something bizarre. As soon as I stop violently shaking my head back and forth, I will show you what I found.

This post was supposed to be about framing pitches. Some catchers are very good at fooling umpires into calling strikes on pitches that are actually balls. There is lots of excellent data to quantify the ability of any catcher to do this. As you might guess, this is a precious skill that any team would want to have in their catcher.

As I reviewed the data and put together a strategy to analyze and visualize it for the post, I realized that I needed to draw pictures of home base, more commonly called home plate. Why home base, then? That is what it is called in the official baseball rule book. How did I end up on a web page showing those rules? That is an excellent question.

I searched for the dimensions of home plate; it wasn’t something I had committed to memory. Trust me, I know the numbers now, and I doubt I will ever forget. Here’s why…

The following paragraph is taken from Official Baseball Rules, 2024 edition, published by the Office of the Commissioner of Baseball.

2.02 Home Base. Home base shall be marked by a five-sided slab of whitened rubber. It shall be a 17-inch square with two of the corners removed so that one edge is 17 inches long, two adjacent sides are 8½ inches and the remaining two sides are 12 inches and set at an angle to make a point.

So, what’s the big deal? The rule book describes an impossible figure. The shape described does not, and cannot, exist. Unbelievable, isn’t it? Look at the drawing I conjured up.

 

Figure 1. Home plate as it should be and home plate as described in rule book.

 

I suppose a lawyer could litigate this. It seems that the intent was for the angle formed at the point to be 90 degrees, which it clearly is not when following the description from the rule book. It takes slightly more than 12 inches to meet the requirements of Pythagoras and his ubiquitous theorem. Is Major League Baseball concerned about this? Apparently not. Am I concerned that they have fudged a famous trigonometry theorem? I’ll crank up some Mozart and mull it over for a bit. My guess is I won’t lose much sleep.

 

Pitching is (or was) more Important than Hitting? Who knew?

 

This analysis examines the relationship between a team’s On-base Plus Slugging (OPS) and their total wins in Major League Baseball (MLB) over a five-year period from 2004 to 2008. OPS is a key statistic in baseball that combines on-base percentage and slugging percentage, providing a comprehensive measure of a player’s (or team’s) ability to get on base and hit for power. The scatterplot visualizes this relationship, with each point representing a team’s OPS and corresponding number of wins for a particular season. The data points are colored by year, allowing us to observe any patterns or trends across the seasons. That factor proved not to be very useful.

A linear regression model was applied to determine if there is a significant correlation between OPS and team wins. The analysis revealed an R-squared value of 0.196. The R-squared value indicates that approximately 19.6% of the variance in team wins can be explained by their OPS, suggesting a moderate correlation. While OPS is a useful statistic, the relatively low R-squared value implies that other factors, such as pitching, defense, and managerial decisions, also play a significant role in determining a team’s success over a season.

The analysis covers data from five consecutive MLB seasons, providing a broad overview of the relationship between OPS and wins over multiple years. The consistency of the trend line and equation across the years indicates that the OPS-wins relationship is relatively stable during this time period.  However, given the moderate R-squared value, this analysis suggests that while OPS is an important metric for assessing team performance, it should be considered alongside other variables for a more comprehensive understanding of what drives a team’s success.

In a recent post, I demonstrated that WHIP is much more predictive of a team’s record than OPS, at least in the mid-2000s. I don’t think anyone will be surprised to learn that pitching is more important than hitting if you want to win baseball games. There will be more on that and related topics coming soon.

 

Now, Isn’t This Interesting?

 

This scatterplot visualizes the relationship between a baseball team’s WHIP (Walks plus Hits per Inning Pitched) and the number of wins they achieved during the seasons from 2004 to 2008. I included both the AL and NL in this analysis. Each point on the graph represents a team in a specific year, with the color indicating the corresponding season. The WHIP is plotted on the x-axis, while the number of wins is plotted on the y-axis. This visualization allows us to observe if there is a pattern or trend between these two variables across different years.

A trendline, represented by a solid red line, has been added to the scatterplot, which provides a general indication of the relationship between WHIP and wins. The slope of the line suggests that as WHIP increases, the number of wins tends to decrease. The strength of this relationship is indicated by the R-squared value of 0.49, meaning that WHIP accounts for approximately 49% of the variability in the number of wins. This moderate R-squared value suggests a fairly significant correlation between the two variables.

In summary, the scatterplot illustrates a moderate negative correlation between WHIP and team wins, indicating that WHIP is a meaningful factor in a team’s success, though not the sole determinant. Including both leagues from 2004 to 2008 allows for an interesting, if limited, analysis over multiple seasons, with the trendline and R-squared value providing insights into the overall pattern between these two metrics. This plot highlights the importance of WHIP in predicting team performance while suggesting that other factors certainly contribute to a team’s total wins.

Here is a scatterplot illustrating wins in terms of team ERA (earned run average). When I was a kid, I didn’t think ERA was very valuable, and the following plot shows that it has less explanatory value than WHIP.

As we saw in a previous post, payroll differences explained approximately 19 percent of the variability in win totals. Team ERA explains about 44 percent of the variability, while the WHIP metric has more explanatory value (49 percent) when determining what leads to wins in major league baseball. I will keep posting more information as my research progresses.