Scales of Unusualness: 2023 MLB Catchers (Defense)

The hierarchical cluster tree, or dendrogram, visualizes the relationships among 2023 MLB catchers based on their defensive statistics. As always, players who are closer together on the x-axis have similar defensive profiles, meaning their statistics in categories like putouts, assists, errors, and caught stealing percentage are more alike. The height of the horizontal lines (distance) indicates how similar or dissimilar players are: the lower the line, the more similar the players are in their defensive performance.

The visualization highlights individual performance and helps teams or analysts compare players across a wide range of defensive metrics. For example, catchers clustered together likely share similar defensive styles or capabilities, making it easier to compare catchers in terms of their effectiveness behind the plate. Furthermore, the dendrogram’s structure shows which players stand out as outliers due to superior or weaker performance compared to their peers, giving teams valuable insights for recruitment, strategy, or training decisions.

Note that J.T. Realmuto is off by himself. Despite not receiving a Gold Glove Award, his defensive performance in 2023 was ostensibly exceptional. In a future post, I will drill down into the advanced metrics to see why he was overlooked. Don’t be surprised if the dendrogram I created in this post is deemed suspect in a few days or so.

 

 

Pitching is (or was) more Important than Hitting? Who knew?

 

This analysis examines the relationship between a team’s On-base Plus Slugging (OPS) and their total wins in Major League Baseball (MLB) over a five-year period from 2004 to 2008. OPS is a key statistic in baseball that combines on-base percentage and slugging percentage, providing a comprehensive measure of a player’s (or team’s) ability to get on base and hit for power. The scatterplot visualizes this relationship, with each point representing a team’s OPS and corresponding number of wins for a particular season. The data points are colored by year, allowing us to observe any patterns or trends across the seasons. That factor proved not to be very useful.

A linear regression model was applied to determine if there is a significant correlation between OPS and team wins. The analysis revealed an R-squared value of 0.196. The R-squared value indicates that approximately 19.6% of the variance in team wins can be explained by their OPS, suggesting a moderate correlation. While OPS is a useful statistic, the relatively low R-squared value implies that other factors, such as pitching, defense, and managerial decisions, also play a significant role in determining a team’s success over a season.

The analysis covers data from five consecutive MLB seasons, providing a broad overview of the relationship between OPS and wins over multiple years. The consistency of the trend line and equation across the years indicates that the OPS-wins relationship is relatively stable during this time period.  However, given the moderate R-squared value, this analysis suggests that while OPS is an important metric for assessing team performance, it should be considered alongside other variables for a more comprehensive understanding of what drives a team’s success.

In a recent post, I demonstrated that WHIP is much more predictive of a team’s record than OPS, at least in the mid-2000s. I don’t think anyone will be surprised to learn that pitching is more important than hitting if you want to win baseball games. There will be more on that and related topics coming soon.

 

Now, Isn’t This Interesting?

 

This scatterplot visualizes the relationship between a baseball team’s WHIP (Walks plus Hits per Inning Pitched) and the number of wins they achieved during the seasons from 2004 to 2008. I included both the AL and NL in this analysis. Each point on the graph represents a team in a specific year, with the color indicating the corresponding season. The WHIP is plotted on the x-axis, while the number of wins is plotted on the y-axis. This visualization allows us to observe if there is a pattern or trend between these two variables across different years.

A trendline, represented by a solid red line, has been added to the scatterplot, which provides a general indication of the relationship between WHIP and wins. The slope of the line suggests that as WHIP increases, the number of wins tends to decrease. The strength of this relationship is indicated by the R-squared value of 0.49, meaning that WHIP accounts for approximately 49% of the variability in the number of wins. This moderate R-squared value suggests a fairly significant correlation between the two variables.

In summary, the scatterplot illustrates a moderate negative correlation between WHIP and team wins, indicating that WHIP is a meaningful factor in a team’s success, though not the sole determinant. Including both leagues from 2004 to 2008 allows for an interesting, if limited, analysis over multiple seasons, with the trendline and R-squared value providing insights into the overall pattern between these two metrics. This plot highlights the importance of WHIP in predicting team performance while suggesting that other factors certainly contribute to a team’s total wins.

Here is a scatterplot illustrating wins in terms of team ERA (earned run average). When I was a kid, I didn’t think ERA was very valuable, and the following plot shows that it has less explanatory value than WHIP.

As we saw in a previous post, payroll differences explained approximately 19 percent of the variability in win totals. Team ERA explains about 44 percent of the variability, while the WHIP metric has more explanatory value (49 percent) when determining what leads to wins in major league baseball. I will keep posting more information as my research progresses.

 

2018 AL WAR vs OPS

 

The scatterplot titled “2018 AL WAR vs OPS (Colored by Position)” visually explores the relationship between Wins Above Replacement (WAR) and On-base Plus Slugging (OPS) for players in the American League during the 2018 season. Each point on the plot represents a player, with OPS on the x-axis and WAR on the y-axis, and the points are colored according to the player’s position. This allows us to observe how players across different positions performed in terms of their offensive output and overall contribution to their teams.

Notably, the plot highlights standout players such as Mookie Betts and Mike Trout, who are positioned in the upper right corner, indicating their exceptional performance. Betts, then an outfielder for the Boston Red Sox, and Trout, still a center fielder for the Los Angeles Angels, both had extremely high OPS and WAR values. Their positions in the plot underscore their status as two of the most valuable players in the league during the 2018 season.

In contrast, Chris Davis, a first baseman for the Baltimore Orioles, is positioned in the lower-left corner of the plot. Davis had one of the lowest OPS and WAR values in 2018, indicating his struggles. The spread of points across the plot also reveals how different positions cluster in certain areas, with players like Davis standing out as outliers in underperformance. At the same time, Betts and Trout exemplify top-tier performance. This is a pretty cool visualization of this type of data. I find scatterplots useful.

 

Here’s a Little 3D For You

 

How is this for a different perspective? The 3D Cluster Analysis of 2023 National League (NL) shortstops visually represents player performance using an extra dimension, highlighting their key differences and similarities. Using a sophisticated technique called Principal Component Analysis (PCA), the high-dimensional performance metrics of the shortstops were reduced to three principal components, which encapsulate most of the variance in the data. This dimensionality reduction (or expansion, if you prefer) allows for a clear visualization in three-dimensional space, where each player’s metrics reflect their overall performance. The players are grouped into three distinct clusters, each represented by a different color, providing insights into how these athletes compare to one another based on their statistics.

The clusters were determined using the K-means clustering algorithm (much more of that down the line), which groups players with similar performance metrics into the same cluster. As earlier, the plot reveals three main clusters: Cluster 1 in blue, Cluster 2 in green, and Cluster 3 in red. Each cluster represents a subset of players with comparable performance profiles. For instance, the player in Cluster 3 (Mookie Betts), shown in red, exhibits stronger or more consistent performance in certain areas, distinguishing him from those in the other clusters.

Unsurprisingly, Betts is once again highlighted in the analysis. Notice that he is off by himself in red, focusing our attention. This emphasis allows for a closer examination of where Betts stands relative to his peers in the 2003 NL shortstop group. While I do believe that the two-dimensional plot from the last post is more diagnostic, no one can deny how cool the 3D plot looks. And that is why I published this post.

 

Scales of Unusualness: Offensive Production of NL Shortstops in 2023

To the surprise of no one, Mookie Betts was, by far, the most unusual offensive performer last year among NL shortstops. If you study the plot, you can follow the line connecting Betts to the other players.

 

Betts is a cluster of one. His offensive production was so far above all the other shortstops that no one could cluster with him. And that, I must say, is highly unusual.

 

Scales of Unusualness: Offensive Production of AL Second Basemen in 2009

The title of this post is unusual (you see what I did). Why? Historically, I have chosen titles inspired by the band Arctic Monkeys. They had a propensity for overly long song titles that may or may not have anything to do with the actual song. For the longest time, they were my favorite band. The last two CDs, though, have given me pause. I listened to both of them hundreds of times, and I must admit that I just don’t get it. Maybe on some deep (nearly subconscious) level, I have given this post a most unlikely title as a mild form of protest. I am dubious of my potential impact.

This short essay is about the offensive production of American League second basemen in 2009. I will view each player’s statistics through an Explanatory Data Analysis lens, specifically by creating different Scales of Unusualness.

Here is a table of some of the data used for the study. Of course, the variables would be very different if we used players from last year. In 2009, no exit velocity or launch angle data were available. Even with this partial data set, many advanced metrics were eliminated from the table to make it more legible. Variations of these variables (and the others) were used to create the plot, which will appear shortly.

Table 1. AL Second Basemen, 2009

If we were just considering batting average (BA), it would be easy to rank the players. In fact, the players happen to be ranked in descending order based on that column, with Cano leading the way at .320. What happens if we want to consider all the variables together? Human brains aren’t very good at that task, but computers have no problem.

Next comes the scale that was referenced earlier. We can take each column and standardize the data by giving each value a z-score. A z-score measures the distance of a number in standard deviations from the mean of the data sample, in this case, all the columns with a “Z” prefix. Cano had the highest batting average, and you can see that his z-score for “Z_BA” is 1.68, which is more than double the next highest number in the column. That means his batting average for that year was highly unusual compared to other AL second basemen.

Table 2. AL Second Basemen Z-Scores, 2009

One interesting note. Look at the table and see if you can determine the two most unusual players when all the data is considered. I don’t think it can be done. There are eight variables, and that is six or seven too many. Fortunately, a technique called Cluster Analysis quickly solves the problem. Below is a Cluster Tree, or Dendrogram, of the computer’s analysis.

 

Figure 1. Dendrogram

The plot shows two large categories, those with high and those with relatively low offensive productivity. Among the top performers, the software identified Ben Zobrist as the most unusual. That means that Zobrist had the best offensive season of any AL second baseman that year. If you study the plot, you will see that Alberto Callaspo finishes a close second.

I would like to point out a couple more things. The plot shows that Maicer Izturis and Howie Kendrick had the most similar seasons. Their statistics were highly correlated in their unusualness with respect to the other players. Who knew?

So, as you might have guessed, there is a payoff to this post. A Scale of Unusualness doesn’t just identify the best or most productive offensive player; it works equally well on both ends of the scale. The most unusual offensive second baseman in the AL in 2009 wasn’t Zobrist; it was the unfortunate Nick Punto (with Chris Getz closing fast). Punto was much more unusually bad than Zobrist was unusually good. My guess is that when I include defensive metrics, Punto will more than redeem himself. You can’t play from 2001 – 2014 (and win a World Series) without being a big-time player. This one-year snapshot does not do him justice. Maybe I will post the defensive analysis next. Perhaps I will include offense and defense together in a more comprehensive study. Now that I think of it, I should take a break and give those Arctic Monkeys’ CDs another 300 listens.

 

19 Percent…huh?

19 Percent…huh?

I spent a lot of time putting together the lone figure in this post. My forthcoming baseball book will be filled with plots like the one that follows. I have known many people whose eyes glaze over when presented with figures or graphs (including professors who should know better). Pay a little attention to this one; you will be rewarded.

Between 2004 and 2008, there was a growing disparity in the payrolls of clubs in Major League Baseball. Lots was written about the unfairness of this. I agree with those who thought it outrageous that one team could spend 8 or 9 times what others could afford to pay their players. Consequently, every season began with plenty of fan bases lamenting the stone-cold truth that their teams had no chance to compete for a title or make the playoffs.

Growing up as a fan of the team then known as the Cleveland Indians, I knew that as soon as a young player started to excel, he was on his way out of town. It was a simple fact that other larger market clubs could easily outbid us for a young star’s services. Such was life in the big city.

Every year, big-money teams seemed to crush the less fortunate, and no one seemed to care. The fact that always got me going was that if a team (think Yankees), signed a player to a big contract and that guy floundered, all they did was treat the signing as a sunken cost and go about their business. Clubs like Cleveland, on the other hand, could be crippled by one bad signing. That is a statement of fact.

So, let’s see if we can gain some insight. One of the great things about a scientific mindset is that we can cut through the narratives and what people think is true, and get at the mathematical heart of the issue at hand. The following figure does just that.

I plotted payroll data from 2004 through 2008 against the winning percentage of all MLB teams. I colored the data points using a playoffs variable to simplify the plot. I think it makes it more interesting and easier to read.

Figure 1. 2004 – 2008 MLB Payroll versus Win Percentage Data.

The scatterplot is basically a blob (yes, the Yankees are in the upper right corner). That means a minimal relationship exists between a team’s payroll and that team’s record, at least for these 5 years. Note the equation in the lower right of the plot. That means a team’s payroll explains only 19 percent of all the MLB team’s record. In other words, there was very little explanatory value in predicting the number of games a team would win if you knew their payroll. The relationship between payroll and record is minimal.

Surprised? Well…if payroll is not a predictor of a team’s success, what is? If payroll accounts for about 19 percent of the outcome; what explains the other 81 percent? I will be looking into that in my book. Perhaps I will find that left-handed middle relievers are the key to success (I doubt it), or maybe if you are putting together a team, you need batters with high exit velocities, pitchers with exceptional strikeout rates, and outfielders who can run like the wind. I am going to try my best to find out.