Sports

Baseball Players’ Performance Could Be Better Measured by Machine Learning

Baseball Players’ Performance Could Be Better Measured by Machine Learning

A young economics graduate and a Major League Baseball coach who is tight for cash establish a novel method of determining the worth of baseball players in the film “Moneyball.” The Oakland A’s were able to recruit top talent that other teams had passed on, entirely renewing the franchise without going over budget because to their creative method to calculate players’ pay and statistical data.

A comparable effect on the sport might result from recent study at the Penn State College of Information Sciences and Technology. In comparison to the sport’s current statistical analysis techniques, the team has created a machine learning model that could more accurately assess baseball players’ and teams’ short and long-term performance.

Their method, which builds on recent developments in natural language processing and computer vision, would fundamentally alter and possibly even improve how the state of a game and a player’s influence on it are measured.

The current family of methodologies, known as sabermetrics, rely on the frequency with which a player or team executes a specific event, like hitting a double or home run, said Connor Heaton, a doctorate candidate in the College of IST. However, it doesn’t consider the surrounding context of each action.

“Think about a scenario in which a player recorded a single in his last plate appearance,” said Heaton. “He could have hit a dribbler down the third base line, advancing a runner from first to second and beat the throw to first, or hit a ball to deep left field and reached first base comfortably but didn’t have the speed to push for a double. Describing both situations as resulting in ‘a single’ is accurate but does not tell the whole story.”

Instead, Heaton’s model determines the significance of in-game events based on their effects on the game and the context in which they occur, and by viewing the game as a series of events, it then generates numerical representations of how players affect the game.

“We often talk about baseball in terms of ‘this player had two singles and a double yesterday,’ or ‘he went one for four,” said Heaton. “A lot of the ways in which we talk about the game just summarize the events with one summary statistic. Our work is trying to take a more holistic picture of the game and to get a more nuanced, computational description of how players impact the game.”

To the best of our knowledge, ours is the first to capture and represent a nuanced state of the game and utilize this information as the context to evaluate the individual events that are counted by traditional statistics for example, by automatically building a model that understands key moments and clutch events.

Professor Prasenjit Mitra

With Heaton’s innovative approach, computers may learn the function or meaning of various words thanks to sequential modeling techniques utilized in natural language processing. He used that method to explain to his model the significance of certain events in a baseball game, such as when a batter hits a single. Then, in order to provide fresh perspective on current statistics, he modelled the game as a series of occurrences.

“The impact of this work is the framework that is proposed for what I like to call ‘interrogating the game,’” said Heaton. “We’re viewing it as a sequence in this whole computational scaffolding to model a game.”

The output of the model can adequately describe a player’s short-term impact on the game or their form. These form embeddings, which are shown as 64-element vectors created by adapting work from computer vision, effectively reflect a player’s influence in the game over a short period of time, such as the course of 15 plate appearances, or over a longer period of time, such as the player’s career.

Additionally, when combined with traditional sabermetrics, the form embeddings can predict the winner of a game with over 59% accuracy.

Heaton explained how the same data is plotted using embeddings made using his approach as well as the conventional sabermetrics approach. In the long run, sabermetric-based depictions of player influence can be erratic, drastically varying from one game to the next. Heaton’s method helps “smooth out” the way players are described over time, while still allowing for fluctuation in player performance.

“Both embeddings can help differentiate good players from bad players,” said Heaton. “But ours provides much more nuance into the exact way in which the good players impact the game.”

The researchers used information already gathered from systems deployed at major league stadiums that measure specific data on every pitch thrown, including player positions in the field, base occupancy, pitch velocity, and pitch rotation, to train their model. They concentrated on two sorts of data: season-by-season data to look into position-specific data such as walks and hits per inning pitched for pitchers and on-base plus slugging % for batters; and pitch-by-pitch data to assess information such as pitch type and launch angle.

Three features are used to identify each pitch in the dataset that was gathered: the game in which it occurred, the at-bat number within the game, and the pitch number inside the at-bat. The researchers were able to fully reconstruct the series of events that make up an MLB game utilizing just three pieces of knowledge.

When a pitch is delivered, 325 potential game modifications, including adjustments to the base occupancy and ball-strike count, may take place. To be able to explain what happened, how it happened, and who was involved with each play, they integrated this information with already-existing pitch-by-pitch data that defines the thrown pitch and at-bat action. They also input player statistics from sabermetrics.

The project combines Heaton’s interest in historical statistical analysis of baseball with his academic focus on natural language processing.

“There’s this whole ecosystem built up around modeling language and the sequence of words,” said Heaton. “It seems like there was potential for it to be adopted to model sequences of other things; to just generalize it a little bit. I started thinking about sports analytics and it just seemed like there was a lot that could be done to improve both our understanding of the game and how the game is modeled computationally.”

The researchers are hoping that their study will operate as a solid foundation for a fresh description of how players in baseball and other sports affect the flow of the game.

“This work has the potential to significantly advance the state of the art in sabermetrics,” said Prasenjit Mitra, professor of information sciences and technology and co-author on the paper. “To the best of our knowledge, ours is the first to capture and represent a nuanced state of the game and utilize this information as the context to evaluate the individual events that are counted by traditional statistics for example, by automatically building a model that understands key moments and clutch events.”

Heaton and Mitra presented their paper, “Using Machine Learning to Describe How Players Impact the Game in the MLB,” was one of seven finalists in the 2022 Research Paper competition at the MIT Sloan Sports Analytics Conference earlier this month.