Classification/Valuation Problem: Baseball

  • A
  • Thread starter WWGD
  • Start date
  • Tags
    Baseball
In summary, the conversation discusses a data project that aims to classify baseball players as overvalued or undervalued by comparing two valuations, V1 and V2, in different currencies. V1 is based on salary while V2 is a weighted sum of player statistics. The goal is to compute a unit-free ratio of V1/V2 and monetize V2 by regressing it against V1. There is a discussion on how to properly normalize the data and potential issues with the model. The conversation also touches on Sabrmetrics and scraping data from a website for analysis.
  • #1
WWGD
Science Advisor
Gold Member
7,003
10,423
Hi All,
I am doing a small data project that consists of classifying Baseball players as being either overvalued or
undervalued. I have two valuations V1, V2 for each of the players, though in different "currencies" and I am trying to see how to express both in the same currency. I have been going over M.Lewis' book " Moneyball" , but I don't want to copy his ideas (or , more accurately, the ideas he describes in the book)
One of the valuations, the first, say v1, is just by salary , the average salary of the previous 3 years. The second one, say v2, though , is a weighted sum of player statistics, and would have "runs" units. The statistics I am considering are : Average, OBA, Number of Hrs , Hr/At bat, etc : the key idea is that the game of Baseball is about runs, a team wins the game by scoring more runs than its opponent; then the statistics that correlate highly with run scoring or run prevention (meaning preventing the opposing team from scoring runs) are weighted highly towards the player's value V2. Then I want to compute a ratio V1/V2 of the two indices. But I want this ratio to be unit-free, meaning both valuations must be expressed in the same units. Unfortunately, V2 is in "runs" units, so I want to transform it in a reasonable way into $ units, which are the units V1 appears in.
My idea was to find a way of transforming the latter score ( the weighted sum of selected statistics ) into the first type, say calling it v2' i.e., to monetize the weighted sum index by regressing one valuation against the other, i.e., regressing V1 against V2 , and using the resulting data ( assuming the regression is significant, i.e., that we are confident - enough that the slope of the regression line is not 0 ).

Does this regression idea make sense? If you are not familiar with Baseball, I think we can do something very similar with Soccer.

Ultimately, the decidion for how fairly a player is valuated would be given by:c
1) if V1/V2' >1 , then the player is overvalued
2) If V1/V2'=1 , then the player is accurately valued
3) If V1/V2' <1 , then the player is undervalued.

Any ideas on how to monetize the index V2 into V2', so that the quotient V1/V2' is unit-free?
I have thought of regressing one index against another
 
Last edited:
Physics news on Phys.org
  • #2
It sounds like you need to normalize to a standard reference player who is known to be fairly paid. Otherwise all you can do would be to do something like a standardized regression to identify players that are paid at the average rate.
 
  • Like
Likes FactChecker and WWGD
  • #3
Dale said:
It sounds like you need to normalize to a standard reference player who is known to be fairly paid. Otherwise all you can do would be to do something like a standardized regression to identify players that are paid at the average rate.

Thanks for your reply. What do you mean by standardized regression?
 
  • #4
A standardized regression is where you standardize all of the variables before regression. Meaning you subtract the mean and divide by the standard deviation.
 
  • Like
Likes WWGD
  • #5
If you are interested in a dimensionless version of the ratio V1/V2, is it going to make sense to redefine V2 by a regression of something against V1? If the regression is V2 = A (V1) + B then your ratio V1/V2 will be computed as V1/(A V1 + B). Any error in representing a particular players V2 by A V1 + B will result in an error in the ratio associated with that player.

If you use Dale's approach, you translate the raw V1 and V2 measurements to dimensionless "Z-scores". You might need further manipulations if you decide that an "average" player isn't "fairly" valued.
 
  • Like
Likes WWGD
  • #6
FWIW: In the US baseball, for players with major league contracts , I think you will find the distribution of salaries is really multi-modal. Players yearly salaries vary by orders of magnitude: from 105 ... 108 per year. Your model as stated might have problems. You decide.
 
  • Like
Likes Dale and WWGD
  • #7
jim mcnamara said:
FWIW: In the US baseball, for players with major league contracts , I think you will find the distribution of salaries is really multi-modal. Players yearly salaries vary by orders of magnitude: from 105 ... 108 per year. Your model as stated might have problems. You decide.
That is a good hint. If the runs measure is similarly multi modal, then it could be that the residuals are still normal. But it is definitely something to check.
 
  • Like
Likes WWGD
  • #8
Stephen Tashi said:
If you are interested in a dimensionless version of the ratio V1/V2, is it going to make sense to redefine V2 by a regression of something against V1? If the regression is V2 = A (V1) + B then your ratio V1/V2 will be computed as V1/(A V1 + B). Any error in representing a particular players V2 by A V1 + B will result in an error in the ratio associated with that player.

If you use Dale's approach, you translate the raw V1 and V2 measurements to dimensionless "Z-scores". You might need further manipulations if you decide that an "average" player isn't "fairly" valued.

Thanks, I was thinking of using regression but not necessarily in this way. But that is a good point. Thanks all for your input.
 
  • #10
Sorry if this is OT, but does anyone know how to scrape data (preferably, but not necessarily using Python) from a site without knowing the data type used? I have a source site http://www.usatoday.com/sports/mlb/salaries/ and would like to download the data to do basic analysis. I saw the source code but I was not able to figure out the data type. Do I contact the webmaster? I know some basic Python data structures and methods, but these assume knowledge of the data type of the source data.
 
  • #11
It's pretty simple. Here's a simple way to do it using BeautifulSoup. I added the pandas bit just to make analysis and writing to a csv easier.

Code written in python
Code:
from bs4 import BeautifulSoup
import pandas as pd

url = "http://www.usatoday.com/sports/mlb/salaries/" 

page = requests.get(url)
soup = BeautifulSoup(page.text)

name = []
team = []
pos = []
salary = []
years = []
value = []
annual = []for row in soup.find_all('tr')[1:]:
    col = row.find_all('td')
    column_1 = col[1].string.strip()
    name.append(column_1)
    column_2 = col[2].string.strip()
    team.append(column_2)
    column_3 = col[3].string.strip()
    pos.append(column_3)
    column_4 = col[4].string.strip()
    salary.append(column_4)
    column_5 = col[5].string.strip()
    years.append(column_5)
    column_6 = col[6].string.strip()
    value.append(column_6)
    column_7 = col[7].string.strip()
    annual.append(column_7)

columns = {"name":name, "team":team,"pos":pos,"salary":salary,"years":years,"value":value,"annual":annual}
df = pd.DataFrame(columns)

df.to_csv("somefilename",index = False)
 
  • Like
Likes WWGD
  • #12
MarneMath said:
It's pretty simple. Here's a simple way to do it using BeautifulSoup. I added the pandas bit just to make analysis and writing to a csv easier.

Code written in python
Code:
from bs4 import BeautifulSoup
import pandas as pd

url = "http://www.usatoday.com/sports/mlb/salaries/"

page = requests.get(url)
soup = BeautifulSoup(page.text)

name = []
team = []
pos = []
salary = []
years = []
value = []
annual = []for row in soup.find_all('tr')[1:]:
    col = row.find_all('td')
    column_1 = col[1].string.strip()
    name.append(column_1)
    column_2 = col[2].string.strip()
    team.append(column_2)
    column_3 = col[3].string.strip()
    pos.append(column_3)
    column_4 = col[4].string.strip()
    salary.append(column_4)
    column_5 = col[5].string.strip()
    years.append(column_5)
    column_6 = col[6].string.strip()
    value.append(column_6)
    column_7 = col[7].string.strip()
    annual.append(column_7)

columns = {"name":name, "team":team,"pos":pos,"salary":salary,"years":years,"value":value,"annual":annual}
df = pd.DataFrame(columns)

df.to_csv("somefilename",index = False)
Excellent, Marne, thanks!
 
  • #13
Just
MarneMath said:
It's pretty simple. Here's a simple way to do it using BeautifulSoup. I added the pandas bit just to make analysis and writing to a csv easier.

Code written in python
Code:
from bs4 import BeautifulSoup
import pandas as pd

url = "http://www.usatoday.com/sports/mlb/salaries/"

page = requests.get(url)
soup = BeautifulSoup(page.text)

name = []
team = []
pos = []
salary = []
years = []
value = []
annual = []for row in soup.find_all('tr')[1:]:
    col = row.find_all('td')
    column_1 = col[1].string.strip()
    name.append(column_1)
    column_2 = col[2].string.strip()
    team.append(column_2)
    column_3 = col[3].string.strip()
    pos.append(column_3)
    column_4 = col[4].string.strip()
    salary.append(column_4)
    column_5 = col[5].string.strip()
    years.append(column_5)
    column_6 = col[6].string.strip()
    value.append(column_6)
    column_7 = col[7].string.strip()
    annual.append(column_7)

columns = {"name":name, "team":team,"pos":pos,"salary":salary,"years":years,"value":value,"annual":annual}
df = pd.DataFrame(columns)

df.to_csv("somefilename",index = False)
Just a quick question, Marne, do we need graphlab/pip to do the downloading and installation?
 
  • #14
For my personally, I have Anaconda installed which is a distribution that contains nearly every scientific library you'll ever need and also functions as package manager. So I have no idea what you'll need in your case. I imagine that a github exist for beautiful soup and you could simply pull that and then run setup.py in your libraries if you are opposed to using pip or anaconda. I know a github exist for pandas since I used it when I need to make a contribution.
 
  • Like
Likes WWGD
  • #15
MarneMath said:
For my personally, I have Anaconda installed which is a distribution that contains nearly every scientific library you'll ever need and also functions as package manager. So I have no idea what you'll need in your case. I imagine that a github exist for beautiful soup and you could simply pull that and then run setup.py in your libraries if you are opposed to using pip or anaconda. I know a github exist for pandas since I used it when I need to make a contribution.
Sorry to bother you one more time, Marne, justto know which version of Python you are running.

EDIT: Never Mind
 
Last edited:
  • #16
Dale said:
do something like a standardized regression
Dale said:
subtract the mean and divide by the standard deviation.
I disagree. Imagine the weakest player is measured at 1000 runs, earning $1000; the next scores 1001, earning $11000, the next 1002 runs earning $21000, and so on, each extra run earning $10000. In runs/$, the weakest player is clearly the best value for money.

@WWGD, you need to determine the monetary value of extra runs. You could look at what teams earn in prizes and sponsorship and compare that with scores. Even then, the value of a player to a team is the marginal benefit, i.e. how much would they lose by switching to a weaker player. That depends on how that team compares to others. A team that wins every game comprehensively could afford to save some on salaries.
 
  • Like
Likes WWGD
  • #17
Thanks all,
I was considering doing a Mann-Whitney U test , to tell if both rankings are similar between the two baseball player rankings: by salary and by WAR (Wins Above Replacement) data . Still, does anyone know , if rankings are not similar at some confidence level, how to compare the two distributions?
 
  • #18
WWGD said:
Thanks all,
I was considering doing a Mann-Whitney U test , to tell if both rankings are similar between the two baseball player rankings: by salary and by WAR (Wins Above Replacement) data . Still, does anyone know , if rankings are not similar at some confidence level, how to compare the two distributions?
Have you looked up rank correlation?
 
  • Like
Likes WWGD
  • #19
haruspex said:
Have you looked up rank correlation?
Thanks, but I don't see how to use this. I am trying to identify the elements in both sets who are not equally-ranked. I can see how this allows me to compare both distributions ( and the test being non-parametric is a plus), but I don't see how to identify differently-ranked elements.
 
  • #20
WWGD said:
Thanks, but I don't see how to use this. I am trying to identify the elements in both sets who are not equally-ranked. I can see how this allows me to compare both distributions ( and the test being non-parametric is a plus), but I don't see how to identify differently-ranked elements.
If you plot the rankings against each other, x and y, and do a straight line fit, you can take the rank mismatch of a given player as the (square of the) distance of the point from the line.
Note that the straight line in question should not be the usual linear regression. That minimises Σ(Δy)2. I suggest a symmetric regression, minimising the the sum of squares of the distances from the points to the line, Σ(Δy)2+(Δx)2. See e.g. http://stats.stackexchange.com/ques...en-linear-regression-on-y-with-x-and-x-with-y
 
  • Like
Likes WWGD
  • #21
Dale said:
That is a good hint. If the runs measure is similarly multi modal, then it could be that the residuals are still normal. But it is definitely something to check.
I am a little confused by this requirement of normality for residuals. AFAIK, Gauss Markov, e.g., only requires errors to be uncorrelated, IID with mean 0. This gives us BLUE : regression coefficients that are unbiased and with minimal variance ( among other unbiased estimators). What else do we get if the errors are normally -distributed ( and, BTW does this mean that each ##\epsilon_i ## is normal or that the set ##\{ e_i \} ## itself is normally -distributed? IIRC, normality implies that the coefficients agree with the maximum likelihood estimators? (Sorry for necropost).
 
  • #22
One way is to look at market pricing - do a regression on the log salary for all players (would want to take the log due to the large right tail) against a basket of statistics (or percentile rankings of them) you think are relevant, then the players with positive (negative) alphas are 'overpaid' ('underpaid') relative to the market

From a baseball standpoint - your stats are not going to be meaningful unless they are done separately per position. A slugging stat that is good for a shortstop or catcher (not to mention pitcher) will not be good for an outfielder
 
  • Like
Likes WWGD
  • #23
BWV said:
One way is to look at market pricing - do a regression on the log salary for all players (would want to take the log due to the large right tail) against a basket of statistics (or percentile rankings of them) you think are relevant, then the players with positive (negative) alphas are 'overpaid' ('underpaid') relative to the market

From a baseball standpoint - your stats are not going to be meaningful unless they are done separately per position. A slugging stat that is good for a shortstop or catcher (not to mention pitcher) will not be good for an outfielder
Thank you. My basic idea is that Baseball is about consistently scoring more runs than your opponents. So any stat that is conducive to either scoring runs ( Hits, 2bs, etc.) , or preventing your opponent from scoring ( Fielding avg. , Double plays performed , etc.) are what matters, essentially.
 
  • #24
WWGD said:
Thank you. My basic idea is that Baseball is about consistently scoring more runs than your opponents. So any stat that is conducive to either scoring runs ( Hits, 2bs, etc.) , or preventing your opponent from scoring ( Fielding avg. , Double plays performed , etc.) are what matters, essentially.

Yes, but the difficulty is equating stats across very different positions and teams- how do you compare fielding percentage between a catcher and outfielder, or between a player on a team with excellent pitching that induces a lot of ground and fly balls vs a team with bad pitching that gives up a lot of hits? also stats are commonly adjusted for ballparks - a slugging stat in Colorado is not the same as one in Miami (due to altitude)

You might look at Wins Above Replacement (WAR), which attempts to be a single statistic where one can compare across positions

https://www.fangraphs.com/library/misc/war/
 
Last edited:
  • Like
Likes WWGD and DrClaude
  • #25
MarneMath said:
For my personally, I have Anaconda installed which is a distribution that contains nearly every scientific library you'll ever need and also functions as package manager. So I have no idea what you'll need in your case. I imagine that a github exist for beautiful soup and you could simply pull that and then run setup.py in your libraries if you are opposed to using pip or anaconda. I know a github exist for pandas since I used it when I need to make a contribution.
MarneMath, anyone, sorry for the necropost. I am ready to do some scraping, what version of Anaconda can I use? Must it be 2.7?
 

What is the classification/valuation problem in baseball?

The classification/valuation problem in baseball is the challenge of accurately determining the value of players and teams in terms of their performance and contributions to the game. This involves evaluating various statistics and metrics to determine the worth of players and teams in comparison to others.

What factors are considered when classifying/valuing players and teams in baseball?

Factors that are commonly considered when classifying/valuing players and teams in baseball include individual player statistics such as batting average, home runs, and earned run average, as well as team statistics such as wins and losses, run differential, and playoff appearances. Other factors may also be taken into account, such as player contracts and market value.

How is the classification/valuation problem in baseball addressed?

The classification/valuation problem in baseball is addressed through the use of advanced statistical analysis, such as sabermetrics, which takes into account a wide range of factors and attempts to provide a more accurate evaluation of player and team performance. Additionally, teams may also use scouting reports, expert opinions, and other methods to assess the value of players and teams.

What challenges are associated with the classification/valuation problem in baseball?

One challenge associated with the classification/valuation problem in baseball is the ever-changing nature of the game, as new statistics and metrics are constantly being developed and used. Additionally, the subjectivity of evaluating player and team performance can also make it difficult to reach a consensus on their value. There may also be biases and limitations in the data used for analysis.

How important is the classification/valuation problem in baseball?

The classification/valuation problem in baseball is extremely important, as it can have a significant impact on team decisions, such as player contracts, trades, and roster composition. It also plays a role in determining the success of teams and players, as well as their perceived value in the eyes of fans and the media.

Similar threads

  • Programming and Computer Science
Replies
13
Views
2K
  • STEM Career Guidance
Replies
4
Views
1K
Back
Top