# A Classification/Valuation Problem: Baseball

#### WWGD

Gold Member
Hi All,
I am doing a small data project that consists of classifying Baseball players as being either overvalued or
undervalued. I have two valuations V1, V2 for each of the players, though in different "currencies" and I am trying to see how to express both in the same currency. I have been going over M.Lewis' book " Moneyball" , but I don't want to copy his ideas (or , more accurately, the ideas he describes in the book)
One of the valuations, the first, say v1, is just by salary , the average salary of the previous 3 years. The second one, say v2, though , is a weighted sum of player statistics, and would have "runs" units. The statistics I am considering are : Average, OBA, Number of Hrs , Hr/At bat, etc : the key idea is that the game of Baseball is about runs, a team wins the game by scoring more runs than its opponent; then the statistics that correlate highly with run scoring or run prevention (meaning preventing the opposing team from scoring runs) are weighted highly towards the player's value V2. Then I want to compute a ratio V1/V2 of the two indices. But I want this ratio to be unit-free, meaning both valuations must be expressed in the same units. Unfortunately, V2 is in "runs" units, so I want to transform it in a reasonable way into $units, which are the units V1 appears in. My idea was to find a way of transforming the latter score ( the weighted sum of selected statistics ) into the first type, say calling it v2' i.e., to monetize the weighted sum index by regressing one valuation against the other, i.e., regressing V1 against V2 , and using the resulting data ( assuming the regression is significant, i.e., that we are confident - enough that the slope of the regression line is not 0 ). Does this regression idea make sense? If you are not familiar with Baseball, I think we can do something very similar with Soccer. Ultimately, the decidion for how fairly a player is valuated would be given by:c 1) if V1/V2' >1 , then the player is overvalued 2) If V1/V2'=1 , then the player is accurately valued 3) If V1/V2' <1 , then the player is undervalued. Any ideas on how to monetize the index V2 into V2', so that the quotient V1/V2' is unit-free? I have thought of regressing one index against another Last edited: Related Set Theory, Logic, Probability, Statistics News on Phys.org #### Dale Mentor It sounds like you need to normalize to a standard reference player who is known to be fairly paid. Otherwise all you can do would be to do something like a standardized regression to identify players that are paid at the average rate. #### WWGD Science Advisor Gold Member It sounds like you need to normalize to a standard reference player who is known to be fairly paid. Otherwise all you can do would be to do something like a standardized regression to identify players that are paid at the average rate. Thanks for your reply. What do you mean by standardized regression? #### Dale Mentor A standardized regression is where you standardize all of the variables before regression. Meaning you subtract the mean and divide by the standard deviation. #### Stephen Tashi Science Advisor If you are interested in a dimensionless version of the ratio V1/V2, is it going to make sense to redefine V2 by a regression of something against V1? If the regression is V2 = A (V1) + B then your ratio V1/V2 will be computed as V1/(A V1 + B). Any error in representing a particular players V2 by A V1 + B will result in an error in the ratio associated with that player. If you use Dale's approach, you translate the raw V1 and V2 measurements to dimensionless "Z-scores". You might need further manipulations if you decide that an "average" player isn't "fairly" valued. #### jim mcnamara Mentor FWIW: In the US baseball, for players with major league contracts , I think you will find the distribution of salaries is really multi-modal. Players yearly salaries vary by orders of magnitude: from 105 ... 108 per year. Your model as stated might have problems. You decide. #### Dale Mentor FWIW: In the US baseball, for players with major league contracts , I think you will find the distribution of salaries is really multi-modal. Players yearly salaries vary by orders of magnitude: from 105 ... 108 per year. Your model as stated might have problems. You decide. That is a good hint. If the runs measure is similarly multi modal, then it could be that the residuals are still normal. But it is definitely something to check. #### WWGD Science Advisor Gold Member If you are interested in a dimensionless version of the ratio V1/V2, is it going to make sense to redefine V2 by a regression of something against V1? If the regression is V2 = A (V1) + B then your ratio V1/V2 will be computed as V1/(A V1 + B). Any error in representing a particular players V2 by A V1 + B will result in an error in the ratio associated with that player. If you use Dale's approach, you translate the raw V1 and V2 measurements to dimensionless "Z-scores". You might need further manipulations if you decide that an "average" player isn't "fairly" valued. Thanks, I was thinking of using regression but not necessarily in this way. But that is a good point. Thanks all for your input. #### WWGD Science Advisor Gold Member Sorry if this is OT, but does anyone know how to scrape data (preferably, but not necessarily using Python) from a site without knowing the data type used? I have a source site http://www.usatoday.com/sports/mlb/salaries/ and would like to download the data to do basic analysis. I saw the source code but I was not able to figure out the data type. Do I contact the webmaster? I know some basic Python data structures and methods, but these assume knowledge of the data type of the source data. #### MarneMath Education Advisor It's pretty simple. Here's a simple way to do it using BeautifulSoup. I added the pandas bit just to make analysis and writing to a csv easier. Code written in python Code: from bs4 import BeautifulSoup import pandas as pd url = "http://www.usatoday.com/sports/mlb/salaries/" page = requests.get(url) soup = BeautifulSoup(page.text) name = [] team = [] pos = [] salary = [] years = [] value = [] annual = [] for row in soup.find_all('tr')[1:]: col = row.find_all('td') column_1 = col[1].string.strip() name.append(column_1) column_2 = col[2].string.strip() team.append(column_2) column_3 = col[3].string.strip() pos.append(column_3) column_4 = col[4].string.strip() salary.append(column_4) column_5 = col[5].string.strip() years.append(column_5) column_6 = col[6].string.strip() value.append(column_6) column_7 = col[7].string.strip() annual.append(column_7) columns = {"name":name, "team":team,"pos":pos,"salary":salary,"years":years,"value":value,"annual":annual} df = pd.DataFrame(columns) df.to_csv("somefilename",index = False) #### WWGD Science Advisor Gold Member It's pretty simple. Here's a simple way to do it using BeautifulSoup. I added the pandas bit just to make analysis and writing to a csv easier. Code written in python Code: from bs4 import BeautifulSoup import pandas as pd url = "http://www.usatoday.com/sports/mlb/salaries/" page = requests.get(url) soup = BeautifulSoup(page.text) name = [] team = [] pos = [] salary = [] years = [] value = [] annual = [] for row in soup.find_all('tr')[1:]: col = row.find_all('td') column_1 = col[1].string.strip() name.append(column_1) column_2 = col[2].string.strip() team.append(column_2) column_3 = col[3].string.strip() pos.append(column_3) column_4 = col[4].string.strip() salary.append(column_4) column_5 = col[5].string.strip() years.append(column_5) column_6 = col[6].string.strip() value.append(column_6) column_7 = col[7].string.strip() annual.append(column_7) columns = {"name":name, "team":team,"pos":pos,"salary":salary,"years":years,"value":value,"annual":annual} df = pd.DataFrame(columns) df.to_csv("somefilename",index = False) Excellent, Marne, thanks! #### WWGD Science Advisor Gold Member Just It's pretty simple. Here's a simple way to do it using BeautifulSoup. I added the pandas bit just to make analysis and writing to a csv easier. Code written in python Code: from bs4 import BeautifulSoup import pandas as pd url = "http://www.usatoday.com/sports/mlb/salaries/" page = requests.get(url) soup = BeautifulSoup(page.text) name = [] team = [] pos = [] salary = [] years = [] value = [] annual = [] for row in soup.find_all('tr')[1:]: col = row.find_all('td') column_1 = col[1].string.strip() name.append(column_1) column_2 = col[2].string.strip() team.append(column_2) column_3 = col[3].string.strip() pos.append(column_3) column_4 = col[4].string.strip() salary.append(column_4) column_5 = col[5].string.strip() years.append(column_5) column_6 = col[6].string.strip() value.append(column_6) column_7 = col[7].string.strip() annual.append(column_7) columns = {"name":name, "team":team,"pos":pos,"salary":salary,"years":years,"value":value,"annual":annual} df = pd.DataFrame(columns) df.to_csv("somefilename",index = False) Just a quick question, Marne, do we need graphlab/pip to do the downloading and installation? #### MarneMath Education Advisor For my personally, I have Anaconda installed which is a distribution that contains nearly every scientific library you'll ever need and also functions as package manager. So I have no idea what you'll need in your case. I imagine that a github exist for beautiful soup and you could simply pull that and then run setup.py in your libraries if you are opposed to using pip or anaconda. I know a github exist for pandas since I used it when I need to make a contribution. #### WWGD Science Advisor Gold Member For my personally, I have Anaconda installed which is a distribution that contains nearly every scientific library you'll ever need and also functions as package manager. So I have no idea what you'll need in your case. I imagine that a github exist for beautiful soup and you could simply pull that and then run setup.py in your libraries if you are opposed to using pip or anaconda. I know a github exist for pandas since I used it when I need to make a contribution. Sorry to bother you one more time, Marne, justto know which version of Python you are running. EDIT: Never Mind Last edited: #### haruspex Science Advisor Homework Helper Gold Member 2018 Award do something like a standardized regression subtract the mean and divide by the standard deviation. I disagree. Imagine the weakest player is measured at 1000 runs, earning$1000; the next scores 1001, earning $11000, the next 1002 runs earning$21000, and so on, each extra run earning $10000. In runs/$, the weakest player is clearly the best value for money.

@WWGD, you need to determine the monetary value of extra runs. You could look at what teams earn in prizes and sponsorship and compare that with scores. Even then, the value of a player to a team is the marginal benefit, i.e. how much would they lose by switching to a weaker player. That depends on how that team compares to others. A team that wins every game comprehensively could afford to save some on salaries.

#### WWGD

Gold Member
Thanks all,
I was considering doing a Mann-Whitney U test , to tell if both rankings are similar between the two baseball player rankings: by salary and by WAR (Wins Above Replacement) data . Still, does anyone know , if rankings are not similar at some confidence level, how to compare the two distributions?

#### haruspex

Homework Helper
Gold Member
2018 Award
Thanks all,
I was considering doing a Mann-Whitney U test , to tell if both rankings are similar between the two baseball player rankings: by salary and by WAR (Wins Above Replacement) data . Still, does anyone know , if rankings are not similar at some confidence level, how to compare the two distributions?
Have you looked up rank correlation?

#### WWGD

Gold Member
Have you looked up rank correlation?
Thanks, but I don't see how to use this. I am trying to identify the elements in both sets who are not equally-ranked. I can see how this allows me to compare both distributions ( and the test being non-parametric is a plus), but I don't see how to identify differently-ranked elements.

#### haruspex

Homework Helper
Gold Member
2018 Award
Thanks, but I don't see how to use this. I am trying to identify the elements in both sets who are not equally-ranked. I can see how this allows me to compare both distributions ( and the test being non-parametric is a plus), but I don't see how to identify differently-ranked elements.
If you plot the rankings against each other, x and y, and do a straight line fit, you can take the rank mismatch of a given player as the (square of the) distance of the point from the line.
Note that the straight line in question should not be the usual linear regression. That minimises Σ(Δy)2. I suggest a symmetric regression, minimising the the sum of squares of the distances from the points to the line, Σ(Δy)2+(Δx)2. See e.g. http://stats.stackexchange.com/questions/22718/what-is-the-difference-between-linear-regression-on-y-with-x-and-x-with-y

#### WWGD

Gold Member
That is a good hint. If the runs measure is similarly multi modal, then it could be that the residuals are still normal. But it is definitely something to check.
I am a little confused by this requirement of normality for residuals. AFAIK, Gauss Markov, e.g., only requires errors to be uncorrelated, IID with mean 0. This gives us BLUE : regression coefficients that are unbiased and with minimal variance ( among other unbiased estimators). What else do we get if the errors are normally -distributed ( and, BTW does this mean that each $\epsilon_i$ is normal or that the set $\{ e_i \}$ itself is normally -distributed? IIRC, normality implies that the coefficients agree with the maximum likelihood estimators? (Sorry for necropost).

#### BWV

One way is to look at market pricing - do a regression on the log salary for all players (would want to take the log due to the large right tail) against a basket of statistics (or percentile rankings of them) you think are relevant, then the players with positive (negative) alphas are 'overpaid' ('underpaid') relative to the market

From a baseball standpoint - your stats are not going to be meaningful unless they are done separately per position. A slugging stat that is good for a shortstop or catcher (not to mention pitcher) will not be good for an outfielder

#### WWGD

Gold Member
One way is to look at market pricing - do a regression on the log salary for all players (would want to take the log due to the large right tail) against a basket of statistics (or percentile rankings of them) you think are relevant, then the players with positive (negative) alphas are 'overpaid' ('underpaid') relative to the market

From a baseball standpoint - your stats are not going to be meaningful unless they are done separately per position. A slugging stat that is good for a shortstop or catcher (not to mention pitcher) will not be good for an outfielder
Thank you. My basic idea is that Baseball is about consistently scoring more runs than your opponents. So any stat that is conducive to either scoring runs ( Hits, 2bs, etc.) , or preventing your opponent from scoring ( Fielding avg. , Double plays performed , etc.) are what matters, essentially.

#### BWV

Thank you. My basic idea is that Baseball is about consistently scoring more runs than your opponents. So any stat that is conducive to either scoring runs ( Hits, 2bs, etc.) , or preventing your opponent from scoring ( Fielding avg. , Double plays performed , etc.) are what matters, essentially.
Yes, but the difficulty is equating stats across very different positions and teams- how do you compare fielding percentage between a catcher and outfielder, or between a player on a team with excellent pitching that induces alot of ground and fly balls vs a team with bad pitching that gives up alot of hits? also stats are commonly adjusted for ballparks - a slugging stat in Colorado is not the same as one in Miami (due to altitude)

You might look at Wins Above Replacement (WAR), which attempts to be a single statistic where one can compare across positions

https://www.fangraphs.com/library/misc/war/

Last edited:

#### WWGD

Gold Member
For my personally, I have Anaconda installed which is a distribution that contains nearly every scientific library you'll ever need and also functions as package manager. So I have no idea what you'll need in your case. I imagine that a github exist for beautiful soup and you could simply pull that and then run setup.py in your libraries if you are opposed to using pip or anaconda. I know a github exist for pandas since I used it when I need to make a contribution.
MarneMath, anyone, sorry for the necropost. I am ready to do some scraping, what version of Anaconda can I use? Must it be 2.7?

"Classification/Valuation Problem: Baseball"

### Physics Forums Values

We Value Quality
• Topics based on mainstream science
• Proper English grammar and spelling
We Value Civility
• Positive and compassionate attitudes
• Patience while debating
We Value Productivity
• Disciplined to remain on-topic
• Recognition of own weaknesses
• Solo and co-op problem solving