Classification/Valuation Problem: Baseball

WWGD · Nov 14, 2016

Hi All,
I am doing a small data project that consists of classifying Baseball players as being either overvalued or
undervalued. I have two valuations V1, V2 for each of the players, though in different "currencies" and I am trying to see how to express both in the same currency. I have been going over M.Lewis' book " Moneyball" , but I don't want to copy his ideas (or , more accurately, the ideas he describes in the book)
One of the valuations, the first, say v1, is just by salary , the average salary of the previous 3 years. The second one, say v2, though , is a weighted sum of player statistics, and would have "runs" units. The statistics I am considering are : Average, OBA, Number of Hrs , Hr/At bat, etc : the key idea is that the game of Baseball is about runs, a team wins the game by scoring more runs than its opponent; then the statistics that correlate highly with run scoring or run prevention (meaning preventing the opposing team from scoring runs) are weighted highly towards the player's value V2. Then I want to compute a ratio V1/V2 of the two indices. But I want this ratio to be unit-free, meaning both valuations must be expressed in the same units. Unfortunately, V2 is in "runs" units, so I want to transform it in a reasonable way into $ units, which are the units V1 appears in.
My idea was to find a way of transforming the latter score ( the weighted sum of selected statistics ) into the first type, say calling it v2' i.e., to monetize the weighted sum index by regressing one valuation against the other, i.e., regressing V1 against V2 , and using the resulting data ( assuming the regression is significant, i.e., that we are confident - enough that the slope of the regression line is not 0 ).

Does this regression idea make sense? If you are not familiar with Baseball, I think we can do something very similar with Soccer.

Ultimately, the decidion for how fairly a player is valuated would be given by:c
1) if V1/V2' >1 , then the player is overvalued
2) If V1/V2'=1 , then the player is accurately valued
3) If V1/V2' <1 , then the player is undervalued.

Any ideas on how to monetize the index V2 into V2', so that the quotient V1/V2' is unit-free?
I have thought of regressing one index against another

Dale · Nov 14, 2016

It sounds like you need to normalize to a standard reference player who is known to be fairly paid. Otherwise all you can do would be to do something like a standardized regression to identify players that are paid at the average rate.

WWGD · Nov 14, 2016

Dale said:

It sounds like you need to normalize to a standard reference player who is known to be fairly paid. Otherwise all you can do would be to do something like a standardized regression to identify players that are paid at the average rate.

Thanks for your reply. What do you mean by standardized regression?

Dale · Nov 14, 2016

A standardized regression is where you standardize all of the variables before regression. Meaning you subtract the mean and divide by the standard deviation.

Stephen Tashi · Nov 15, 2016

If you are interested in a dimensionless version of the ratio V1/V2, is it going to make sense to redefine V2 by a regression of something against V1? If the regression is V2 = A (V1) + B then your ratio V1/V2 will be computed as V1/(A V1 + B). Any error in representing a particular players V2 by A V1 + B will result in an error in the ratio associated with that player.

If you use Dale's approach, you translate the raw V1 and V2 measurements to dimensionless "Z-scores". You might need further manipulations if you decide that an "average" player isn't "fairly" valued.

jim mcnamara · Nov 15, 2016

FWIW: In the US baseball, for players with major league contracts , I think you will find the distribution of salaries is really multi-modal. Players yearly salaries vary by orders of magnitude: from 10⁵ ... 10⁸ per year. Your model as stated might have problems. You decide.

Dale · Nov 15, 2016

jim mcnamara said:

FWIW: In the US baseball, for players with major league contracts , I think you will find the distribution of salaries is really multi-modal. Players yearly salaries vary by orders of magnitude: from 10⁵ ... 10⁸ per year. Your model as stated might have problems. You decide.

That is a good hint. If the runs measure is similarly multi modal, then it could be that the residuals are still normal. But it is definitely something to check.

WWGD · Nov 15, 2016

Stephen Tashi said:

If you are interested in a dimensionless version of the ratio V1/V2, is it going to make sense to redefine V2 by a regression of something against V1? If the regression is V2 = A (V1) + B then your ratio V1/V2 will be computed as V1/(A V1 + B). Any error in representing a particular players V2 by A V1 + B will result in an error in the ratio associated with that player.

If you use Dale's approach, you translate the raw V1 and V2 measurements to dimensionless "Z-scores". You might need further manipulations if you decide that an "average" player isn't "fairly" valued.

Thanks, I was thinking of using regression but not necessarily in this way. But that is a good point. Thanks all for your input.

DrClaude · Nov 15, 2016

I guess you already know about Sabrmetrics, but just in case: https://en.wikipedia.org/wiki/Runs_created

WWGD · Nov 15, 2016

Sorry if this is OT, but does anyone know how to scrape data (preferably, but not necessarily using Python) from a site without knowing the data type used? I have a source site http://www.usatoday.com/sports/mlb/salaries/ and would like to download the data to do basic analysis. I saw the source code but I was not able to figure out the data type. Do I contact the webmaster? I know some basic Python data structures and methods, but these assume knowledge of the data type of the source data.

MarneMath · Nov 28, 2016

It's pretty simple. Here's a simple way to do it using BeautifulSoup. I added the pandas bit just to make analysis and writing to a csv easier.

Code written in python

Code:

from bs4 import BeautifulSoup
import pandas as pd

url = "http://www.usatoday.com/sports/mlb/salaries/" 

page = requests.get(url)
soup = BeautifulSoup(page.text)

name = []
team = []
pos = []
salary = []
years = []
value = []
annual = []for row in soup.find_all('tr')[1:]:
    col = row.find_all('td')
    column_1 = col[1].string.strip()
    name.append(column_1)
    column_2 = col[2].string.strip()
    team.append(column_2)
    column_3 = col[3].string.strip()
    pos.append(column_3)
    column_4 = col[4].string.strip()
    salary.append(column_4)
    column_5 = col[5].string.strip()
    years.append(column_5)
    column_6 = col[6].string.strip()
    value.append(column_6)
    column_7 = col[7].string.strip()
    annual.append(column_7)

columns = {"name":name, "team":team,"pos":pos,"salary":salary,"years":years,"value":value,"annual":annual}
df = pd.DataFrame(columns)

df.to_csv("somefilename",index = False)

WWGD · Nov 28, 2016

MarneMath said:

It's pretty simple. Here's a simple way to do it using BeautifulSoup. I added the pandas bit just to make analysis and writing to a csv easier.

Code written in python

Code:

from bs4 import BeautifulSoup
import pandas as pd

url = "http://www.usatoday.com/sports/mlb/salaries/"

page = requests.get(url)
soup = BeautifulSoup(page.text)

name = []
team = []
pos = []
salary = []
years = []
value = []
annual = []for row in soup.find_all('tr')[1:]:
    col = row.find_all('td')
    column_1 = col[1].string.strip()
    name.append(column_1)
    column_2 = col[2].string.strip()
    team.append(column_2)
    column_3 = col[3].string.strip()
    pos.append(column_3)
    column_4 = col[4].string.strip()
    salary.append(column_4)
    column_5 = col[5].string.strip()
    years.append(column_5)
    column_6 = col[6].string.strip()
    value.append(column_6)
    column_7 = col[7].string.strip()
    annual.append(column_7)

columns = {"name":name, "team":team,"pos":pos,"salary":salary,"years":years,"value":value,"annual":annual}
df = pd.DataFrame(columns)

df.to_csv("somefilename",index = False)

Excellent, Marne, thanks!

WWGD · Nov 28, 2016

Just

MarneMath said:

It's pretty simple. Here's a simple way to do it using BeautifulSoup. I added the pandas bit just to make analysis and writing to a csv easier.

Code written in python

Code:

from bs4 import BeautifulSoup
import pandas as pd

url = "http://www.usatoday.com/sports/mlb/salaries/"

page = requests.get(url)
soup = BeautifulSoup(page.text)

name = []
team = []
pos = []
salary = []
years = []
value = []
annual = []for row in soup.find_all('tr')[1:]:
    col = row.find_all('td')
    column_1 = col[1].string.strip()
    name.append(column_1)
    column_2 = col[2].string.strip()
    team.append(column_2)
    column_3 = col[3].string.strip()
    pos.append(column_3)
    column_4 = col[4].string.strip()
    salary.append(column_4)
    column_5 = col[5].string.strip()
    years.append(column_5)
    column_6 = col[6].string.strip()
    value.append(column_6)
    column_7 = col[7].string.strip()
    annual.append(column_7)

columns = {"name":name, "team":team,"pos":pos,"salary":salary,"years":years,"value":value,"annual":annual}
df = pd.DataFrame(columns)

df.to_csv("somefilename",index = False)

Just a quick question, Marne, do we need graphlab/pip to do the downloading and installation?

MarneMath · Nov 28, 2016

For my personally, I have Anaconda installed which is a distribution that contains nearly every scientific library you'll ever need and also functions as package manager. So I have no idea what you'll need in your case. I imagine that a github exist for beautiful soup and you could simply pull that and then run setup.py in your libraries if you are opposed to using pip or anaconda. I know a github exist for pandas since I used it when I need to make a contribution.

WWGD · Nov 29, 2016

MarneMath said:

For my personally, I have Anaconda installed which is a distribution that contains nearly every scientific library you'll ever need and also functions as package manager. So I have no idea what you'll need in your case. I imagine that a github exist for beautiful soup and you could simply pull that and then run setup.py in your libraries if you are opposed to using pip or anaconda. I know a github exist for pandas since I used it when I need to make a contribution.

Sorry to bother you one more time, Marne, justto know which version of Python you are running.

EDIT: Never Mind

haruspex · Nov 30, 2016

Dale said:

do something like a standardized regression

Dale said:

subtract the mean and divide by the standard deviation.

I disagree. Imagine the weakest player is measured at 1000 runs, earning $1000; the next scores 1001, earning $11000, the next 1002 runs earning $21000, and so on, each extra run earning $10000. In runs/$, the weakest player is clearly the best value for money.

@WWGD, you need to determine the monetary value of extra runs. You could look at what teams earn in prizes and sponsorship and compare that with scores. Even then, the value of a player to a team is the marginal benefit, i.e. how much would they lose by switching to a weaker player. That depends on how that team compares to others. A team that wins every game comprehensively could afford to save some on salaries.

WWGD · Dec 15, 2016

Thanks all,
I was considering doing a Mann-Whitney U test , to tell if both rankings are similar between the two baseball player rankings: by salary and by WAR (Wins Above Replacement) data . Still, does anyone know , if rankings are not similar at some confidence level, how to compare the two distributions?

haruspex · Dec 15, 2016

WWGD said:

Thanks all,
I was considering doing a Mann-Whitney U test , to tell if both rankings are similar between the two baseball player rankings: by salary and by WAR (Wins Above Replacement) data . Still, does anyone know , if rankings are not similar at some confidence level, how to compare the two distributions?

Have you looked up rank correlation?

WWGD · Dec 15, 2016

haruspex said:

Have you looked up rank correlation?

Thanks, but I don't see how to use this. I am trying to identify the elements in both sets who are not equally-ranked. I can see how this allows me to compare both distributions ( and the test being non-parametric is a plus), but I don't see how to identify differently-ranked elements.

haruspex · Dec 15, 2016

WWGD said:

Thanks, but I don't see how to use this. I am trying to identify the elements in both sets who are not equally-ranked. I can see how this allows me to compare both distributions ( and the test being non-parametric is a plus), but I don't see how to identify differently-ranked elements.

If you plot the rankings against each other, x and y, and do a straight line fit, you can take the rank mismatch of a given player as the (square of the) distance of the point from the line.
Note that the straight line in question should not be the usual linear regression. That minimises Σ(Δy)². I suggest a symmetric regression, minimising the the sum of squares of the distances from the points to the line, Σ(Δy)²+(Δx)². See e.g. http://stats.stackexchange.com/ques...en-linear-regression-on-y-with-x-and-x-with-y

WWGD · Dec 10, 2018

Dale said:

That is a good hint. If the runs measure is similarly multi modal, then it could be that the residuals are still normal. But it is definitely something to check.

I am a little confused by this requirement of normality for residuals. AFAIK, Gauss Markov, e.g., only requires errors to be uncorrelated, IID with mean 0. This gives us BLUE : regression coefficients that are unbiased and with minimal variance ( among other unbiased estimators). What else do we get if the errors are normally -distributed ( and, BTW does this mean that each ##\epsilon_i ## is normal or that the set ##\{ e_i \} ## itself is normally -distributed? IIRC, normality implies that the coefficients agree with the maximum likelihood estimators? (Sorry for necropost).

BWV · Dec 17, 2018

One way is to look at market pricing - do a regression on the log salary for all players (would want to take the log due to the large right tail) against a basket of statistics (or percentile rankings of them) you think are relevant, then the players with positive (negative) alphas are 'overpaid' ('underpaid') relative to the market

From a baseball standpoint - your stats are not going to be meaningful unless they are done separately per position. A slugging stat that is good for a shortstop or catcher (not to mention pitcher) will not be good for an outfielder

WWGD · Dec 17, 2018

BWV said:

One way is to look at market pricing - do a regression on the log salary for all players (would want to take the log due to the large right tail) against a basket of statistics (or percentile rankings of them) you think are relevant, then the players with positive (negative) alphas are 'overpaid' ('underpaid') relative to the market

From a baseball standpoint - your stats are not going to be meaningful unless they are done separately per position. A slugging stat that is good for a shortstop or catcher (not to mention pitcher) will not be good for an outfielder

Thank you. My basic idea is that Baseball is about consistently scoring more runs than your opponents. So any stat that is conducive to either scoring runs ( Hits, 2bs, etc.) , or preventing your opponent from scoring ( Fielding avg. , Double plays performed , etc.) are what matters, essentially.

BWV · Dec 17, 2018

WWGD said:

Thank you. My basic idea is that Baseball is about consistently scoring more runs than your opponents. So any stat that is conducive to either scoring runs ( Hits, 2bs, etc.) , or preventing your opponent from scoring ( Fielding avg. , Double plays performed , etc.) are what matters, essentially.

Yes, but the difficulty is equating stats across very different positions and teams- how do you compare fielding percentage between a catcher and outfielder, or between a player on a team with excellent pitching that induces a lot of ground and fly balls vs a team with bad pitching that gives up a lot of hits? also stats are commonly adjusted for ballparks - a slugging stat in Colorado is not the same as one in Miami (due to altitude)

You might look at Wins Above Replacement (WAR), which attempts to be a single statistic where one can compare across positions

https://www.fangraphs.com/library/misc/war/

WWGD · Jun 16, 2019

MarneMath said:

For my personally, I have Anaconda installed which is a distribution that contains nearly every scientific library you'll ever need and also functions as package manager. So I have no idea what you'll need in your case. I imagine that a github exist for beautiful soup and you could simply pull that and then run setup.py in your libraries if you are opposed to using pip or anaconda. I know a github exist for pandas since I used it when I need to make a contribution.

MarneMath, anyone, sorry for the necropost. I am ready to do some scraping, what version of Anaconda can I use? Must it be 2.7?

Classification/Valuation Problem: Baseball

Graduate Expected numbers of cards of a last color remaining

Undergrad The problem of points

Graduate Probability puzzle

Undergrad The countability paradox of computable numbers

Undergrad How does axiom of foundation prevent infinite sequence of elements?

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Classification/Valuation Problem: Baseball

Similar threads