# A Classification/Valuation Problem: Baseball

Tags:
1. Nov 14, 2016

### WWGD

Hi All,
I am doing a small data project that consists of classifying Baseball players as being either overvalued or
undervalued. I have two valuations V1, V2 for each of the players, though in different "currencies" and I am trying to see how to express both in the same currency. I have been going over M.Lewis' book " Moneyball" , but I don't want to copy his ideas (or , more accurately, the ideas he describes in the book)
One of the valuations, the first, say v1, is just by salary , the average salary of the last 3 years. The second one, say v2, though , is a weighted sum of player statistics, and would have "runs" units. The statistics I am considering are : Average, OBA, Number of Hrs , Hr/At bat, etc : the key idea is that the game of Baseball is about runs, a team wins the game by scoring more runs than its opponent; then the statistics that correlate highly with run scoring or run prevention (meaning preventing the opponesing team from scoring runs) are weighted highly towards the player's value V2. Then I want to compute a ratio V1/V2 of the two indices. But I want this ratio to be unit-free, meaning both valuations must be expressed in the same units. Unfortunately, V2 is in "runs" units, so I want to transform it in a reasonable way into $units, which are the units V1 appears in. My idea was to find a way of transforming the latter score ( the weighted sum of selected statistics ) into the first type, say calling it v2' i.e., to monetize the weighted sum index by regressing one valuation against the other, i.e., regressing V1 against V2 , and using the resulting data ( assuming the regression is significant, i.e., that we are confident - enough that the slope of the regression line is not 0 ). Does this regression idea make sense? If you are not familiar with Baseball, I think we can do something very similar with Soccer. Ultimately, the decidion for how fairly a player is valuated would be given by:c 1) if V1/V2' >1 , then the player is overvalued 2) If V1/V2'=1 , then the player is accurately valued 3) If V1/V2' <1 , then the player is undervalued. Any ideas on how to monetize the index V2 into V2', so that the quotient V1/V2' is unit-free? I have thought of regressing one index against another 2. Nov 14, 2016 ### Dale ### Staff: Mentor It sounds like you need to normalize to a standard reference player who is known to be fairly paid. Otherwise all you can do would be to do something like a standardized regression to identify players that are paid at the average rate. 3. Nov 14, 2016 ### WWGD Thanks for your reply. What do you mean by standardized regression? 4. Nov 14, 2016 ### Dale ### Staff: Mentor A standardized regression is where you standardize all of the variables before regression. Meaning you subtract the mean and divide by the standard deviation. 5. Nov 15, 2016 ### Stephen Tashi If you are interested in a dimensionless version of the ratio V1/V2, is it going to make sense to redefine V2 by a regression of something against V1? If the regression is V2 = A (V1) + B then your ratio V1/V2 will be computed as V1/(A V1 + B). Any error in representing a particular players V2 by A V1 + B will result in an error in the ratio associated with that player. If you use Dale's approach, you translate the raw V1 and V2 measurements to dimensionless "Z-scores". You might need further manipulations if you decide that an "average" player isn't "fairly" valued. 6. Nov 15, 2016 ### jim mcnamara ### Staff: Mentor FWIW: In the US baseball, for players with major league contracts , I think you will find the distribution of salaries is really multi-modal. Players yearly salaries vary by orders of magnitude: from 105 ... 108 per year. Your model as stated might have problems. You decide. 7. Nov 15, 2016 ### Dale ### Staff: Mentor That is a good hint. If the runs measure is similarly multi modal, then it could be that the residuals are still normal. But it is definitely something to check. 8. Nov 15, 2016 ### WWGD Thanks, I was thinking of using regression but not necessarily in this way. But that is a good point. Thanks all for your input. 9. Nov 15, 2016 ### DrClaude ### Staff: Mentor 10. Nov 15, 2016 ### WWGD Sorry if this is OT, but does anyone know how to scrape data (preferably, but not necessarily using Python) from a site without knowing the data type used? I have a source site http://www.usatoday.com/sports/mlb/salaries/ and would like to download the data to do basic analysis. I saw the source code but I was not able to figure out the data type. Do I contact the webmaster? I know some basic Python data structures and methods, but these assume knowledge of the data type of the source data. 11. Nov 28, 2016 ### MarneMath It's pretty simple. Here's a simple way to do it using BeautifulSoup. I added the pandas bit just to make analysis and writing to a csv easier. Code written in python Code (Text): from bs4 import BeautifulSoup import pandas as pd url = "http://www.usatoday.com/sports/mlb/salaries/" page = requests.get(url) soup = BeautifulSoup(page.text) name = [] team = [] pos = [] salary = [] years = [] value = [] annual = [] for row in soup.find_all('tr')[1:]: col = row.find_all('td') column_1 = col[1].string.strip() name.append(column_1) column_2 = col[2].string.strip() team.append(column_2) column_3 = col[3].string.strip() pos.append(column_3) column_4 = col[4].string.strip() salary.append(column_4) column_5 = col[5].string.strip() years.append(column_5) column_6 = col[6].string.strip() value.append(column_6) column_7 = col[7].string.strip() annual.append(column_7) columns = {"name":name, "team":team,"pos":pos,"salary":salary,"years":years,"value":value,"annual":annual} df = pd.DataFrame(columns) df.to_csv("somefilename",index = False) 12. Nov 28, 2016 ### WWGD Excellent, Marne, thanks! 13. Nov 28, 2016 ### WWGD Just Just a quick question, Marne, do we need graphlab/pip to do the downloading and installation? 14. Nov 28, 2016 ### MarneMath For my personally, I have Anaconda installed which is a distribution that contains nearly every scientific library you'll ever need and also functions as package manager. So I have no idea what you'll need in your case. I imagine that a github exist for beautiful soup and you could simply pull that and then run setup.py in your libraries if you are opposed to using pip or anaconda. I know a github exist for pandas since I used it when I need to make a contribution. 15. Nov 29, 2016 ### WWGD Sorry to bother you one more time, Marne, justto know which version of Python you are running. EDIT: Never Mind Last edited: Nov 29, 2016 16. Nov 30, 2016 ### haruspex I disagree. Imagine the weakest player is measured at 1000 runs, earning$1000; the next scores 1001, earning $11000, the next 1002 runs earning$21000, and so on, each extra run earning $10000. In runs/$, the weakest player is clearly the best value for money.

@WWGD, you need to determine the monetary value of extra runs. You could look at what teams earn in prizes and sponsorship and compare that with scores. Even then, the value of a player to a team is the marginal benefit, i.e. how much would they lose by switching to a weaker player. That depends on how that team compares to others. A team that wins every game comprehensively could afford to save some on salaries.

17. Dec 15, 2016

### WWGD

Thanks all,
I was considering doing a Mann-Whitney U test , to tell if both rankings are similar between the two baseball player rankings: by salary and by WAR (Wins Above Replacement) data . Still, does anyone know , if rankings are not similar at some confidence level, how to compare the two distributions?

18. Dec 15, 2016

### haruspex

Have you looked up rank correlation?

19. Dec 15, 2016

### WWGD

Thanks, but I don't see how to use this. I am trying to identify the elements in both sets who are not equally-ranked. I can see how this allows me to compare both distributions ( and the test being non-parametric is a plus), but I don't see how to identify differently-ranked elements.

20. Dec 15, 2016

### haruspex

If you plot the rankings against each other, x and y, and do a straight line fit, you can take the rank mismatch of a given player as the (square of the) distance of the point from the line.
Note that the straight line in question should not be the usual linear regression. That minimises Σ(Δy)2. I suggest a symmetric regression, minimising the the sum of squares of the distances from the points to the line, Σ(Δy)2+(Δx)2. See e.g. http://stats.stackexchange.com/ques...en-linear-regression-on-y-with-x-and-x-with-y