Wittman100cUCSC Applied Economics Laboratory and Research Seminar: Section 3 wittman@cats.ucsc.edu

III. BASKETBALL PLAYER SALARIES

A. Introduction

Sports statistics create a great opportunity to measure the relationship between productivity and income. The data is much more detailed than that typically available to economists. The basketball data set collected by Kahn and Sherer is very rich and allows us to test a number of hypotheses.

Suppose that we want to find out the role of race in determining salaries. A simple-minded way of doing this is to run the following regression:

ls SAL c RACE

where RACE is 1 if white; 0 otherwise.

The results suggest that there is no discrimination against black basketball players since the coefficient of RACE is negative, implying that whites make less than blacks (Please note that I sometimes use black and white for a short hand to the preferred African-American and European-American). While simple income comparisons (between ethnic backgrounds or genders) are commonly done, it is wrong methodologically, since one needs to control for productivity. In this case, productivity means how many baskets and rebounds each player makes. The work by Kahn and Sherer provides guidelines on proper econometric methodology.

B. Choice of variables

The Kahn and Sherer article, like most of the articles chosen for study in this course, is an exemplary model of research. Its results are convincing for a variety of reasons: (1) There is not one, but several related studies employing different data, all of which confirm in different ways the basic ideas. (2) The authors undertook various formulations of the econometric model and the effect of RACE was robust to the alternative formulations. (3) the authors have chosen a good data set -- the performance variables are relatively close to the ideal. (4) the authors are aware of the possible biases inherent in the data and account for them.

The purpose of this course is to get you to think for yourself and develop critical understanding. You will not just replicate someone else's work (including mine). In this spirit, one should always critically assess others' work and try to improve on it. With regard to Kahn and Sherer's study, I believe that there is room for improvement in their choice of variables. In choosing variables one should think carefully. One does not just throw in variables which seem to make sense. One chooses the formulation that makes the most sense. Furthermore one needs to carefully consider the data.

I start with the last point first. In this study income is a function of performance. If we do not include bonuses for playoff games, then income does not depend on this year's performance but rather on previous years' performance. That is, salary contracts are made before the start of the season and depend on previous years' performance with the preceding year's performance being most influential (unless there was a multi-year contract). Ideally we would have salary as a function of lagged performance. In this data set, we are given the total points over all seasons. Thus this data set implicitly assumes that performance is the same each year. Such an assumption is incorrect. But that is what we have to work with.

In this study Kahn and Sherer use logs so that, in the original formulation, the variables are multiplied. Suppose that one thought that salary (SAL) should be a function of total offensive rebounds (OFFREB) in a year. Then one might want to have either OFFREB per year as a summary or break it down into constituent parts OFFREB PER MINUTE * AVERAGE MINUTES PER GAME PLAYED* GAMES PER YEAR. The authors have these last two variables denoted by MINS and GAMES respectively, but they have OFFREB per game not per minute. Given MINS and GAMES, it makes more sense to have offensive rebounds per minute than per game.

Also note that POINTS is career points scored. It should be in the same units as OFFREB (either per game as the author did or per minute as I have suggested).

I believe that the interesting variable is average minutes played by year, MINPYEAR, rather than its constituent parts, GAMES * MINS. Therefore MINPYEAR should be substituted since the constituent parts give no clue as to worth, and we should save on degrees of freedom when there is no cost in doing so.

Also, I think that the variables should be per minute rather than per game (then minutes instead of games) since per game conflates productivity per minute and number of minutes per game and the variable games may not vary as much as minutes played per game. Also the negatives are more meaningful per minute. Someone who plays only a few minutes per game will have fewer fouls per game than someone who plays a lot of minutes per game; a measurement of fouls per game would make it look like the more fouls, the higher the pay.

We want to capture the negatives and one of the negatives is missing shots. The authors use career field goal percentages (fraction made) but this is already embodied in total points. Again one might want to think of this as a formula. Instead of total field goal points, the authors should have used field goal points attempted per minute times field goal percentage. But better yet, instead of having FTPCT and FGPCT the authors should have had FTMISSED and FGMISSED (field goals missed per minute and free throws missed per minute). Once again, the negatives are in the same unit of account as the positives.

I am somewhat skeptical about the use of CENTER and FORWARD. If players in these positions are better, they should be captured in the other variables such as OFFREB or ASSISTS. To also include CENTER would then be double counting. I do not see CENTER and FORWARD as proxies for other unmeasured variables, but those who know more about basketball may disagree and want to include them. While the authors do not use height, some students wanted to include height because taller players would be more productive, other things being equal. However, we already have these measures of productivity (for example, rebounds) and therefore one should not include height.

STEALS and BLOCKS are such a rare event that I doubt that they would add to someone's salary. Now they might be a proxy for other skills, but the rarity of observations suggest little confidence in the coefficients. I might be inclined to drop them from the equation.(1)

I would also be inclined to drop DRAFTNO since most of the other variables should be a good predictor of the number. If I were to keep it, it would be as a residual from the predicted DRAFTNO when the independent variables are the above productivity numbers (See section B2).

There are two kinds of approaches to econometrics--throw everything into the soup (hoping that the econometrics will clarify the relationships) and carefully choosing the key ingredients (so that we know what we are eating intellectually). I prefer the latter approach. Hence I do not want both rebounds and height in my equations.

The authors also use several variables concerning the characteristic of the local area, including RACEMSA, POPSMA, INCSMA. I am skeptical that these variables would be relevant. My skepticism does depend on how I characterize the market for basketball players. I believe that players are in competition with one another. To illustrate, suppose that there are two black players of equal skill and one player plays in a heavily white city and the other in a heavily black city and fans are prejudiced in favor of their own race. The team owner in the heavily black city will not pay more for the black player since he could get the other black player from the white city for less. Hence racial bias will not appear as variations in back pay across cities.

Now theory is a good guide to setting up equations and choosing variables, but ultimately theory needs to be confronted with data. These variables could be left in and we could let the data show whether Wittman is right. My own taste is to not do this regarding the variables under discussion. In general, I like to limit the number of questionable variables thrown into the equation. If it is a central issue, then I will keep such variables in, even if questionable, since that is the question. Here I feel that these other variables are not as central to the question I am trying to answer (is there discrimination, not whether fans are the source of discrimination) and I will choose to not include them.

HOMEATT is also a questionable variable. If the players draw the crowds because of their personalities or whatever beyond the wins implied by WINPCT or scoring, then it may be OK. But it may have nothing to do with the present players or embodied in the other variables and therefore useless. I would be inclined not to use it.

In a nutshell, SEASONS, GAMES, CENTER, FORWARD, FTPCT and FGPCT would be dropped, FGMISSED and FTMISSED would be added, and all measures of productivity would be per minute. I would also drop RACEMSA POPSMA and INCSMA

C. Regressions

1. Linear specification

Using the variable we have identified in the last section (and dropping those that I found objectionable), a priori (before looking at the data) my choice of independent variables are:

POINTPM = (2*TLFGM + TRIPTM + TLFTM)/TLMINS

It is useful to consider the equation for POINTPM in greater detail. Total field goals made (TLFGM) includes 2 pointers and 3 pointers while free throws are worth 1 point. Therefore a triple pointer gets 2 points for being a field goal plus 1 point for being a triple pointer which adds up to3)

OFFREBPM = OFFREB / TLMINS

DEFREBPM = DEFREB / TLMINS

ASSISTPM = ASSISTS / TLMINS

PFOULSM = PFOULS / TLMINS

MISFGPM = (TLFGA -TLFGM) / TLMINS

MISFTPM = (TLFTA -TLFTM) / TLMINS

Note: All these variables can be generated by putting "genr" before the equation.

Since SEASONS has a zero in it, we first must change the smpl to exclude that observation. Luckily, there were no values of TLMINS that were zero:

smpl if SEASONS > 0

genr MINPS = TLMINS/SEASONS ls SAL c POINTPM OFFREBPM DEFREBPM ASSISTPM PFOULSM MISFGPM MISFTPM MINPS RACE

The regression results are very encouraging concerning the quality of the model.


LS // Dependent Variable is SAL
Date: 4/27/94 / Time: 2:35
SMPL range: 1 - 235
SMPL condition: SEASONS 0 Number of observations: 234
VARIABLECOEFFICIENTSTD. ERROR T-STAT.2-TAIL SIG.
C -620775.42163593.67 -3.7946175 0.000
POINTPM 1332533.7 276639.32 4.8168631 0.000
OFFREBPM 1443412.4 1038525.7 1.3898668 0.165
DEFREBPM 2167483.6 473957.53 4.5731600 0.000
ASSISTPM 1258202.1 426914.68 2.9471979 0.003
PFOULSM -1711873.7 659265.93 -2.5966360 0.009
MISFGPM -430593.25 531381.41 -0.8103280 0.418
MISFTPM 187252.73 1552241.7 0.1206337 0.904
MINPS 112.57455 36.068379 3.1211424 0.002
RACE 108078.20 39311.358 2.7492869 0.006
R-squared0.559371Mean of dependent407236.6
Adjusted R-squared 0.541667S.D. of dependent351579.2
S.E. of regression 238020.1 Sum of squared resid1.27E+13
Durbin-Watson stat1.829886 F-statistic31.59603
Log likelihood-3223.867


The R-square of .56 is very high for cross section, especially considering the fact that the independent variables are not the same type of thing as the dependent variable. If one ran consumption against income, both are in dollars and consumption is a large part of income so a high R-square would not be surprising. In time series money might be regressed against money lagged. Again a high R-square would not be surprising. But here the high results are not guaranteed by the formulation of the data.

The F-statistic, 31.6, is large and significant.

More importantly, almost all of the coefficients have the correct sign giving us considerable confidence in the results. The more points per minute, offensive rebounds per minute, defensive rebounds per minute, assists per minute and minutes played, the higher the salary; the more fouls per minute and missed field goals per minute, the lower the salary. The only wrong sign is associated with missed free throws. It should be negative, but it is positive although not at all significant (0.90 probability). According to these results, being white is worth an extra $108,078 a year. The result is very significant (0.003 as a one tail test). Also according to the results an extra point per minute is worth $1,332,533 (remember this is based on data for 1985-86, when salaries where considerably lower).

While the regression results are very supportive, one multiple regression is not conclusive. One should check whether the results are robust to alternative formulations, and other studies based on other data sets should be undertaken. I will now briefly discuss two alternative specifications based on the same data set.

2. Multiplicative Specification

In the regression just discussed, the independent variables had an additive effect. I choose this because I felt that points and rebounds are additive in their effect on salary, not multiplicative (although minutes and points per minute are clearly multiplicative). Also a linear equation is easier to interpret. However in many empirical studies, it is common to assume a multiplicative effect between the independent variables (equivalently, that the variables are additive in their logs). Therefore, I took logs of all the variables considered in the previous multiple regression. Note that WHITE = log(RACE + 1). This is because log(0) is undefined while log(1) = 0.

ls LSAL c LPOINTPM LOFFREBPM LDEFREBPM LASSISTPM LPFOULSM LMISFGPM LMISFTPM LMINPS LWHITE


LS // Dependent Variable is LSAL
Date: 4/27/94 / Time: 2:35
SMPL range: 1 - 235 SMPL condition: SEASONS 0
Number of observations: 234

VARIABLECOEFFICIENTSTD. ERRORT-STAT. 2-TAIL SIG.
C 10.553533 0.2910370 36.261831 0.000
LPOINTPM 2.5613401 0.4921478 5.2044122 0.000
LOFFREBPM0.6260268 1.8475615 0.3388395 0.735
LDEFREBPM3.5845057 0.8431815 4.2511673 0.000
LASSISTPM 0.5036242 0.7594912 0.6631074 0.507
LPFOULSM -1.9291459 1.1728495 -1.6448367 0.100
LMISFGPM -1.4192388 0.9453399 -1.5013000 0.133
LMISFTPM -1.9459130 2.7614742 -0.7046646 0.481
LMINPS 0.0005184 0.0000642 8.0789216 0.000
COLSTART="1">LWHITE
0.2844901 0.1008961 2.8196356 0.005
R-squared0.697809 Mean of dependent12.63027
Adjusted R-squared 0.685667S.D. of dependent0.755267
S.E. of regression 0.423443 Sum of squared resid40.16415
Durbin-Watson stat1.816247F-statistic57.47251
Log likelihood-125.8371

The regression results are a bit different than our earlier formulation. In general the coefficients are smaller, and the standard errors higher. LPFOULSM is only significant at the 10% level. However, in some ways the model suggests a better fit: the intercept is positive, the R-square is 0.6978, and LMISFTPM is negative. In any event, it remains true that whites again make more than blacks. (2)

3. An alternate specification

One student suggested a totally different formulation. The measured variables may not capture the true productivity of a basketball player. Sports professionals may be able to better assess productivity than students doing a multiple regression. Therefore the student suggested an equation somewhat similar to the following:

SAL = A + B (TEAMSAL - SAL) + C ALLPRO/SEASONS + D DRAFTNO + E RACE

Because SAL is both the dependent and independent variable in this equation we must group SAL on the left of the equation:

SAL = [ A / (1+B) ] + [ B / (1+B) ] (TEAMSAL - SAL) + [ C / (1+B) ] ALLPRO/SEASONS + [ D / (1+B) ] DRAFTNO + [ E / (1+B) ] RACE

genr ALLPROPS = ALLPRO/SEASONS

genr TSAL = TEAMSAL - SAL

ls SAL c TSAL ALLPROPS DRAFTNO RACE


LS // Dependent Variable is SAL
Date: 4/27/94 / Time: 2:36
SMPLrange: 1 - 235
SMPL condition: SEASONS
0 Number of observations: 234
VARIABLECOEFFICIENTSTD. ERRORT-STAT.2-TAIL SIG.
C330198.5961389.8955.37871240.000
TSAL0.02196490.01354131.62207070.105
ALLPROPS1270227.8105624.42 12.0258910.000
DRAFTNO-3163.2962647.31520 -4.8867942 0.000
RACE14224.233 38777.972 0.36681220.714
R-squared0.470153Mean of dependent407236.6
Adjusted R-squared0.460898S.D. of dependent 351579.2
S.E. of regression258142.0Sum of squared resid1.53E+13
Durbin-Watson stat1.721033F-statistic 50.80000
Log likelihood -3245.441


DRAFTNO should be negative since a higher DRAFTNO means an earlier pick. SAL is subtracted from TEAMSAL so SAL is not partially regressed against itself. Note that the sign of E depends on the racism of sportswriters and basketball scouts relative to the racism occurring in salaries. For example, suppose that sportswriters tended to choose whites for ALLPRO and that they overrated whites more than owners of teams overpaid whites. Then the coefficient of RACE would be negative since payment to whites would be less than thought justified by sportswriters (even though owners tended to slightly overpay white players). Still my a priori is that the coefficient of RACE will be positive.

As can be seen, the coefficients are in the predicted direction, but the coefficient of RACE is insignificant (.357 as a one tail test). Once again the R square is quite high and the equation as a whole is very significant.

Note that before I ran the regression, I decided not to include ALLSTAR. This is because I felt that ALLSTAR and ALLPRO would be highly correlated, creating multicollinearity problems. The regression results, LS ALLPRO C ALLSTAR, suggest that I was right to be concerned.

One also needs to be aware of the potential biases that might arise when variables are only be imperfect proxies. Consider the variable SAL -- 1985-1986 Pro compensation. As the authors note, SAL does not include non-salary compensation such as bonuses. So what we might think of as yearly income may not be the same as the actual variable chosen. Suppose that SALARY underestimates yearly income that is E[u] < 0. then our assumptions justifying the use of least squares is violated and our least squares estimate of the intercept term is biased downwards from the true intercept. Suppose that the non-measured salary is likely to be greater for Whites (which the authors argue is the case, but their argument is not that compelling; there is also little reason to believe that the reverse is true). Then the least squares assumption regarding independence between the error term and the variable, RACE, does not hold and the least squares estimate of the coefficient on RACE (1 for white) is downward biased from the true relationship. Now if bonuses are not correlated with RACE, then the estimated coefficient of RACE is not biased but its variance is larger than otherwise.

D. Other ways of detecting discrimination.

The following four tests would not only be interesting exercises, but also useful contributions to our knowledge. As far as I know, there is no published research on these particular questions.

1. Do whites play longer in general than their skills would suggest?

If this were the case, minutes played per season would be greater for whites. This could be examined by looking at:

ls MINPS c POINTPM OFFREBPM DEFREBPM ASSISTPM PFOULFPM MISFTPM RACE


LS // Dependent Variable is MINPS
Date: 07/31/96 Time: 15:33
Sample: 1 235
Included observations: 234
Excluded observations: 1
VariableCoefficientStd. Error T-StatisticProb.
C 588.1130 282.48762.0819070.0385
POINTPM 2524.891 311.3882 8.1085000.0000
OFFREBPM 1461.030 1955.084 0.747298 0.4557
DEFREBPM 5306.779 858.6971 6.180036 0.0000
ASSISTPM 3257.324 808.7326 4.027690 0.0001
PFOULSM -8804.913 1102.966 -7.982939 0.0000
RACE -99.46619 76.36512 -1.302508 0.1941
R-squared 0.580260Mean dependent var1740.593
Adjusted R-squared0.569166S.D. dependent var724.5671
S.E. of regression475.5911Akaike info criterion12.35857
Sum squared resid51344422Schwartz criterion12.46194
Log likelihood-1770.985F-statistic52.30187
Durbin-Watson stat 1.343757Prob(F-statistic) 0.000000


Note that I did not include MISFTPM since my earlier results suggested that this variable is unreliable.

2. Do cities with a higher percentage of whites, play whites a higher percentage of the time?

This may have two components: more white players and playing them more often than justified. This is the key to discrimination in competitive markets -- segregation. One set of firms discriminate and the other non-discriminatory firms gain by reverse discrimination.

The researcher needs to know economics in order to test for discrimination since salaries are part of labor markets. Also it is virtually impossible to test the apriori hypothesis of no discrimination (since statistical tests are designed to reject, not accept)

3. Do teams with more white players do worse than teams with more black players?

This could be fairly easily tested

4. Discrimination among fans and sportswriters

A test for discrimination among fans or sportswriters would have ALLSTAR and ALLPRO as the dependent variable.3

E. Opportunistic Empiricism

As stated in earlier lectures, one purpose of this course is to make you into unrelenting empiricists so that whenever you hear a "factual" statement you ask the following: (1) how in principle the statement could be tested if any data were freely available and (2) how the statement can actually be tested given existing data.

1. White Men Can't Jump

To illustrate from my own personal experience, when I saw the movie, "White Men Can't Jump," I immediately thought of some hypothetical tests. One could ask for a random sample of black and white men (or black and white pro basketball players) to jump and record how high their feet got off the ground or how high their hands reached (controlling for the person's height). But more exciting from the viewpoint of today's lecture, we have data to indirectly test the hypothesis. 4

Consider the following data:

genr REB = OFFREB + DEFREB

genr REBMIN = REB/TLMINS.

Note that REBMIN is only an imperfect measure of jumping ability since getting rebounds also depends on being in the right place at the right time. In econometrics we often have to make use of imperfect proxies. On the other hand, some might say that part of being a good jumper is being at the right place at the right time.

RACE = 1 if white.

genr HEIGHT = 12 * HEIGHTF + HEIGHTI -Height thus gives t********************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************0069510

-2.76843310.0061
HEIGHT0.02027740.000879023.0684480.0000
R-squared0.702210Mean of dependent var 0.191381
Adjusted R-squared0.699621S.D. of dependent var 0.081876
S.E. of regression0.044874 Sum of squared resid 0.463138
Log likelihood394.1070F-statistic271.1784
Durbin-Watson stat 1.891042Prob(F-statistic)0.000000

Our a prior expectations are that the coefficient of RACE is negative and the coefficient of HEIGHT is positive. The results are very strong. Both coefficients have the right sign and are highly significant (0.003 and 0.0000, respectively). The R-square is 70%.

In the movie, the white player was not able to do a dunk shot but he was very good at shooting from a distance. Unfortunately, the data collected by Kahn and Sherer does not have statistics on dunk shots. However, other data may provide clues to jumping. Two point goals are shot close to the hoop, while 3 point goals and free throws are shot from farther away and are less likely to involve jumping.

The next few regressions adjust the sample set to the following:

smpl if SEASONS > 0 and TRIPTM > 0 and TLFTM > 0 and TLMINS > 0 and TLFGA > 0

genr TLS1 = 2*(TLFGM - TRIPTM) / (3 * TRIPTM + TLFTM)

TLS1 is the ratio of points made from close up to points made at a distance. 2*(TLFGM - TRIPTM) assumes that the field goal measure includes 2 and 3 point shots.

ls TLS1 c HEIGHT RACE


LS // Dependent Variable is TLS1
Date: 3-25-1994 / Time: 21:42
SMPL range: 2 - 235
SMPL condition: SEASONS 0 AND TRIPTM 0 AND TLFTM 0 AND TLMINS 0 AND TLFGRIA + + 65 BELGIUM + + 66 CYPRUS + + 67 DENMARK + + 68 FINLAND + + 69 FRANCE + + 70 GERMANY + + endent var 4.003704 Adjusted R-squared -0.004663 S.D. of dependent var1.476499 S.E. of regression1.479937 Sum of squared resid 350.4343 Log likelihood -293.6690 F-statistic 0.624016 Durbin-Watson stat1.966688Prob(F-statistic) 0.537087


In this formulation the coefficient of HEIGHT should be positive and the coefficient of RACE should be negative. The results are only mildly confirming. The signs are in the correct direction, but the levels of significance are 0.134 and 0.305. The R-square is 0.008.

There is no one correct way of defining variables and setting up equations. I have combined several variables into one dependent variable measure (TLS1). The above equation looks for comparative advantage, not absolute advantage (a black could be twice as good as a white player in two point field goals and three times as good in three pointers, and hence would look comparatively worse using the measure I have invented).

Another possibility is to control for overall basketball ability, perhaps measured by minutes played in a season. One might then use one of the two following equations:

genr TLS2 = TLMINS/SEASONS

ls TLFGM c TLS2 HEIGHT RACE


For a copy of the printout see the full (paper) copy



or,

genr TLS3 = TLFGM/TLFGA


For a copy of the printout see the full (paper) copy


On the other hand, the statement about white men not being able to jump may be a statement about basketball ability in general and measured on an absolute scale. In this way we would not want to control for ability in general since the statement would imply that blacks had a higher ability in general. Total points per minute might be regressed against RACE and height.

genr POINTS = 2*(TLFGM - TRIPTM) + 3* TRIPTM + 2* TLFTM

genr POINTPM = POINTS/TLMINS

ls POINTPM c RACE HEIGHT


For a copy of the printout see the full (paper) copy

But here we know the answer already since blacks make up 75% of the National Basketball Association players and only 11% of the population, blacks are on average better players than whites.

Which of these equations is best? Obviously, it depends on the question you are trying to ask. But one can also judge the question. The last equation is boring because we know the general answer already. Equation 1 answers the initial question most directly, but it is in the same spirit as equation 5. It is a judgment call, but my feeling is that equation 2 (where the dependent variable is TLS1) is best. It asks whether blacks play a different type of game than whites, not whether they are better. I think that this is a more interesting question with a more interesting answer since the answer is not so obvious. Equations 3 and 4 ask similar questions to 2, but not with such a direct and clear measure.

2. Predicting draft number

This data set also contains information about college performance. For example, CFGM stands for field goals made in college. One could predict draft number based on college performance (The better the college performance, the lower the draft number). Unfortunately, colleges play in different quality leagues so the numbers are not that meaningful (I do very well against my 8 year old). So if possible, one would want to have a proxy for quality of competition (FFOUR is a possibility).

3. How skilled are basketball scouts?

Even with the rudimentary skills taught in this course, I believe that students are capable of producing publishable research (in secondary journals) if they ask the right questions. I know virtually nothing about statistical studies of sports, but I suspect that the following question has not been answered previously with econometric tools and if cleverly done, might be publishable: What is the relation between draft choice and eventual performance? A rudimentary stab at this question might look at the following equation:

genr POINTPS = POINTS/SEASONS

genr REBOUNDS = (OFFREB + DEFREB)/SEASONS

genr ASSISTPS = ASSISTS/SEASONS

ls DRAFTNO c POINTPS REBOUNDS ASSISTPS


For a copy of the printout see the full (paper) copy

A more sophisticated study and a better data set would account for the fact that some draft choices are no longer playing (a real bad choice if they were drafted recently). Alternatively, one might confine the study to the first 2 or 3 years after the draft. One should always be aware of missing data and how it might alter the observed empirical results.

Now that there are free agents, draft choice is not as important in the past. One could test whether there is declining care in choice by seeing whether R squared has declined over time.

I do not want to spend a great deal of time on this issue. I just wanted to suggest that there are lots of questions that can be answered with the data sets provided in this course.



Data Files

Data File: NBADATA.ASC

Source: Kahn, Lawrence M.; Sherer, Peter D., "Racial Differences in Professional Basketball Players' Compensation," Journal of Labor Economics v6, n1 (Jan. 1988):40-61.

Name Variable Description

ABAGAMES I3 (F3.0) number of ABA games
ALLPRO I2 (F2.0) number of times all league 1st or 2nd team
ALLSTAR I2 (F2.0)number of times named to all-star team
ASSISTS I4 (F4.0)total pro assists
BLOCKS I4 (F4.0)total pro shots blocked
BYEAR I2 (F2.0)--birth year (e.g. 55=1955)
CAWARDS I1 (F1.0) --total college player of the year awards plus times named to first or second All-America Team
CFGA I4 (F4.0)total college field goals attempted
CFGM I4 (F4.0)total college field goals made
CFTA I3 (F3.0)total college free throws attempted
CFTM I3 (F3.0) total college free throws made
CGAMES I3 (F3.0)total college games
CHAMP I2 (F2.0)number of pro championship teams played on
CMINS I4 (F4.0) total college minutes
CONF I2 (F2.0) field not used
CREB I4 (F4.0) total college rebounds
CSEA I1 (F1.0) total college seasons
CTRPA I3 (F3.0) total college three point goals attempted
CTRPM I2 (F2.0) total college three goals made
DEFREB I5 (F5.0) total pro defensive rebounds
DISQUAL I2 (F2.0) number of times disqualified
DRAFTNO I3 (F3.0) college draft number
EARLY I1 (F1.0) dummy variable for leaving college early
FFOUR I1 (F1.0) number of trips to final four (college)
GPLAY I3 (F3.0) number pro playoff games played
HEIGHTI I2 (F2.0) inches to be added onto
HEIGHTFI1 (F1.0) height in feet, e.g. 6 or 7
NOTCOL I1 (F1.0) dummy variable for not attending college
OFFREB I4 (F4.0) total pro offensive rebounds
PFOULS I4 (F4.0) total pro fouls committed
PLAYID I3 (F3.0) player ID number
POSITION I1 (F1.0) position (1 or 5= center; 2,4 or 7= forward; 3 or 6= guard)
PRODEF I2 (F2.0) number of times 1st or 2nd all-defensive team
RACE I1 (F1.0) race, 1= white, 0= black
SAL I7 (F7.0) 1985-6 pro compensation
SEASONS I2 (F2.0) total pro seasons
STEALS I4 (F4.0) total pro steals
TEAM I2 (F2.0) NBA team (in alphabetical order: e.g. 1= Atlanta, 2= Boston, etc.)
TEAMCH I2 (F2.0) number of pro team changes
TLFGA I5 (F5.0) total pro field goals attempted
TLFGM I5 (F5.0) total pro field goals made
TLFTA I5 (F5.0) total pro free throws attempted
TLFTM I5 (F5.0) total pro free throws made
TLGAMES I4 (F4.0) total pro (NBA or ABA) games played
TLMINS I5 (F5.0) total pro minutes played
TRIPTA I3 (F3.0) total pro three point goals attempted
TRIPTM I3 (F3.0) total pro three point goals made
WEIGHT I3 (F3.0) weight in pounds
YPLAY I2 (F2.0) number of years in the pro playoffs

The following variables refer to the player's 1985-86 team

ARENA I5 (F5.0) arena capacity
COL83 F7.4 1983 SMSA cost of living index
HOMEAT I6 (F6.0) previous season's home attendance
INCOME I5 (F5.0) 1983 SMSA per capita income in dollars
MAX I4 (F4.2) maximum ticket price in dollars
MIN I4 (F4.2) minimum ticket price in dollars
POPCIT I5 (F5.1) 1980 city population (divided by 10000)
POPCMA I5 (F5.1) 1980 Consolidated Metropolitan Area population (divided by 10000)
POPMSA I5 (F5.1) 1980 Standard Metropolitan Statistical Area population (divided by 10000)
RACECIT I3 (F3.1) percent of 1980 population in the city that was black
RACECMA I3 (F3.1) percent of 1980 population in the Consolidated Metropolitan Area that was black
RACEMSA I3 (F3.1) percent of 1980 population in the Standard Metropolitan Statistical Area that was black
TEAMSAL I8 (F8.0) total team salary
TOTAT I7 (F7.0) previous season's total attendance (home plus away)
WINPCT I3 (F3.3) previous season's winning percentage

Notes:

(1) If a player only played a few minutes in a season, then our confidence in his output per minute variables would be reduced. In such a situation, weighted least squares should be used.(back to text)

(2) The R-squares of the two equations cannot be directly compared since one is measuring percent explanation of the variation in SAL and the other percent explanation of the variation in LOG(SAL).(back to text)

References:

Kahn, Lawrence M.; Sherer, Peter D., "Racial Differences in Professional Basketball Players' Compensation," Journal of Labor Economics v6, n1 (Jan. 1988):40-61.

back to the top