This is a supplemental reference to aid readers in understanding Kucoach's sevenfortyseven.com articles written by his alter ego, Nathan Sliver. In case you're unaware, these articles are a parody of Nate Silver's fivethirtyeight.com, which means ku employs data-based techniques to try to explain our league.
Multivariable Regression
Ku often employs regression as a method of explaining relationships. A simple regression between two variables, say a player's FT rating and a players FT percentage, draws a line between the observations such that the sum of the squared difference between each point and the line is minimized. In other words, it draws a line between a bunch of points to try to explain the relationship between them. Like so.
When the model includes more than one explanatory variable (e.g. a player's inside, jump shooting, free throw, and 3 point rating explaining offensive efficiency), our simple human minds can’t really visualize it that well because it requires more than the 2 or 3 dimensions our brains can handle so we resort to numbers.
Summary of Fit Table
Really the only number I'll ever reference in this table is the Adjusted R squared. This number explains how much of the variation in the dependent variable (offensive efficiency) is explained by the independent variables in the model. In the above example, FT rating explains 70% of the variance observed in FT percentage. If the model is good then the remainder of the variance is explained by randomness, however it is often the case that the remaining variation is explained by omitted variables (in this case, maybe strength, stamina, and whether the FT was shot at home). The adjusted R squared is preferable to the R squared because it penalizes the model for including irrelevant variables, whereas the R squared will always increase when you add a new variable, whether relevant or not.
Analysis of Variance Table
The only number you need to worry about here is the F stat. If this number is very small, less than 0.05, than it indicates that the difference explanatory variables included are explaining different things, i.e. the variables aren’t really all just difference manifestations of the same property.
Residuals by Predicted Plot
On occasion I will include a residuals by predicted plot. With this plot we want the pattern of points to look random. If there is any sort of pattern or trend to the points, it suggests skewness of the data or a poorly fit model.
Variable/Indicator Parameterization Function Table
This is the most important part of the results. This table describes the relationships between the explanatory variables and the dependent variable. This is best explained with an example.
Let’s look at the estimate for cohesion. You can essentially think of this as a coefficient for a linear equation or the marginal effect of each variable. In this case, the estimate is 0.033. This means that if cohesion increases by one level then we would expect a 0.033 increase in winning percentage on average and holding all else equal i.e. if two teams were exactly the same but one team had good cohesion and the other had average, we would expect the team with good cohesion to have a winning percentage 0.033 higher. This is just an estimate however. If we look over at the lower and upper 95% columns. If we were to run this model 100 times the estimate for cohesion would lie between 0.014 and 0.052 95 times. If zero is in that range, then our estimate is probably not good enough to talk about because we’re not sure that the variable has any effect at all. If zero is in the 95% confidence interval then the p-value (prob>|t|) will be greater than 0.05, meaning that we are really only interested in estimates if the p-value is lower than 0.05.
Well, we'll start with that for now.