Money within the Area of College Sports (With a focus into UCLA's Finances)
One of the most common concerns within the scope of Universities costs is how well schools are allocating their money. Traditionally, schools have a responsibility to their interests but many students today may have trouble understanding the true allotment of their finances. Within this research, the data from the revenues and expenses of UCLA may help answer these questions based on the analysis of the school's spending. In fact, there will be data visualizations and plots that compare UCLA within their sport's divisions finances to truly reveal how contrasting the financial habits are at UCLA.
Specifically, the research question we pose is how transparent and effective is UCLA's allocation of funds for their sports programs. Is it justifiable for many schools to be under intense criticism for their extensive funding in sports as an academic establishment? In order to answer these questions, we took many parts of UCLA's finances, such as ticket sales or student fees, to approach trends that can tell us more about UCLA's financial allocations.
The main motivation to research this topic is the relevancy of the issue. UCLA's football coach, Chip Kelly, is set to make \$23 million dollars on his next contract (with an average of \\$4.7 million annually)1. On the other hand, staff, professors, and other faculty struggle to earn pay close to a deserved standard. Students are the ones that have an impact on the revenue of academic institutions, thus, it is expected that they should know where and how the money is allocated in sports (in particular the large football programs).
As a student at UCLA, I am inclined to deliver these results. My background as a Statistics student helped during the analysis and the findings of this research. I have performed other research projects with relevancy in predicting or using machine learning classification techniques on data. The data from the College Athletics Financial Information (CAFI) Database contains a lot of numerical data that was used to perform statistical analysis.
import pandas as pd
# commands to read the datafile
df = pd.read_csv("University_of_California_Los_Angeles_data.csv")
# http://cafidatabase.knightcommission.org/fbs/pac-12/university-of-california-los-angeles
When reading the data from CAFI, we can see that the data is structured in a 48x30 shape. We will keep this in mind since there are a lot of columns compared to the number of observations. Specific analysis tools will be used for this type of data structure. Overall, we are provided the revenues and expenses from the data. Since there are many subsections of the spending, we can further investigate these variables individually further. One important column that will be useful is Year which will be helpful for time analysis.
df.sample()
The rows are structured by year and essentially sorted by the column Data in which it may come from either UCLA, Pac-12 median, or Football Bowl Subdivision Median. This is useful to understand since we will be grouping by Data in much of the analysis to compare the observations. The observations of the data already reveal much-needed cleaning and data manipulation. This section begins the data cleaning.
# We can remove useless columns that mess with our missing values
df = df.drop("IPEDS ID", axis = 1) # we don't need the ID since it is useless to our analysis
df = df.drop("NCAA Subdivision", axis = 1) # We already know UCLA is in the Football Bowl Subdivision
df = df.drop("FBS Conference", axis = 1) # We already know UCLA is in the Pacific-12 Conference
# We will leave the other nan since the other cols are important observations
df.sample()
Some columns are useless to our analysis. Removing them will help simplify our analysis tools since they do not provide any new insights for each observation.
# We need to check for missing NaN values in the rows as well
print("Shape of our df:", df.shape)
print("Shape of our df with NaN removed:", df.dropna().shape)
# This shows us that dropping we have 2 rows that contain some missing values, we will ignore this
df[df.isna().any(axis=1)]
# rows 1 and 13 have missing values
There are two rows (out of 48) that contain missing values. With limited observations, each row is essential for our analysis. Thus, choose not to omit these rows and instead keep them in mind for later imputation.
# We need to convert the string columns of money ($) back to int using the following library and suggestion reference
# from stackoverflow:
# https://stackoverflow.com/questions/8421922/how-do-i-convert-a-currency-string-to-a-floating-point-number-in-python
from re import sub
from decimal import Decimal
import math
for col in range(2, df.shape[1]):
for row in range(0, df.shape[0]):
if isinstance(df.iloc[row, col], str):
df.iloc[row, col] = int(Decimal(sub(r'[^\d.]', '', df.iloc[row, col])))
# changes each string column from index 2 to the last column into int
df = pd.concat([df[["Data", "Year"]], df.drop(["Data", "Year"], axis = 1).apply(pd.to_numeric)], axis = 1)
# changes cols to int
df.sample()
In order to work with the numerical values provided, we need to restructure the cells containing money observations. The data for \$ observations are removed and reformatted to integer form.
from statistics import mean
import numpy as np
# New column for analysis net profit
df["Net Profit"] = np.array(df["Total Revenues"] - df["Total Expenses"])
# imputting values
new_val = np.nanmean(df[df["Data"] == "University of California-Los Angeles"]["Total Football Coaching Salaries"])
df.loc[13, "Total Football Coaching Salaries"] = new_val
# imputing missing value as mean of UCLA values
One useful column that is not made available in the data is the net profit of an observation. Knowing the revenue and expenses, we can take their difference to make this new column for further analysis. Also, we performed some data imputation on a row. For the missing value, we have imputed the mean of the column grouped by the specific Data (based either on UCLA, Pac-12, or Football Subdivision).
With the newly cleaned data, we are able to make data visualizations that help better understand the structure of the data. Numerical data from the expenses and revenues of the observations will be important to present.
# importing graphing visualiztion libraries
import matplotlib.pyplot as plt
import seaborn as sns
# Data viz
# Basic Histogram of important column variable
fig, ax = plt.subplots(2, 1, figsize = (16, 20))
ucla_rows = df[df["Data"] == "University of California-Los Angeles"] # rows of just UClA observations
for i in range(3, 13):
ax[0].plot(ucla_rows["Year"], ucla_rows[df.columns[i]], label = df.columns[i])
ax[0].legend(loc="upper left")
ax[0].ticklabel_format(style = "sci", axis = "y", scilimits = (6, 6))
ax[0].set(title = "UCLA Expenses Over time", xlabel = "Year", ylabel = "Money ($) in Millions")
for i in range(14, 22):
ax[1].plot(ucla_rows["Year"], ucla_rows[df.columns[i]], label = df.columns[i])
ax[1].legend(loc="upper left")
ax[1].ticklabel_format(style = "sci", axis = "y", scilimits = (6, 6))
ax[1].set(title = "UCLA Revenue Over time", xlabel = "Year", ylabel = "Money ($) in Millions")
One of the ways we can illustrate the entirety of the data is by using each column in a graph. In this case, the two line graphs express the expenses and revenues of UCLA's finances from 2005 to 2020. The graphs contain specific parts of UCLA's expenses or revenues. It may help point out any significant deviations of the funding in specific years. The overall trend of the data seems to positively spike upwards over time. This makes sense since with each year expenses and revenues both grow (because of factors like inflation or more money circulation within the school). The plots can guide us to start comparing and investigating the expenses and revenue further.
Some key takeaways from the line graphs also reveal that some streams of revenue or expenses do not change much over time (e.g. Excessive Transfers Back or Institutional/Government Support) while others have sharp changes (e.g. Support and Admin Compensation w/Severance or Corporate Sponsorship, Advertising, Licensing). Later in the research, we will analyze if these spikes are justifiable or questionable.
# Data viz
# Line graph overtime of Revenues for each UCLA, PAC-12 median, Football sub-division median
fig, ax = plt.subplots(3, 1, figsize = (10, 16))
ucla_rows = df[df["Data"] == "University of California-Los Angeles"] # rows of just UClA observations
ax[0].plot(ucla_rows["Year"], ucla_rows["Total Revenues"], label = "Total Revenues", color = "green")
ax[0].plot(ucla_rows["Year"], ucla_rows["Total Expenses"], label = "Total Expenses", color = "red")
ax[0].legend(loc="upper left")
ax[0].ticklabel_format(style = "sci", axis = "y", scilimits = (6, 6))
ax[0].set(title = "UCLA Football Spending vs Athletics Revenue", xlabel = "Year", ylabel = "Money ($) in Millions")
pac12_rows = df[df["Data"] == "Pacific-12 Conference Median"] # rows of Pacific-12 median
ax[1].plot(ucla_rows["Year"], pac12_rows["Total Revenues"], label = "Total Revenues", color = "green")
ax[1].plot(ucla_rows["Year"], pac12_rows["Total Expenses"], label = "Total Expenses", color = "red")
ax[1].legend(loc="upper left")
ax[1].ticklabel_format(style = "sci", axis = "y", scilimits = (9, 9))
ax[1].set(title = "PAC-12 Football Spending vs Athletics Revenue", xlabel = "Year", ylabel = "Money ($) in Billions")
fbsub_rows = df[df["Data"] == "Football Bowl Subdivision Median"] # rows of football sub-division median
ax[2].plot(fbsub_rows["Year"], fbsub_rows["Total Revenues"], label = "Total Revenues", color = "green")
ax[2].plot(fbsub_rows["Year"], fbsub_rows["Total Expenses"], label = "Total Expenses", color = "red")
ax[2].legend(loc="upper left")
ax[2].ticklabel_format(style = "sci", axis = "y", scilimits = (9, 9))
ax[2].set(title = "Football Bowl Subdivision Spending vs Athletics Revenue", xlabel = "Year", ylabel = "Money ($) in Billions")
It may be difficult to find a relationship with all the streams of expenses and revenues within UCLA's football program. Thus, we will analyze simply the total values. The three figures, shown over time, reveal the expenses and revenues of UCLA with the other controls. Particularly, it is important to include the median of the Football Bowl Subdivision to illustrate the broad trends compared to UCLA's observation. With that graph, there is a generally consistent trend of a net profit where the expenses usually do not pass the revenue. However, we also include the PAC-12 median since it may be more closely aligned with the same factors as UCLA. The Pac-12 median is different from the Football Bowl Subdivision in where sometimes expenses do exceed the revenue, but the trend normally remains close.
While investigating UCLA's total revenues and expenses, it is expected that the lines are closely related but not too close or too far. Surprisingly, the total revenues and expenses for all but 2019 and 2020 are exactly matched where the net profit is 0. We only see a deviation of net loss after 2018, which may be explained by the COVID-19 pandemic where a shutdown in sports activities occured. However, the visual shown is clearly skeptical and worth further investigating. The trend shown in UCLA's line graph is atypical but may be explained by some collection error or caused by nefarious manipulation.
# Data viz
# Basic Histogram of coaching variables
plt.figure(figsize=(12, 8))
plt.hist(ucla_rows["Coaches Compensation"], bins = 5, alpha = 0.4, label = "Coaches Compensation")
plt.hist(ucla_rows["Total Football Coaching Salaries"], bins = 5, alpha = 0.3, label = "Total Football Coaching Salaries")
plt.ticklabel_format(style = "sci", axis = "x", scilimits = (6, 6))
plt.legend(loc = 'upper right')
plt.title('Coaching Spending UCLA')
plt.xlabel('($) Millions')
plt.ylabel('Count')
plt.show()
Coaches' salaries and compensation are key factors that are concerning for some people. It may be important to know the trends of their benefits. In the Coaching Spending UCLA histogram, the data has two peaks. In reality, coaches like UCLA's Chip Kelly are unique in that many programs do not contribute high amounts toward coaching but it does happen to a similar frequency of the minimal times of coaching pay allocation.
# Data viz
# Looking at student/school/coaches cost
print("Student Fees compared to Coaching Salaries")
sns.relplot(
data = ucla_rows,
x = "Student Fees", y = "Total Football Coaching Salaries",
hue = "Data", size = "Total Academic Spending (University-Wide)"
).set(title = "UCLA (millions)")
plt.ticklabel_format(axis = "both", style = "scientific", scilimits=(6, 6))
sns.relplot(
data = pac12_rows,
x = "Student Fees", y = "Total Football Coaching Salaries",
hue = "Data", size = "Total Academic Spending (University-Wide)"
).set(title = "PAC-12 (millions)")
plt.ticklabel_format(axis = "both", style = "scientific", scilimits=(6, 6))
sns.relplot(
data = fbsub_rows,
x = "Student Fees", y = "Total Football Coaching Salaries",
hue = "Data", size = "Total Academic Spending (University-Wide)"
).set(title = "Football sub-division (millions)")
plt.ticklabel_format(axis = "both", style = "scientific", scilimits=(6, 6))
Investigating the coaching salaries further, we can relate them to the Student Fees to determine their true relationship. For each group, the plots above indicate that with an increase in student fees and total academic spending, coaching salaries will also increase. The Football sub-division median data illustrates how a maximum of student fees, from larger spending academic Universities, the coaching salaries will still increase.
Observing the UCLA data, the trend is less aligned with the football subdivision median data. Increased student fees do not firmly align with coaching salaries. Instead, more total academic spending means a higher emphasis on coaching salary spending. The UCLA data may suggest at some point student fees were reduced and the fees are not typically spent on the coaching salaries.
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
def trend_plot(x_var, y_var):
# Prints scatterplot with linear best fit regression line based on x, y
# y is preferably the cost large scaled with millions
fig, ax = plt.subplots(3, 1, figsize = (12, 28))
dataType = [ucla_rows, pac12_rows, fbsub_rows]
for i in range(0, len(dataType)):
X = np.array(dataType[i][x_var])
y = np.array(dataType[i][y_var])
reg = LinearRegression().fit(X.reshape(-1,1), y)
ytrain = reg.intercept_ + reg.coef_ * X
print(str(dataType[i]["Data"].unique()[0]), ":")
print("Regression Coef:", reg.coef_)
print("Regression intercept:", reg.intercept_)
print("MSE:", mean_squared_error(y, ytrain))
print("r-squared:", r2_score(y, ytrain))
ax[i].plot(X,y,'ro',X,ytrain,'b-')
ax[i].ticklabel_format(axis = "y", style = "scientific", scilimits=(6, 6))
ax[i].set(title = str(y_var + " by " + x_var + " " + dataType[i]["Data"].unique()[0]),
xlabel = x_var, ylabel = y_var)
print("\n")
trend_plot("Student Fees", "Donor Contributions")
To investigate student fees further, a linear regression will be useful. Funds from other revenues, like donor contributions, can reveal how essential student fees are to football programs. Mostly, the graphs from the linear regression plots explain that more student fees result in more successful donor contributions. In the case of UCLA, it is difficult to conclude this trend with such a low r-square score (0.17). Despite this, increased student fees do seem to indicate a positive relationship with donor contributions.
trend_plot("Total Football Coaching Salaries", "Ticket Sales")
One of the key questions based on the research question at hand was investigating how significant coaches played a role in the football program's success. Again, developing a linear regression plot but with coaching salaries and a response variable of ticket sales (a good measure of football success) was performed.
Notably, an increase in coaching salaries does seem to increase ticket sales for most football programs. However, the trend line of the Football Bowl Subdivision Median is not a straight line, instead, it is a curve. Essentially, an increase in coaching salaries will help until a max efficiency. Then, more funding in coaching salaries has diminishing returns.
The r-squared value for UCLA's graph is very low (0.006). This can signal that the amount UCLA pays their football coaching salaries is not dependent on ticket sales.
Over the course of analyzing UCLA's financial allocation in their football program, I found many insightful and surprising trends. While even observing the variables, there are many key components to a program's expenses and revenues to explore. One interesting insight conveyed how over time, an increase in revenue and expenses were also related. The following graphs help illustrate how net profit does not increase over the years. As a result, this can mean that more money is put into football programs each year without substantial net gain. It may be riskier to invest in football programs without the proportional returns as more money is empowered in football programs. In addition, student fees also seemed to increase as coaching salaries increased. If students are paying more for their football programs, the coaching should be reflective of the success of the football program.
trend_plot("Year", "Net Profit")
Overall, does UCLA and other academic institutions effectively allocate their football funds? I believe the answer is that most academic institutions do in fact allocate their expenses well. On the other hand, UCLA instead may have questionable funding behaviors. In the Football Bowl subdivision and the PAC-12 conference, the increased investment of student fees and coaching salaries generally produced increased football success. The investment correlates to more recognition (ticket sales, donor contributions) but does not associate with net profit (as explained above).
In contrast, a noteworthy result of analyzing UCLA's financial allocations over the years raised an alarming flag. Unfortunately, I am unable to explain how the revenue and expenses of UCLA matched from the years prior to 2019. Of course, this is unlikely and would require further investigation into the data collection methods. Nevertheless, I was also able to analyze UCLA's funding in coaching salaries. The result of coaching salaries was not dependent on how successful the football program was at UCLA. This implies how UCLA may be overpaying their coaching staff as the returns are not beneficial. However, student fees were not very correlated with coaching salaries. Thus, students' contributions do not clearly fund these salaries, but some other source of expenses does.
The research conducted helps provide context to some of the controversies in college football programs, but it did bring into light new questions. All in all, further inquiry is highly recommended and to keep in check with how money is allocated in sports programs across the US.