Building on Part 1, in this R post I perform data manipulation to prepare 2020-21 university financial data for analysis and visualisation in Part 3.
Introduction
This post builds on insights from the first post, where I imported and inspected Open finance data from the Higher Education Statistics Agency (HESA) website. We now have a good idea of what’s in the financial dataset in question, and how it’s organised (which, as noted, is markedly un-“tidy”).
In the final part of this exploration, I’ll analyse the “surplus/deficit for the year” of the top 5 Scottish universities in 2021 (according to Times Higher Education). The goal in this post is therefore to extract and prepare the relevant data for this analysis.
Data manipulation
I’ll start by importing the libraries again, and the dataset:
library(tidyverse) # import most of the packages needed
library(kableExtra) # for HTML tables
library(here) # to find absolute path to dataset
# import dataset, skipping first 12 rows
uniFinData <- read.csv(here("static/data/financial_statements.csv"), skip=12)
With the insights from the inspection, we have enough information start filtering for the data of interest.
First, it’s helpful to create a vector of the unique names of the top 5 universities. A rough-and-ready keyword search shows the information needed to identify the 5 universities:
uniFinData %>%
filter(grepl("Aberdeen|Edinburgh|Glasgow|Strathclyde|Andrews", HE.Provider)) %>%
distinct(UKPRN, HE.Provider)
## UKPRN HE.Provider
## 1 10007783 The University of Aberdeen
## 2 10007772 Edinburgh Napier University
## 3 10007790 The University of Edinburgh
## 4 10007762 Glasgow Caledonian University
## 5 10002681 Glasgow School of Art
## 6 10007794 The University of Glasgow
## 7 10005337 Queen Margaret University, Edinburgh
## 8 10007803 The University of St. Andrews
## 9 10007805 The University of Strathclyde
As the full provider names are a little unwieldy, I’ll use the relevant UKPRN’s (UK Provider Reference Numbers) for the relevant universities for the purpose of filtering:
universities <- c(10007783, 10007790, 10007794, 10007803, 10007805)
The “surplus/deficit for the year” is calculated from the following broad categories: “Income”, “Expenditure”, “Gain/(loss) on disposal of tangible assets”, “Gain/(loss) on investment property”, “Gain/(loss) on investments”, “Share of operating surplus/(deficit) in associate(s)”, “Share of operating surplus/(deficit) in joint venture(s)”, and “Taxation”. This can be seen from the interactive HESA table here.1 These are all in the Category.marker
column (as can be seen from the list of unique variables at the end of the first post).
I’ll now filter for these financial categories, in addition to filtering for the top 5 universities. I also exclude summary rows where Year.End.Month
is “All”, to exclude duplicate rows, and drop some unnecessary columns:
uniFinData <- uniFinData %>%
filter(UKPRN %in% universities,
Academic.year == "2020/21",
Category.marker %in% c(
"Income",
"Expenditure",
"Gain/(loss) on disposal of tangible assets",
"Gain/(loss) on investment property",
"Gain/(loss) on investments",
"Share of operating surplus/(deficit) in associate(s)",
"Share of operating surplus/(deficit) in joint venture(s)",
"Taxation"),
Year.End.Month != "All") %>%
# drop unnecessary columns
select(HE.Provider, Category.marker,
Category, Value..000s.)
The results, filtering for University of Aberdeen, show that there are still some unwanted rows evident in the Category
column:
HE.Provider | Category.marker | Category | Value..000s. |
---|---|---|---|
The University of Aberdeen | Expenditure | Depreciation and amortisation | 16855 |
The University of Aberdeen | Expenditure | Interest and other finance costs | 5840 |
The University of Aberdeen | Expenditure | Other operating expenses | 59306 |
The University of Aberdeen | Expenditure | Restructuring costs | 0 |
The University of Aberdeen | Expenditure | Staff costs | 146956 |
The University of Aberdeen | Expenditure | Total expenditure | 228957 |
The University of Aberdeen | Gain/(loss) on disposal of tangible assets | Gain/(loss) on disposal of tangible assets | 4493 |
The University of Aberdeen | Gain/(loss) on investment property | Gain/(loss) on investment property | 0 |
The University of Aberdeen | Gain/(loss) on investments | Gain/(loss) on investments | 5780 |
The University of Aberdeen | Income | Donations and endowments | 236 |
The University of Aberdeen | Income | Funding body grants | 87577 |
The University of Aberdeen | Income | Investment income | 2012 |
The University of Aberdeen | Income | Other income | 25764 |
The University of Aberdeen | Income | Research grants and contracts | 45927 |
The University of Aberdeen | Income | Total income | 235939 |
The University of Aberdeen | Income | Tuition fees and education contracts | 74423 |
The University of Aberdeen | Share of operating surplus/(deficit) in associate(s) | Share of operating surplus/(deficit) in associate(s) | 0 |
The University of Aberdeen | Share of operating surplus/(deficit) in joint venture(s) | Share of operating surplus/(deficit) in joint venture(s) | (912) |
The University of Aberdeen | Taxation | Taxation | 0 |
The University of Aberdeen | Taxation | Taxation on research and development expenditure credit |
These unwanted rows are: summary rows starting with the word “Total” (“Total income” and “Total expenditure”), and a row called “Taxation on research and development expenditure credit” with blanks rather than zeros in the Value..000s.
column.2 I remove these next, and also rename some columns for clarity (Category.marker
is renamed to Category
, and Category
is renamed to Subcategory
, to reflect something approaching the role of these columns in the dataset):
uniFinData <- uniFinData %>%
filter(str_detect(Category, "^Total", negate = TRUE),
Category != "Taxation on research and development expenditure credit") %>%
select(University = HE.Provider, Category = Category.marker,
Subcategory = Category, Value000s = Value..000s.)
Here is the result, sorted by Category
(formerly Category.marker
), filtering again for University of Aberdeen:
University | Category | Subcategory | Value000s |
---|---|---|---|
The University of Aberdeen | Expenditure | Depreciation and amortisation | 16855 |
The University of Aberdeen | Expenditure | Interest and other finance costs | 5840 |
The University of Aberdeen | Expenditure | Other operating expenses | 59306 |
The University of Aberdeen | Expenditure | Restructuring costs | 0 |
The University of Aberdeen | Expenditure | Staff costs | 146956 |
The University of Aberdeen | Gain/(loss) on disposal of tangible assets | Gain/(loss) on disposal of tangible assets | 4493 |
The University of Aberdeen | Gain/(loss) on investment property | Gain/(loss) on investment property | 0 |
The University of Aberdeen | Gain/(loss) on investments | Gain/(loss) on investments | 5780 |
The University of Aberdeen | Income | Donations and endowments | 236 |
The University of Aberdeen | Income | Funding body grants | 87577 |
The University of Aberdeen | Income | Investment income | 2012 |
The University of Aberdeen | Income | Other income | 25764 |
The University of Aberdeen | Income | Research grants and contracts | 45927 |
The University of Aberdeen | Income | Tuition fees and education contracts | 74423 |
The University of Aberdeen | Share of operating surplus/(deficit) in associate(s) | Share of operating surplus/(deficit) in associate(s) | 0 |
The University of Aberdeen | Share of operating surplus/(deficit) in joint venture(s) | Share of operating surplus/(deficit) in joint venture(s) | (912) |
The University of Aberdeen | Taxation | Taxation | 0 |
This looks much better, but it’s still necessary to fix some things:
- the additional, non-income/expenditure, categories filtered-for earlier feature in both the
Category
andSubcategory
columns. I’ll usecase_when
to lump these together as “Other” within theCategory
column - the “brackets” issue in the
Value000s
column, mentioned in the first post (an accountancy convention), needs to be addressed. It turns out these are only used where there are losses in the “Other” subcategories – i.e. values classed as typical/operating “Expenditure” lack brackets and are represented as positive values, something which could potentially trip up an analysis. I’ll fix these issues by (i) using the wonderfulstringr
package to remove brackets and incorporate a minus sign, and (ii) multiplying all values classed as “Expenditure” by -1:
uniFinData <- uniFinData %>%
# remove brackets and replace with minus sign
mutate(Value000s = str_replace_all(Value000s, "\\(", "\\-"),
Value000s = str_replace_all(Value000s, "\\)", "")) %>%
mutate(Value000s = as.numeric(Value000s)) %>%
# multiply Expenditure numbers by -1
mutate(Value000s = case_when(Category == "Expenditure" ~ Value000s*-1, TRUE ~ Value000s)) %>%
# Organise the "Category" column (this approach seemed quickest):
mutate(Category = case_when(
Category == "Income" ~ "Income",
Category == "Expenditure" ~ "Expenditure",
TRUE ~ "Other"),
# manually create factor levels, for more logical sorting
Category = factor(Category, levels = c("Income", "Expenditure", "Other"))) %>%
arrange(University, Category)
# display
glimpse(uniFinData)
## Rows: 85
## Columns: 4
## $ University <chr> "The University of Aberdeen", "The University of Aberdeen"…
## $ Category <fct> Income, Income, Income, Income, Income, Income, Expenditur…
## $ Subcategory <chr> "Donations and endowments", "Funding body grants", "Invest…
## $ Value000s <dbl> 236, 87577, 2012, 25764, 45927, 74423, -16855, -5840, -593…
It can be seen that Value000s
is now in a numeric data type (“dbl”, short for “double precision floating point”), and some values are negative - hurrah!
To have a proper look, I converted the Category
column into a factor, with manually ordered levels, so that the data will sort in a logical order in the following kable
table. Filtering again for University of Aberdeen, things look much better now (and, as intended, the “Expenditure”-related and previously bracketed numbers have minus signs):
University | Category | Subcategory | Value000s |
---|---|---|---|
The University of Aberdeen | Income | Donations and endowments | 236 |
The University of Aberdeen | Income | Funding body grants | 87577 |
The University of Aberdeen | Income | Investment income | 2012 |
The University of Aberdeen | Income | Other income | 25764 |
The University of Aberdeen | Income | Research grants and contracts | 45927 |
The University of Aberdeen | Income | Tuition fees and education contracts | 74423 |
The University of Aberdeen | Expenditure | Depreciation and amortisation | -16855 |
The University of Aberdeen | Expenditure | Interest and other finance costs | -5840 |
The University of Aberdeen | Expenditure | Other operating expenses | -59306 |
The University of Aberdeen | Expenditure | Restructuring costs | 0 |
The University of Aberdeen | Expenditure | Staff costs | -146956 |
The University of Aberdeen | Other | Gain/(loss) on disposal of tangible assets | 4493 |
The University of Aberdeen | Other | Gain/(loss) on investment property | 0 |
The University of Aberdeen | Other | Gain/(loss) on investments | 5780 |
The University of Aberdeen | Other | Share of operating surplus/(deficit) in associate(s) | 0 |
The University of Aberdeen | Other | Share of operating surplus/(deficit) in joint venture(s) | -912 |
The University of Aberdeen | Other | Taxation | 0 |
-
For further guidance, see “Understanding University Finance”, from the British Universities Finance Directors Group (BUFDG), p.40. HESA also provides definitions for many of the terms used. ↩︎
-
This category is now redundant, having been merged into “Taxation” from 2017/18 onwards (confirmed via email by HESA). ↩︎