Open HESA financial data 2: data manipulation

Building on Part 1, in this R post I perform data manipulation to prepare 2020-21 university financial data for analysis and visualisation in Part 3.

Introduction

This post builds on insights from the first post, where I imported and inspected Open finance data from the Higher Education Statistics Agency (HESA) website. We now have a good idea of what’s in the financial dataset in question, and how it’s organised (which, as noted, is markedly un-“tidy”).

In the final part of this exploration, I’ll analyse the “surplus/deficit for the year” of the top 5 Scottish universities in 2021 (according to Times Higher Education). The goal in this post is therefore to extract and prepare the relevant data for this analysis.

Data manipulation

I’ll start by importing the libraries again, and the dataset:

library(tidyverse) # import most of the packages needed
library(kableExtra) # for HTML tables
library(here) # to find absolute path to dataset

# import dataset, skipping first 12 rows
uniFinData <- read.csv(here("static/data/financial_statements.csv"), skip=12)

With the insights from the inspection, we have enough information start filtering for the data of interest.

First, it’s helpful to create a vector of the unique names of the top 5 universities. A rough-and-ready keyword search shows the information needed to identify the 5 universities:

uniFinData %>%
  filter(grepl("Aberdeen|Edinburgh|Glasgow|Strathclyde|Andrews", HE.Provider)) %>%
  distinct(UKPRN, HE.Provider)

##      UKPRN                          HE.Provider
## 1 10007783           The University of Aberdeen
## 2 10007772          Edinburgh Napier University
## 3 10007790          The University of Edinburgh
## 4 10007762        Glasgow Caledonian University
## 5 10002681                Glasgow School of Art
## 6 10007794            The University of Glasgow
## 7 10005337 Queen Margaret University, Edinburgh
## 8 10007803        The University of St. Andrews
## 9 10007805        The University of Strathclyde

As the full provider names are a little unwieldy, I’ll use the relevant UKPRN’s (UK Provider Reference Numbers) for the relevant universities for the purpose of filtering:

universities <- c(10007783, 10007790, 10007794, 10007803, 10007805)

The “surplus/deficit for the year” is calculated from the following broad categories: “Income”, “Expenditure”, “Gain/(loss) on disposal of tangible assets”, “Gain/(loss) on investment property”, “Gain/(loss) on investments”, “Share of operating surplus/(deficit) in associate(s)”, “Share of operating surplus/(deficit) in joint venture(s)”, and “Taxation”. This can be seen from the interactive HESA table here.¹ These are all in the Category.marker column (as can be seen from the list of unique variables at the end of the first post).

I’ll now filter for these financial categories, in addition to filtering for the top 5 universities. I also exclude summary rows where Year.End.Month is “All”, to exclude duplicate rows, and drop some unnecessary columns:

uniFinData <- uniFinData %>%
  filter(UKPRN %in% universities,
         Academic.year == "2020/21",
         Category.marker %in% c(
           "Income",
           "Expenditure",
           "Gain/(loss) on disposal of tangible assets",
           "Gain/(loss) on investment property",
           "Gain/(loss) on investments",
           "Share of operating surplus/(deficit) in associate(s)",
           "Share of operating surplus/(deficit) in joint venture(s)",
           "Taxation"),
         Year.End.Month != "All") %>%
  # drop unnecessary columns
  select(HE.Provider, Category.marker, 
         Category, Value..000s.)

The results, filtering for University of Aberdeen, show that there are still some unwanted rows evident in the Category column:

HE.Provider	Category.marker	Category	Value..000s.
The University of Aberdeen	Expenditure	Depreciation and amortisation	16855
The University of Aberdeen	Expenditure	Interest and other finance costs	5840
The University of Aberdeen	Expenditure	Other operating expenses	59306
The University of Aberdeen	Expenditure	Restructuring costs	0
The University of Aberdeen	Expenditure	Staff costs	146956
The University of Aberdeen	Expenditure	Total expenditure	228957
The University of Aberdeen	Gain/(loss) on disposal of tangible assets	Gain/(loss) on disposal of tangible assets	4493
The University of Aberdeen	Gain/(loss) on investment property	Gain/(loss) on investment property	0
The University of Aberdeen	Gain/(loss) on investments	Gain/(loss) on investments	5780
The University of Aberdeen	Income	Donations and endowments	236
The University of Aberdeen	Income	Funding body grants	87577
The University of Aberdeen	Income	Investment income	2012
The University of Aberdeen	Income	Other income	25764
The University of Aberdeen	Income	Research grants and contracts	45927
The University of Aberdeen	Income	Total income	235939
The University of Aberdeen	Income	Tuition fees and education contracts	74423
The University of Aberdeen	Share of operating surplus/(deficit) in associate(s)	Share of operating surplus/(deficit) in associate(s)	0
The University of Aberdeen	Share of operating surplus/(deficit) in joint venture(s)	Share of operating surplus/(deficit) in joint venture(s)	(912)
The University of Aberdeen	Taxation	Taxation	0
The University of Aberdeen	Taxation	Taxation on research and development expenditure credit

These unwanted rows are: summary rows starting with the word “Total” (“Total income” and “Total expenditure”), and a row called “Taxation on research and development expenditure credit” with blanks rather than zeros in the Value..000s. column.² I remove these next, and also rename some columns for clarity (Category.marker is renamed to Category, and Category is renamed to Subcategory, to reflect something approaching the role of these columns in the dataset):

uniFinData <- uniFinData %>%
  filter(str_detect(Category, "^Total", negate = TRUE),
         Category != "Taxation on research and development expenditure credit") %>%
  select(University = HE.Provider, Category = Category.marker, 
         Subcategory = Category, Value000s = Value..000s.)

Here is the result, sorted by Category (formerly Category.marker), filtering again for University of Aberdeen:

University	Category	Subcategory	Value000s
The University of Aberdeen	Expenditure	Depreciation and amortisation	16855
The University of Aberdeen	Expenditure	Interest and other finance costs	5840
The University of Aberdeen	Expenditure	Other operating expenses	59306
The University of Aberdeen	Expenditure	Restructuring costs	0
The University of Aberdeen	Expenditure	Staff costs	146956
The University of Aberdeen	Gain/(loss) on disposal of tangible assets	Gain/(loss) on disposal of tangible assets	4493
The University of Aberdeen	Gain/(loss) on investment property	Gain/(loss) on investment property	0
The University of Aberdeen	Gain/(loss) on investments	Gain/(loss) on investments	5780
The University of Aberdeen	Income	Donations and endowments	236
The University of Aberdeen	Income	Funding body grants	87577
The University of Aberdeen	Income	Investment income	2012
The University of Aberdeen	Income	Other income	25764
The University of Aberdeen	Income	Research grants and contracts	45927
The University of Aberdeen	Income	Tuition fees and education contracts	74423
The University of Aberdeen	Share of operating surplus/(deficit) in associate(s)	Share of operating surplus/(deficit) in associate(s)	0
The University of Aberdeen	Share of operating surplus/(deficit) in joint venture(s)	Share of operating surplus/(deficit) in joint venture(s)	(912)
The University of Aberdeen	Taxation	Taxation	0

This looks much better, but it’s still necessary to fix some things:

the additional, non-income/expenditure, categories filtered-for earlier feature in both the Category and Subcategory columns. I’ll use case_when to lump these together as “Other” within the Category column
the “brackets” issue in the Value000s column, mentioned in the first post (an accountancy convention), needs to be addressed. It turns out these are only used where there are losses in the “Other” subcategories – i.e. values classed as typical/operating “Expenditure” lack brackets and are represented as positive values, something which could potentially trip up an analysis. I’ll fix these issues by (i) using the wonderful stringr package to remove brackets and incorporate a minus sign, and (ii) multiplying all values classed as “Expenditure” by -1:

uniFinData <- uniFinData %>%
  # remove brackets and replace with minus sign
  mutate(Value000s = str_replace_all(Value000s, "\\(", "\\-"), 
         Value000s = str_replace_all(Value000s, "\\)", "")) %>%
  mutate(Value000s = as.numeric(Value000s)) %>%
  # multiply Expenditure numbers by -1
  mutate(Value000s = case_when(Category == "Expenditure" ~ Value000s*-1, TRUE ~ Value000s)) %>%
  # Organise the "Category" column (this approach seemed quickest):
  mutate(Category = case_when(
    Category == "Income" ~ "Income",
    Category == "Expenditure" ~ "Expenditure",
    TRUE ~ "Other"),
    # manually create factor levels, for more logical sorting
    Category = factor(Category, levels = c("Income", "Expenditure", "Other"))) %>%
  arrange(University, Category)

# display
glimpse(uniFinData)

## Rows: 85
## Columns: 4
## $ University  <chr> "The University of Aberdeen", "The University of Aberdeen"…
## $ Category    <fct> Income, Income, Income, Income, Income, Income, Expenditur…
## $ Subcategory <chr> "Donations and endowments", "Funding body grants", "Invest…
## $ Value000s   <dbl> 236, 87577, 2012, 25764, 45927, 74423, -16855, -5840, -593…

It can be seen that Value000s is now in a numeric data type (“dbl”, short for “double precision floating point”), and some values are negative - hurrah!

To have a proper look, I converted the Category column into a factor, with manually ordered levels, so that the data will sort in a logical order in the following kable table. Filtering again for University of Aberdeen, things look much better now (and, as intended, the “Expenditure”-related and previously bracketed numbers have minus signs):

University	Category	Subcategory	Value000s
The University of Aberdeen	Income	Donations and endowments	236
The University of Aberdeen	Income	Funding body grants	87577
The University of Aberdeen	Income	Investment income	2012
The University of Aberdeen	Income	Other income	25764
The University of Aberdeen	Income	Research grants and contracts	45927
The University of Aberdeen	Income	Tuition fees and education contracts	74423
The University of Aberdeen	Expenditure	Depreciation and amortisation	-16855
The University of Aberdeen	Expenditure	Interest and other finance costs	-5840
The University of Aberdeen	Expenditure	Other operating expenses	-59306
The University of Aberdeen	Expenditure	Restructuring costs	0
The University of Aberdeen	Expenditure	Staff costs	-146956
The University of Aberdeen	Other	Gain/(loss) on disposal of tangible assets	4493
The University of Aberdeen	Other	Gain/(loss) on investment property	0
The University of Aberdeen	Other	Gain/(loss) on investments	5780
The University of Aberdeen	Other	Share of operating surplus/(deficit) in associate(s)	0
The University of Aberdeen	Other	Share of operating surplus/(deficit) in joint venture(s)	-912
The University of Aberdeen	Other	Taxation	0

For further guidance, see “Understanding University Finance”, from the British Universities Finance Directors Group (BUFDG), p.40. HESA also provides definitions for many of the terms used. ↩︎
This category is now redundant, having been merged into “Taxation” from 2017/18 onwards (confirmed via email by HESA). ↩︎

Introduction¶

Data manipulation¶

Introduction

Data manipulation