Building on Part 1, in this R post I perform data manipulation to prepare 2020-21 university financial data for analysis and visualisation in Part 3.

Introduction

This post builds on insights from the first post, where I imported and inspected Open finance data from the Higher Education Statistics Agency (HESA) website. We now have a good idea of what’s in the financial dataset in question, and how it’s organised (which, as noted, is markedly un-“tidy”).

In the final part of this exploration, I’ll analyse the “surplus/deficit for the year” of the top 5 Scottish universities in 2021 (according to Times Higher Education). The goal in this post is therefore to extract and prepare the relevant data for this analysis.

Data manipulation

I’ll start by importing the libraries again, and the dataset:

library(tidyverse) # import most of the packages needed
library(kableExtra) # for HTML tables
library(here) # to find absolute path to dataset

# import dataset, skipping first 12 rows
uniFinData <- read.csv(here("static/data/financial_statements.csv"), skip=12)

With the insights from the inspection, we have enough information start filtering for the data of interest.

First, it’s helpful to create a vector of the unique names of the top 5 universities. A rough-and-ready keyword search shows the information needed to identify the 5 universities:

uniFinData %>%
  filter(grepl("Aberdeen|Edinburgh|Glasgow|Strathclyde|Andrews", HE.Provider)) %>%
  distinct(UKPRN, HE.Provider)
##      UKPRN                          HE.Provider
## 1 10007783           The University of Aberdeen
## 2 10007772          Edinburgh Napier University
## 3 10007790          The University of Edinburgh
## 4 10007762        Glasgow Caledonian University
## 5 10002681                Glasgow School of Art
## 6 10007794            The University of Glasgow
## 7 10005337 Queen Margaret University, Edinburgh
## 8 10007803        The University of St. Andrews
## 9 10007805        The University of Strathclyde

As the full provider names are a little unwieldy, I’ll use the relevant UKPRN’s (UK Provider Reference Numbers) for the relevant universities for the purpose of filtering:

universities <- c(10007783, 10007790, 10007794, 10007803, 10007805)

The “surplus/deficit for the year” is calculated from the following broad categories: “Income”, “Expenditure”, “Gain/(loss) on disposal of tangible assets”, “Gain/(loss) on investment property”, “Gain/(loss) on investments”, “Share of operating surplus/(deficit) in associate(s)”, “Share of operating surplus/(deficit) in joint venture(s)”, and “Taxation”. This can be seen from the interactive HESA table here.1 These are all in the Category.marker column (as can be seen from the list of unique variables at the end of the first post).

I’ll now filter for these financial categories, in addition to filtering for the top 5 universities. I also exclude summary rows where Year.End.Month is “All”, to exclude duplicate rows, and drop some unnecessary columns:

uniFinData <- uniFinData %>%
  filter(UKPRN %in% universities,
         Academic.year == "2020/21",
         Category.marker %in% c(
           "Income",
           "Expenditure",
           "Gain/(loss) on disposal of tangible assets",
           "Gain/(loss) on investment property",
           "Gain/(loss) on investments",
           "Share of operating surplus/(deficit) in associate(s)",
           "Share of operating surplus/(deficit) in joint venture(s)",
           "Taxation"),
         Year.End.Month != "All") %>%
  # drop unnecessary columns
  select(HE.Provider, Category.marker, 
         Category, Value..000s.)

The results, filtering for University of Aberdeen, show that there are still some unwanted rows evident in the Category column:

HE.Provider Category.marker Category Value..000s.
The University of Aberdeen Expenditure Depreciation and amortisation 16855
The University of Aberdeen Expenditure Interest and other finance costs 5840
The University of Aberdeen Expenditure Other operating expenses 59306
The University of Aberdeen Expenditure Restructuring costs 0
The University of Aberdeen Expenditure Staff costs 146956
The University of Aberdeen Expenditure Total expenditure 228957
The University of Aberdeen Gain/(loss) on disposal of tangible assets Gain/(loss) on disposal of tangible assets 4493
The University of Aberdeen Gain/(loss) on investment property Gain/(loss) on investment property 0
The University of Aberdeen Gain/(loss) on investments Gain/(loss) on investments 5780
The University of Aberdeen Income Donations and endowments 236
The University of Aberdeen Income Funding body grants 87577
The University of Aberdeen Income Investment income 2012
The University of Aberdeen Income Other income 25764
The University of Aberdeen Income Research grants and contracts 45927
The University of Aberdeen Income Total income 235939
The University of Aberdeen Income Tuition fees and education contracts 74423
The University of Aberdeen Share of operating surplus/(deficit) in associate(s) Share of operating surplus/(deficit) in associate(s) 0
The University of Aberdeen Share of operating surplus/(deficit) in joint venture(s) Share of operating surplus/(deficit) in joint venture(s) (912)
The University of Aberdeen Taxation Taxation 0
The University of Aberdeen Taxation Taxation on research and development expenditure credit
 

These unwanted rows are: summary rows starting with the word “Total” (“Total income” and “Total expenditure”), and a row called “Taxation on research and development expenditure credit” with blanks rather than zeros in the Value..000s. column.2 I remove these next, and also rename some columns for clarity (Category.marker is renamed to Category, and Category is renamed to Subcategory, to reflect something approaching the role of these columns in the dataset):

uniFinData <- uniFinData %>%
  filter(str_detect(Category, "^Total", negate = TRUE),
         Category != "Taxation on research and development expenditure credit") %>%
  select(University = HE.Provider, Category = Category.marker, 
         Subcategory = Category, Value000s = Value..000s.)

Here is the result, sorted by Category (formerly Category.marker), filtering again for University of Aberdeen:

University Category Subcategory Value000s
The University of Aberdeen Expenditure Depreciation and amortisation 16855
The University of Aberdeen Expenditure Interest and other finance costs 5840
The University of Aberdeen Expenditure Other operating expenses 59306
The University of Aberdeen Expenditure Restructuring costs 0
The University of Aberdeen Expenditure Staff costs 146956
The University of Aberdeen Gain/(loss) on disposal of tangible assets Gain/(loss) on disposal of tangible assets 4493
The University of Aberdeen Gain/(loss) on investment property Gain/(loss) on investment property 0
The University of Aberdeen Gain/(loss) on investments Gain/(loss) on investments 5780
The University of Aberdeen Income Donations and endowments 236
The University of Aberdeen Income Funding body grants 87577
The University of Aberdeen Income Investment income 2012
The University of Aberdeen Income Other income 25764
The University of Aberdeen Income Research grants and contracts 45927
The University of Aberdeen Income Tuition fees and education contracts 74423
The University of Aberdeen Share of operating surplus/(deficit) in associate(s) Share of operating surplus/(deficit) in associate(s) 0
The University of Aberdeen Share of operating surplus/(deficit) in joint venture(s) Share of operating surplus/(deficit) in joint venture(s) (912)
The University of Aberdeen Taxation Taxation 0

 
This looks much better, but it’s still necessary to fix some things:

  • the additional, non-income/expenditure, categories filtered-for earlier feature in both the Category and Subcategory columns. I’ll use case_when to lump these together as “Other” within the Category column
  • the “brackets” issue in the Value000s column, mentioned in the first post (an accountancy convention), needs to be addressed. It turns out these are only used where there are losses in the “Other” subcategories – i.e. values classed as typical/operating “Expenditure” lack brackets and are represented as positive values, something which could potentially trip up an analysis. I’ll fix these issues by (i) using the wonderful stringr package to remove brackets and incorporate a minus sign, and (ii) multiplying all values classed as “Expenditure” by -1:
uniFinData <- uniFinData %>%
  # remove brackets and replace with minus sign
  mutate(Value000s = str_replace_all(Value000s, "\\(", "\\-"), 
         Value000s = str_replace_all(Value000s, "\\)", "")) %>%
  mutate(Value000s = as.numeric(Value000s)) %>%
  # multiply Expenditure numbers by -1
  mutate(Value000s = case_when(Category == "Expenditure" ~ Value000s*-1, TRUE ~ Value000s)) %>%
  # Organise the "Category" column (this approach seemed quickest):
  mutate(Category = case_when(
    Category == "Income" ~ "Income",
    Category == "Expenditure" ~ "Expenditure",
    TRUE ~ "Other"),
    # manually create factor levels, for more logical sorting
    Category = factor(Category, levels = c("Income", "Expenditure", "Other"))) %>%
  arrange(University, Category)

# display
glimpse(uniFinData)
## Rows: 85
## Columns: 4
## $ University  <chr> "The University of Aberdeen", "The University of Aberdeen"…
## $ Category    <fct> Income, Income, Income, Income, Income, Income, Expenditur…
## $ Subcategory <chr> "Donations and endowments", "Funding body grants", "Invest…
## $ Value000s   <dbl> 236, 87577, 2012, 25764, 45927, 74423, -16855, -5840, -593…

It can be seen that Value000s is now in a numeric data type (“dbl”, short for “double precision floating point”), and some values are negative - hurrah!

To have a proper look, I converted the Category column into a factor, with manually ordered levels, so that the data will sort in a logical order in the following kable table. Filtering again for University of Aberdeen, things look much better now (and, as intended, the “Expenditure”-related and previously bracketed numbers have minus signs):

University Category Subcategory Value000s
The University of Aberdeen Income Donations and endowments 236
The University of Aberdeen Income Funding body grants 87577
The University of Aberdeen Income Investment income 2012
The University of Aberdeen Income Other income 25764
The University of Aberdeen Income Research grants and contracts 45927
The University of Aberdeen Income Tuition fees and education contracts 74423
The University of Aberdeen Expenditure Depreciation and amortisation -16855
The University of Aberdeen Expenditure Interest and other finance costs -5840
The University of Aberdeen Expenditure Other operating expenses -59306
The University of Aberdeen Expenditure Restructuring costs 0
The University of Aberdeen Expenditure Staff costs -146956
The University of Aberdeen Other Gain/(loss) on disposal of tangible assets 4493
The University of Aberdeen Other Gain/(loss) on investment property 0
The University of Aberdeen Other Gain/(loss) on investments 5780
The University of Aberdeen Other Share of operating surplus/(deficit) in associate(s) 0
The University of Aberdeen Other Share of operating surplus/(deficit) in joint venture(s) -912
The University of Aberdeen Other Taxation 0

 


  1. For further guidance, see “Understanding University Finance”, from the British Universities Finance Directors Group (BUFDG), p.40. HESA also provides definitions for many of the terms used. ↩︎

  2. This category is now redundant, having been merged into “Taxation” from 2017/18 onwards (confirmed via email by HESA). ↩︎