In this series, using R, I explore 2020-21 financial data from the top 5 Scottish universities, downloaded from the HESA website, focusing here on data inspection.
Introduction
In this blog post, together with two related posts, I’ll explore some Open finance data from the Higher Education Statistics Agency (HESA) website with R. After importing and inspecting the data (this post), I’ll do some data cleaning and manipulation (the second post), and then finish with a basic analysis that allows for a financial comparison between some Scottish universities (the final post). I’ll compare the top 5 universities, as stated in Times Higher Education here, i.e. the Universities of Edinburgh, Glasgow, Aberdeen, Dundee, and St. Andrews.
The data I’ll use is reported to HESA by universities annually as part of a “Consolidated statement of comprehensive income and expenditure”. It is available here, where it’s possible to download the source data, which I’ll use below,1 and also inspect the data in an online table. I’ll look exclusively at the data from the academic year 2020–21.
Although it’s not evident from HESA’s online table, the source data is far from “tidy”. By this I mean that we do not have a situation where each column represents a variable, and each row an observation.2 Instead, multiple variables are included within single columns. To get to a point where it’s possible to do something useful with the data, therefore, it’s necessary to do some data wrangling/manipulation. But before doing this, it’s necessary to inspect the data to understand what needs to be manipulated - that’s the goal of this post.
Importing and inspecting the data
To get started, I’ll import necessary packages:
library(tidyverse) # import most of the packages needed
library(kableExtra) # for HTML tables
library(here) # to find absolute path to dataset
I’ll start by loading the first 15 rows of the dataset, and display these using the kable
package:
uniFinData <- read.csv(here("static/data/financial_statements.csv"), nrows=15, header=F)
# display cols 1-4 via kable:
kable(uniFinData[,1:4], format = "html", col.names = NULL, row.names = TRUE) %>%
kable_styling(html_font = "Arial", font_size = 14) %>%
scroll_box(height = "300px")
1 | Title | Consolidated statement of comprehensive income and expenditure by HE provider and academic year | ||
2 | Location | UK | ||
3 | Academic years | 2015/16 to 2020/21 | ||
4 | Data source | HESA / OfS | ||
5 | Data source link | https://www.hesa.ac.uk/data-and-analysis/finances/table-1 | ||
6 | Data file canonical link | https://www.hesa.ac.uk/data-and-analysis/finances/table-1.csv | ||
7 | Licence | Creative Commons Attribution 4.0 International Licence | ||
8 | Code page | Unicode UTF-8 | ||
9 | ||||
10 | Last updated | Jul-22 | ||
11 | ||||
12 | ||||
13 | UKPRN | HE Provider | Country of HE provider | Region of HE provider |
14 | 10007783 | The University of Aberdeen | Scotland | Scotland |
15 | 10007849 | Abertay University | Scotland | Scotland |
(For neatness, I’ll hide the kable
code blocks from now on, with the echo = FALSE
knitr option.)
As can be seen, there is some metadata in the first few rows. This includes, for example, the name of the dataset (“Consolidated statement of comprehensive income and expenditure by HE provider and academic year “), when it was last updated (“Jul-22”), and the dataset’s date range (academic years “2015/16 to 2020/21”). The data table itself starts on row 13.
To import the data table, I’ll skip the first 12 rows:
# import dataset
uniFinData <- read.csv(here("static/data/financial_statements.csv"), skip=12)
# inspect first few rows
glimpse(uniFinData)
## Rows: 101,244
## Columns: 10
## $ UKPRN <int> 10007783, 10007849, 10007856, 10000291, 1000775…
## $ HE.Provider <chr> "The University of Aberdeen", "Abertay Universi…
## $ Country.of.HE.provider <chr> "Scotland", "Scotland", "Wales", "England", "En…
## $ Region.of.HE.provider <chr> "Scotland", "Scotland", "Wales", "East of Engla…
## $ Academic.year <chr> "2015/16", "2015/16", "2015/16", "2015/16", "20…
## $ Financial.year.end <chr> "", "", "", "", "", "", "", "", "", "", "", "",…
## $ Category.marker <chr> "Actuarial gain/(loss) in respect of pension sc…
## $ Category <chr> "Actuarial gain/(loss) in respect of pension sc…
## $ Value..000s. <chr> "(7408)", "(6329)", "(10455)", "(28888)", "(884…
## $ Year.End.Month <chr> "All", "All", "All", "All", "All", "All", "All"…
The glimpse()
function, which prints the first few rows in a transposed format (i.e. with columns listed from top to bottom, and rows from left to right), shows that most columns have been imported as “character” type. The exception is the UK provider number column, UKPRN
, which is imported as a numeric “integer” type.
A couple of things are evident from the first few rows. One is that the same variable name appears in different columns, e.g. “Scotland” appears in both the Country
and Region
columns. So it’s already clear the data isn’t in a “tidy” format.
Another is that the Value..000s.
column, which should be numeric, has been been imported as ‘character’ type. This is due to the use of brackets to signify some negative values, an accountancy convention. This will need to be addressed later, before numerical operations can be performed with this column.
There is an option in read.csv
to import “character” data as “factors”, stringsAsFactors = True
. I’ll leave these columns as “character”-type for now, though (and, in any event, it’s clear that Value..000s.
should be “numeric” rather than “factor”-type).
To calculate the number of unique variables in the dataset the n_distinct
function can be used, in conjunction with summarise()
in dplyr
. The across()
function, together with the everything()
option, applies n_distinct
to all columns:
uniFinData %>%
summarise(across(everything(), n_distinct)) %>%
# the result is a 1-row data frame, which can be printed transposed with glimpse:
glimpse()
## Rows: 1
## Columns: 10
## $ UKPRN <int> 287
## $ HE.Provider <int> 288
## $ Country.of.HE.provider <int> 4
## $ Region.of.HE.provider <int> 12
## $ Academic.year <int> 6
## $ Financial.year.end <int> 43
## $ Category.marker <int> 20
## $ Category <int> 39
## $ Value..000s. <int> 18473
## $ Year.End.Month <int> 12
A couple of points to note from this are, first, columns like Value..000s.
, which have a lot of unique numeric values, are of little interest at this stage as they will not be used for filtering. Columns with a smaller number of categorical text values, which can be used for filtering, are of greater interest.
Second, there is an odd mismatch between the number of unique values in the UKPRN
and HE.Provider
columns. Without knowing anything about the data, we might expect these numbers to match – we might wonder, in particular, whether this mismatch is due to data quality issues (e.g. missing or erroneous data).
It turns out that the mismatch is due to the use of “total” entries in HE.Provider
column, paired with blank entries in the UKPRN
column. This can be seen by grouping the data by UKPRN
and creating a summary of distinct values in the HE.Provider
column:
# examining
uniFinData %>%
group_by(UKPRN) %>%
summarise(num = n_distinct(HE.Provider)) %>%
arrange(desc(num)) %>%
head(3)
## # A tibble: 3 × 2
## UKPRN num
## <int> <int>
## 1 NA 2
## 2 10000163 1
## 3 10000216 1
Sorting, descending, by number shows there are 2 distinct variables in HE.Provider
paired with blank NA
entries in UKPRN
(which are also grouped, as NA
is treated as a value by R here). Looking specifically at these unique values in HE.Provider
shows that they are “total” entries:
uniFinData %>%
# specifically filtering for these NA values:
filter(is.na(UKPRN)) %>%
distinct(HE.Provider)
## HE.Provider
## 1 Total
## 2 Year to date total
Otherwise, there is the expected relationship between the UKPRN
and HE.Provider
. Excluding rows with NA
in UKPRN
, before using summarise
and across
(as above), shows equivalent numbers of distinct variables:
uniFinData %>%
filter(!is.na(UKPRN)) %>%
summarise(across(c(UKPRN, HE.Provider), n_distinct))
## UKPRN HE.Provider
## 1 286 286
Returning now to the main analysis, it’s helpful to create a list of the distinct values in the columns of interest. This can be done by using lapply
to apply the unique
function across all columns of the uniFinData
data frame:
# create a list of all the unique values in the table
uniFinDataUniqueVal <- lapply(uniFinData, unique)
We can now print the elements of the list corresponding to the columns of interest (i.e. those useful for filtering) for reference. From the initial n_distinct
calculation above, these are in columns 3-8, and 10:
# display the list elements of interest
uniFinDataUniqueVal[c(3:8, 10)]
## $Country.of.HE.provider
## [1] "Scotland" "Wales" "England" "Northern Ireland"
##
## $Region.of.HE.provider
## [1] "Scotland" "Wales"
## [3] "East of England" "West Midlands"
## [5] "South West" "Northern Ireland"
## [7] "London" "East Midlands"
## [9] "North West" "Yorkshire and The Humber"
## [11] "South East" "North East"
##
## $Academic.year
## [1] "2015/16" "2016/17" "2017/18" "2018/19" "2019/20" "2020/21"
##
## $Financial.year.end
## [1] "" "2018-07-31" "2018-03-31" "2018-06-30" "2018-02-28"
## [6] "2018-05-31" "2017-12-31" "2018-01-31" "2019-07-31" "2018-08-31"
## [11] "2019-03-31" "2018-12-31" "2018-10-31" "2018-09-30" "2019-06-30"
## [16] "2018-11-30" "2019-01-31" "2019-02-28" "2019-05-31" "2020-07-31"
## [21] "2019-08-31" "2020-03-31" "2019-12-31" "2019-10-31" "2019-09-30"
## [26] "2020-06-30" "2019-11-30" "2020-01-31" "2020-02-28" "2020-05-31"
## [31] "2019-12-28" "2021-07-31" "2020-08-31" "2021-03-31" "2020-12-31"
## [36] "2020-10-31" "2020-09-30" "2021-06-30" "2020-11-30" "2021-01-31"
## [41] "2021-02-28" "2021-05-31" "2020-12-26"
##
## $Category.marker
## [1] "Actuarial gain/(loss) in respect of pension schemes"
## [2] "Total comprehensive income for the year represented by:"
## [3] "Change in fair value of hedging financial instrument(s) plus foreign currency translation"
## [4] "Expenditure"
## [5] "Dividends"
## [6] "Income"
## [7] "Gain/(loss) on disposal of tangible assets"
## [8] "Gain/(loss) on investment property"
## [9] "Gain/(loss) on investments"
## [10] "Miscellaneous types of Other comprehensive income"
## [11] "Surplus for the year attributable to:"
## [12] "Share of operating surplus/(deficit) in associate(s)"
## [13] "Share of operating surplus/(deficit) in joint venture(s)"
## [14] "Surplus/(deficit) before other gains/losses and share of surplus/(deficit) in joint ventures and associates"
## [15] "Surplus/(deficit) before tax"
## [16] "Surplus/(deficit) for the year"
## [17] "Taxation"
## [18] "Total comprehensive income/(expenditure) for the year"
## [19] "Taxation "
## [20] "Unrealised surplus on revaluation of land and buildings"
##
## $Category
## [1] "Actuarial gain/(loss) in respect of pension schemes"
## [2] "Attributable to the non-controlling interest"
## [3] "Attributable to the University"
## [4] "Change in fair value of hedging financial instrument(s) plus foreign currency translation"
## [5] "Depreciation and amortisation"
## [6] "Dividends"
## [7] "Donations and endowments"
## [8] "Endowment comprehensive income for the year"
## [9] "Funding body grants"
## [10] "Gain/(loss) on disposal of tangible assets"
## [11] "Gain/(loss) on investment property"
## [12] "Gain/(loss) on investments"
## [13] "Interest and other finance costs"
## [14] "Investment income"
## [15] "Miscellaneous types of Other comprehensive income"
## [16] "Non-controlling interest"
## [17] "Other income"
## [18] "Other operating expenses"
## [19] "Provider"
## [20] "Research grants and contracts"
## [21] "Restricted comprehensive income for the year"
## [22] "Restructuring costs"
## [23] "Revaluation reserve comprehensive income for the year"
## [24] "Share of operating surplus/(deficit) in associate(s)"
## [25] "Share of operating surplus/(deficit) in joint venture(s)"
## [26] "Staff costs"
## [27] "Surplus/(deficit) before other gains/losses and share of surplus/(deficit) in joint ventures and associates"
## [28] "Surplus/(deficit) before tax"
## [29] "Surplus/(deficit) for the year"
## [30] "Taxation"
## [31] "Taxation on research and development expenditure credit"
## [32] "Total"
## [33] "Total comprehensive income/(expenditure) for the year"
## [34] "Total expenditure"
## [35] "Total income"
## [36] "Total taxation"
## [37] "Tuition fees and education contracts"
## [38] "Unrealised surplus on revaluation of land and buildings"
## [39] "Unrestricted comprehensive income for the year"
##
## $Year.End.Month
## [1] "All" "07, July" "03, March" "06, June"
## [5] "02, February" "05, May" "12, December" "01, January"
## [9] "08, August" "10, October" "09, September" "11, November"
This provides a helpful reference for the next, data wrangling, step, as we have a better sense of the information contained in the dataset, how it’s organised, and the work that needs to be done prior to analysis.
-
Data downloaded on 15/11/22. Aside from some updates for English providers in November 2022, financial data for the 2021-22 academic year won’t be available until the summer of 2023. ↩︎
-
For more on “tidy data”, a term coined by Hadley Wickham, see for example: https://tidyr.tidyverse.org/articles/tidy-data.html. ↩︎