Introducing Multidimensional Data
Most of the data you’ll use on our site is statistical data, facts as figures. Recall that a dataset is a collection of related data. Statistical data is often published in the form of a multidimensional dataset. This may be unfamiliar jargon, but as we’ll see, it’s a familiar and very useful idea.
After reading this tutorial, you’ll:
- Understand the idea of multidimensional data
- Understand the technical terminology of multidimensional data
- Be able to find the precise data you need inside a multidimensional dataset
- Be ready to use PublishMyData to explore multidimensional data in detail
As an example, consider these Olympic medal results, shown in familiar tabular form.
The rows and columns format is familiar from the spreadsheets we use every day, but this is in fact, multidimensional data.
These data have two dimensions, country and medal colour. The country dimension is shown as rows, and the medal colour dimension as columns.
It’s often convenient to use a geographic dimension, such as country: we call these the Reference Area dimension of the dataset. By convention in PublishMyData we always show the Reference Area as rows.
So we’re already very comfortable with working with two-dimensional data, but these data show medal results for the 2012 Olympics, so we can think of the data as including a third dimension, the year of the results. We call a time-based dimension the Reference Period.
Here, Reference Period, the year dimension has the same value for all the data in our table, so we describe this dimension as locked to a value (in this case, 2012). By contrast, the country and medal dimensions can take different values. We call these free dimensions.
To obtain a value we lock one dimension at a time - by choosing what value it takes - until all the dimensions are locked. Here, we choose year = 2012, country = USA, medal = gold. Where our locked dimension values intersect, we have located the observation we’re interested in.
We call the number in the cell the measure. The measure has a value (46) and a unit (medals won). This measure is, intuitively, a count - ie how many of something were there?. It’s also very common to encounter ratios, expressed as a percentage or as somethings per something.
Two dimensions plus one
Let’s imagine that we obtain two more spreadsheets of medal data, covering 2008 and 2004.
Note that our Observation from before - because it locked values for all of its dimensions - remains valid, but we’ve added a bunch of new observations.
Now, suppose we are interested in how the number of gold medals won by each team has changed over time. Our data set contains all this data, but it’s not very convenient to extract it (even in a small toy data set like this.)
Wouldn’t it be more convenient to have a table which locked the medal dimension to ‘gold’ and then showed us how that changed over time?
This is more like it, but if all you have is spreadsheets, you’ll need some Excel skills, and maybe a little bit of trial and error to get here. There’s a better way to do it.
First of all, let’s make our illustrations a bit clearer by getting rid of the numbers, and just using colour to show what’s in each cell. This is just for the illustration though - bear in mind that the measures are all still there!
And finally, let’s adjust how we draw it a little bit. Imagine stacking the spreadsheets front to back. Notice that the data isn’t changing, just we’re adding a third dimension to the illustration.
Here we have a data cube. The one illustrated is a cube, but in practice, the dimensions don’t need to be - and indeed most likely won’t be the same size. For instance, we’ve only shown three countries, but the example could include a hundred more rows.
The important change here is that the data are no longer tied to a two dimensional representation, they fill space.
Let’s look more closely at what we have here.
Our three original spreadsheets, showing the free dimensions Reference Area and Medal and the locked dimension, Years are still here…
…but we now realise that they are slices through the cube, and that there are other ways of slicing the cube…
Each of these slices is a potential two dimensional spreadsheet.
Sliced one direction (top 3), we lock Year and have Reference Area and Medal free.
Remember that free dimension are what we tabulate, so our spreadsheets of these is Reference Area (rows) and Medal (cols) for a particular year - the original setup.
Sliced another way (middle 3), we lock Medal, and have Reference Area and Medal free. Our tabular view will show Reference Area (rows) and Year (cols) for a particular medal colour.
Sliced still another way (bottom 3), we lock Reference Area and have Year and Medal free. Our table will show Medal and Year for a particular country.
This is where we start to see the usefulness of arranging our data as a cube. By choosing a dimension to lock, and its value, we can obtain a table of exactly the view onto our data that we need.
As we’ve seen, by locking all but two dimensions, we end up with a 2d table that is a cross-section through the cube. When one of the two free dimensions is the Reference Period, the cross-section can be more precisely described as a time series - because it shows change over time.
More than three dimensions
So what’s all the fuss about? After all it’s pretty common to see Excel workbooks that do just this - a different slice on each worksheet.
Recall how, a little earlier, looking at our 2d table we realised we’d been assuming a third (year) dimension? Perhaps we might now realise that there’s a fourth dimension - the gender of the athlete. Our tables so far have locked the gender dimension to ‘All’ but maybe we also want to provide charts for male and female athletes.
Recall too how we redrew our table to add the new dimension. Clearly that’s not possible this time - we’ve reached the limits of what can be drawn on a screen - and its not necessarily easy to visualise in your mind’s eye either, but it’s not necessary to visualise it - just keep locking down dimensions, slicing through the hypercube until you have something with just two free dimensions and you’ll have a spreadsheet.
There’s no reason to stop with just four dimensions. We can go on to imagine datasets of any dimensionality. Let’s add a fifth.
It’s almost impossible to visualise this, but again, you don’t need to - just apply the dimension locking trick to slice up the n-dimensional cube and PublishMyData will reduce it to an easily-handled table.
We’ve now reached the limit of what would be practical in Excel - this would need 54 worksheets, but the Data Cube can keep on adding dimensions indefinitely.
And what about our example observation? It’s still there, at the co-ordinates: year = 2012, gender = all, competition = Summer Olympics, country = USA, medal = gold. Turns out it was 5-dimensional all along.
Let’s consider the fortunes of some different geographies in the former Yugoslavia. From the examples above, we might expect to see something like this:
However, this diagram is not correct. The real position is a little bit more complicated. Serbia and Montenegro competed as a single team in the 2004 Olympics, but as separate countries subsequently. It doesn’t make sense to have an observation for just Serbia or just Montenegro in 2004 for any medal, and it doesn’t make sense to have observations with a Reference Area of ‘Serbia AND Montenegro’ in 2008 or 2012. There are holes in our cube: It is sparse.
Note that this is not the same as winning, say, zero gold medals in a given year - which would be a perfectly valid observation. These are holes in the cube because they represent impossible co-ordinates, combinations of dimension that make no real-world sense. Reference Area is particularly prone to this sort of change over time, and its very common to see this pattern in real-world data. Moreover real-world data may include gaps because for some reason (including error) no observation was recorded for a particular combination of dimensions. Sparse cubes are very much the rule, rather than the exception, and sparse tables will result from this.
There is, therefore, no guarantee that a data cube will have an observation for every possible combination of dimensions.
To sum up, here’s a recap of some of the language we’ve covered:
- cross section
- A table. A two dimensional slice.
- data cube
- A way of representing statistical data as observations at co-ordinates specified by the value of their dimensions. Data cubes can have 2, 3, 4 or any number of dimensions.
- A collection of related data issued under a shared useage licence.
- A property of an observation that takes a value. For instance it may be the geographical area to which the observaion relates, the year for which the observation records a value etc. All the observations in a dataset will have the same number of dimensions.
- free dimension
- A dimension which can take several values
- In geometry, the four(+) dimensional eqivalent of a cube. Frequently used to make this stuff sound more complicated than it really is.
- locked dimension
- A dimension for which we have specified a single value
- What the value of an observation represents. Commonly, it will be a count or a ratio. In PublishMyData, this is always a dimension, so a count and ratio will be different observations (though all other dimension values will be the same).
- multidimensional data
- Data which comprises observations for given values of its dimensions
- n-dimensional cube
- In geometry, a cube with n dimensions, where n is a number larger than 3. (See also hypercube)
- An individiual item of data, specified by the intersection of several dimensions at particular values.
- Part of a data cube obtained by locking the values of one or more dimensions. Note that in our examples above all the slices shown have two dimensions, but a six dimensional cube might technically have five, four, three and two dimensional slices.
- Reference Area
- The dimension which specifies the geographic area of an observation.
- Reference Period
- The dimension which specifies the temporal period of an observation.
- Describes a data cube where some combinations of dimension values do not have observations. ie most of the cubes you will meet in real life data.
- A slice through a data cube which has two free dimensions, and which locks a value for all the other dimensions. Such two dimensional slices are useful because they are equivalent to a familiar spreadsheet view.
- time series
- A cross section in which Reference Period is a free dimension. ie A table showing change over time.
Most pages in this site have an API tab which includes contextual details of how to access the data programmatically. The API tab on the Getting Started help page describes some overarching principles.