Why do we want to transform a table from wide to long?

In the article published yesterday I explained how to fetch statistics from GENESIS using the statistics on death causes as an example. After downloading all the data and glueing the tables together you are finally left with one huge monster table.

36 columns for 2 genders and 18 age groups – 2480 rows for 31 years and 80 causes of death. Unless you are an idiot savant you are probably going to feel overwhelmed.

This type of table is called a “cross table” because the values gain meaning through rows and columns crossing each other. The structure is also referred to as “wide“, for obvious reasons. Now luckily there is a handy tool called “pivot table” present in all major spreadsheet applications that allows us to look at such a table from all sorts of different perspectives. Let’s say we want to know how many women and men aged from 20 to 34 years died per year of any type of “malignant neoplasm” from 1980 to 2010. Extracting the the necessary data manually from such a huge cross table would be quite a drag – aided by a pivoting tool it becomes a chinch. How pivoting is done practically I will explain in a separate article.

The thing is though, that table keeping the data needs to be available in a specific structure, so the spreadsheet application knows how to deal with it and apply the pivoting mechanisms. This very structure is referred to as “long“. The picture illustrates how these types relate to each other.

To achieve this transformation we are going to use the melt function of the reshape package for R. In case you’re new to R check out this wikibook for more information. To make a long story short – if you’re (halfway) seriously interested in statistical computing, there is no way around this tool.

Awesome! So, how is it done now?

Let’s keep it simple for now and assume we have cross table that looks like this in a spreadsheet:

First we need to export the wide formatted table as a CSV into a text file ‘wide.csv’. It is important that also the first column has a name in the cell above it – let’s say a good name for the a,b column is T, then the content of the CSV file would look like this for the above example:

T,x,y
a,1,2
b,3,4

From there we go with the command line in R.

data <- read.table('wide.csv', # read CSV into data frame
    header=T,                  # first row holds the field names
    sep=",",                   # comma separated
    quote="\"",                # quotes for strings
    check.names=F              # don't mess with the field names
)

data <- melt(data,             # melt data into long format
    id=c("T"),                 # this column is already 'long'
    variable_name="S"          # name of the to be 'longed' fields
)

write.table(data,              # write data frame in data to CSV
    file = "long.csv",         # yep
    sep=",",                   # yep
    row.names=F                # don't put line numbers into the first col
)

And here’s the resulting content of ‘long.csv’:

T,S,value
a,x,1
a,y,2
b,x,3
b,y,4

Now after reimporting it into Excel, Calc or whatever, the merry pivoting can start! I am going to come back to pivoting as a technology several times on this blog. One ofthe next articles will be about how to use the pivoting feature in LibreOffice‘s spreadsheet program.

Dealing with four dimensional cross tables

Maybe you noticed that the table on death causes is not two but four dimensional. As a matter of fact there are just few modifications needed to the above script to handle that case as well. This is going to be covered in a soon-to-follow, separate article.

(original article published on www.joyofdata.de)