In fits and starts, over the past month I’ve been getting to grips with an exciting new Oyster Card data set from TfL and the wonderfully supportive Andrew Gaitskell, their resident Oyster Card data expert. For those few of you who live under a rock or have been unable to visit London: the Oyster Card is a contact-less electronic ticketing system for entering and exiting the Tube, Docklands Light Railway, Overground, and Rail (I’ll leave the Bus for another post since that turns out to be a much more difficult proposition analytically). Every time a user ‘taps in’ or ‘taps out’ of a given mode of transit this leaves a trace in TfL’s usage database, so this is a very different type of data set from the scheduled activities visualised by my colleague Joan Serras here.
Anyway, thanks to a research partnership established a few months ago we’ve got limited access to this data (i.e. I can’t see any information about the user of the card and, as an additional layer of protection, the card number has been encrypted). What makes this data particularly exciting to us urban planners/researchers is that we have both entries and exits. I never thought I’d find having to fish my Oyster Card out of my pocket on the way out of the Tube to be a ‘plus’, but as a researcher it means that I can assemble a series of discrete trip segments into a single journey. For example, using this data we can tell that user X started their day on a train to Euston from Watford, jumped on the Victoria Line when they reached London, and then exited in Green Park. Suddenly, we’ve gone from only being able to say that y commuters in Watford work in London to being able to talk (and think) about the distribution of such commuters across Central London.
[Note: I’ve reordered this post to bring the visualisations nearer to the start of the post, if you’d like to get more background detail then that’s now near the bottom.]
The visualisations below represent flows in and out of every Oyster-enabled station in Central London at several points in time over the course of a weekday (a Monday in November) and a weekend day (a Sunday in November). Each figure shows the totals over a 10-minute interval starting at (respectively) 8am, 1pm, and 6pm, and for comparative purposes, I’ve placed the Sunday and Monday visualisations next to each other.
The size of the pie chart corresponds to the total volume of people passing through the station (I’ve grouped National Rail and Tube stations of the same name — e.g. Waterloo, Kings Cross, … — into 1 super-total) scaled against the maximum flow from across the two day sample which was was, roughly, 17,000. And yes, that total is in the space of just 10 minutes! I couldn’t think of any ‘natural’ way to differentiate between entries and exits, so I decided to follow Ollie’s lead and use red for entry (think: red is ‘stop walking’) and green for exit (think: green is ‘go about your business’). Ollie is actually working on a similar (but not the same!) project right now that you’ll be seeing shortly at the London Transport Museum, and I don’t think I’m stealing any of his thunder here since he’s taken on the rather more difficult challenge of doing this type of mapping entirely within a web browser and using a dynamic, aggregated feed that could even be done in real-time.
|Time of Day||Sunday||Monday|
So do these visualisations pass the common sense test? My feeling so far is ‘yes’ in that they capture broad-brush dynamics at work in London:
- Life on Sunday clearly gets off to a much later start than it does on the following Monday.
- Weekday peaks between 8am and 9am, and between 5pm and 6pm are self-evident and differ from Sunday’s usage.
What’s more interesting to me is the radical difference in entry/exit patterns over the course of the day on Monday: at 8am we have far more entries than exits on a station-by-station basis, except at the core stations in Central London; at 6pm the balance is reversing with more entries than exits in Central London and many more exits than entries elsewhere around the region. Interestingly, what becomes obvious in the interactive version of this project (done in Processing) is that the ‘peak’ evening load lasts much longer than the equivalent A.M. rush hour — load levels peak at 6pm, but remain quite high until well after 10pm at the most-used stations in the data set. Note too that at 1pm the system is broadly in-balance, something that it superficially appears to have in common with Sunday usage across much of the day. This general behaviour also seems to be largely in line with Ollie’s findings from the Barclay’s bike scheme
I’m hoping to put together a video of this activity so that you can watch the entire system in action — there are interesting deviations from expected behaviour that merit further attention: an early-moning fillip at Heathrow Airport of passengers departing the country (and we might predict that they will be London residents since they are more likely to have Oyster Cards). But for the time being this should give some sense of what the system looks like at the coarsest level. In a future post I hope to start reporting results from assembling the tap-ins/tap-outs into end-to-end journeys that will give us a sense of the principals axes along which commuters and visitors travel.
To my knowledge, the only work that has ever really tried to tackle this type of aggregate urban movement analysis head-on is Goddard’s Functional Regions within the City Centre: a Study by Factor Analysis of Taxi Flows in Central London (1970, Transactions of the Institute of British Geographers, pp.161-182). Goddard used logs collected from taxi drivers to derive ‘functional regions’ within Central London, showing the principal Origins/Destinations (O/D) for taxis (especially those picked up or dropped off at mainline stations) and highlighting the ‘neighbourhoods’ that these links implied.
As you might expect, there are a lot of tap-ins and tap-outs in the course of a week: I’m looking at 14 million records per week day, and 6 million records per weekend day. And that’s after cleaning out the irrelevant or corrupted records. So the numbers can add up pretty quickly, especially since, ideally, I want to examine at least a full month’s usage. What we’ve found from previous work with telecommunications is that several weeks’ worth of data should give us a pretty good baseline for what average system behaviour looks like and against which we can test for ‘unusual’ behaviour on a node or link in the network.
For now, however, I’m just looking at the first couple of days in order to see if the results I’m getting feel remotely right, and the visualisations below are the first fruits of that process. The common sense test for big data sets — i.e. do the simplest summary results come at all close to fitting with what a rational human being would expect to see? — is surprisingly useful: most of us have some sense of what the overall system should look like when visualised, and the collision between our expectations and what the data are telling us is happening is often the first sign that something has gone wrong in your processing. I point this issue out because there are no data feeds without their quirks and very ‘exciting’ early results are almost inevitably the product of a processing mistake.
For that same reason it’s often useful to capture debugging output (and to output as much of it from your processing scripts as you dare) as you go so that it is easy to look back through the log to see what might have happened. I’m a big fan of Perl for doing this type of work since it’s text-handling abilities are unmatched and it’s the quintessential ‘glue’ language: if you can’t do it directly in Perl, you can always use someone else’s module from CPAN to interface with the target language or application. The much bigger issue is storing the data since, ultimately, we’ll need the ability to query and aggregate something on the order of 330 million records. For that, I’m able to get away with using partitioned tables in MySQL and the performance (even on my iMac) is acceptable well into the billions; the only hitch is the indexing: once your indexes exceed your computer’s RAM you’ll be doing linear searches and performance will tank.