Thoughts on social media interactions from conversations at JISCexpo

I’ve just come back from the JISCexpo end-of-programme meeting in Manchester — I’d attended as a developer on the Open Citations Project. While there I met some of the University of Lincoln‘s web team, and it was interesting to see how they were using the web to interact with current and prospective students.

Take a look at their main Twitter account:

Screenshot of the University of Lincoln’s Twitter page.

They’re being a lot more interactive than I’ve seen elsewhere within HE. They’re actually responding to queries and pointing people in the right direction for further information. On the other hand, most of what we do is regurgitate RSS feeds.

This got me wondering whether we should strive to be using social media in a more bi-directional fashion. I’m not saying that it should be to the detriment of publishing news articles — one could mix them on the same account or have both news and conversational accounts.

Looking as though one would respond to tweets would give a “beneficial air of friendliness”, which could translate into “conversions” and new opportunities. The people that manage these accounts have a lot of knowledge about their University, department or college that they’re likely not going to think to share until asked, but which would be useful to a lot of people on the Internet.

Relatedly, one of their students had put together three videos[1, 2, 3] which he placed on YouTube and labelled as “banned adverts”². These have racked up about 2 million news between them.

Seeing these, their press people commissioned him to produce a clearing advert, which has garnered them even more (positive) publicity. It’s great to see them innovating in finding ways to generate interest among potential students and the wider Internet.

Posted in Uncategorized | Tagged , , , | 2 Comments

High memory usage when using Python’s tarfile module

The Python tarfile module is a handy way to access files within tar archives without needing to unpack them first. You can iterate over files using the following pattern:

import tarfile

tar = tarfile.open(filename, 'r:gz')
for tar_info in tar:                  # tar_info is the metadata for a
                                      #   file in the archive.
    file = tar.extractfile(tar_info)  # file is a file-like object.
    for line in file:                 # We can do standard file-like
        print line,                   #   things.

Behind the scenes, each TarFile object maintains a list of members of the archive, and keeps this updated whenever you read or write members. This is fine for small archives, particularly if you want to access the metadata without having to re-read the archive. (TarFile objects have getmember, getmembers, and getnames methods for this kind of access.)

This list of members contains the TarInfo objects for every file in the archive. When you’ve got an archive with 18 million members (as I have), this list will no longer conceivably fit in memory. It’s not documented (as far as I can tell), but the solution is to periodically set the members attribute on the TarFile object to the empty list:

import tarfile

tar = tarfile.open(filename, 'r:gz')
for tar_info in tar:
    file = tar.extractfile(tar_info)
    do_something_with(file)
    tar.members = []

Obviously one loses some functionality as specified above, but hopefully now my scripts will terminate in reasonable time!

Posted in Python | Tagged , , , | 1 Comment

The PRONOM vocabulary

The National Archives are currently preparing a vocabulary specification for describing file formats that appear in digital repositories, targeted at the field of digital preservation. You can find the current version linked from their blog post.

I don’t know much about digital preservation, but I have a number of concerns, both about the scope of their effort, and on a few technicalities.

Scope

The vocabulary provides a way of describing file formats, with the aim that each repository will maintain its own collection of file format descriptions. This could well be unfortunate if each describes the same formats in subtly different – and possibly incompatible – ways, using locally-minted URIs. We end up with a co-reference problem, as there won’t be any easy ways to link my definition of PDF to your definition of PDF.

I imagine it would be much better in the long run were the National Archives able to maintain their own taxonomy of file formats that individual repositories could link to. The linking could be as simple as matching on internet media types, or using file to determine file contents. It would also give convenient hooks for subclassing were one to wish to say “a plain text file encoded in latin1″.

As far as I can tell they don’t seem to have considered hierarchies of file formats, composite file formats, or container file formats. As examples, PDF/A is a narrower format than PDF, .tar.gz is a container format composed with a compression format, and Matroska is a container format for audio and video streams. Saying “it’s an MKV file” doesn’t actually tell you much about what you need to decode it.

Technicalities

Here are some specific issues that are likely easily addressed:

  • The specification is currently available as a PDF. It would be great if this could be made available as an OWL document.
  • The namespace URI is given as http://reference.data.gov.uk/technical-registry/ with a suggested prefix of pronom. Should the URI be more specific?
  • The definitions do not consistently give domains and ranges for properties.
  • The specification alludes to MIME types. These references should really be to their successor, Internet Media Types.
  • The datatypes on the releaseDate and withdrawnDate properties are missing a ‘#’ on the end of the XSD namespace part.
  • The example demonstrates use of dc:description. As far as I am aware this has been deprecated in favour of dcterms:description.
  • There’s a confusion between instances, subclassing and concept schemes. The example describes something that is both an instance of rdfs:Class and pronom:file-format, making it both a class and an instance, something generally discouraged. This technically also makes pronom:file-format a metaclass, which is fun, but probably not what was intended.
    If formats are classes, then they should subclass one another; if they are instances they should probably be part of a skos:ConceptScheme and be linked by skos:narrower and skos:broader.
  • “URIs” like pronom:Image_(Vector) cannot be expressed as QNames in Turtle or SPARQL as parentheses cannot appear in local parts. This will make serialization and querying more verbose than it needs to be. URIs that might percent-encode or -decode to something different are also probably best avoided.
  • There is a convention that class names be capitalised as per ClassName and property names as per propertyName (unless there is a more specific convention, as e.g. VCard). The property and class names in the specification do not follow this convention. Image_(Raster) should probably be RasterImage.
  • Sections 4.4 provides “miscellaneous resources”, but each is labelled a property. These are really instances, and should have a class associated with them (e.g. “Endianness”)
Posted in Uncategorized | Leave a comment

Using logging to debug what suds is sending across the wire

suds, a SOAP client for Python, makes good use of Python’s logging module, which makes it easy to work out what it’s sending across the wire.

I’d previously mentioned hacking suds to use an urllib2 handler with debugging enabled. Modifying libraries systemwide, while it works, isn’t the prettiest way to work out what’s happening, and so we should turn to the logging framework.

The logging module contains a basicConfig method, which we could invoke thus:

import logging

logging.basicConfig(level=logging.DEBUG)

However, suds chucks out a log of stuff at this level, the majority of which we’re not interested in. From observing this (or reading the code) we note that we’re really only interested in the suds.transport.http logger. Hence we attach a handler to just this logger:

import logging

handler = logging.StreamHandler(sys.stderr)
logger = logging.getLogger('suds.transport.http')
logger.setLevel(logging.DEBUG), handler.setLevel(logging.DEBUG)
logger.addHandler(handler)

Furthermore, if we’re only interested in outgoing messages, we can attach a filter:

class OutgoingFilter(logging.Filter):
    def filter(self, record):
        return record.msg.startswith('sending:')

handler.addFilter(OutgoingFilter())

We now get lots of messages like the following spewed out on stderr, perfect for working out whether your code is constructing the right SOAP requests!

sending:
URL:https://nexus.ox.ac.uk/EWS/exchange.asmx
HEADERS: {'SOAPAction': u'"http://schemas.microsoft.com/exchange/services/2006/messages/GetUserOofSettings"', 'Content-Type': 'text/xml; charset=utf-8', 'Content-type': 'text/xml; charset=utf-8', 'Soapaction': u'"http://schemas.microsoft.com/exchange/services/2006/messages/GetUserOofSettings"'}
MESSAGE:
<SOAP-ENV:Envelope xmlns:ns0="http://schemas.microsoft.com/exchange/services/2006/types" xmlns:ns1="http://schemas.microsoft.com/exchange/services/2006/messages" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/">
   <SOAP-ENV:Header/>
   <SOAP-ENV:Body>
      <ns1:GetUserOofSettingsRequest>
         <Mailbox xmlns="http://schemas.microsoft.com/exchange/services/2006/types">
            <ns0:Address>firstname.lastname@unit.ox.ac.uk</ns0:Address>
         </Mailbox>
      </ns1:GetUserOofSettingsRequest>
   </SOAP-ENV:Body>
</SOAP-ENV:Envelope>
Posted in Uncategorized | Tagged , , , | Leave a comment

Exchange Web Services, suds, and Python

This post is out of date. Please see this new post about Exchange 2010.

I’d quite like to be able to interact with the University’s Exchange 2007 instance (Nexus) programmatically using suds (a Python SOAP client) and Exchange Web Services (EWS), but it’s not straightforward to get it set up. After a couple of hours of debugging, I’ve worked out what was going wrong.

This walk-through uses Alex Koshelev’s EWS-specific fork of suds, which you can grab from BitBucket. You will need to apply the patch attached to this ticket, which we’ll explain later. For the final code, jump straight to the bottom.

Authentication

Exchange Web Services is a SOAP-based API for interacting with an Exchange instance. Its WSDL definition is generally available at http://exchange.example.org/EWS/Services.wsdl. In Oxford’s case, it is here.

Exchange requires you to authenticate before you can access the WSDL, which you can do with the following code:

from suds.transport.https import HttpAuthenticated
from suds.client import Client

transport = HttpAuthenticated(username='abcd0123',
                              password='secret')
client = Client("https://nexus.ox.ac.uk/EWS/Services.wsdl",
                transport=transport)

This works fine until you try to call a method:

print client.service.GetUserOofSettings()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
    print client.service.GetUserOofSettings()
  File "/usr/lib/python2.7/site-packages/suds/client.py", line 537, in __call__
    return client.invoke(args, kwargs)
  File "/usr/lib/python2.7/site-packages/suds/client.py", line 597, in invoke
    result = self.send(msg)
  File "/usr/lib/python2.7/site-packages/suds/client.py", line 635, in send
    result = self.failed(binding, e)
  File "/usr/lib/python2.7/site-packages/suds/client.py", line 700, in failed
    raise Exception((status, reason))
Exception: (401, u'Unauthorized ( The server requires authorization to fulfill the request. Access to the Web server is denied. Contact the server administrator.  )')

But we’ve already given it credentials! By turning on debugging in urllib2 (hacking suds.transport.https.HttpAuthenticated to add a handler as explained by Jamie Grove) we realise that each request is being sent without authentication, receiving a 401, and then (for GET requests, as for the WSDL) being resent with authentication. However, when we POST our getUserOofSettings() request it doesn’t resend it, probably because it’s forbidden by the HTTP spec.

To get round this, we need to make sure the client sends an authenticated request the first time round by specifying the URL and realm explicitly. Let’s define our own transport, modelled on suds.transport.https.HttpAuthenticated:

import urllib2
from suds.transport.http import HttpTransport
from suds.client import Client

class Transport(HttpTransport):
    def __init__(self, **kwargs):
        realm, uri = kwargs.pop('realm'), kwargs.pop('uri')
        HttpTransport.__init__(self, **kwargs)
        self.handler = urllib2.HTTPBasicAuthHandler()
        self.handler.add_password(realm=realm,
                                  user=self.options.username,
                                  passwd=self.options.password,
                                  uri=uri)
        self.urlopener = urllib2.build_opener(self.handler)

transport = Transport(realm='nexus.ox.ac.uk',
                      uri='https://nexus.ox.ac.uk/',
                      username='abcd0123',
                      password='secret')
client = Client('https://nexus.ox.ac.uk/EWS/Services.wsdl',
                transport=transport)

Using this, we get the following much more promising response, complaining about a missing argument, not missing authentication:

suds.WebFault: Server raised fault: ‘The request failed schema validation: The element ‘GetUserOofSettingsRequest’ in namespace ‘http://schemas.microsoft.com/exchange/services/2006/messages’ has incomplete content. List of possible elements expected: ‘Mailbox’ in namespace ‘http://schemas.microsoft.com/exchange/services/2006/types’.’

Namespace issues

Without the patch mentioned earlier in the article, we might attempt the following:

address = client.factory.create('ns1:EmailAddress')
address.Address = 'firstname.lastname@unit.ox.ac.uk'

client.service.GetUserOofSettings(address)

However, this fails:

suds.WebFault: Server raised fault: ‘The request failed schema validation: The element ‘GetUserOofSettingsRequest’ in namespace ‘http://schemas.microsoft.com/exchange/services/2006/messages’ has invalid child element ‘Mailbox’ in namespace ‘http://schemas.microsoft.com/exchange/services/2006/messages’. List of possible elements expected: ‘Mailbox’ in namespace ‘http://schemas.microsoft.com/exchange/services/2006/types’.’

Configuring logging (import logging; logging.basicConfig(level=logging.INFO)) reveals that suds is sending the following:

<?xml version="1.0" encoding="UTF-8"?>
<Envelope xmlns="http://schemas.xmlsoap.org/soap/envelope/">
   <Header/>
   <Body>
      <GetUserOofSettingsRequest xmlns="http://schemas.microsoft.com/exchange/services/2006/messages">
         <Mailbox>
            <Address xmlns="http://schemas.microsoft.com/exchange/services/2006/types">firstname.lastname@unit.ox.ac.uk</Address>
         </Mailbox>
      </GetUserOofSettingsRequest>
   </Body>
</Envelope>

It should look like the example in the EWS 2007 documentation, but it’s stuck Mailbox in the messages namespace, not the types one. Applying the patch mentioned above resolves this issue.

Summary

So, grab suds-ews, apply the patch, and use the following code:

import urllib2

from suds.client import Client
from suds.transport.http import HttpTransport

class Transport(HttpTransport):
    def __init__(self, **kwargs):
        realm, uri = kwargs.pop('realm'), kwargs.pop('uri')
        HttpTransport.__init__(self, **kwargs)
        self.handler = urllib2.HTTPBasicAuthHandler()
        self.handler.add_password(realm=realm,
                                  user=self.options.username,
                                  passwd=self.options.password,
                                  uri=uri)
        self.urlopener = urllib2.build_opener(self.handler)


transport = Transport(realm='nexus.ox.ac.uk',
                      uri='https://nexus.ox.ac.uk/',
                      username='abcd0123',
                      password='secret')
client = Client('https://nexus.ox.ac.uk/EWS/Services.wsdl',
                transport=transport)

address = client.factory.create('ns1:EmailAddress')
address.Address = 'firstname.lastname@unit.ox.ac.uk'

response = client.service.GetUserOofSettings(address)
print response

In my case this prints the following:

(reply){
   GetUserOofSettingsResponse = 
      (GetUserOofSettingsResponse){
         ResponseMessage = 
            (ResponseMessageType){
               _ResponseClass = "Success"
               ResponseCode = "NoError"
            }
         OofSettings = 
            (UserOofSettings){
               OofState = "Enabled"
               ExternalAudience = "None"
               Duration = 
                  (Duration){
                     StartTime = 2011-05-14 14:00:00
                     EndTime = 2011-05-15 14:00:00
                  }
               InternalReply = 
                  (ReplyBody){
                     Message = "[snip]"
                  }
               ExternalReply = 
                  (ReplyBody){
                     Message = None
                  }
            }
         AllowExternalOof = "All"
      }
 }

The internal reply fragment can be pulled out using response.GetUserOofSettingsResponse.OofSettings.InternalReply.Message.

And we’re done! You may find the API documentation useful when working out what methods are available to play with.

Posted in Exchange Web Services, Python, SharePoint and Exchange | Tagged , , , , , | 8 Comments

Live bus locations from ACIS/OxonTime

I’ve spent the day investigating how to get the raw data behind OxonTime’s live bus locations. The idea was that it could at some point be integrated into Mobile Oxford’s transport offering. Here’s how it works.

Please be aware that what follows is the result of a purely academic investigation, and that before using this information it may be worth contacting Oxfordshire County Council to discuss your plans, particularly if you’re going to distribute the results of your endeavours.

Performing a request

By monitoring the HTTP requests made by the applet we can deduce that it fetches data from a particular resource with parameters provided in the querystring. The resource is located at http://oxfordshire.acislive.com/pda/mainfeed.asp, and takes the following parameters:

type
One of STOPS, INIT or STATUS. The first retrieves a list of stops and their locations for a given area. The latter two both return a list of buses, with the latter being used for updates.
maplevel
I believe this only accepts the values 0 through 3, with nothing returned for 0 and 1. The results for 2 and 3 seem identical, so there seems little point in varying it.
SessionID
This seems to be a misnomer as it specifies the area you want to enquire about. We’ll explain how to convert to and from these numbers later.
systemid
Always 35 as it gets unhappy if you change it.
stopSelected
This expects an ATCO code yet seems to be ignored, so you may as well leave it as 34000000701.

vehicles
A comma-separated list of vehicle identifiers you currently believe to be in the area you’re enquiring about. This is only passed when type=STATUS, and lets you find out when the given buses have left the area.

Format of responses

The response in each case is a pipe-delimited list of values, with the first being the action the client should perform in updating its state. The things you may expect are:

STOP
Retrieves a list of stop locations in the area. Nearby stops are collected into one line.
NEW
This signifies that a bus is to be found at the given location.
DEL
These only appear when type=STATUS and signify that a bus is no longer at the location. If the bus is still within the requested area there will be a subsequent corresponding NEW action to give its new location. If it has left there will be no such NEW action. The buses that appear here will be a subset of those provided in the vehicles parameter of the querystring.

Now we know the form of the response, let’s see the values returned for each action. We’ll start with STOP.

STOP actions

A STOP action has the form:

STOP|35|naptan-codes|x,y|stop-names|stop-bearings|count

Here’s an example:

STOP|35|693123456^693234567^693345678|12,413|Banbury Road B1^Banbury Road B2^Banbury Road B3|172^179^340|3

As mentioned earlier, stops that are close to one another (e.g. on opposite sides of the road, or a ‘lettered’ group) are collected together, with count giving the number of stops at this location.

naptan-codes, stop-names and stop-bearings are each caret-delimited lists with fairly obvious contents. stop-bearings are given in clockwise degrees from grid north.

x and y give pixel offsets from the top-left corner of the displayed area. More on this later.

I have no idea what the 35 signifies, and currently assume it’s something to be with systemid.

A note on bus stop identifiers

The NaPTAN (National Public Transport Access Nodes) database provides two classes of identifiers, ATCO codes and NaPTAN codes. ATCO codes are upto 12 characters in length, whereas NaPTAN codes consist of nine digits. OxonTime predominantly exposes the latter; these are the numbers beginning ’693′ displayed at bus stops. However, ATCO codes are used for the currentStop parameter and are accepted elsewhere in place of their equivalent NaPTAN codes.

The NaPTAN database is currently maintained under license from the Department for Transport by Thales. Access requires a license, which may come with a fee for commercial use. However, you may be interested to note that NaPTAN data is finding its way into OpenStreetMap.

NEW actions

These have the form:

NEW|identifier|orientation|service-name|Operators/common/bus/1|y,x

Here’s an example:

NEW|1024|4|X5|Operators/common/bus/1|45,302

identifier serves to keep track of buses between requests. I don’t know whether it has some further meaning outside of this API. orientation is an integer between 1 and 8 inclusive, being N, NE, E, SE, S, SW, W, NW respectively. service-name is the same as is used on the rest of the site (e.g. ‘S1′, ’5′, ‘TUBE’). The next bit seems constant and can probably be safely ignored. Finally the offsets are given, only this time with y first; I have no idea why.

DEL actions

These have the form:

DEL|identifier|

Here’s an example:

DEL|1024|

These identifiers match up with those given in NEW actions. Note the trailing pipe.

By periodically making requests with type=STATUS one can process the returned lines of a stream of commands describing how to update the local state. This makes client implementation easier as you are effectively applying a diff, as opposed to having to compare new to old.

The co-ordinate system

First off, I’d better give you a disclaimer. This API and its associated co-ordinate system is very specific to the applet that is its only intended client. As such, the co-ordinate system provides exactly what it needs and no more.

Locations are addressable using a combination of SessionID — hereafter known as map number for clarity — and an x and y pixel offset from the top-left corner of that tile.
The maps are each 418 pixels square, and are arranged in a grid aligned with the British National Grid. The important thing to note about this is that grid North is not the same as true North, and that if you intend to plot these things on (for example) a Google or OpenLayers map, you’ll need to get your projections right.

The maps seem to be numbered somewhat arbitrarily, as shown by this map of bus stops and their associated map numbers. Colours are based on a hash of the number of the map they appear on.

A map showing the relative locations of bus stops and the ACIS map numbers for each map.

A map showing the relative locations of bus stops and the ACIS map numbers for each map.

These were found by requesting bus stops for all map numbers between 2500 and 3000, so are likely not complete. Predicting the map numbers for areas beyond these seems non-trivial or prone to error.

Doing the conversion

Edit: The following are for zoom-level 3 maps. Lower map numbers are at zoom-level 2 and cover nine times the area, so it probably makes more sense to retrieve and parse those.

To help you on your way, here is a Python dictionary which maps between map numbers and their top-left corners expressed as metres East and North of the SV square on the British National Grid. A pixel is equivalent to about 2.2×2.2 metres, as given by scale, making each map about 920 metres square.

map_numbers = {
    2507: (445998.30384769646, 204584.56186062991),
    2511: (446915.61022902746, 204584.49926297821),
    2512: (446913.71674095385, 205502.67433143588),
    2513: (446914.66775580525, 206418.97178315694),
    2516: (447831.1147245816, 205501.4561684193),
    2517: (447831.2934340348, 206419.14371744529),
    2520: (448749.2218840057, 205501.31610451243),
    2521: (448748.08603126393, 206418.72288576054),
    2522: (448749.21670514799, 207337.62523250855),
    2537: (448748.23059583915, 210084.58299463647),
    2538: (448747.09615575505, 211001.00119758685),
    2543: (450582.89109881676, 200918.25920371327),
    2545: (450582.58945117606, 202753.0387530663),
    2546: (450582.18354506971, 203668.85981883231),
    2548: (451497.38359153748, 201836.31023917606),
    2549: (451498.11202925985, 202752.26514351295),
    2550: (451499.38388189324, 203669.3077042877),
    2551: (452417.47236742743, 200918.93886234396),
    2552: (452416.98231942754, 201835.54957236422),
    2554: (452414.24108836311, 203668.59070562778),
    2557: (449664.20383959071, 206418.02172485131),
    2560: (450580.70403987489, 205501.67253490919),
    2561: (450580.96866136562, 206419.57452165778),
    2562: (450592.76460073108, 207336.75147627734),
    2563: (451499.11722530029, 204585.22058827255),
    2564: (451500.02145752736, 205501.68224598444),
    2565: (451499.00067467656, 206418.26238617001),
    2567: (452415.21798651741, 204584.90006837764),
    2568: (452415.5239759339, 205501.35499095405),
    2569: (452415.91032238252, 206418.74768511369),
    2570: (452415.68718924839, 207335.36782698822),
    2572: (449663.33610940503, 209167.13277178022),
    2573: (449665.39498988277, 210084.39421717002),
    2574: (449664.42323207163, 211001.82726484918),
    2575: (450582.29631623952, 208251.89590644254),
    2576: (450581.9053864188, 209169.67328096708),
    2577: (450583.29205249622, 210086.52104155198),
    2583: (452414.4291987372, 208251.72387988595),
    2584: (452415.09925185668, 209169.53539625759),
    2588: (453333.46479139361, 201834.4451865858),
    2589: (453331.95987896924, 202751.40410128961),
    2590: (453332.13594133727, 203668.53383136354),
    2593: (454247.50073518371, 202751.23971250441),
    2594: (454248.12170020386, 203668.51922434368),
    2597: (455165.54771764105, 202751.03839555787),
    2598: (455165.47211411892, 203669.19776463267),
    2603: (453331.21754538687, 204585.5174666637),
    2604: (453331.98750959791, 205502.05752572144),
    2605: (453331.24806580396, 206417.68972628826),
    2606: (453332.1148533299, 207336.05405913451),
    2607: (454249.16141248826, 204585.23652909664),
    2608: (454248.23316329095, 205502.58158632374),
    2609: (454248.31319587171, 206418.29606150393),
    2610: (454248.53242847562, 207334.54058771511),
    2612: (455166.59495257848, 205501.09410160859),
    2613: (455166.73990575952, 206417.83877819678),
    2614: (455163.93216277892, 207334.39865799341),
    2618: (456080.67203641078, 207334.0465145215),
    2619: (453333.25990631629, 208252.02178324395),
    2623: (454248.23726063699, 208251.90292474729),
    2627: (455165.28811999154, 208252.35041511542),
    2631: (456082.48076873791, 208252.31447046637),
    2733: (459748.48375305417, 189000.38567863638),
    2760: (462499.33169628482, 184416.6509955432),
    2761: (462498.71278643585, 185335.01518757801),
    2762: (463414.76846406935, 182586.06825647893),
    2769: (460666.45730665681, 189003.07510846868),
    2770: (461582.15014206048, 186251.43149620973),
    2772: (461583.17539895047, 188083.54467407736),
    2785: (464333.41855637298, 181667.60420131753),
    2795: (467079.15163364826, 179833.99098493406),
    2798: (464332.1937042338, 182585.51349055913),
    2844: (459748.72631959885, 191750.65753935973),
    2847: (456997.5453540723, 194500.33944191789),
    2848: (456999.14465004159, 195420.23667532994),
    2852: (457916.38047195785, 195419.96546529085),
    2854: (458831.49276305328, 193585.87604164047),
    2862: (457000.69281007705, 197252.53583802056),
    2878: (460665.66056860518, 189915.02235057726),
    2880: (460665.57930458471, 191750.16239531059),
    2882: (461581.82949289313, 189919.59644253476),
    2883: (461583.02792168478, 190835.35835896427),
    2884: (461581.33536609553, 191751.06046902022),
    2885: (461580.22909733956, 192669.40825026957),
    2887: (462497.66216840595, 190833.82408314376),
    2892: (463416.08698396623, 191750.39838604111),
    2993: (456998.56464994745, 207334.51643302356),
    2997: (457917.34275805362, 207333.4961082915)
}
scale = (2.2004998202088553, 2.1987468516915558)

These offsets were calculated by:

  • For each stop, finding the absolute position as given by the NaPTAN database
  • Instantiating two variables, ??offset and ??position, to null 2D-vectors
  • For each pair of stops appearing on the same map, adding the positive differences of their offsets within the map, and their absolute positions, to the respective variables
  • Dividing ??offset by ??position to give a conversion factor between pixels and metres (given above as scale)
  • For each map finding the average of each stop’s location minus its offset multiplied by the conversion factor (given above in map_numbers)

Here’s a bit of Python to convert between a map number and x and y offsets, and the WGS84 co-ordinate system. I’ve cheated a little in using Django’s GIS functionality which in turn uses the ctypes module to call functions from the GEOS library. If you’re not using Python or don’t want such a large dependency then you may wish to read the documentation linked from this page on the Ordnance Survey website.

from django.contrib.gis.geos import Point

def to_wgs84(map_number, rel_pos):
    """
    Takes an ACIS map number and a two-tuple specifying the offset on that
    map. Returns a Point object under the WGS84 projection.
    """
    corner = map_numbers[map_number]
    # Differing signs as we're applying a left-down offset to a left-up position
    pos = (
        corner[0] + rel_pos[0] * scale[0],
        corner[1] - rel_pos[1] * scale[1],
    )

    # 27700 is BNG; 4326 is WGS84
    return Point(pos, srid=27700).transform(4326, clone=True)

def from_wgs84(point):
    """
    Takes a Point object under any projection and returns a map number and
    two-tuple for that point. Raises ValueError if the point does not lie on
    any maps we know about.
    """
    # Make sure we're using the British National Grid
    pos = point.transform(27700, clone=True)
    for map_number, corner in map_numbers.items():
        rel = (
            (pos[0] - corner[0]) / scale[0], 
            (corner[0] - pos[1]) / scale[1],
        )
        # This is the right map if it appears in the 418 pixel square to the
        # lower-right of the the corner.
        if 0 <= rel_pos[0] < 418 and 0 <= rel_pos[1] < 418:
            return map_number, rel_pos
    raise ValueError("Appears on unknown map")

Next steps

The terms of use for the OxonTime website forbid using it for other than personal non-commercial purposes, making it an abuse of the terms to use this data in some sort of mash-up. From my reading of the terms, however, there’s nothing to stop one writing and distributing a client that uses this API directly. If you’ve got the time and inclination, why not write an iPhone/Android/mobile-du-jour client application using all sorts of fancy geolocation and free mapping data?

You could even scrape the real-time information from http://www.oxontime.com/pip/stop.asp?naptan=naptan-code&textonly=1 (just be wary about non-well-formed HTML). Obviously, make yourself aware of the terms, and if in doubt, contact Oxfordshire Country Council. Be warned that this API, though likely stable, comes with no guarantee to that effect. Also, it seems a little slow at times, so be gentle and treat it with respect.

Posted in Uncategorized | Tagged , , , | 6 Comments

Publishing lecture lists as Atom feeds

It’s become apparent that lots of IT departments are looking to make lecture lists available as RSS and Atom feeds.

As such, here’s a work-in-progress example of a lecture being described in Atom (sans namespaces):

<feed>

  <entry>
  
    <!-- Title comprises both course and topic for consumers who don't
         know about the extra namespaced metadata                      -->
    <title>Computer graphics: Clipping and transformations</title>
    <content type="xhtml">
      <div xmlns="http://www.w3.org/1999/xhtml">
        <h1>Clipping and transformations</h1>
        <p>Lots of interesting stuff</p>
      </div>
    </content>
    <updated>2005-07-31T12:29:29Z</updated>
    <published>2003-12-13T08:29:29-04:00</published>

    <!-- Here we have the schools that may earn credit by taking the
         course associated with this lecture. This would enable a
         comsumer to display all courses someone studying for a
         particular degree might be interested in.                     -->
    <oxevent:schools>
      <oxevent:school key="XCOM">Computer Science: Part A</oxevent:school>
      <oxevent:school key="DCOM">Computer Science: Part B</oxevent:school>
    </oxevent:schools>

    <!-- These form a hierarchy and would let users find more easily
         the course or lecture that interests them.                    -->
    <oxevent:unit key="comlab">Computing Laboratory</oxevent:unit>
    <oxevent:series key="cg">Computer graphics</oxevent:series>
    <oxevent:instance key="1">Clipping and transformations</oxevent:instanceTitle>
    <oxevent:format key="lecture">lecture</oxevent:format>
    
    <ev:startdate>2010-01-18T16:00:00+00:00</ev:startdate>
    <ev:enddate>2010-01-18T17:00:00+00:00</ev:enddate>

    <!-- Address and lat-long can be grabbed from OxPoints. We're
         looking at importing phone and fax numbers into OxPoints in
         the near future.                                              -->
    <geo:lat>51.75491</geo:lat><geo:long>-1.26078</geo:long>
    <vCard:adr>
      <vCard:street-address>Oxford Playhouse</vCard:street-address>
      <vCard:extended-address>11-12 Beaumont Street</vCard:extended-address>
      <vCard:locality>Oxford</vCard:locality>
      <vCard:postal-code>OX1 2LW</vCard:postal-code>
    </vCard:adr>
    <vCard:tel>+44-1865-305205</vCard:tel>
    <vCard:fax>+44-1865-793748</vCard:fax>
    <oxpoints:where rdf:resource="http://m.ox.ac.uk/oxpoints/id/12345678"/>

  </entry>

</feed>

The content obviously doesn’t make sense, but the form is there. It would be good if there was a feed that exposed all events for a department as this would significantly cut down on the number of feeds we’d need to retrieve. This feed would contain all lectures, seminars, classes and so forth.

There’s still a lot to be worked out:

  • Speakers could do with being marked up.
  • How do we distinguish one-off lectures or those of general interest?
  • How do we relate classes and seminars to lectures, or state that a set of classes each present the same material (but have been split apart to reduce numbers)?

Comments, as always, are both welcome and encouraged.

Posted in Erewhon, Uncategorized | Tagged , , , , | Leave a comment

Adding event times and location to RSS and Atom feeds

Mobile Oxford is currently looking at displaying events feeds from various sources and we’d like to display the start times, locations and any other bits of metadata we can get our hands on.

This article will use Atom terminology in preference to that of RSS for consistency’s sake.

Background

Most events feeds we’ve come across don’t specify these metadata anywhere other than as free text in the item summary or content. This makes it awkward as they can’t be extracted without either some natural language parsing or a regular expression, the first being difficult and prone to error, the other fairly brittle should something unexpected come up.

As such, we’d like to encourage maintainers of events feeds to expose these details in well-known nodes within the structure of the or node.

This document is the result of lots of trawling the web looking for best practice. Most of what’s here is my inferences from various standards so I apologise if I get things wrong. If you find a better way to do things I’d be very grateful to hear it!

Dates and times

In researching this topic I came across two ways to mark up this information, with one more complicated (and thus more expressive) than the other. We’ll start simple…

The W3C ev namespace

Our first example of how best to do it comes from OxITEMS, the University’s newsfeed system. OxITEMS handles plain newsfeeds, events feeds and podcasts, and allows the content authors to not worry about the vagaries of the standards they need to support.

Here’s a condensed excerpt from the Biochemistry Seminars RSS feed (which, just to confuse you, I’ve expressed as Atom):

<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"
      xmlns:ev="http://purl.org/rss/1.0/modules/event/">
  <!-- Feed metadata snipped -->

  <entry>
    <!-- Other entry metadata was here -->
    <ev:startdate>2010-01-18T16:00:00+00:00</ev:startdate>
    <ev:enddate>2010-01-18T17:00:00+00:00</ev:enddate>
  </entry>
</rss>

This is using the http://purl.org/rss/1.0/modules/event/ namespace, described by the W3C in a draft specification.

The W3C define the following elements in that namespace: startdate, enddate, location, organizer and type. The first two are expected to be date-times specified as per ISO 8601[0] (e.g. “2009-12-15T13:07:01Z”), whereas the range of acceptable values for the latter three is intentionally left undefined. We’ll explore the use of location later, but for now it is safe to say that in general the latter three aren’t much use in a semantic context.

The xCal namespace

One of the problems with the ev namespace – as admitted in the draft specification linked above – is that it cannot handle recurrent events. This would get awkward were one to be publishing a feed of lecture series or a theatre feed where any show is shown on multiple nights.[1]

The xCal specification presents an ‘XMLification’ of iCalender, a functional and widely implemented format for events, todo lists and alarms. The main in using this specification is that we can use recurrence as given by the rdate, rrule, exdate and exrule elements.

Here’s a simple example:

<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"
      xmlns:xCal="urn:ietf:params:xml:ns:xcal">
  <!-- Feed metadata snipped -->
  <entry>
    <!-- Other entry metadata was here -->
    <xCal:dtstart>2010-01-18T16:00:00+00:00</xCal:dtstart>
    <xCal:dtend>2010-01-18T17:00:00+00:00</xCal:dtend>
  </entry>
</feed>

This is much the same as the ev namespace. The fun comes when we start defining recurrent events:

<entry>
  <!-- Other entry metadata was here -->
  <xCal:dtstart>2010-01-18T16:00:00+00:00</xCal:dtstart>
  <xCal:dtend>2010-01-18T17:00:00+00:00</xCal:dtend>
  <xCal:rrule>FREQ=WEEKLY;COUNT=8</xCal:rrule>
</entry>

Here we have an event that occurs for eight consecutive weeks starting Monday the 18th of January.

Yes; this breaks the concept behind having XML in the first place. However, conversion between xCal and iCalendar should be fairly simple, and there (apparently) exists XSL to convert from xCal to iCalendar.

As of yet I don’t know of any sensible way to describe an event that occurs at different times on different days (e.g. “Every Monday at 2PM, and every Friday at 4PM”). The easiest route is to use xCal:rdate to specify each occurrence explicitly, but this ignores unbounded recurrence.

Locations

Locations are probably a bit easier to describe, but there seem to be lots of ways to approach the problem. We’ll start with the ev and geo namespaces, before looking at the georss, xCal and vCard namespaces. Finally we’ll look at hooking your feed up to the OxPoints RDF store.

The ev and geo namespaces

Here’s how OxITEMS does it:

<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"
      xmlns:ev="http://purl.org/rss/1.0/modules/event/"
      xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#">
  <!-- Feed metadata snipped -->
  <entry>
    <!-- Other entry metadata was here -->
    <ev:location>OX1 3QU (Main Meeting Room, [...])</ev:location>
    <geo:lat>51.759091</geo:lat>
    <geo:long>-1.255121</geo:long>
</entry>

Again, we see the use of the ev namespace. In this case, the information is drawn from the old OxPoints and provides a structured (if not necessarily friendly) bit of data.
The geo namespace is defined by the W3C and is used here to provide a co-ordinate in WGS84. Being machine-readable, this is extremely useful and can easily be used to plot a marker on a map.

In addition to the geo namespace, there’s also the georss, which achieves much the same result:

<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"
      xmlns:georss="http://www.georss.org/georss">
  <!-- Feed metadata snipped -->
  <entry>
    <!-- Other entry metadata was here -->
    <georss:where>
      <gml:Point>
        <gml:pos>-1.260350 51.760010</gml:pos>
      </gml:Point>
    </georss:where>
  </entry>
</feed>
The xCal namespac

It should be pointed out that in isolation neither of geo or georss, and xCal, is sufficient; one is human-readable but not machine-readable, and the converse holds for the other. An example of such a problem may be found in this feed (based on one from Daily Info):

<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"
      xmlns:ev="http://purl.org/rss/1.0/modules/event/"
      xmlns:xCal="urn:ietf:params:xml:ns:xcal"
      xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#">
  <!-- Feed metadata snipped -->
  <entry>
    <!-- Other entry metadata was here -->
    <title>11 Dec 2009 - 17 Dec 2009: Jack and the Beanstalk</title>
    <content type="xhtml">
      <div xmlns="http://www.w3.org/1999/xhtml">
        <p><em>description was here</em></p>
        <p>
          <a href="http://www.dailyinfo.co.uk/reviews/venue/859/Oxford_Playhouse">
            Oxford Playhouse
          </a>
          (<a href="http://www.oxfordplayhouse.com/">www.oxfordplayhouse.com/</a>),
          11-12 Beaumont Street, Oxford OX1 2LW; Tel. 01865 305305; Fax: 01865 793748.<br />
        </p>
      </div>
    </content>
    <geo:lat>51.75491</geo:lat><geo:long>-1.26078</geo:long>
    <xCal:location><![CDATA[http://www.dailyinfo.co.uk/reviews/venue/859/Oxford_Playhouse]]></xCal:location>
  </entry>
</feed>

I’ve left the content in as while it is evident to a user as to which part is the address, the feed parser has no chance of extracting the information. In addition to there being no address, the venue name is also not given. The contents of the xCal:location element cannot be assumed to be dereferencable and is thus not much use. The best we can do is plot a marker on a map without giving the user any sensibly placed clues as to what it is they’re looking for when they get there.

vCards

There’s also a fourth way to do it, using an XMLised version of vCards, specified by the W3C. Here’s an example:

<?xml version='1.0' encoding='UTF-8'?>
<feed xmlns="http://www.w3.org/2005/Atom"
      xmlns:vCard="http://www.w3.org/2006/vcard/ns#">
  <!-- Feed metadata snipped -->
  <entry>
    <!-- Other entry metadata was here -->
    <vCard:adr>
      <vCard:street-address>Cheese Room, Aardvark College</vCard:street-address>
      <vCard:extended-address>Aardvark Street</vCard:extended-address>
      <vCard:locality>Oxford</vCard:locality>
      <vCard:postal-code>OX1 1AA</vCard:postal-code>
    </vCard:adr>
  </entry>
</feed>

This solution has the advantage of attaching a role to each part of the address, allowing the consuming system to display as much or as little of the address as it chooses.

OxPoints

For feeds specific to the University of Oxford (or at least describe events at University venues) it may be worth referencing OxPoints. OxPoints is an RDF store containing locations of and relationships between entities at the University, and can be queried in a variety of ways. If you’d like to add such a link I’d suggest you use:

<?xml version='1.0' encoding='UTF-8'?>
<feed xmlns="http://www.w3.org/2005/Atom"
      xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
      xmlns:oxpoints="http://ns.ox.ac.uk/namespace/oxpoints/2009/02/owl#">
  <!-- Feed metadata snipped -->
  <entry>
    <!-- Other entry metadata was here -->
    <oxpoints:where rdf:resource="http://m.ox.ac.uk/oxpoints/id/12345678"/>
  </entry>
</feed>

Conclusions

With so many different ways of conveying the same information it could be tough deciding between them. That said, you don’t lose anything by implementing as many as is feasible.

For times use both ev and xCal, and for locations use one or both of geo and georss for latitude-longitudes. For textual locations I’d advise a comma-delimited list for ev:location and xCal:location, as well as a vCard for those parsers that understand it. Again, if you use University venues, consider referencing OxPoints.

Here’s a long-winded example of all of these in use:

<?xml version='1.0' encoding='UTF-8'?>
<feed xmlns="http://www.w3.org/2005/Atom"
      xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
      xmlns:oxpoints="http://ns.ox.ac.uk/namespace/oxpoints/2009/02/owl#"
      xmlns:vCard="http://www.w3.org/2006/vcard/ns#"
      xmlns:ev="http://purl.org/rss/1.0/modules/event/"
      xmlns:xCal="urn:ietf:params:xml:ns:xcal"
      xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#">
  <!-- Feed metadata snipped -->
  <entry>
    <!-- Other entry metadata was here -->
    <ev:startdate>2010-01-18T16:00:00+00:00</ev:startdate>
    <ev:enddate>2010-01-18T17:00:00+00:00</ev:enddate>
    <xCal:dtstart>2010-01-18T16:00:00+00:00</xCal:dtstart>
    <xCal:dtend>2010-01-18T17:00:00+00:00</xCal:dtend>

    <geo:lat>51.75491</geo:lat><geo:long>-1.26078</geo:long>
    <vCard:adr>
      <vCard:street-address>Oxford Playhouse</vCard:street-address>
      <vCard:extended-address>11-12 Beaumont Street</vCard:extended-address>
      <vCard:locality>Oxford</vCard:locality>
      <vCard:postal-code>OX1 2LW</vCard:postal-code>
    </vCard:adr>
    <vCard:tel>+44-1865-305205</vCard:tel>
    <vCard:fax>+44-1865-793748</vCard:fax>
    <oxpoints:where rdf:resource="http://m.ox.ac.uk/oxpoints/id/12345678"/>
  </entry>
</feed>

Footnotes and references

  1. ^ Specifically a subset of ISO 8601 where the century should be specified. See the W3C’s note on dates and times for more details. Being a note (as opposed to a recommendation) it is entirely optional as to whether you format your date-times in this manner, but it’s probably a Good Idea.
  2. ^ This leads to the question of whether you have a feed for each lecture series (with an entry per lecture), or one feed containing all series. The former would seem more sensible if you wanted to attach lecture notes to individual lectures. As of yet I haven’t found a sensible way to associate entries in a series, or to say “this seminar accompanies that lecture” or that “this set of seminars are for Group A, and those for Group B”. More thinking and investigation needed!
Posted in Erewhon, Uncategorized | Tagged , , , | Leave a comment

Django, class-based views, metaclassing, and view validation

As you may or may not be aware, the University’s mobile portal, Mobile Oxford, is a site built on the Django web framework. I plan to explain some of the concepts behind how it works, partly for everyone else’s insight, and partly to garner some feedback on how we might aim to do things better.

When an HTTP request is handled by a Django website, it attempts to match the local part of the URL against a series of regular expressions. Upon finding a match it passes an object representing the request as an argument to a callable associated with the regular expression. In Python, most callables you will find are class or instance methods, or functions. The Django documentation only briefly refers to the fact that one can use callables other than functions.

Class-based views and metaclasses

We’re using class-based views, a concept that doesn’t seem to have much of a presence on the Internet. The usual approach is to define a method __call__(self, request, …) on a class, an instance of which is then placed in an urlconf. Our approach is the following:

  • Have a base view called, oddly enough, BaseView.
  • Define a method __new__(cls, request, …) that despatches to other class methods depending on the HTTP method.
  • The __new__ method also calls a method to add common context for each of the handlers.
  • We use a metaclass to save having to put @classmethod decorators in front of every method.
  • We never create instances of the view classes; instead, __new__ returns an HttpResponse object and the class itself is place in the urlconf.

Here’s the code:

from inspect import isfunction
from django.template import RequestContext

class ViewMetaclass(type):
     def __new__(cls, name, bases, dict):
         # Wrap all functions but __new__ in a classmethod before
         # constructing the class
         for key, value in dict.items():
             if isfunction(value) and key != '__new__':
                 dict[key] = classmethod(value)
         return type.__new__(cls, name, bases, dict)

class BaseView(object):
    __metaclass__ = ViewMetaclass

    def method_not_acceptable(cls, request):
        """
        Returns a simple 405 response.
        """

        response = HttpResponse(
            'You can't perform a %s request against this resource.' %
                request.method.upper(),
            status=405,
        )
        return response

        # We could go on defining error status handlers, but there's
        # little need. These can also be overridden in subclasses if
        # necessary.

        def initial_context(cls, request, *args, **kwargs):
            """
            Returns common context for each of the HTTP method
            handlers. You will probably want to override this in
            subclasses.
            """

            return {}

        def __new__(cls, request, *args, **kwargs):
            """
            Takes a request and arguments from the URL despatcher,
            returning an HttpResponse object.
            """

            method_name = 'handle_%s' % request.method
            if hasattr(cls, method_name):
                # Construct the initial context to pass to the HTTP
                # handler
                context = RequestContext(request)
                context.update(cls.initial_context(request,
                                                   *args, **kwargs))

                # getattr returns a staticmethod , which we pass the
                # request and initial context
                handler_method = getattr(cls, method_name)
                return handler_method(request, context,
                                      *args, **kwargs)
            else:
                # Our view doesn't want to handle this method; return
                # a 405
                return cls.method_not_acceptable(request)

Our actual view code can then look a little something like this (minus
all the faff with input validation and authentication):

class CheeseView(BaseView):
    def initial_context(cls, request, slug):
        return {
            'cheese': get_object_or_404(Cheese, slug=slug)
        }

    def handle_GET(cls, request, context, slug):
        return render_to_response('cheese_detail.html', context)

    def handle_DELETE(cls, request, context, slug):
        context['cheese'].delete()
        # Return a 204 No Content response to acknowledge the cheese
        # has gone.
        return HttpResponse('', status=204)

    def handle_POST(cls, request, context, slug):
        # Allow a user to change the smelliness of the cheese
        context['cheese'].smelliness = request.POST['smelliness']
        context['cheese'].save()
        return HttpResponse('', status=204)

For those who aren’t familiar with metaclasses, I’ll give a brief description of class creation in Python. First, the class statement executes all the code in the class body, using the newly bound objects (mostly the methods) to populate a dictionary. This dictionary is then passed to the __new__ method on the metaclass, along with the name of the class and its base classes. Unless otherwise specified, the metaclass will be type, but the __metaclass__ attribute is used to override this. The __new__ method can alter the name, base classes and attribute dictionary as it sees fit. In our case we are wrapping the functions in class method constructors so that they do not become
instance methods.

Other things we could do are:

  • Override handle_DELETE in a subclass to return a 403 Forbidden if the cheese is important (calling super(cls, cls).handle_DELETE if it isn’t)
  • Despatch to other methods from a handler to keep our code looking modular and tidy
  • Subclass __new__ to add more parameters to the handlers on subclasses

As an example of the last point, we have an OAuthView that ensures an access token for a service and adds an urllib2 opener to the parameters which contains the necessary credentials to access a remote resource.

The subclassing view can then simply call opener.open(url) without having to worry about achieving the requisite authorisation.

Using class-based views allows us to define other methods on the views to return metadata about the resource being requested. As an example, we have a method that constructs the content for the breadcrumb trail, and another that returns the metadata for displaying in search results.

Achieving such extensibility with function-based views would be nigh on impossible.

Now for view validation…

As you may have noticed, all these methods (handle_foo, initial_context) have similar signatures. To make sure they’re consistent we have a test that looks through all the installed apps looking for classes that subclass BaseView. It then uses inspect.getargspec to compare the signatures. The alternative to this would be to check them in the metaclass, and raise an appropriate error at class creation time.

Hopefully you find this useful. We’d be very grateful to hear any suggestions or criticisms that you may have. I certainly don’t suggest this approach is applicable in every case, but it’s helped us to adhere as much as possible to the DRY principle.

Posted in Django, Erewhon, Mobile Oxford, Python | Tagged , , , | 1 Comment