Refering to my log analysis

I talked a while back about an aspect of Logfile Analysis referred to in the TIDSR as Referer Analysis. In simple terms, this looks are the Referer portion of the logfiles and attempts to find patterns or identify interesting items. This is really part art, part science.

We’ve already learnt that the Referer field is less than perfect, but when reviewed over a large enough dataset, some interesting points can be made. The OPMS:Stats database contains 98 million records, for files hosted on media.podcasts.ox.ac.uk, and covers the period Nov 1st 2010 to Feb 28th 2011. For these 98 million records there are just shy of 9000 unique referer records – a proportion that appears to be lower than the typical website average, due I suspect to many of the files being accessed by non-web browser applications (e.g. media players account for 52% of all accesses, 40% are web browsers, 15% are iOS devices) which don’t tend to advertise where the file was linked to from.

Let’s take a look at some of the things I’ve found in our own lightweight Referer Analysis using our OPMS:Stats database and tools. Most of the referer entries are typical webpage URLs – albeit there are a number of IP based systems that don’t seem to lead to actual webpages. By taking these URLs and splitting them on “/” you can break up the data (e.g. using Excel’s Text-to-columns function) and then scan through the column site names looking for interesting names and patterns.

YouTube

One of the first names that leapt out was YouTube. There are a couple of YouTube URLs in the data that appear to be samples from podcasting videos we offer. These have been uploaded by members of the public (as far as I can determine), but either way, are externally generated from our perspective. Each video helpfully has a link back to the podcast file (hence why they appear in our podcast file logs) and these files have been downloaded via these links. In effect, free advertising and some simple social media at work. As there’s just the two, I’ll do a quick breakdown here:

  • Referer link: http://www.youtube.com/watch?v=1x6rj1Bruyc
    • User “30Comodore” has added this one video, an except of a philosophy debate/lecture on Radical Atheism, and added a link back to the file.
    • This YouTube page has had 340 views (at time of writing), and led to 12 downloads of the file, from IP addresses in Sweden, Ireland, United States, Australia and the UK (all external to Oxford University computer addresses).
  • Referer link: http://www.youtube.com/watch?v=kPx4CE1Oscw
    • User “ChrisLappas” has uploaded this as one of 123 videos they have published to YouTube. It is (co-incidentally perhaps?) another Atheism video, this time and except of Dawkins vs Harries debate marking the 200th Anniversary of the Darwin debate at Oxford.
    • This page has had 831 views and led to 5 downloads from IP addresses in the UK, Finland and Greece.
  • Both links have resulted in downloads without any apparent time clustering – that is, the downloads have occurred throughout the 4 month monitoring window, suggesting this isn’t quite so time-critical or time-sensitive and able to attract ongoing attention.
  • Unsurprisingly, all of these files were accessed from (presumably) the same browser used to play the YouTube video, which reflects a typical distribution of User Agents (i.e. IE, Firefox, Safari, Opera and Chrome). Most of these files were downloaded to PC’s running some version of Microsoft’s Windows OS.
  • Unusually, in respect of the majority of the data in the log files, these referred downloads all resulted in 200 (Status OK) codes, meaning the complete file was downloaded in a single session (i.e. one click and they took the whole file).
  • Looking at the Reverse DNS entries, most were from public (home?) internet providers, though one download was to a student network at Emory University in the USA.

Twitter

Aware that we have been running an experiment to test the impact of promoting podcasts on Twitter, I decided to look for instances of Twitter in the referer data. Twitter appears 6 times in the referer data, and resulted in 15 podcasts being downloaded (2 of which were via the Twitter API). Interestingly, none of the file requests bear the LfI tracking decoration, suggesting none of them were related to our experiment – or perhaps slightly more puzzling, none of the handful of downloads attributed to the experiment appear to have been done via a web-browser with a referer. This might be explained by users using Twitter client software to read the tweets, which then resulted in the URL being passed to a web-browser or media player to access the file, which would have no context with which to seed a referer value. That is, the download software doesn’t know where the link came from, so couldn’t tell us.

11 of the downloads were via the Twitter frontpage (http://twitter.com/) suggesting some opportunistic downloading was done as a tweet referencing our files scrolled by on the Latest feed. For most of the tweets, it is fairly clear that the URL tweeted came from the Podcasts Web Portal, due to the inclusion of a piece of URL decoration used for that location.

StumbledUpon

StumbledUpon is a web recommendation system, and appears to have found some of our podcasts – specifically our very popular General Philosophy feed and also our Podcasting Web Portal.

StumbleUpon accounts for 215 downloads at a cursory count, but a scan of the data suggests this isn’t a simple story. All 215 were performed by one of three User Agents (i.e. systems and software) and that appears to be a Window’s based PC equipped with Media Center, and from 3 US based IP addresses.

The story becomes a little clearer when looking at who and what was downloaded. One of those IP addresses links back to Northern Michigan University public wifi access – but before you get excited about educational usage, the 156 files that user downloaded were purely Album Art, and if I was to hazard a guess as to what they were doing, I would say they were simply loading our website via StumbledUpon (this is because the current portal’s inefficient approach is to have a single page loading all album art available, much of which is stored on media.podcasts.ox.ac.uk). The bad news – there’s no evidence that after loading the portal page they downloaded any podcasts.

*.ac.uk

One common request is to be able to identify accesses from within an educational context. Referer Analysis is perhaps not an ideal way to do this, but it can offer some insight as to where the podcasts are being linked from (which isn’t the same as where the people accessing them are based). Looking for referer data that contains a domain that ends with .ac.uk is a handy way of spotting UK FE and HE links – and perhaps unsurprisingly for a University podcasting service, over 1/3rd of the referers contain this domain. Of those 3600+ records, only 584 are from outside of Oxford. Due to the large number of examples, I will select the major ones and present them below.

  • The University of Nottingham’s Xpert search system features frequently in the data, and can have 55 downloads attributed to users from that system. Handily the refering URL also includes a search term used to find the data, and includes the likes of “environment”, “philosophy”, “RMIT” & “technology”.
    • However, this 55 download count may be a little shallow. Further digging shows that 43 of those downloads were from one US based IP address identified as fastsearch.net. The downloads all occurred in a close time period (i.e. one day in late November) and the domain may have some relation to fastsearch.com which is redirected to Microsoft’s Sharepoint website.
    • Of the 12 remaining downloads, 10 were downloaded in two clusters by a US based IP address, and the other two were to an Italian IP and UK IP respectively. The US downloads appear to have been performed by an automated crawler service (“FAST Enterprise Crawler 6 / Scirus scirus-crawler@fast.no; http://www.scirus.com/srsapp/contactus/”).
    • This might suggest that only 2 “valid” downloads were actually performed via the Xpert system!
    • But, this only accounts for referer’s with the http://xpert.nottingham.ac.uk/ domain in use… and there is a much larger number of referer entries that come from http://www.nottingham.ac.uk/xpert/xxxxx and web1.nottingham.ac.uk/xpert. Let’s take a look at these entries in the next point.
    • Xpert has multiple URLs for access, and from a user perspective, they both look the same. Pat Lockley is an Xpert developer and will be joining OUCS next month, so I will have to ask him about this more then. Widening the term to look for anything from nottingham.ac.uk we find they are all from the Xpert system, and account for 270 referer records (and a much broader range of search terms).
    • 411 downloads can now be attributed to Xpert, but as we’ve seen from the above, deeper probing is likely to reduce this considerably. There are a number of clustered downloads from single IP addresses, which is either an enthusiastic consumer, or an automated system – indeed, these 411 downloads have come via 154 IP addresses.
    • I’m going to have to skip delving deeper on this because of time constraints, but it may be interesting to revisit later as the reporting tools develop and make identifying patterns of usage easier to find.
  • MediaPlayer.group.cam.ac.uk features heavily too. This is a demonstration site (presently offline due to technical issues) built for the Steeple project that aggregates podcasting content from multiple institutions. Indeed, over 1000 downloads can be attributed to this Steeple Aggregator demonstrator over this four month data sample.
    • The geographic distribution for visitors is impressive with 550 unique IP address coming in via this channel representing 79 different countries and very little UK/US/China bias either.
  • Jorum is a JISC funded learning content repository, and has links with Oxford Podcasting through a series of JISC funded projects (Steeple, OpenSpires, etc). They were able to access our podcasting catalogue in a similar fashion to the Steeple Aggregator (MediaPlayer@CAM) and Xpert, and thus can direct users to our content from within their portal system.
    • Jorum.ac.uk features 89 times in the referer database linked to approximately 209 downloads.
    • One user in Israel appears to have been trying to download 16 philosophy lectures via Jorum, but been having some connectivity problems. The logs show 53 requests, many of which are for partial file content along with 16 complete downloads, all attempted over an 8 hour period during the daytime.
  • If we were to take download numbers out of context, then the largest winner in the UK academic space would go to yet another Steeple offshoot – the Ensemble Aggregator demonstrator built by CETIS. Their winning score? 12202 downloads in our sample period.
    • However, don’t get excited, not a single one of those was for an actual podcast. They were all Album Art downloads to 138 Unique IP addresses spread amongst 30 countries, suggesting that the search service is attracting attention, but not convincing users to download content after looking.

Google

Rather an obvious one to have left till now, but Google features in a wide range of referer URLs, indeed, almost one half of all the URLs are via a Google webpage! These URLs have the advantage (perhaps) of featuring the search terms that were used to locate content, and perhaps it is no surprise that many of these searches are for Images rather than Podcasts (image searching being far easier via Google than audio or even video searching). There are many examples of Google’s Translate system being used on our Podcast Web Portal to enable users to browse and download content more easily.

This is a topic area that I think will need further exploration in the future and one that is too deep to cover in this project and posting.

Further Notes

I can not find any referrers that are clearly American Universities (*.edu) suggesting Oxford’s material is not being incorporated into any US based educational websites. Wikipedia appears 3 times in the referer data, pointing at 3 podcasts, which have been downloaded collectively 45 times. Facebook is fairly absent from this data, with just the one entry to an African Studies file that generated 4 downloads – perhaps suggesting that our material is not entering the social graph and being shared by the students.

Researchers and Academics who have recorded podcasts quite often are posting links to those podcasts on their own websites, and these appear from time to time in the referer data, though with very low downloads (1-3 appears common). There are many URLs that appear to be blog sites in the data, but at a casual glance, none have been particularly impressive in terms of generating downloads – again, this might be a limitation of the current visualisation methods.

Concluding

It was pre-conceived that this analysis would take place early on in the project as part of the Rapid Analysis Report. However, limitations in the data collection and analysis (i.e. almost non-existant) meant this has been delayed until tools could be created and time spent to investigate the data they had collected. This unfortunately leaves this analysis feeling a little rushed and somewhat incomplete, but I hope, showing insights that would be worthy of following up in the future – Google for one, could be almost an entire project in itself; Blog post extraction and analysis could be another.

This form of analysis is perhaps the only means available to an institution that doesn’t implement particularly sophisticated tracking on its portals or have some form of Business Intelligence system available. It can shed some light onto the users and activities of our consumers, if only by where they are finding our content and its associations. It certainly shows that a broad audience will promote some of the content themselves (YouTube, Twitter, Blogs, etc) and this can lead to further downloads. The impact of various aggregators and portals needs to be appreciated as the engineering efforts required to open up this content is not particularly onerous.

This data covers over a third of all the download data so offer some significant coverage, with the rest likely to be due to the nature of our content (i.e. media files accessed by a media player, not a web browser). Certainly its range offers a fairly lightweight means of assessing broad impact measures in terms of content placement and I feel that is something that can’t be readily overlooked when looking for our users.

Posted in Quantitative, Tech-Heavy, WP2: Initial Rapid Analysis, WP7: Embedding Toolkit | Comments Off

Comments are closed.