Refering to my log analysis

I talked a while back about an aspect of Logfile Analysis referred to in the TIDSR as Referer Analysis. In simple terms, this looks are the Referer portion of the logfiles and attempts to find patterns or identify interesting items. This is really part art, part science.

We’ve already learnt that the Referer field is less than perfect, but when reviewed over a large enough dataset, some interesting points can be made. The OPMS:Stats database contains 98 million records, for files hosted on, and covers the period Nov 1st 2010 to Feb 28th 2011. For these 98 million records there are just shy of 9000 unique referer records – a proportion that appears to be lower than the typical website average, due I suspect to many of the files being accessed by non-web browser applications (e.g. media players account for 52% of all accesses, 40% are web browsers, 15% are iOS devices) which don’t tend to advertise where the file was linked to from.

Let’s take a look at some of the things I’ve found in our own lightweight Referer Analysis using our OPMS:Stats database and tools. Most of the referer entries are typical webpage URLs – albeit there are a number of IP based systems that don’t seem to lead to actual webpages. By taking these URLs and splitting them on “/” you can break up the data (e.g. using Excel’s Text-to-columns function) and then scan through the column site names looking for interesting names and patterns.


One of the first names that leapt out was YouTube. There are a couple of YouTube URLs in the data that appear to be samples from podcasting videos we offer. These have been uploaded by members of the public (as far as I can determine), but either way, are externally generated from our perspective. Each video helpfully has a link back to the podcast file (hence why they appear in our podcast file logs) and these files have been downloaded via these links. In effect, free advertising and some simple social media at work. As there’s just the two, I’ll do a quick breakdown here: Continue reading

Posted in Quantitative, Tech-Heavy, WP2: Initial Rapid Analysis, WP7: Embedding Toolkit | Comments Off on Refering to my log analysis

Can you hear me tweeting?

Measuring Impact with Twitter and

For the past month (and ongoing) we have been running an experiment to test what impact the advertising of selected podcasts can have via Twitter. In this short report we look at the initial findings and walk through the methodology and indirect benefits noting that Twitter is not a simple panacea for attracting an audience but implementing processes involving it will help improve access to content and allow for some easy gains to be made in measuring impact.

The Oxford Podcasting service has a twitter feed for making announcements about our podcasts and related news (@OxfordPodcasts). It has been a fairly under utilised tool thus far, and we wanted to see if it could make an appreciable impact on selected podcasts being accessed and downloaded. Prior to our experiment, there had been 43 tweets and we had 1525 followers – a fairly sizable following, though far from superstar status.

Learning to tweet

We adopted a thrice-weekly posting strategy, where we would look for a topical (news related) item on Mondays, an older (>6 months) podcast that had received little attention for Wednesdays, and Friday was for a selected new (<2 weeks) podcast. Twitter has a 140 character posting limit, so there was some initial debate over exactly what to put there. First thoughts were to try and squeeze direct links to the podcast file and a related RSS feed in, leaving about 85 characters to “sell” the podcast with. We realised this wasn’t going to be ideal, and that whilst we were going to use an URL shortening service, we really needed a landing page to give more information on the podcast to point at. As the current web portal is very limited and difficult to redevelop and because the replacement is not yet ready for production usage, we decided that a blog site for the podcasting service would meet our needs. Continue reading

Posted in Quantitative, Statistics, Tech-Moderate, WP6: Publishing to more channels | 1 Comment

User Agent Analysis – Part 2: Name those agents

In writing a system to parse the myriad of User Agent strings that appear in our podcast hosting logs, I have come across a number of interesting (on a very geeky level) observations. Short background is that I have sampled three random log files (one from each year of operation), manually extracted the user-agent strings, collated them, and then begun writing a parser to handle this sample of data. I’m writing in python, and unfortunately there is not an existing system that really works for our use-case. The biggest source of help so far has been (honourable mention to UAS-info has an extensive database of systems, browsers, oses and a working library of regular expressions to parse for these, with an implementation available in python.

However, their library is rather short of “Multimedia Player” references – and specifically for the iTunes application. Unfortunately, a very large proportion of our podcasts are downloaded with a user agent identifying as iTunes. I am in the process of modifying their library to include matching for these agent strings, but I still have a few strings that need some further analysis. The following is a quick look at my current (work in progress) challenges.

iOS devices and iTunes

The following is a list (not 100% complete) of user agent strings that appear to be iOS devices. The pattern is fairly obvious: Application-Device/ios_version (device_version?; memory_size)Continue reading

Posted in Tech-Heavy, WP2: Initial Rapid Analysis, WP3: Website Enhancement | 4 Comments

User Agent Analysis

As part of a suite of software tools we’re developing to analyse our disparate sources of tracking data, I have been looking at log files (e.g. those produced by Apache Web Server) and in particular, how to breakdown and analyse the User Agent string.

The UA string (when supplied – there are requests made without one), gives us some hint at what software and systems are being used to access our content. This isn’t authoritative – it’s too easy to spoof the UA, and this is used by any number of people, typically to gain access to sites that differentiate based on your software/computer usage (e.g. if this is an iOS device, don’t try and display adobe flash applications; if this is Internet Explorer 6.0, tell them to go away and update their browser). But, in the main (and with a large enough dataset) it is fair to look at these strings and seek patterns, if only to understand better what sort of user experience our downloaders are having – i.e. does their machine natively support our video formats? are they using the iTunes application? should we be providing format X because we have so many people on platform Y?

Anyhow, I have been trying to understand the pattern of such strings, and compare that against a sample of our data, so as to build a tool that is flexible enough to import this information, and detailed enough to help us break down the information into usable categories. This is proving non-trivial, so I thought I’d share some of my findings and analysis so far.

In terms of variation – I have been looking at a small sample (two or three days of logs) of data, some from 2009, another from very recently – there were 130 unique UA strings in the 2009 data, and over 750 in the 2011 sample. The format of the string varies depending on platform, but to give some examples:

  1. Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.1.4322)
  2. Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv: Gecko/20091102 Firefox/3.5.5
  3. Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)
  4. iTunes/10.1.2 (Macintosh; Intel Mac OS X 10.6.6) AppleWebKit/533.19.4
  5. iTunes-iPad-M/4.2.1 (64GB)
  6. iTunes/10.1.2 (Windows; Microsoft Windows 7 Business Edition (Build 7600)) AppleWebKit/533.19.4
  7. curl/7.18.2 (i486-pc-linux-gnu) libcurl/7.18.2 OpenSSL/0.9.8g zlib/ libidn/1.10
  8. iTunes-iPad-M/3.2 (64GB)
  9. iTunes-iPhone/4.2.1 (4; 16GB)
  10. QuickTime/7.6.6 (qtver=7.6.6;cpu=IA32;os=Mac 10.6.6)
  11. AppleCoreMedia/ (iPad; U; CPU OS 4_2_1 like Mac OS X; en_us)
  12. Azureus;Mac OS X;Java 1.6.0_22
  13. QuickTime/7.6.9 (qtver=7.6.9;os=Windows NT 6.1)
  14. Drupal (+
  15. iTunes/9.0 (Windows; N)

The above all occur several thousand times each in our sample data, so they can’t be considered particularly unusual examples, unlike the following which are some of the uncommon examples:

  1. BTWebClient/300B(24369)
  2. NSPlayer/12.00.7600.16385 WMFSDK/12.00.7600.16385
  3. iTunes/10.1.2 (000000000; 00000 000 00 0 000000) DDDDDDDDDDDDDDDDDDDD
  4. MLBot (
  5. Java/1.6.0_23
  6. Xenu’s Link Sleuth 1.1a
  7. Zune/4.7
  8. Mozilla/5.0 (compatible; Ezooms/1.0;
  9. lwp-request/5.818 libwww-perl/5.820
  10. DoCoMo/2.0 N905i(c100;TB;W24H16) (compatible; Googlebot-Mobile/2.1; +
  11. Google-Listen/1.1.4 (FRG83D)
  12. VeryCD \xb5\xe7\xc2\xbf v1.1.15 Build 110125 BETA
  13. iTMS
  14. FDM 3.x
  15. podcaster/3.7.3 CFNetwork/485.10.2 Darwin/10.3.1

Whilst I have selected these for being visually unusual, the only other information here is that they are ordered in terms of frequency occurring in the sample data, and that the items on the second list appear less than 100 times each (sometimes only once or twice).

So, we’re looking for a common format to be able to parse this data. There is some helpful information at the MSDN for IE and Microsoft environments, similarly for Mozilla/Firefox based browsers. Thus far my working model looks like this:

Application Name / Application Version (A compatibility flag; Browser Version or Platform; Miscellaneous Details) Even more miscellaneous details.

You can see from just the small sample above (List 2, Item 10, breaks this format for starters), that fitting  this template programmatically to the range of data is going to be difficult, but it is going to be necessary for us to easily answer questions over User Experience as performing searches requiring dynamic pattern matching within strings on datasets comprising many millions of entries is near impractical.

Of course, the other key lesson here is not to read too much into such narrow sets of data. My first instinct was surprise at the distribution/quantity of hits for specific UA strings. In our latest sample, item 1 from the first list accounts for around a 1/3rd of all hits for the day. This seems odd – because it suggests there are a *lot* of Windows 2000 based computers, running Internet Explorer 6, downloading our podcasts. Without context of where are these computers, or what are they accessing, we might think that people using a 10 year old operating system and a web browser with many known security issues to be rather odd. I hope that the tools in development will allow me to look at these sorts of data sets, and then drill down for more information that can reveal more about examples such as this.


Posted in Quantitative, Tech-Heavy, WP3: Website Enhancement | 1 Comment

Feedback from Oxford Students (Part 2)

So far the data suggests that most listeners of Oxford’s podcasts are people outside of University of Oxford. Considering that a core activity of the University is teaching and learning, it is essential to explore whether and how the podcasts influence the students at Oxford.

To help address this issue, a survey questionnaire was designed and delivered (initially) to a group of 3rd-year undergraduates in English Faculty at Oxford who attended an optional lecture in Modern English from which the course lecturer produces podcasts. The questionnaire was delivered in the class by the tutor and 93% of respondents (28 students) took the survey. Below we explore the survey and initial results.

Questions about podcasting in general

Q1 Have you ever listened to podcasts from any of the following sites? (Please tick all that apply)

Roughly half of respondents had listened to podcasts from Oxford’s iTunes U. This may be due to the following reasons:

  • The course tutor produces podcasts on Oxford’s iTunes U and has told their students about this.
  • High levels of news coverage for iTunes U

It is also worth noting that nearly half of the respondents did not listen to any podcasts at all, which may be because the students had attended the lectures and were surveyed in advance of any exams before the need for revision was pressing. Continue reading

Posted in Qualitative, Tech-Lite, Uncategorized, WP2: Initial Rapid Analysis | Comments Off on Feedback from Oxford Students (Part 2)