Sifting signal from noise

Think of this as Connecting The Dots part II, as I want to pick up on some of the specifics touched on in that post. My apologies, but this is the tip of an iceberg and is quite a long post.

From the diagram in A Sky Full Of Stars we can identify around 29[1] locations were data can be collected from, and in Fishing With A Broken Net we explore what our coverage is like. In this post I want to focus on explaining more about the data that we presently have available.

Thus far we have the following data sets to analyse:

  1. Apache Access Log files from the Fileserver media.podcasts.ox.ac.uk (two servers – labelled Incy and Revisionist in the diagrams – load balanced and presented via that url)
  2. Apache Access Log files from the Mobile Oxford Portal m.ox.ac.uk
  3. Summarised and sanitised data related to the iTunes U Portal provided weekly by Apple in an Excel spreadsheet file
  4. Summarised and sanitised data related to the Oxford Web Portal provided weekly as a PDF of the Google Analytics report generated by the Javascript tool embedded into the portal

We also have some supplementary datasets related and of interest:

  • A geolocation database allowing us to determine Country of Origin from IP address
  • An RSS feed of RSS feeds published by Oxford’s RSS Server (i.e. a list of all the feeds containing the publicly accessible catalogue information provided as RSS XML data)
  • Access to Reverse DNS information for IP addresses that provide such a reference

…and that’s about it for the moment.

A key problem is that these data sets are not easily related to each other and we presently lack suitable tools to handle this information, but I’ll get back to that in another posting. Lets look at what is in these datasets first.

1) Apache Access Log files – media.podcasts.ox.ac.uk

When a visitor’s computer requests a file from media.podcasts, a http request is sent and picked up by one of two servers based at OUCS. This http request is handled by the Apache Web Server software, and each request is summarised and written to a log file related to that specific machine. These log files range in size from 1Mb through to 600Mb uncompressed (to save space they are compressed into files ranging from 57Kb to 27Mb – roughly 5% of their original size) and contain between 4,000 and 1,800,000 logged requests.

This allows me to estimate that there are around 120 Million requests to analyse in the available data. I’ll talk more about quantities and distribution of data in Fishing With A Broken Net, for now, lets focus on what information is being logged.

Apache is configured to record log entries with the following format:

LogFormat “%{%Y-%m-%dT%H:%M:%S%z}t %v %A:%p %h %l %u \”%r\” %>s %b \”%{Referer}i\” \”%{User-Agent}i\””

This records (See Apache Common Log Format documentation for more details):

  • %{%Y-%m-%dT%H:%M:%S%z}t – Time the request was received (standard english format)
  • %v – The canonical ServerName of the server serving the request.
  • %A:%p – Local IP address + “:” + The canonical port of the server serving the request
  • %h – Remote host
  • %l – Remote logname (from identd, if supplied). This will return a dash unless ?IdentityCheck is set On.
  • %u – Remote user (from auth; may be bogus if return status (%s) is 401)
  • “%r” – First line of request wrapped in quotes
  • %>s – Status. For requests that got internally redirected, this is the status of the *original* request — %…>s for the last.
  • %b – Size of response in bytes, excluding HTTP headers. In CLF format, i.e. a ‘-‘ rather than a 0 when no bytes are sent.
  • “%{Referer}i” – The contents of Referer: header line(s) in the request sent to the server.
  • “%{User-Agent}i” – The contents of User-Agent: header line(s) in the request sent to the server.

Lets look at a small number of sample records to illustrate this:

2009-01-28T06:32:44+0000 media.podcasts.ox.ac.uk 163.1.3.24:80 www.xxx.yyy.zzz – – “GET /mat/nanotechnology/quantum3-medium-audio.mp3?CAMEFROM=podcastsGET HTTP/1.1” 200 3605971 “-” “Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)”

  • “2009-01-28T06:32:44+0000” is fairly straightforward, Jan 28th, 2009 at 6:32am and 44 seconds, GMT.
  • “media.podcasts.ox.ac.uk 163.1.3.24:80” is the name of our server and its address (and access port – 80, the standard for http traffic).
  • www.xxx.yyy.zzz” this is an anonymised (for this posting) visitor IP address. Read What’s In An IP Address to learn more about what this can represent.
  • “- -” These two dashes help to highlight useful information that public podcasting just doesn’t tend to provide. The first dash is because there is no remote logname information. The second dash is because there is no remote user information. That latter blank datapoint in some circumstances could have helped identify a person rather than a machine(s)
  • “GET /mat/nanotechnology/quantum3-medium-audio.mp3?CAMEFROM=podcastsGET HTTP/1.1” – This is actually the most useful bit of data here, but it isn’t always clear as to what it means. I’ll break this down further below.
  • “200” – Request was successful. The item requested was available and our server did it’s best to send the file to the visitor.
  • “3605971” – the number of bytes sent. Without further information such as the filesize of the requested file we can’t tell if this was the entire file, or just a portion of it.
  • “-” – no referrer information was supplied by the requesting computer, so no hint here as to how or where they found the link to this file they’re now being sent.
  • “Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)” – some information about the User Agent, i.e. the software used to request the file. Or at least, what the software wants us to believe is responsible for requesting the file. This can’t be trusted for many reasons not least of which is spoofing. Take a look at http://www.useragentstring.com/ to get a breakdown of what your user agent looks like.

GET /mat/nanotechnology/quantum3-medium-audio.mp3?CAMEFROM=podcastsGET HTTP/1.1” aka, the first line of the request contains some more information that needs further decoding and referencing. “GET” is the http action being used, it does what it says on the tin. “/mat/nanotechnology/quantum3-medium-audio.mp3” is the path portion of the URL initially requested. Using a little insider knowledge about how things can be stored on this server, I can infer that this is related to our Materials department and from a feed related to Nanotechnology. The filename suggests this is the third item in the series, that it’s using our medium quality format and that it is an audio only file (.mp3). However, this structure is not consistently used, and to determine some of this background knowledge I had to know about what departmental codes are used at Oxford, how there was once a publishing convention used for file formats, and something about the Nanotechnology podcast series. The “?CAMEFROM=podcastsGET” portion of the querystring gives a hint as to the source of the link, as many of the links in our central portals feature some form of “URL decoration” that should be specific to a particular source location. Again, because this is a manually set feature in the main, this is also not consistently used and can not be completely trusted. Finally the “HTTP/1.1” says what version of the http protocol is being used.

Implicit in this request is that the file has had one request on this time and date – and this is typically termed a “hit” on the file, and a common metric used to measure impact through quantity. Hits are sometimes helpful in a general trend sense, but should not be taken literally to mean that a whole file has been downloaded and viewed by a given visitor because of that hit.

2009-01-28T17:27:16+0000 media.podcasts.ox.ac.uk 163.1.3.25:80 www.xxx.yyy.zzz – – “GET /devoff/alumni2007_cancer/AlbumCover.png HTTP/1.1” 200 80161 “http://www.sciencelive.org/content/view/131/96/” “Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_4_11; en) AppleWebKit/525.27.1 (KHTML, like Gecko) Version/3.2.1 Safari/525.27.1”

Let’s pick up the differences highlighted in this sample.

  • “/devoff/alumni2007_cancer/AlbumCover.png” – lets me know that this is a static image file being requested, what we call the Album Art. Again, background knowledge allows me to interpret which department and likely podcast feed this Album Art belongs to.
  • “http://www.sciencelive.org/content/view/131/96/” is an example of a referrer link, something of interest in the Referrer Analysis section of the TIDSR. Loading this link into a browser would take us to one of the Steeple Ensemble Aggregator Demonstrators (unfortunately suffering from a terminal problem presently).
  • “Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_4_11; en) AppleWebKit/525.27.1 (KHTML, like Gecko) Version/3.2.1 Safari/525.27.1” is a much more complete User Agent string, but this can be boiled down to saying a visitor was using the Safari Web Browser on their Apple Mac computer.

2009-01-28T17:59:08+0000 media.podcasts.ox.ac.uk 163.1.3.25:80 www.xxx.yyy.zzz – – “GET /devoff/campaign_video/album_cover.png HTTP/1.1” 206 106083 “-” “VLC media player – version 0.9.2 Grishenko – (c) 1996-2008 the VideoLAN team”

The User Agent here “VLC media player – version 0.9.2 Grishenko – (c) 1996-2008 the VideoLAN team” is a little more interesting as this is identifying itself as a media playback application, not a typical web browser or iTunes application. There are no clues here as to the computer platform in use, but it does suggest that not everyone is relying on what they can get via either the web portal or iTunes U site alone.

2009-01-28T17:48:24+0000 media.podcasts.ox.ac.uk 163.1.3.24:80 www.xxx.yyy.zzz – – “GET / HTTP/1.0” 200 207 “-” “check_http/1.96 (nagios-plugins 1.4.5)”

The post talks about signal from noise, and this is an example of noise in terms of impact monitoring. This entry is from an automated software check being performed, essentially checking that our server is responding to http requests. There is no specific media file being requested (as shown by “GET / HTTP/1.0”) and it is identifying itself as part of the Nagios Monitoring Software suite. If the IP address can be traced we may find that this is being done by our own system administrators, or that someone else is sufficiently interested in our hosting that they have set up an automatic check.

2009-01-28T17:59:48+0000 media.podcasts.ox.ac.uk 163.1.3.24:80 www.xxx.yyy.zzz – – “GET /devoff/campaign_video/clip4-medium-video.mp4?CAMEFROM=podcastsGET HTTP/1.1” 304 – “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”

There are a few interesting quirks to this entry.

  • “clip4-medium-video.mp4?CAMEFROM=podcastsGET” suggests that the request was based on a link published on our Web Portal and for a video file related to Oxford’s fundraising campaign.
  • The “304” status code is telling the requestor that the file has not been modified. This suggests that this requestor has asked for the file before and wants to know if any changes have been made…
  • … and the User Agent would suggest the obvious reason why: This is an automated scan by Google’s search engine looking for updates to it’s search listings.

This result is almost insignificant unless you’re particularly interested in knowing how often your site is being indexed by these search engines. In effect, this is noise.

2009-01-28T20:43:51+0000 media.podcasts.ox.ac.uk 163.1.3.24:80 www.xxx.yyy.zzz – – “GET /robots.txt HTTP/1.0” 404 208 “-” “gsa-crawler (Enterprise; S5-L9K7Q4E5SESJB; googlebox@oucs.ox.ac.uk)”

Another search related example, again noise though:

  • “GET /robots.txt HTTP/1.0” is a typical URL for a search engine wanting to know what it is and is not allowed to index.
  • “404” is a popular/infamous status code online saying that such as file does not exist.
  • “gsa-crawler (Enterprise; S5-L9K7Q4E5SESJB; googlebox@oucs.ox.ac.uk)” User Agent string gives a rather large hint that is from a local Google Search Appliance.

One key thing to understand is that these examples are FAR from comprehensive in terms of the range similar log requests. For example, another category of significant log entries are those generated by Internet Caching Agents (such as Content Delivery Networks like Akamai). Casual observation of the raw data also suggests that time is a trend when it comes to the popularity of certain items or sites and the type of visitors those podcasts attract – e.g. the variation of entries for a popular item is greater than for less popular material.

2) Apache Access Log files – m.ox.ac.uk

Very similar to the media.podcast, though a slightly rearranged ordering of the content. Initial sampling of this data suggests low traffic levels related to podcasting and largely dominated by Search Engines.

3) Apple Weekly iTunes U Data

Apple emails iTunes U site-managers with an excel based data report weekly that contains a sanitised (i.e. pre-processed/filtered/limited) set of results presented in six categories: Summary; Tracks; Browse; Edits; Previews; Users. Whilst the supporting documentation gives a reasonable accounting of what this data represents, it is still unconfirmed and unsupported information, and occasionally gives rise to trends that seem unbelievable. Part of the work for LfI and the Podcasting Service is to find a relationship between data we can observe and verify, and the information third parties are giving us.

Let’s have a quick overview of what’s available in these spreadsheets:

  • Summary – focuses on overall totals for a given week, with a leaning towards platform differentiation (are you on an iPad, iPod, Mac or PC?) and an attempt at highlighting actions within the iTunes interface (download, subscribe, preview, editpage). The headline figure which has underpinned Impact analysis so far at Oxford is the Weekly Total Downloads which is calculated as DownloadTrack + DownloadTracks + DownloadiOS + SubscriptionEnclosure.
  • 4 worksheets are included, each representing a week of downloads on a track by track basis. This provides a “Path” (something that has changed with the evolution of iTunes U), a “Count” of the number of downloads for that track, a “Handle” (which, whilst unique, can be changed from week to week depending on certain site actions) and a “GUID” field (which didn’t appear till nearly 6 months after the launch of iTunes U in Europe). The GUID is derived from data provided to Apple in our RSS feeds, and should theoretically be both unique and able to be linked to our RSS catalogue data.
  • 4 worksheets cover the Browse functions – a Path (not clear as to what it actually relates to), a Count, a Handle, and occasionally a GUID.
  • 4 worksheets cover Edit functions, however since the change to the Public Site Manager in iTunes U, this sheet has been empty (largely because editing is now down outside of the iTunes U interface unlike for the previous system and the internal only iTunes U sites).
  • 4 worksheets cover Previews, which to quote Apple means: “The Previews sheet lists all the tracks users previewed that week through the DownloadPreview and DownloadPreviewiOS actions, including track paths, counts, handles, and GUIDs. Previewing tracks does not result in the track actually being downloaded to the user’s iTunes U library. “. AFAIK, this means playing a podcast within the iTunes U interface or via the video player on an iOS device. The file is still technically downloaded (as this is not a streamed system), but it is then discarded after playback. Also, the extent of the playback is unknown, so it may be for 10 seconds, 60 seconds, the full length or any length. I.e. we can’t tell what this visitor is doing via this number and assuming it means the same as a download (which is often meant to imply that the visitor will watch it) is not a good assumption.
  • 4 worksheets cover Users, which makes more sense when users are having to authenticate to use the software, which is only necessary for Internal iTunes U sites (the Private iTunes U setups). Oxford doesn’t use such a private site, so all of our Users are anonymous, and this sheet of data reflects that and more recently tends towards showing a count that is almost equal to all the categories measured together for a week.

The limitations of this data set is that it needs to be combined with external information to be grouped and interpreted effectively. Only having an overall “count” for an item is the same as counting a “hit” on the file but without any contextual information (such as a datetime, user agent, referrer, etc)[2].

4) Google Analytics summary of Oxford Web Portal

I’m going to leave this section blank for the time being as this data has only recently been acquired and needs further investigation. I’ll update this posting for completeness once that’s done.

*WORK IN PROGRESS HERE* 🙂

Other related datasets

So, we’ve touched on the file access related data, lets take have a quick overview of some more dataset that could be combined with some of the above to help answer more interesting questions.

Access to Reverse DNS information for IP addresses that provide such a reference

Ignoring the many weaknesses of what an IP address represents, it can be informative if such an address has a Domain Name associated with it. Domain Names are the human friendly web addresses we use all the time (www.ox.ac.uk, www.google.com, etc), and the process of finding something on the internet relies on a Domain Name System (DNS) that substitutes these textual names for numeric IP addresses that can be used to navigate the networks between you and the webserver. Reverse Domain Names (or rDNS) allow you (where applicable) to take an IP address and look up it’s Domain Name. Unfortunately not all IP addresses have a Domain Name associated with them (and many IP addresses have multiple domain names). Testing a 3 week sample of the data with a free analysis tool called Analog (more on that in other posts) came back with the following summary.

Analog's Domain Name Analysis of a sample of data from media.podcasts

Analog's Domain Name Analysis of a sample of data from media.podcasts

This show that around 62% of the requests to media.podcasts had IP addresses without an associated Domain Name. If you were to rely on this information as a way to answer “where do you visitors come from?” and attempting to use a Top Level Domain name as gauge, then you would have to ignore 62% of your visitors because they’re unknown. This nicely leads us onto…

A geolocation database allowing us to determine Country of Origin from IP address

There has been a growing demand for geospacial information about web visitors in recent years, and work has been done by a range of parties (mostly commercial organisations) to try and answer that question based in part on your IP address. Whilst this has many weaknesses (for example: VPN connections can make you appear to be somewhere you are not; errors in the databases; Proxy Servers between you and your vistitor; etc), for country level locating these weaknesses are largely minimalistic. We are looking at using a suitable Country Level GeoIP database to help us. As a test and using a manual approach (as Excel does have it’s limits) I took a sample of IP addresses from the media.podcasts data and married them up to the GeoIP data and produced the following snapshot of where some our visitors where coming from.

Testing a sample of visitors' IPs against a GeoIP database

Testing a sample of visitors' IPs against a GeoIP database

Even at the crudest (and inaccurate) comparison, you can see a huge difference between the number of Chinese visitors in this chart and in the Domain Name based chart. We hope that we will be able to use a similar technique to look at a larger and more representative sample of our data during this project.

An RSS feed of RSS feeds published by Oxford’s RSS Server

The RSS^2 (RSS Squared) data feed is an RSS XML listing of all the RSS podcasting feeds published by Oxford. I.e. a list of all the feeds containing the publicly accessible podcast catalogue information provided as RSS XML data. Why do we need this? Well without it we can’t group our podcasting data into terms we might understand, such as the number of downloads for a particular podcasting feed. To answer any queries about the performance of a specific podcast we would have to reference this catalogue in the first instance just to get the URL directing us to the fileserver hosting that podcast, and then we’d have to filter that data manually to produce an answer. Trying to do that for one file can presently take in excess of 3 hours. Having to repeat that for  a podcast feed with 10 items manually becomes very wasteful. Answering questions that compare OER material to podcasts released without a CC licence is near impossible without being able to build two sets of URLs based on the information exposed by the RSS. And to further complicate matters, the RSS Server does not provide any search based functionality, so custom queries to that data are not possible. There are also many more questions that rely on implicit knowledge that can only be programatically found in searching this RSS data, but I’ll talk about those more in an FAQ and Misunderstandings post later.

A few last points on Signal and Noise

As we’re already overloading ourselves in details by this stage, I’ll briefly return to the central theme behind this post – working out what data to use and what to discard. As I mentioned, there is estimated to be over 120 Million items of data to analyse for this project, far more than can be reasonably handled using simplistic means. Many of the Impact related questions require some subtle data crafting and careful filtering of information, and most often the joining of several datasets.

One observation from past attempts to look at the log files was the choice of which http status codes to respect and which to discard. This is clearly illustrated by this pie chart based on a recent 3 week sample of data.

Analog Status Codes - Sample 1

Analog Status Codes - Sample 1

If this is to be taken at face value, convention would suggest we ignore anything that isn’t a 200-OK response code. That leaves an awful lot of data being excluded without understanding what it represents, and it may be significant – one conversation has suggested that the 206 code is related to the activities of Content Delivery Networks – systems that cache your podcast data around the internet for faster access – and how they “touch” your server each time they allow a file to be downloaded. Given we know Apple uses a CDN to support the iTunes U store, ignoring this data would suggest we are ignoring many of the downloads generated by the iTunes U site. I feel this is too significant to be left to ignorance and simplicity and thus requires further study.

I hope in a later Signal to Noise type posting to be able to demonstrate the quantities of data being handled and the filtering done that leads to our information for analysis. All is not as it may seem.

Carl

Footnotes:

  1. This has been revised up as of 19-11-2010 as more recent analysis of our catalogue feeds identified more webhosts linked to by the podcast URLs.
  2. Previous attempts to manually compare samples of data against these figures have usually resulted in confusion and no clear link between the quantities. I feel this has been a result of oversimplification when trying to make comparisons and ineffective filtering being applied to the datasets. I hope this will be resolved by work done on this project to more clearly understand the datasets and to build a tool capable of providing the linkage and filtering needed to clarify this data.
Posted in Quantitative, Statistics, Tech-Heavy, WP2: Initial Rapid Analysis | 3 Comments

3 Responses to “Sifting signal from noise”

  1. […] The Project « Sifting signal from noise […]

  2. […] have talked about how we can collect data from our webservers, that we have a (presently growing) range of webservers in the ecosystem we want to observe, and […]

  3. […] regarding the whereabouts of the visitors. This is based on the Reverse DNS lookup (as discussed in Sifting Signal From Noise). The chart as you’ve seen, looks like this: Fig 4: Analog's Domain Name Analysis of a […]