As part of a suite of software tools we’re developing to analyse our disparate sources of tracking data, I have been looking at log files (e.g. those produced by Apache Web Server) and in particular, how to breakdown and analyse the User Agent string.
The UA string (when supplied – there are requests made without one), gives us some hint at what software and systems are being used to access our content. This isn’t authoritative – it’s too easy to spoof the UA, and this is used by any number of people, typically to gain access to sites that differentiate based on your software/computer usage (e.g. if this is an iOS device, don’t try and display adobe flash applications; if this is Internet Explorer 6.0, tell them to go away and update their browser). But, in the main (and with a large enough dataset) it is fair to look at these strings and seek patterns, if only to understand better what sort of user experience our downloaders are having – i.e. does their machine natively support our video formats? are they using the iTunes application? should we be providing format X because we have so many people on platform Y?
Anyhow, I have been trying to understand the pattern of such strings, and compare that against a sample of our data, so as to build a tool that is flexible enough to import this information, and detailed enough to help us break down the information into usable categories. This is proving non-trivial, so I thought I’d share some of my findings and analysis so far.
In terms of variation – I have been looking at a small sample (two or three days of logs) of data, some from 2009, another from very recently – there were 130 unique UA strings in the 2009 data, and over 750 in the 2011 sample. The format of the string varies depending on platform, but to give some examples:
- Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.1.4322)
- Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:188.8.131.52) Gecko/20091102 Firefox/3.5.5
- Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)
- iTunes/10.1.2 (Macintosh; Intel Mac OS X 10.6.6) AppleWebKit/533.19.4
- iTunes-iPad-M/4.2.1 (64GB)
- iTunes/10.1.2 (Windows; Microsoft Windows 7 Business Edition (Build 7600)) AppleWebKit/533.19.4
- curl/7.18.2 (i486-pc-linux-gnu) libcurl/7.18.2 OpenSSL/0.9.8g zlib/184.108.40.206 libidn/1.10
- iTunes-iPad-M/3.2 (64GB)
- iTunes-iPhone/4.2.1 (4; 16GB)
- QuickTime/7.6.6 (qtver=7.6.6;cpu=IA32;os=Mac 10.6.6)
- AppleCoreMedia/220.127.116.11C148 (iPad; U; CPU OS 4_2_1 like Mac OS X; en_us)
- Azureus 18.104.22.168;Mac OS X;Java 1.6.0_22
- QuickTime/7.6.9 (qtver=7.6.9;os=Windows NT 6.1)
- Drupal (+http://drupal.org/)
- iTunes/9.0 (Windows; N)
The above all occur several thousand times each in our sample data, so they can’t be considered particularly unusual examples, unlike the following which are some of the uncommon examples:
- NSPlayer/12.00.7600.16385 WMFSDK/12.00.7600.16385
- iTunes/10.1.2 (000000000; 00000 000 00 0 000000) DDDDDDDDDDDDDDDDDDDD
- MLBot (www.metadatalabs.com/mlbot)
- Xenu’s Link Sleuth 1.1a
- Mozilla/5.0 (compatible; Ezooms/1.0; email@example.com)
- lwp-request/5.818 libwww-perl/5.820
- DoCoMo/2.0 N905i(c100;TB;W24H16) (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)
- Google-Listen/1.1.4 (FRG83D)
- VeryCD \xb5\xe7\xc2\xbf v1.1.15 Build 110125 BETA
- FDM 3.x
- podcaster/3.7.3 CFNetwork/485.10.2 Darwin/10.3.1
Whilst I have selected these for being visually unusual, the only other information here is that they are ordered in terms of frequency occurring in the sample data, and that the items on the second list appear less than 100 times each (sometimes only once or twice).
So, we’re looking for a common format to be able to parse this data. There is some helpful information at the MSDN for IE and Microsoft environments, similarly for Mozilla/Firefox based browsers. Thus far my working model looks like this:
Application Name / Application Version (A compatibility flag; Browser Version or Platform; Miscellaneous Details) Even more miscellaneous details.
You can see from just the small sample above (List 2, Item 10, breaks this format for starters), that fitting this template programmatically to the range of data is going to be difficult, but it is going to be necessary for us to easily answer questions over User Experience as performing searches requiring dynamic pattern matching within strings on datasets comprising many millions of entries is near impractical.
Of course, the other key lesson here is not to read too much into such narrow sets of data. My first instinct was surprise at the distribution/quantity of hits for specific UA strings. In our latest sample, item 1 from the first list accounts for around a 1/3rd of all hits for the day. This seems odd – because it suggests there are a *lot* of Windows 2000 based computers, running Internet Explorer 6, downloading our podcasts. Without context of where are these computers, or what are they accessing, we might think that people using a 10 year old operating system and a web browser with many known security issues to be rather odd. I hope that the tools in development will allow me to look at these sorts of data sets, and then drill down for more information that can reveal more about examples such as this.