We have talked about how we can collect data from our webservers, that we have a (presently growing) range of webservers in the ecosystem we want to observe, and that we can only observe a limited number of those webservers. In this post, we will explore where those holes are, look at the size of the catch, and then assess whether this is sufficient to paint a clear/representative picture of our activities.
Where are the holes?
Lets start by looking at our knowledge gaps from a timeline perspective. The following graphic highlights the periods for which we have data available to analyse and its source.
This timeline show we have the longest period of data via the Apple provided iTunes U summary spreadsheets. Whilst file hosting has been available prior to Oct 2008 (the start of this timeline), there was not any active log file collection prior to Feb 2009 and due to constraints of log storage space, data prior to this period has been erased. m.ox launched around April 2010 and we have been lucky to gather data since the launch on that portal. As more data sources are made available, we will endeavour to update this graphic.
A key decision to be made for this project is over the scope of the analysis, specifically over the time period, as the both the total amount of data to analyse and the time needed to do that exceeds the project resources, and because not all data is available for the same periods.
This next graphic looks at the total number of podcasts on offer from Oxford and which proportion of them are on central servers for which we have log files.
The blue area represents the 63.9% of podcasts offered by Oxford that are hosted on servers we have collected data for. The fishes that can be seen through the holes represent the podcasts for which we have no related server log information.
How big is the catch anyway?
We’ve touched on previously about the quantity of data available. A few sample statistics on the quantity of statistical data to analyse:
- Estimated 120,000,000 file requests to analyse in the available media.podcasts server log data.
- 112 weeks of summary data from Apple via Excel spreadsheets. This includes approximately 1000 items a week of download counts.
- 7 months of file requests to parse and analyse from the m.ox portal.
- 18 weeks of ‘Google Analytics’ summary reports provided as PDF files.
- There are presently in excess of 2000 podcasts published by Oxford in 285 RSS feeds.
In short, this is quantitively a *lot* of fish/data to sort through, and the tools available are not all that powerful which makes demand on human resourcefulness that much higher, and in excess of this project timescale to analyse fully.
The data is not evenly distributed timewise either. A quick look at the data quantities gathered over time produces the following graphic:
Quickly apparent from this is that nearly 2/3rds of our log file data has been collected in the last few months (which tallies with a surge in demand reported in the Apple iTunes U summary data). Whilst understanding this recent surge is important to our management, it needs to be compared with our historical data too for some questions to be answered. This uneven distribution of data also plays into our scoping decision making.
Is the Apple iTunes U portal significant?
The iTunes U portal is believed to be the largest driver of visitors to Oxford Podcasts, however there are a few aspects to understand about the differences between our primary Web Portal and our iTunes U portal. First, not everything on the Web Portal is made available via iTunes U – indeed, only 76% of our podcasts are also in iTunes U.
The following graphic looks at the number of podcasts within iTunes U that we have server log details for presently.
One key statistic we’ve not yet been able to determine is the proportion of traffic from iTunes U compared to other portals. A quick look at a small sample of data would suggest this to be around 80%.
How much might we have missed?
There are a number of overlapping factors to consider really, but one issue that has been worrying me has been whether we have enough data to be representative of our entire collection. Well, to know that I need to combine some form of total downloads in a given period against the list of hosts for which we have data. I can now do that using a 4 week sample of data from Apple to get the downloads (as recorded by Apple) and compare them against where we believe the files are located at.
Now, you might look at that and conclude that 73% isn’t much different from 72%. However, that conclusion would only be valid if the number of downloads was fairly equal amongst each podcast – something which is far from the truth. To highlight this inequality, look at the following charts…
Figure 6 is looking at a random week’s worth of downloads for the contents of the iTunes U store, as reported by Apple. There are at least 1400 items with one or more downloads in this sample. However, everything looks insignificant compared to the small number of very popular items for this week. Whilst the maximum number of downloads varies from week to week, the overall trend of our most popular stuff compared to the rest of our catalogue tends to be the same. Lets look at the same data but this time with a logarithmic vertical scale:
This chart shows the distribution of downloads over the entire catalogue in more detail. Roughly 50% of the items have 10 or less downloads a week, whilst nearly 40% have between 100 and 10 downloads a week.
Whilst the numbers need to be confirmed over the whole data set, our operating estimates for this trend are that our Top 3 podcasts account for 50% of our downloads in a given week, and that our Top 10 podcasts account for 80% of the total. Clearly, if we do not have access to the hosting data for the most popular items then we would have much larger holes in our data proportional to the quantity of podcasts for which we have data.
Explaining the factors that impact on how successful a podcast can be, and therefore explaining these graphs, is an objective of this project.
Considering the above distribution, having data for 73% of the podcast downloaded from iTunes U (representing nearly 80% of the total for Oxford) is not a bad sample from which to base our analysis. Indeed, if we hypothetically added in the 64% of the remaining 20% of downloads, we could suggest that we have 71% of all the relevant file request data. Not bad for a large net with many holes in it.
You’re talking fishing… but what have you actually done to calculate the above?
The above results are due to combining data from multiple sources (Apple, Log Files & RSS feeds) using a combination of python scripts (largely to extract and parse the RSS catalogue data) and Excel. Whilst there are extra notes on this below in the footnotes, the following summarises the other data gathering methods.
Fig 1: Manual inspection of data sources identifying earliest start dates and that we have continuous information to the present day.
Fig 2: This involves parsing the RSS catalogue feeds to determine the breakdown of podcasts by host. A simple python script incorporating the Feedparser library allows me to breakdown the master RSS/OPML feed. This feed lists all the Podcasting RSS feeds from Oxford. Parsing these 285 RSS feeds then gives me the number of items/podcasts and their file URLs. Importing this data into Excel allows me to then split the URLs and create a column of fileservers, and a pivot table then counts the number of files on each server from this data. Combining the number of files hosted on servers we have data for and dividing by the total gives us the percentage. The diagram is a 50 by 50 grid, with each cell representing 0.04% of the total number of files.
Fig 3: This is fairly easy. Do a directory listing for the log files (each of which represents roughly a day’s worth of data) and then collect the file sizes into an Excel spreadsheet. Use an area chart to combine the values over time and you have this chart.
Fig 4: Is similar to the approach for Fig 2, however, the RSS urls advertised in the master RSS feed need modifying to be the urls supplied to iTunes U (a change which reflects the publish status for each item). This reduces the file/item count as explained. From there, the data is put into Excel and the chart produced in the same way again.
Fig 5: This is where life gets harder. To determine the proportion of requests (i.e. log file data) against the various hosts, we need to combine some weekly download statistics from Apple’s spreadsheet against urls that allow us to determine fileservers. Unfortunately Apple doesn’t include URLs in their track download data, so here we need to match urls to download counts via a “guid” field. The guid data is also present in the RSS data, so that is imported into Excel to match against the URLs data. The workaround discussed in footnote 5 then is applied to put a fileserver next to the track download data. From there a pivot table can be produced that shows the number of downloads per fileserver. This data is than translated to percentages and charted in the same way as Fig 2 and Fig 4. As a time consuming bonus, I also had to manually determine which podcasts where available in iTunes U compared to our Web Portal so as to highlight any files that might be on iTunes U but not on the Web Portal (there were a small number, mostly related to recent ePub work).
Fig 6 & 7: Nice easy charts. Select a week’s worth of download counts, create a bar line chart. Repeat but changing the vertical axis to a log scale.
- Mostly this is data held on media.podcasts.ox.ac.uk run by OUCS, i.e. centrally hosted.
- What constitutes full analysis is for another posting, but TIDSR methods alone would consume most of the available time on just the media.podcasts data.
- Quick method: Choose server log at random. Count total number of hits (t). Count number of requests that include “CAMEFROM=itunesu” in the URL (i). Divide i by t to get proportion of requests clearly marked as from iTunes U. This method has a number of flaws, not least the choice of sample data (one day of data) and that the urls don’t guarantee they’re from the portal (though that’s the only place we actively advertise those patterned urls). Given that iTunes U is only accessible (we think) via the iTunes application you might expect that the number of requests that feature the iTunes Application as the requesting platform to be equal to or higher than 80%. In fact, a quick request count for “iTunes” finds just 75% of them do. Something fishy is going on here…
- The weaknesses here include: not having historic data for file URLs – some hosting arrangements have changed over the period for which a podcast has been publicly available. We’re hoping this is a small and insignificant portion; Using Apple’s data rather than our own – however, we believe Apple’s reporting to be consistent (with perhaps a few reservations born of experience) so a 4 week sample of our largest portal should be fairly indicative of overall demand.
- Several issues with Excel, though two that were significant to this report. 1) The LOOKUP() function doesn’t work as advertised in that it couldn’t match our guid strings (e.g. “http://rss.oucs.ox.ac.uk/tag:2010-08-17:121720:029:conted/critical_reasoning-aud”) across two sheets so that I could build an index that linked these identifiers with URLs, and therefore the host machines. Instead I had to take a multipart function approach that allowed me to manually verify the data – a combination of: MATCH(guidcellref,’sheetname’!$L$1:$L$200000,0) then ADDRESS(matchoutput,3,1,,”sheetname”) and finally IF(ISNA(INDIRECT(addressoutput)),”Feed deleted”,INDIRECT(addressoutput)). 2) Bar charts can’t have more than 255 series. Rather a pain when I want to plot 1400 bars along an axis. Instead a line chart had to be used.