I want to take a moment to talk a little more about the technical disconnect between technical tracking approaches and actual people and their actions, and perhaps float an idea from another field that might be of interest to this discussion.
First up, we’ve talked already about how a “hit” in a log file is the footprint of a computer left on a server, and how that footprint leaves an IP address. Well, we need to complicate that view a little further to aide understanding…
What the above diagram is illustrating is a range of possible entities behind an IP address. At its root, an Internet Protocol Address is 4 sets of numbers between 0 & 255, joined by a dot/period. This is the means by which, on one level , networks are traversed to allow information to find a given destination.
At its simplest, this is the address to finding one computer/device. Having found this one computer we then have to appreciate that it could be being used by zero or more actual people. Why Zero? Automated programs can be causing our podcasts to be downloaded – in fact, automated downloading of multiple podcasts is at the heart of the technology. We can’t assume that one download is being used by just one person, it could be between zero (not listened/viewed) and many people (files passed along, groups watching together, etc).
So, that’s at its simplest. However, the nature of the internet where the population of the planet exceeds the available number of IP addresses suggests that there isn’t a guaranteed one-to-one relationship between addresses and people. However, there are also more computers than there are IP addresses, which means that an IP address could easily mean multiple computers.
The above diagram gives an idea that multiple people can use a computer, multiple computers can share a router (e.g. typical home broadband use has one IP address for the property, shared amongst all the users and computers in that house) and multiple routers can share the IP address of their Internet Service Provider (ISP). For the latter, if you’ve heard of Dynamic IP addresses in relation to your own broadband provision, then you likely have a private IP address, and that we would see your download as being from your ISP and it’s IP address. If you have a Static IP address, then that may apply to your home router, or perhaps even the one computer. We as system administrators and log analysers though, can’t tell.
On the plus side though, each “hit” is at least one action/download… most of the time (see posts referencing http 206 status codes and CDNs).
Talking of countries again…
I mentioned geographical information derived from your IP address. Well, I was curious to know what the distribution of IP addresses was compared to the populations of a given country. To do that I looked up population by country in a quick internet search and got…
… and then did a similar pie chart comparing the number of IP addresses allocated to each country…
As we can see, America clearly dominate the allocation of IP addresses, and the most populous country in the world (China) appears as a distant second when it comes to IP addresses. Contrast this with the snapshot I showed you of IP addresses accessing one of our servers in a short timeperiod and you can conclude that it’s not a consistent percentage of a population accessing content, and that time and trends over time are critical factors to look at for analysis.
This sounds a little like TV/Radio audience figures…
This mystery over how many people are represented by an IP address, and indeed how many times a podcast is view/listened to and by how many people per download, caused me to consider the TV and Radio industries. The analog development of these mediums made assessing viewer/listener numbers very challenging as there was no easy feedback mechanism between the broadcaster and the audience. This led to advertisers who were usually paying for these broadcasts to want to know the impact of their advertising, or at least, the potential impact. Various systems have been used over the years, but it fair to say most of them have been based on qualitative methods involving sample audiences and surveys/diary keeping or similar. The wikipedia article on Audience Measurement is a reasonable introduction to these methods for the curious.
It would be interesting if this or a similar project could develop some form of qualitative and/or quantitive measure that could similarly estimate the number of people who have listened/viewed a podcast that has been downloaded. I fear it is beyond our scope at this stage though. One criticism of such an approach as noted by the wikipedia article is:
Diary-based radio ratings in the U.S. may inflate listenership, because it is only measured in 15-minute increments. Listening at any time during a quarter-hour counts as listening for the entire duration, even if the actual time was just for a song or two.
This reflects similar concerns and queries about the unknown usage patterns of people who download podcasts, and the compromises needing to be made to use such an estimate.
Finally, another use for IP address analysis
It is not uncommon for new IP addresses (new from the point of the observer) to be counted as a metric for measuring growth via new audiences. This is one metric that I have not yet seen mentioned in the TIDSR guides but I think would be valuable to perform as a longitudinal study of our data.
For those who want to see a few more opinions or get a personal security related viewpoint on IP Address tracing, try reading:
- The Myth and the Truth of the IP Address Tracing By Leo Notenboom
- What does your IP address say about you? by Michael Horowitz
- I’m not going anywhere near the context of physical layer switching and MAC addresses and glossing over Network Address Translation in general. Though as a small point, our Log Files don’t record enough details to be able to detect a NAT machine as the destination, and even if they did, I’m not sure this would be particularly useful information.