As most people within the University will be aware, as well as those who have seen external press articles, over the past week or so we have been experiencing severe difficulties in delivering email to Hotmail. With these problems now largely resolved, we now have an opportunity to reflect upon what happened.
How did it start?
Firstly, a brief explanation of our setup for the unfamiliar. OUCS provide a central SMTP relay service (“Oxmail”) that acts as our primary interface for the exchange of email to and from the outside world. As well as handling messages for centrally-provided services, it handles mail from servers within the University’s many constituent departments and colleges.
The problems appear to have originated with a mailing list set up by another University department in order to communicate announcements with a group several thousand external people. An announcement was sent to that list on Friday 23 September and all was well until Sunday evening when one of the recipients replied. Accidentally hitting “reply all”.
Now, here the trouble started. For many mailing lists, it’s perfectly reasonable to allow subscribers to post without the need for their messages to be moderated. Unfortunately it wasn’t appropriate in this case. It was clearly a mistake, and should have been avoidable. Lessons have been learned in the department that created the mailing list and better procedures put in place for managing mailing lists. Nevertheless this is the sort of error which no doubt has hitherto occurred frequently, generally with minimal negative consequences.
So the result was that the user’s reply went out to the several thousand list members. Unfortunate, but not yet a huge issue. But then other list members started to reply, again sending their messages to the list: “Why am I getting these messages?” “Unsubscribe!” “Please stop this!” “Someone is getting fired in the morning!”
All this happened on a Sunday evening, and, not being funded for 24×7 operations, those who could potentially stem the tide of emails were blissfully unaware of the problems until they arrived in the office after the weekend. When the complaints were noticed on Monday morning over 300 separate messages had been posted to the list. That resulted in several million separate messages either already delivered or floating around on mail queues somewhere. We put a temporary block on the source and started removing messages from the offending mailing list from our queues. By mid-morning the problem was apparently resolved and we move on to other work.
The Microsoft block
Unfortunately, all was not well. The list membership included a large number of recipients with Hotmail or Microsoft Live addresses, who had (not unreasonably) been marking the messages as junk. The usually strong reputation of our mail relays’ IP addresses started falling, and during Monday evening enough messages had been tagged as junk for Microsoft to stop accepting email from us. This was approximately eight hours after the problem was fixed at our end.
We had been expecting something like this to occur for some time, but not for this reason. Over the past few months we have seen a huge number of accounts compromised as a result of successful phishing attacks; despite our efforts at user education we are apparently failing to get through to all our users. Compromised accounts are typically abused to send spam, and we have been warning of the potential for disruption to legitimate messages that may result; there has been one known instance of this (with another provider) over the summer.
On Tuesday morning it was rapidly evident that this case was much worse than we have seen with previous incidents. The Hotmail servers were rejecting messages with permanent errors, resulting in immediate bounce notifications to the senders. Messages to recipients at institutions using Live@edu were similarly affected. A large number of users were complaining. As if we needed any more bad news, it turned out that if a user of one particular mail system within the University forwards all their mail to Hotmail in a particular way, the resulting bounces caused a mail loop (which really shouldn’t happen, but that’s another story). To mitigate this issue and to avoid all mail being rejected outright, we took the decision for Oxmail to “hold” all mail for a limited number of domains, rather than attempting delivery and causing a bounce to be generated. This applied to hotmail.com, hotmail.co.uk, live.com, live.co.uk, and msn.com; mail to other domains hosted on Hotmail infrastructure continued to be rejected.
Resolving the issue
Our postmaster team raised a ticket with Microsoft’s email support, while within the security team we mailed our security contacts within Microsoft (messages to Microsoft’s own corporate email system are unaffected) explaining the situation. In particular we stressed that the original cause had been dealt with, and asked whether anything could be done to expedite lifting of the block. Others within OUCS communicated with our sales and more senior contacts. We continued exchanging emails long into Tuesday evening (many of our contacts being on Pacific Time), being encouraged to make use of Microsoft’s reporting mechanisms. We continued to press the matter on Wednesday, particularly as to how long the disruption could be expected to continue, but without much more information being forthcoming. We were particularly unimpressed with the following response via the official support channels:
We are unable to take action for the IP’s … because of their poor reputation within our system. I do apologize if I am unable to provide any details about this situation since we are not a liberty to discuss the nature of the block.
Despite some encouraging signs we remained blocked throughout Friday and into Saturday. Indeed at one point we were told that we should have been unblocked on Thursday, but Oxmail logs proved to the contrary.
It was not until Saturday evening, some time around 8pm BST, that mail once again started to flow. We continued to hold mail until we could monitor its release during working hours on Monday; owing to the large volumes of queued mail and rate-limiting on Hotmail, it took the best part of 24 hours for it all to be delivered. Even several days later, message delivery to Hotmail may be significantly delayed owing to rate-limiting.
Why didn’t you just change IP addresses?
We did consider whether we could switch IP addresses away from the blacklisted ones, or somehow have a special-case diversion for traffic to Hotmail. Such things are not necessarily trivial to change in a complex environment, and we must be wary of unforeseen issues making problems worse. Microsoft’s reputation-based system may not take kindly to a sudden deluge of email from a previously unseen IP address, and attempts to work around the blocks may not have gone down well with those persons we were asking to fix the problem. Furthermore there is the issue of precedent – will we be expected to take such action the next time there is similar disruption to email, even if considerably smaller in scale?
What can be done to avoid such problems in future?
The short answer is that we don’t know. We remain unsure as to whether the blocks were cleared through manual action by Microsoft staff, or simply because after five days of no email traffic, our “bad reputation” in their system expired sufficiently for mail to flow again.
We are happy to admit that the original problem was of the University’s making: as an institution, we screwed up; the fact the blame lies outside OUCS is neither our customers’ nor Microsoft’s concern. But we did what we could to fix the problem as soon as it was identified. The subsequent Microsoft blocks did nothing to prevent further mails from the offending list being delivered to their users, and merely caused major disruption to legitimate emails and business relying on those emails getting through. There has been resulting reputation damage both to Microsoft and to the University. These days admissions, alumni relations and recruitment are all heavily dependent on external email systems including Hotmail.
We cannot help but wonder what would have happened if we had followed the approach of some universities and outsourced student email to Live@edu while keeping staff email inhouse. We could easily have ended up with our staff unable to contact students by email at all for over five days – had it happened in the middle of term the consequences would have been extremely severe.
As a final note, we’d like to thank all of those who have assisted in dealing with this frustrating and time-consuming problem, and to all who have had to bear with us while the disruption has continued.