In December and January we’ve completed some service migrations, we’ve been auditing some services and some new staff members have joined our team, which makes this a good time to clarify what it means to have a migration completed. Although we migrate roughly 15-20 servers per year, the number of servers isn’t all that significant but rather the number of services on each server. More servers sometimes makes things a lot easier – in my experience an old host with multiple services on it can be much harder to untangle and migrate than four servers hosting one clearly defined service each. Especially with virtualisation (and our existing configuration management system) our team appears to be moving more towards the model of one service per host for reduced complexity. As older systems are replaced it’s getting easier with time as our documentation and internal policies/processes are maturing.
Our team has a handful of public/end-user facing services but these represent a small tip of the iceberg – we provide a lot of inter-team and unit level IT support services, plus the fully-team-internal services that in turn support those. As a result of this distribution, a migration task will typically be to migrate a background or inter-team service that’s run for five or six years to new hardware and software, with fairly little in the way of any political involvement. Note that you will see little in the way of end user consultation in the below checklists as a result of these being background supporting services, and financial funding and similar are left out as something that would be done before getting to this stage.
So this post is aimed at IT Support Staff performing a similar migration, to give some extra ideas as to the questions and checklists to run through. If you think you spot something that’s missed off, please do mention this in the comments.
Audit the existing team documentation for the service
For a complex service, auditing the existing internal team documentation helps ensure nothing is missed when planning the migration, by going through and fact checking and updating the existing documentation.
The existing documentation should cover, or be modified to cover:
|Requests for change (discussion and links to related support tickets)|
|Known defects / common issues experienced and their solutions|
|Troubleshooting steps for support queries|
|Notes about data feeds, web interfaces and other interactions with other teams for this service|
|Notes about the physical deployment|
|Notes about the network deployment|
|A clear test table for service verification|
|Links to any documentation we provide to the public/end-users for this service|
If this hasn’t been done the symptom is (aside from inaccurate documentation) that despite the migration being declared complete, small issues crop up over the next month due to missed or miss-understood sub parts of the service.
For service verification tests I like to keep it to a simple table with something similar to
- What the test is
- Command to type (and from where)
- Expected result
So for example if I was writing some tests for the DNS system, I might test name resolution for an external domain name, and I’m also interested in ensuring the authoritative name servers for ox.ac.uk don’t give a result, as that would be outside of their design behaviour and indicate something was wrong. So one test might look like:
|Test||Command||Expected Result (resolver)||Expected Result (auth)|
|External site query from internal host||(from a university host) dig www.bbc.co.uk @$dns_ipv4 +tcp
(from a university host) dig www.bbc.co.uk @$dns_ipv4 +notcp
|DNS record||negative response|
This example isn’t perfect. The person performing the test has to know to substitute $dns_ipv4 for the dns servers ipv4 service interface and I haven’t fully described what a ‘negative response’ or ‘DNS Record’ will look like in their terminal, but it a good starting point. It would be one of many tests (test from an external host, test a record from our own domain, test a record that should be invalid….) and as you improve them the tests that you define for service verification typically end up being a good basis as commands to automate for service monitoring, such as via Zabbix or Nagios.
For our own test tables, the tests include checking that when you log into the server, the Message Of The Day tells you what the server is used for, and if it’s safe to reboot the host for kernel updates of if special consideration is needed. It might also include tests to check that data feeds are coming in correctly (and not just the same data file, never updating), or that permissions are correctly reset on web files if altered (guarding against minor mistakes by team members).
Audit the public documentation
Our team may have a good opinion of what we believe the service is, but does the public documentation match that? We may not have written the documentation, or the person that did may have left, and we want to ensure we don’t overlook some subtle implied sub-service or service behaviour that would otherwise not be noticed.
For example, if the public documentation mentions DNS names or IP addresses, then we should avoid changing these whereever possible, so that many IT officers and end users aren’t inconvenienced into having to reconfigure their clients. If the documentation mentions that we keep logs for 90 days, then we should have 90 days of logs, not less (because we wont be able to troubleshoot issues up to the state retention length) and not more (because this is users confidential data that we shouldn’t be keeping longer than we promised as in the wrong hands it might represent account compromise, financial loss, embarrassment or similar).
Are there open change requests for this service?
If we’re migrating a service, now might be a good time to implement any open change requests that we can accommodate.
Sometimes we can’t change one aspect without altering other parts of the service, but when re-deploying/migrating the service we have an opportunity to alter the architecture and perhaps still provide the same end user facing service, but with improvements that have been requested.
If we can’t implement the change on this cycle (for cost or lack of human resource reasons), lets keep the change request in our pile, but document why so that we know when asked.
If we won’t implement the change (for political reasons, or technical sanity), again lets keep the change request but document the official statement on why it wont be implemented so that we can give a quick consistent response to queries instead of laboriously explaining each time it’s raised.
Using our knowledge, what can we improve with regards to how the service is delivered?
Requests for change aside, perhaps we can see ways from our experience and skill set to improve the quality of the service, the usability or the maintainability.
If end users have to configure software to use our service, can we alter the service to reduce the configuration?
If we previously had restrictions in place due to service load, can these now be lifted on the newer hardware?
If historical scripts import the data or are used to rebuild configuration files, do those scripts pass basic modern coding sanity checks?
|The code isn’t doing something that’s fundamentally no longer needed|
|The code is documented (e.g. perldoc pod format)|
|Any configuration or static/hardcoded variables are declared near the start (we might separate them out into a configuration file later)|
|The code passes basic static code analysis (perlcritic -5)|
|The code makes use of common team modules for common tasks (Template toolkit, Config::Any, Net::MAC etc)|
|The code meets basic team formatting requirements (run through perltidy)|
|The basic task the code is doing is documented in our team docs as part of the service|
|Does an automated test script exist to help regression test the code after changes?|
During the migration
This is usually service specific but generic planning features might be
- Can we eliminate downtime during the migration? (for instance, migrate one node of a cluster at a time with no affect on service?)
- If not can we minimise the downtime by careful planning? (research all the commands in advance, document them as a migration process and test the process)
- If we must have downtime, can we perform the downtime in a low usage period (out of hours, such as 7am or similar)
With the last point, remember to check that if the worst or supposedly impossible happens you can physically get into the building where the hardware is (switch/router/server). The only thing worse than a 7am walk of shame to physically turn on/reconnect a device after cutting off your remote access during planned maintenance work, is doing so only to discover that the building doesn’t open until 9am, making a ten minute service outage in the early morning instead into a two hour service outage that’s noticeable to everyone and runs into business hours.
Decommissioning the old hosts
|Required meta data (such as mail relay summary data used to make annual stats reports) has been copied from the host|
|The host has no outstanding running processes related to its function (e.g. a mail relay has no mail remaining in its queue)|
|If we search the the team documentation system, have all references to the old host been updated?|
|Have the previous hosts been marked decommissioned in the inventory system?|
|Have the previous hosts been deracked and all rack cables untangled/removed?|
|Have the previous hosts had their disks wiped (DBAN) and been marked for disposal?|
|In our configuration management system, have references to the previous, now decommissioned hosts been removed?|
|Remove the host from DHCP if present|
|Remove the host from DNS|
|Remove the host service principal from Kerberos|
|Are all hosts involved documented in the team documentation system?|
|Are all hosts involved documented in the team inventory system?|
|Are all hosts involved now monitored in the team monitoring system?|
|In the rack, are all cables labelled at both ends and is the server labelled?|
|Is the service address/name itself being monitored by the teams monitoring system?|
|Is the host reporting errors into our daemons queue?|
No doubt you’ll have lots of quite service specific migration checks to perform but add to these:
- Ask another team member, not involved in the migration, to read through the documentation. In my experience this works especially well if you can offer a prize, such as a sweet per unique mistake found (Think: Roses, Quality Street). I’m not joking here, people have their own tasks and generally will get bored of reading your documentation within a short space of time, no matter how well structured, which means it’s poorly tested. Offering a group of people an incentive costs very little and sparks interest, you’ll have problems found that you hadn’t thought of. Even if you don’t agree with their criticism, give out the reward for each unique issue raised. In my opinion if you correct as they check it’ll motivate them more as it’s obvious you’re taking action based on their feedback.
- Ask someone more junior in your team, or skilled in a different service area, to run through your service verification tasks (without you stood over them). If it’s not clear to them where to run the check from, or how to run the check, then do not criticise their skills but instead make your test documentation clearer. When the key specialist[s] for the service are on holiday and the service appears to break, perhaps someone from senior management will be standing over them demanding an explanation. At that point you want the service verification tasks to be as clear and comprehensive as possible so that there’s little opportunity to misunderstand them and as a result of running them successfully there’s no doubt that your teams service is not at fault (or if it is at fault, the issue is clearly cornered/defined by the tests and easier to fix).
Perhaps the important concluding point in all of the above is to have the self-discipline not to declare to anyone that the migration as complete until all service documentation has been tested, any migration support tickets/defects successfully addressed and all traces of the previous service tidied away.