Tuesday, May 03, 2016

Anatomy Of A Fiasco

Last week I was the epicenter of a confluence of screw-up the likes of which the world hasn't seen since it got biffed by another planet and the moon was spat out.

It started innocently enough, oozing good intentions and only a whiff of brimstone™ in the air.

For the last year I've been husbanding a set of computer jobs that were written by a since-retired employee to some scheme only he knows. Operational issues arising from user-requested expansion of scope meant that the first job had been belted with the bodge hammer so hard and so often it was now a rat's nest of scripts scheduled all over the clock which worked to do one job - mostly.

The scripts had been amended often, sometimes by someone with only a tenuous grasp of what the update would mean1. They also had to be de-scheduled occasionally. The retiree had done this by hand but I wrote a script to do it automatically2 because I was going on vacation during one of the de-scheduling windows.

Later, I was asked to come up with a filewatcher, a database thingy that reacts to files arriving in a directory. I could not get the blasted thing working, so I wrote one in perl. It worked, indeed is working as I type, very well indeed. Not only that, I realized that for once I had designed something with so much flexibility that it could be used to replace the Rat's Nest code.

The Rat's Nest jobs essentially look for a file that gets transferred into that computer and do things with it. It adopts a technique of checking for the file by name and sleeping for half an hour if it doesn't see it. Problems arise in that the file has a date component as part of its name and files with the wrong date must not get processed - unless they must3.

Crossing midnight is especially nasty and causes all sorts of bugs to appear. I've been denying requests for longer run windows on all the other runs that do the same sort of thing for months because I don't want to inherit the same problems as the retiree did. Almost sorting out this issue actually had the Retiree making three copies of RatsNest.sh with slightly different code and running them in different time windows, which (nearly) covers all the bases and traps the incoming files. Sometimes more than once, which is a problem.

By replacing RatsNest.sh with SteviesSpiffyFilewatcher.sh all would be well. The umpteen different ways of doing the same job would become one thing. It was tested. What could go wrong?

I set up a sort of fake run to run alongside RatsNest.sh in order to see what could go wrong. Nothing did5.

One fly in the ointment was that before I could replace any late-night process, I would have to have the long-ago requested but yet-to-put-in-an-appearance remote computer access facility. This is a set of credentials that would allow me to use bleep to bleep and access the work computers from my laptop. I had been waiting for months for the go-ahead. It arrived at the end of last year. I tested it and it was, after a few teething troubles, dead good

Other projects got in the way of my implementing Project Unrestrained Genius, but last week H hour, D day, S script was decided upon and I adjusted the scheduler to turn off all copies of RatsNest.sh and switch on PureGenius.sh6 and I went home.

A sad mistake.

Naturally, I had made a spelling mistake that I hadn't picked up on. I learned a long time ago that if I don't see a problem in fifteen minutes of looking for it I never will, and have adopted review procedures to negate this7, but this was a small error I didn't know was there and missing iit was easy.

But I had planned for just such a screw-up and would be there with my trusty laptop to fix things on-the-fly should All not Be Well. I detected things not being well around 9:30pm (an expected email did not arrive) and activated remote access.

The remote access software announced it was going to update itself, and promptly did so, and that was it for my remote access. I struggled with the software in a World Gone Mad for two hours but couldn't figure out what had gone wrong.

All things being equal, I would then have done what I used to do in 1986 - jump an the next train west and work at my desk to fix the issue - but our office management has sent many emails telling me that unless I add my name to a special list ahead of time every time I need to get in I won't be allowed in after office hours.

I think you can see things were escalating nicely

I went to bed determined to takle the first train in that would arrive around 7 am, but was so worried that I couldn't sleep. I got up at 4:30, showered and ran for a train, just missing one and having to wait until 5:40 for the next one. I finally got into the office, fixed the typo and ran the scripts. I mailed out to everyone I could think of about the situation, who was to blame and why, what had been done and when and grabbed a cup of tea. It was about 7:20 am.

At around 7:40 am I became aware that my email client had disconnected from the email service, and that my explanatory mea culpa was still sitting in my outbox. A frantic series of phone calls revealed the ugly truth that everyone switched to the new cloud-hosted Office 3658 service a few days before was now working as they did back in 1986, sans email.

I began another frantic series of phone calls to the people I'd been mailing, but no-one was picking up. Turns out that the process I had screwed up was an essential part of the early morning processing and everyone had been up since three trying to figure out what was wrong.

I gnashed my teeth (again) at the paucity of documentation left me by the retiree, and ran upstairs to try and find anyone in the affected user group to tell face-to-face what was happening. Since none of them had been migrated to the failing cloud email system, they found my tales of dropped email service unconvincing, only reluctantly coming around when I suggested their quiet morning didn't mean no-one had problems, just that they couldn't tell anyone about them. Once I had them convinced I had to stand around for five minutes so they could all shout at me.

It transpired that another part of the organization had pushed out a "patch" that had nobbled the network connectivity to the Office 365 cloud, but I didn't find out about that until it was all over bar the punching.

On returning to my desk, one of my colleagues mentioned that he was getting emails to his phone, and an idea formed9. I would send out the mea culpa email from my personal account and see who yelled back at me.

Unfortunately my personal email account's name is one of those Fluffybunnystevie.net sort of names and a good 50% of the recipients would simply bin the incoming mail even though I sent it with the subject header AS YOU LOVE LIFE DO NOT BIN THIS EMAIL10.

It worked, sorta. Two of the upper management wrote back to say vile things in SHOUTY CAPS about me and my extreme incompetence, and mock me for my stupid email account name. About three hours later the cloud email service came back online and my original email went out, causing a reprise of SHOUTY CAPS and a stream of "me too" Replies to All that gave the cloud service a good workout.

So, all things considered, not my best work.

I have at least six more scripts that need replacing with the new style code, so I redesigned my new changes rollout checklist such that in the event I lost remote access nothing happens11 and poked around, eventually discovering the secret extra step needed to provoke my laptop to connect to the remote service. It seemed that all the Incompetence Demons had fled and gone away.

I can't wait to have another go.

  1. My favorite being the calculation of yesterday's date that would occasionally result in the nothingth of any given month
  2. And was shouted at by the retiree's Luddite colleagues when confronted by the new code
  3. All computer programmers4 recognize this requirement. Users supply it disappointingly frequently
  4. Now termed "developers"
  5. Naturally this roused my paranoia to new levels, but weeks of testing provoked no anti-programmer demons to manifest
  6. The original name was too long
  7. By grabbing someone who doesn't know what I'm doing and explaining the code to them. Works every time
  8. Now Office 364.5 since it took until lunch time to fix things
  9. I never learn
  10. Now I come to read it back, perhaps the subject header could have been better worded
  11. Which is what I should have done in the first place; turn off the old processes and turn on the new ones from home using the remote access service instead of doing that at work and planning on rolling it back from home

2 comments:

Westville 13 said...

So I found your blog today thanks to a link from the London Reconnections site and have been reading it ever since and howling out loud with laughter and you have Wintersmith (saw Steeleye Span in concert at Sussex University must be 44 years ago) and you namechecked Charles Stross (I am with child waiting for the next Laundry story) so I just wanted to say hallo and I hope you read your comments. I also hope to be reading a lot more of your blogs.

Stevie said...

Of course I read the comments. I have all three committed to memory.

My almond martyr is UEA, but it was what they might call a "bijou university" in my day, a mere 3000 students, not the sprawling edu-conglomerate it has apparently become. I saw Span there in '75 or '76. Still have the ticket stub somewhere.

FYI: I'm not sure becoming pregnant is an effective way of persuading Stross to write a Laundry Story. I've met the man and he seems unthreatened by the prospect of other people's children.