Friday 28 March 2014

Fixed problem with job synchronisation

If you have submitted any jobs over the last few days you might have noticed that some were getting stuck with a status of "New, allocated".
One failed database update about a completed job back on the 20th started as a small problem, but as the difference between the Tenerife and Web site databases grew the problem started to affect more jobs.

"Job Control" - the catchy name for our back-end system which shuffles jobs up and down between the Web site and the telescope in Tenerife - is written with a very conservative "let's not break anything" attitude to data synchronisation. It will fail-safe and send messages to humans rather than plough on and make a bigger mess. Eventually if things get really bad and it can't do anything useful it stops all together and waits for human help.

Unfortunately two things occurred which created the delay in fixing the problem. The data synchronisation problem wasn't big enough to stop the system all together - some data was still going back and forth. Also, Job Control hadn't yet been upgraded to use our overall system problem monitoring software - so I was unaware of a problem until a helpful user emailed to ask what was wrong. (Thanks, you know who you are!).

Now, the database update is fixed, Job Control is upgraded to produce better warnings and errors and all jobs are synced up correctly. The telescope operation was unaffected (except for perhaps not doing some jobs which may have been done). Also, jobs that have been correctly synced to the telescope today will have the correct amount of waiting time taken into account.

No comments:

Post a Comment