Episode8

Power outage restart
Ignacio Valdes  Date: Thu, 14 Aug 2008 17:49:15 -0500

Hello all, We had a power outage with the server going totally down. Here is a terminal session dump of what we had to do to get the taskman re-started. The only thing that isn't as it is below is that one enters 't' and gets taskman up and running. This shows several useful commands such as how to find out where gtm exists as well as showing how I was in the right id but in the wrong user space /home/ivaldes instead of being in /home/vista/EHR

login as: ivaldes ivaldes@IP address password: Last login: Thu Aug 14 16:59:22 2008 from south124.ich.local [ivaldes@vista ~]$ su Password: [root@vista ivaldes]# su vista [vista@vista ivaldes]$ gtm

GTM>S DUZ=9

GTM>D ^XUP

Setting up programmer environment GTM>W $ZS 150374954,XUP+4^XUP,%GTM-E-REQRUNDOWN, Error accessing database /home/vista/EHR/ g/mumps.dat. Must be rundown on cluster node vista.ich.local. GTM>h [vista@vista ivaldes]$ echo $GTM_DIST

[vista@vista ivaldes]$ alias GTM alias GTM='/usr/local/gtm/mumps -direct' [vista@vista ivaldes]$ alias gtm alias gtm='/usr/local/gtm/mumps -direct' [vista@vista ivaldes]$ /usr/local/gtm/mupip rundown [vista@vista ivaldes]$ /usr/local/gtm/mupip rundown -r "*" %GTM-I-MUFILRNDWNSUC, File /home/vista/EHR/g/mumps.dat successfully rundown [vista@vista ivaldes]$ gtm

GTM>S DUZ=9

GTM>D ^XUP

Setting up programmer environment This is a TEST account.

Terminal Type set to: C-VT320

Select OPTION NAME: EVE 1  EVE       Systems Manager Menu 2  EVENT CAPTURE (ECS) EXTRACT AU  ECX ECS SOURCE AUDIT     Event Capture (ECS) Extract Audit 3  EVENT CAPTURE DATA ENTRY  ECENTER     Event Capture Data Entry 4  EVENT CAPTURE EXTRACT  ECXEC     Event Capture Extract 5  EVENT CAPTURE MANAGEMENT MENU  ECMGR     Event Capture Management Menu Press  to see more, '^' to exit this list, OR CHOOSE 1-5: 1 EVE     Systems Manager Menu

WARNING -- TASK MANAGER DOESN'T SEEM TO BE RUNNING!!!!

Select Systems Manager Menu Option: taskman Management

WARNING -- TASK MANAGER DOESN'T SEEM TO BE RUNNING!!!!

Select Taskman Management Option: taskman Management Utilities

Select Taskman Management Utilities Option: r   1    Remove Taskman from WAIT State 2   Restart Task Manager CHOOSE 1-2: 2 Restart Task Manager ARE YOU SURE YOU WANT TO RESTART TASKMAN? NO//YES (YES) Restarting...%GTM-E-JOBFAIL, JOB command failure %GTM-I-TEXT, Error redirecting stdout (creat) to _ZTM0.mjo %SYSTEM-E-ENO13, Permission denied

%GTM-E-JOBFAIL, JOB command failure %GTM-I-TEXT, Failed to set STDIN/OUT/ERR for the job

GTM>h [vista@vista ivaldes]$ whoami vista [vista@vista ivaldes]$ pwd /home/ivaldes [vista@vista ivaldes]$ cd /home/vista [vista@vista ~]$ cd log bash: cd: log: No such file or directory [vista@vista ~]$ echo $gtmgbldir /home/vista/EHR/g/mumps.gld [vista@vista ~]$ cd EHR [vista@vista EHR]$ ls env2 g  logs  o  r  WVEHR-gui  WVEHR-gui.log  WVEHR-VOE1.0-GTM-Routines.tgz [vista@vista EHR]$ cd logs [vista@vista logs]$ ls XWBTCPL.mje XWBTCPL.mjo [vista@vista logs]$ gtm

GTM>S DUZ=9

GTM>D ^XUP

Setting up programmer environment This is a TEST account.

Terminal Type set to: C-VT320

Select OPTION NAME: EVE 1  EVE       Systems Manager Menu 2  EVENT CAPTURE (ECS) EXTRACT AU  ECX ECS SOURCE AUDIT     Event Capture (ECS) Extract Audit 3  EVENT CAPTURE DATA ENTRY  ECENTER     Event Capture Data Entry 4  EVENT CAPTURE EXTRACT  ECXEC     Event Capture Extract 5  EVENT CAPTURE MANAGEMENT MENU  ECMGR     Event Capture Management Menu Press  to see more, '^' to exit this list, OR CHOOSE 1-5: 1 EVE     Systems Manager Menu

WARNING -- TASK MANAGER DOESN'T SEEM TO BE RUNNING!!!!

Select Systems Manager Menu Option: taskman Management

WARNING -- TASK MANAGER DOESN'T SEEM TO BE RUNNING!!!!

Select Taskman Management Option: taskman Management Utilities

Select Taskman Management Utilities Option: r   1    Remove Taskman from WAIT State 2   Restart Task Manager CHOOSE 1-2: 2 Restart Task Manager ARE YOU SURE YOU WANT TO RESTART TASKMAN? NO//YES (YES) Restarting...TaskMan restarted!

Select Taskman Management Utilities Option: mtm Monitor Taskman

Checking Taskman. Current $H=61222,63435 (Aug 14, 2008@17:37:15) RUN NODE=61222,63423 (Aug 14, 2008@17:37:03) Taskman is current.. Checking the Status List: Node     weight  status      time       $J EHR:vista         RUN      T@17:37:03   3863      Main Loop

Checking the Schedule List: Taskman has 1 task scheduled. It is not overdue.

Checking the IO Lists: There are no tasks waiting for devices.

Checking the Job List: There are no tasks waiting for partitions. For EHR:CACHEWEB there is 1 tasks. Out Of Service

Checking the Task List: There are no tasks currently running. On node EHR:vista there is 1 free Sub-Manager(s). Status: Run

Enter monitor Action: UPDATE// ^

Select Taskman Management Utilities Option: halt

Do you really want to halt? YES//

Logged out at Aug 14, 2008 5:37 pm GTM>h [vista@vista logs]$

K.S. Bhaskar Date: Thu, 14 Aug 2008 20:10:11 -0400

Ignacio --

You really ought to consider journaling. See how it's set up on the latest Toasters, for example, and see how simple it is. The Toaster has a small shell script that automatically recovers the database from the journal file on boot up and even starts up Taskman. Of course, if you like to practice typing... 8-)

Regards -- Bhaskar I, Valdes  	Date: Fri, 15 Aug 2008 06:15:26 -0700 (PDT)

Many years as a software engineer before medical school ruined the joy of typing as well as video games for me... Can you please post the script to this thread? -- IV K.S. Bhaskar  	Date: Fri, 15 Aug 2008 09:50:11 -0400

Ignacio --

You can adapt the following to your needs. You will need to turn on before-image journaling.

The script /etc/init.d/wvehrvoe10 is automatically executed by the system when it is booted or shut down:


 * 1) ! /bin/bash
 * 2) BEGIN INIT INFO
 * 3) Provides:          wvehrvoe10
 * 4) Required-Start:    $local_fs
 * 5) Required-Stop:     $local_fs
 * 6) Default-Start:     2 3 4 5
 * 7) Default-Stop:      0 1 6
 * 8) Short-Description: PIP V0.1
 * 9) Description:       Starts and Stops WorldVisA EHR VOE/ 1.0
 * 10) END INIT INFO


 * 1) Author: K.S. Bhaskar 


 * 1) Do NOT "set -e"

NAME=wvehrvoe10 PATH=/sbin:/usr/sbin:/bin:/usr/bin DESC="WorldVistA EHR VOE/ 1.0" SCRIPTNAME=/etc/init.d/$NAME

do_start {       su -c /opt/wvehrvoe10/gtm_V5.3-001_i686/wvehrstart wvehr
 * 1) Function that starts WorldVistA EHR VOE/ 1.0
 * 1) Function that starts WorldVistA EHR VOE/ 1.0

}

do_stop {       su -c /opt/wvehrvoe10/gtm_V5.3-001_i686/wvehrstop wvehr
 * 1) Function that stops WorldVistA EHR VOE/ 1.0
 * 1) Function that stops WorldVistA EHR VOE/ 1.0

}

case "$1" in start)        do_start        ;;  stop) do_stop ;; restart|force-reload)        do_stop        do_start        ;;  *) echo "Usage: $SCRIPTNAME {start|stop|restart|force-reload}" >&2 exit 3 ;; esac



It calls the script /opt/wvehrvoe10/gtm_V5.3-001_i686/wvehrstart to recover the database (effectively a no-op if it was shut down cleanly, starts Taskman, and removes journal files that are more than three days old (this is for a demo; adjust to your needs):

cd `dirname $0` rm -f tmp/*.mj[oe] source ./env $gtm_dist/mupip journal -recover -backward g/mumps.mjl \ && $gtm_dist/mupip set -journal="enable,on,before" -file g/mumps.dat \ && ./run START^ZTMB find g -iname mumps.mjl_* -mtime +3 -exec rm -v {} \;
 * 1) !/bin/bash

The script /opt/wvehrvoe10/gtm_V5.3-001_i686/wvehrstop stops Taskman and attempts a clean shut down (not always possible):

cd `dirname $0` source ./env ./run STOP^ZTMKU </dev/null
 * 1) !/bin/bash

I use a small script /opt/wvehrvoe10/gtm_V5.3-001_i686/env to set environment variables:


 * 1) env - file to be sourced to create VistA environment
 * 2) This temporary version of the commands to set up the VistA
 * 3) environment assumes that the parent and child use the same
 * 4) version of GT.M.
 * 1) version of GT.M.

export gtmver=`basename $PWD` if -d ../parent  ; then pushd ../parent/$gtmver 1>/dev/null source ./env popd 1>/dev/null fi

tmp=`dirname $PWD` tmp0="$PWD/o($PWD/p $PWD/r $tmp/p $tmp/r)"

if -n $routines  ; then export routines="$tmp0 $routines" else export routines="$tmp0" fi
 * 1) If there is an existing $routines, this environment comes before it

if -f $PWD/g/mumps.dat  ; then export vista_home=$tmp ; fi
 * 1) If a mumps.dat exists (vs. mumps.dat.gz) then this a usable environment

source gtm/gtmprofile export gtmgbldir=$PWD/g/mumps.gld export gtmroutines="$routines $gtm_dist"

The net of this is that when the Toaster boots, the database is recovered, and Taskman started. It doesn't matter whether the system was shut down cleanly or whether it crashed. I suggest that production VistA environments, especially in non-ASP environments, be set up along the lines of the Toaster.

Regards -- Bhaskar Nancy Anthracite Date: Fri, 15 Aug 2008 10:30:47 -0400

Note that that using the script to start and stop VistA itself is not recommended.

The menu system should be used for starting the system, and if you insist on using a script, Expect would be preferable as it would use the menu system. Currently AND the correct routine that runs with the option that is used for Taskman in the Menu system is RESTART^ZTMB.

By using the menu system, you know as best as is possible that patches and checks and balances will be taken into account.

There is a similar startup routine that directly calls routines for starting VistA for use with Cache circulating.

Doing things the "easy way" looks great when you want to do a demo, but for productions systems, think seriously about using the menu system. You can consolidate several items in the menu system into one menu if that would make it easier for you, but please don't circumvent the checks and balances. -- Nancy Anthracite K.S. Bhaskar Date: Fri, 15 Aug 2008 10:35:02 -0400

Nancy --

Whether for production or for demo purposes, the reason to script Taskman startup is to facilitate the packaging of VistA as an appliance. Are you saying that the wvehrstart script should use RESTART^ZTMB instead of START^ZTMB?

Regards -- Bhaskar Nancy Anthracite Date: Fri, 15 Aug 2008 10:57:24 -0400

RESTART instead of START, yes. -- Nancy Anthracite kdtop	Date: Fri, 15 Aug 2008 15:51:03 -0700 (PDT)

Bhaskar,

I was looking through this script. It looks to me like you are preloading responses for the mumps routine. I was trying to figure out how to do this a year ago and never got a good answer.

So what are you doing here? It looks like you are redirecting standard input. What does that EOF do?

Thanks Kevin

cd `dirname $0` source ./env ./run STOP^ZTMKU </dev/null K.S. Bhaskar  	Date: Fri, 15 Aug 2008 23:42:49 -0400
 * 1) !/bin/bash

The bash construct (which works on many shells) is, when there is a command such as:

grvb -mbg kvtz <<GLZNOP oinad mnjbz GLZNOP

it means run the command grvb -mbg kvtz, and as its STDIN (standard input) feed the lines oinad and mnjbz. The GLZNOP on the command line tells it the marker to look for, and the GLZNOP on a line by itself is a marker that says no more input is available for the command. EOF is just slightly more readable to programmers than GLZNOP, but the shell doesn't care - it just matches the word after the << and the word on a line by itself.

Regards -- Bhaskar

kdtop	Date: Sat, 16 Aug 2008 06:54:00 -0700 (PDT)

VERY helpful! Thanks. This opens all kinds of possibilities....

Thanks again, Kevin

Branden Tanga Date: Sun, 17 Aug 2008 04:50:52 -0700 (PDT)

Hello,

While using GT.M journaling is a good idea, that doesn't necessarily mean that you can always recover your VistA database. This is due to the fact that GT.M journals on the GT.M level, which is sets and kills. VistA operates at the Fileman and business logic level, where one Fileman command is made up of multiple sets and kills. Unfortunately, VistA nor Fileman has journaling at it's own level.

So let's say that you have a task in taskman that is executing a Fileman command, which in turn is made up of 10 GT.M sets. Your server dies in the middle of that command, at GT.M set 5. GT.M journaling will allow you to recover to GT.M set 5, but your Fileman call never finished, and you cannot automatically roll back past GT.M set 1 because Fileman has no journal record of it's own, marking set 1. You can manually roll back GT.M past set 1, but that means that YOU the programmer has to know what was being executed, and know to which GT.M set you have to roll back to.

Now imagine if you have multiple tasks running concurrently when your server goes down. GT.M will recover happy as a clam, but you will have multiple Fileman calls in various states of completion. What if rolling back past one Fileman call puts another Fileman call in an invalid state? To my knowledge, you cannot roll forward or back through a GT.M journal file based on process id (please correct me if I am wrong here). So all your sets and kills across all your processes are interspersed with each other in the GT.M log.

So what do you do? When I have lost a server and ended up with the results of an incomplete Fileman call, I had to find the incomplete globals and edit them appropriately. Luckily, for my close calls the end user was available to tell me what they were doing. That made it much easier to find what globals were affected. Thus I have never rolled back through a GT.M journal as a result of server failure, I have only moved forward fixing errors as I find them.

Apologies if you already knew this, but I'm not sure how many people have thought of the ramifications caused by VistA not having a journaling system of its own.

Branden Tanga

P.S. I know that GT.M has the capabilities for an application to leverage its journal file, in essence bringing the journal file to the level of your business logic. Unfortunately, VistA does not take advantage of anything like that, and the VistA or Fileman routines would have to be edited.

K.S. Bhaskar  	Date: Mon, 18 Aug 2008 09:47:22 -0400

Branden, this is not a GT.M issue, but rather, as you note, a VistA/Fileman design issue, in that while the database engine can provide recovery of database state, without the use of transaction processing features by the application code, you are not guaranteed that the database state is Consistent (referring to the ACID transaction properties of Atomicity, Consistency, Isolation and Durability). I don't know what a transaction might be in the health care arena, but consider transferring $100 from your checking account to your savings account that is implemented by subtracting $100 from your checking account balance and adding $100 to your savings account balance. In the event of a system crash, either both the subtraction and addition operations should be reflected in the state of the database, or neither should be reflected. It is not acceptable for one to be reflected and the other not to be reflected. The MUMPS language provides TStart and TCommit commands that you can bracket your code with and which provides Atomicity. Thus, if the application logic is correct (in our example, the transfer is implemented as a subtraction from one account and an addition of the same amount to the other account), we have Consistency.

As you note, VistA/Fileman does not use MUMPS transaction processing commands, and therefore, when a database state is recovered from a crash, it can, and likely will, be Inconsistent. Since VistA has been designed this way, and has operated for years, my guess is that either (a) from an application point of view, transaction Consistency is not important - for example, if a system crashes during registration, perhaps an incomplete registration means that the patient has to be re-registered, but and the consequence is simply an unused serial number or (b) there is application logic to search for and correct Inconsistencies.

It would be good to hear from some application experts on this topic. Thank you very much.

Regards -- Bhaskar

fred trotter	Date: Mon, 18 Aug 2008 10:12:00 -0500

Is it a true statement that ACID compliance for VistA could be implemented entirely in FileMan? Or would it require more fundamental changes in other places?

The problem with Brandens story is that his workaround for a non-ACID crash was to leverage extensive knowledge of how VistA works to figure out where it was broken. Essentially these kinds of efforts prevent the "kernelization" of VistA. Important details of how the VistA/MUMPs works are required in order to fix this type of problem. Issues like these ensure that VistA usage grows only as fast as VistA "kernel" expertise, and that grows slowly indeed.

If the VistA project cannot find a way past these kinds of issues it will be eclipsed by other FOSS projects. Either by VistA-based efforts like WebVistA (knowing that it is difficult to tell what that looks like) or by other efforts like OpenMRS, Tolven and ClearHealth proper.

It seems clear that Baskar has done his part. He has exposed an API from GTM to handle this issue.

What now?

-- Fred Trotter K.S. Bhaskar  	Date: Mon, 18 Aug 2008 11:58:25 -0400

Fred --

You are thinking like a programmer and not like a business person. Remember that things like ACID properties (and more esoteric things like two phase commit) are technologies intended to assist in business continuity in the face of unplanned events. As a geek at heart, I keep reminding myself that technology is only a means to an end, and not an end unto itself. VistA (at least DHCP) existed well before ACID properties and seems to run well. So, I think the questions to ask (before imposing a requirement of ACIDity) are:

Do the business processes of health care require ACID transaction properties or are the business processes inherently robust in the face of non-Atomicity and non-Consistency? [Isolation and Durability are not at issue here.] If this is the case, is a requirement of ACIDity like requiring brake fluid for restaurants?

If the answer is that the business processes of health care (at least as addressed by VistA) are not inherently robust in the face of non-Atomicity and non-Consistency, then what mechanisms currently exist in VistA that provide these requirements?

Until we look at the above questions first, looking at ACIDity is like putting the cart before the horse. Branden was not the first to experience a VistA system crash. Let's find out what others have done before him after recovering from a crash.

Regards -- Bhaskar George Timson  Date: Mon, 18 Aug 2008 09:44:51 -0700 (PDT)

Fred Trotter asks: >Is it a true statement that ACID compliance for VistA could be implemented entirely in FileMan? Or would it require more fundamentalchanges in other places?

No, it is not a true statement, because other VistA code changes the database without going thru FileMan calls.

Fred comments:

> It seems clear that Baskar has done his part. He has exposed an API from GTM to handle this issue.

What Bhaskar exposed was transaction-processing syntax that has been in the MUMPS Standard for a long time, but which the VA chose not to use. GTM of course is to be commended for implementing the MUMPS Standard! ;-)

Fred asks:

> What now?

Well, if someone wants to fund a man-year of retrofitting all VA code with the TS and TC commands, maybe the VA would be willing to change their (SAC) standard, and test and distribute hundreds of transaction-processing changes to their code. But I doubt it, when they don't even take bug-fixes and functionality enhancements from the outside. Woodhouse Gregory  Date: Mon, 18 Aug 2008 09:47:36 -0700

Production VistA systems normally use journalling. Other measures include the use of RAID and UPS devices. For historical reasons (lack of uniform support across MUMPS implementations) VistA systems have not used transactions. This is no longer the case, but there is plenty of legacy code out there that does not use transactions. Instead, it was/is necessary to restore journaled globals explicitly.

In response to Fred's question: Fileman does not provide ACID support directly: this needs to be handled by the underlying MUMPS system. The role of Fileman is to provide a higher level abstraction than MUMPS globals, and to provide various tools (import/export, reporting, query and update, etc.) Screenman and the Classic APIs also provide (character based) UI support.

Metaphors be with you. fred trotter  	Date: Mon, 18 Aug 2008 11:54:55 -0500

K.S. Bhaskar wrote:

> Fred -- You are thinking like a programmer and not like a business person.

No exactly the opposite.

> As a geek at heart, I keep reminding myself that technology is only a means to an end, and not an end unto itself. VistA (at least DHCP) existed well before ACID properties and seems to run well.

Under the care and feeding of highly trained experts who do nothing else.

My point is not at all that we need ACID, my point is this:

If system crashes require in-depth knowledge of MUMPS/FileMan/VistA to fix, then users cannot treat VistA as a "kernel". By "kernel" I mean a reliable platform whose internal workings can safely be ignored if certain requirements are respected (i.e. the right hardware, MUMPS implementation, etc etc.)

It would be entirely fine for me to have the VistA community say "Backup VistA every hour. If the system crashes, reinstall the most recent good backup, and send a alert that 1 hours worth of data has been potentially lost"

That's not great... ACID would be better but that is what you had to do with MySQL for a long time and is an acceptable work-around.

Unacceptable answer is "Use your extensive understanding of VistA internal state to correct the values of Globals that were in use at the time of the crash"

That answer implies that you must be a MUMPS expert to support VistA which is intractable. I am not a C expert but I use the C-based linux kernel all the time.

I am talking about a business problem in the context of one technical solution, but my concern is about the business problem.

-- Fred Trotter Woodhouse Gregory Date: Mon, 18 Aug 2008 10:11:17 -0700

On Aug 18, 2008, at 9:44 AM, George Timson wrote:

> Fred Trotter asks: Is it a true statement that ACID compliance for VistA could be implemented entirely in FileMan? Or would it require more fundamental changes in other places?

> No, it is not a true statement, because other VistA code changes the database without going thru FileMan calls.

This a perennial problem with VistA code. I've long argued that developers should resist the urge to manipulate Fileman globals directly, but even if everyone stopped today, there would still be plenty of code that bypasses Fileman. Another, perhaps more insidious, problem is that developers and systems personnel often manipulate globals to correct errors ("crashes").

> Well, if someone wants to fund a man-year of retrofitting all VA code with the TS and TC commands, maybe the VA would be willing to change their (SAC) standard, and test and distribute hundreds of transaction-processing changes to their code. But I doubt it, when they don't even take bug-fixes and functionality enhancements from the outside.

The SAC has been revised to allow the the use TS and TC, but that doesn't address the legacy code problem (the issue you address above). Steven McPhelan Date: Mon, 18 Aug 2008 13:35:06 -0400

George stated "...if someone wants to fund a man-year of retrofitting all VA code with the TS and TC commands.." I understand that George was making a different point. I do not think that one man year is even close to sufficient time to rewrite all the existing VA code to be TP compliant. To make the changes, QA it, and release it would be a very large task indeed. Then it does no good as George implied to undertake such a task and to not put in place the structure to mandate and enforce that all new code from that point forward would only use TP procedures.

All of this is predicated upon the assumption that load testing of such rewritten code to be TP compliant shows that there is no decrease in the number of the transactions filed per time period without the requirement to upgrade the hardware to handle TP vs non-TP processing. I won't get into the practical issues of how the existing code would handle TP rollbacks because the filing failed. For good or bad, many VistA programs file data and proceed on with no checks to see if the filing of the data was indeed successful.

-- Steve "Rest satisfied with doing well, and leave others to talk of you as they please." - Pythagoras fred trotter Date: Mon, 18 Aug 2008 12:45:35 -0500

> So what do you do? When I have lost a server and ended up with the results of an incomplete Fileman call, I had to find the incomplete globals and edit them appropriately. Luckily, for my close calls the end user was available to tell me what they were doing. That made it much easier to find what globals were affected. Thus I have never rolled back through a GT.M journal as a result of server failure, I have only moved forward fixing errors as I find them.

Ok, I will make my question more specific. Is this paragraph illustrative of how to handle a crash moving forward? If this is how crashes are handled, then this is a problem. If there is another procedure that can be followed, then it is important enough to have a description on the WorldVistA wiki. Or to have a link from the wiki to an already published solution. To help, I have created the page:


 * http://vistapedia.net/index.php?title=Restoring_a_VistA_installation

HTH, -FT fred trotter  	Date: Mon, 18 Aug 2008 13:07:19 -0500

Going on to discuss the pure technical issue:

Is there no way to do this on a meta level? What about executing TS and TC commands before and after every routine. So that at a minimum you know roughly in which routine the failure took place.

Perhaps you could have some "named idle journal". So that you could automatically roll back to a time when at the least nothing was happening on the system.

Any time I suggest something like this I usually get back that something like this already happens, or Baskar tells me that GTM already does something like this. I know I am way way over my head with regards to how MUMPS works....

-- Fred Trotter Woodhouse Gregory  Date: Mon, 18 Aug 2008 11:15:40 -0700

On Aug 18, 2008, at 8:58 AM, K.S. Bhaskar wrote:

> Do the business processes of health care require ACID transaction properties or are the business processes inherently robust in the face of non-Atomicity and non-Consistency? [Isolation and Durability  are not at issue here.]  If this is the case, is a requirement of ACIDity like requiring brake fluid for restaurants?

> If the answer is that the business processes of health care (at  least as addressed by VistA) are not inherently robust in the face of non-Atomicity and non-Consistency, then what mechanisms currently   exist in VistA that provide these requirements?

This is interesting. It seems uncontroversial that database integrity is a requirement for health information systems (for example, we wouldn't want a penicillin Allergy to be "lost"). In the ACID model, I would be hard pressed to say which of the four properties (atomicity, consistency, isolation and durability) can be dispensed with. But what is less obvious is that the ACID approach is the only route to database integrity. Thee latest ACM Queue takes this on with a little column whimsically entitled "BASE: an alternative to ACID"


 * http://www.acmqueue.com/modules.php?name=Content&pa=showpage&pid=540&page=1

Results like the CAP theorem have interested me for some time, given that I am interested in (developing) alternatives to heavy-handed approaches database consistency like message ordering (frequently employed in HL7).

Anyway, the CAP theorem is just another version of a well-known dilemma in database programming: in choosing between the 2-phase and 3-phase commit, you are forced to choose between an algorithm that can fail, even when updating the database is safe, and one that can block indefinitely.

"It is never too late to become reasonable and wise; but if the insight comes too late, there is always more difficulty in starting the change." -- Immanuel Kant Woodhouse Gregory  	Date: Mon, 18 Aug 2008 11:23:46 -0700

Free associating a bit, I can't help but think of a famous result in (mathematical) model theory called Löb's Theorem. It states that a system cannot assert its own soundness without being inconsistent.

fred trotter wrote:

> Going on to discuss the pure technical issue: Is there no way to do this on a meta level? What about executing TS and TC commands before and after every routine. So that at a minimum you know roughly in which routine the failure took place.

> Perhaps you could have some "named idle journal". So that you could automatically roll back to a time when at the least nothing was happening on the system.

> Any time I suggest something like this I usually get back that something like this already happens, or Baskar tells me that GTM already does something like this. I know I am way way over my head with regards to how MUMPS works....

This is a good question. It shouldn't be difficult to write a meta- interpreter of the type you describe, though I'm unsure what the performance implications would be.

Basically, you're running into the legacy code problem. Modern MUMPS implementations do support ACID transactions, but this facility was not available when the bulk of VistA was developed. This has led to a controversy between people arguing that it is not feasible to build transaction support into VistA, and people (like me) that argue that it is essential to do so. Unfortunately, this generally mutates into a highly emotional Debate over the use of MUMPS, which is not the point at all.

fred trotter  Date: Mon, 18 Aug 2008 14:33:08 -0500

I agree that ACID vs no ACID is probably a waste of time. Any practical suggestions for workarounds for VistA rebuilding?

-- Fred Trotter Woodhouse Gregory Date: Mon, 18 Aug 2008 12:39:32 -0700

>> So what do you do? When I have lost a server and ended up with the results of an incomplete Fileman call, I had to find the incomplete globals and edit them appropriately. Luckily, for my close calls the end user was available to tell me what they were doing. That made it much easier to find what globals were affected. Thus I have never rolled back through a GT.M journal as a result of server failure, I have only moved forward fixing errors as I find them.

> Ok, I will make my question more specific. Is this paragraph illustrative of how to handle a crash moving forward? If this is how crashes are handled, then this is a problem. If there is another procedure that can be followed, then it is important enough to have a description on the WorldVistA wiki. Or to have a link from the wiki to an already published solution. To help, I have created the page:


 * http://vistapedia.net/index.php?title=Restoring_a_VistA_installation

It's close - far too close for my comfort. Production systems should always be journaled, but I suspect many people here who may be developers, or who may be just "kicking the tires", may not enable journaling.

"Think globally, act locally." --René Dubos Chris Richardson  	Date: Mon, 18 Aug 2008 14:23:35 -0700

Well, guys, there is nothing left to do but contact your congressmen about this and start a grass-roots effort to get this funding. It would be embarrassing if a foreign government might pay for our software to be properly updated. Branden Tanga Date: Thu, 21 Aug 2008 20:27:34 -0700 (PDT)

Sorry to bring up a seemingly dead topic, but I haven't kept up with this thread over the past few days.

I don't see code that directly edits globals as the major issue. The main problem as I see it, is that having transactional processing built into Fileman is not good enough to be able to safely roll back and forward through a VistA log. In the same way that a single Fileman call is made up of multiple Mumps sets and kills, a single VistA transaction can be made up of multiple Fileman calls. So you would need code in VistA itself that defines what a "transaction" is. Likely, this definition would be different for each module in VistA. There is no way for a pure programmer like me to denote VistA transactions, you would need domain experts for each module to mark which action or group of actions are a transaction.

I totally agree, my solution is not optimal. When I had a server failure, I was faced with 2 options:


 * 1) Figure out where in the GT.M journal to roll back to
 * 2) Figure out how to fix the globals manually, and move on.

Because of the risk that rolling back through the journal may cause other Fileman calls to be incomplete, and the ridiculous amount of time it would take to figure out which exact GT.M set or kill I needed to roll back to, I chose #2. I talked to my end users to figure out what they were doing, edited the necessary globals, and if their actions were "finished", then I considered the database recovery as complete as possible. In short, I had to choose the lesser of 2 evils, which was to fix the globals manually and move on. Skip Ormsby  	Date: Fri, 22 Aug 2008 07:01:30 -0400

If my creaky old brain remembers correctly, one of the reasons for non-traction processing is because of code like this (before the unsubscripted kills prevention was implemented and the New command, although there are still plenty of times the Kill is used)
 * S DIC=4,DIC(0)="AEMQZ" D ^DIC
 * ;Now being a good developer since I am making a Classic call I need to do local variable clean up
 * K ^DIC ; Ahh oops

Generally the favorites were ^DD, ^DIC, and ^DPT in no particular order. Solution - read the journal until you find the unsubscripted Kill and clip it out. It may take X amount of time before you actually notice that something has disappeared, so you would have journal activity that needs to be applied post the unsubscripted Kill.

-skip "we have met the enemy and he is us." - Pogo Steven McPhelan  	Date: Fri, 22 Aug 2008 08:29:59 -0400

That is how I have always handled this problem in the past so very long ago since I have not had to do this in years. That is the purpose of the journal which is to bring a backup copy up to date with all the transactions since that backup by dejournaling. If there was something I knew I dd not want to happen (Skip's K ^DIC example), we would edit the journal file to remove the offending code and then proceed with the normal dejournaling procedures. Of course if you are not journaling then you do not have this option.

I have not looked in years, are the journal files still just text files or have they been "updated and improved"?

K.S. Bhaskar  	Date: Fri, 22 Aug 2008 10:08:17 -0400

Branden --

There was some off-list discussion of this topic. To summarize, when the VA runs VistA, they rely on the MUMPS implementation to restore the database to the state it was in just before the crash (of course, they use computer hardware and operating systems that don't make a habit of crashing). A combination of their business processes and VistA application logic is such that after a crash, they don't usually need to go in and make changes - in other words, their business processes are in a good state when the MUMPS system recovers the database and VistA is restarted.

I speculate that when units of work need to be done, they are put in queues (in the database) for Taskman background processes to handle, and the design of Taskman is such that when the database is recovered, it picks up unfinished work from the queues. But this is just a guess.

Regards -- Bhaskar

Steven McPhelan  Date: Fri, 22 Aug 2008 16:11:28 -0400

If the taskman globals are journaled, they will be recovered and Taskman will start where he left off. However, any existing jobs running at the tiime of the crash will not be restarted. Skip Ormsby  Date: Fri, 22 Aug 2008 20:15:58 -0400

As long as the subject is about power outages, at the hospital I was at we had no break power for 8 PDP 11-44s that would run the critters for 15-20 minutes, which was long enough for either the main generator to kick in or for one of us to gracefully shut the systems down. When we went to the MSM/486 Configuration, we made sure that all of the parallel bars were zipped tied and the plugs were zipped tied so the didn't accidentally come out of the plug. And of course the no break power would last a very long time, but we never pushed it past 1/2 hour. The biggest problem was more in line with a disk controller going nuts, or a nic card that would go berzerk, or bad memory, which in turn would put curd into the data base. For us it was a case of fix and forget it, because there were other fish to fry. Never did have a real power outage to the computer circuit, even when the room was flooded from a broken pipe in the ceiling. Lost the lights, etc., but computer kept right on humming until we shut them down in a speedy, graceful shutdown.

-skip "we have met the enemy and he is us." - Pogo

Episode 7 Log Homepage Episode 9