Welcome, guest | Sign In | My Account | Store | Cart

Running separate processes with os.system() has a single major disadvantage for the work I'm doing, and that is the external program is prone to entering a state where it stops producing any output. The "wallclock" subroutine runs an external command and periodically checks to see if it has finished, killing it if it runs for too long.

Python, 66 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
#!/usr/bin/python

import os,time,sys

time_map = { 	"seconds"	: 1,
				"minutes"	: 60,
				"hours"		: 3600,
				"days"		: 86400
			}

def get_pid_status(pid):
	#True is running, False is not running
	ps_command = "ps -p " + str(pid)
	popen_results = os.popen4(ps_command)
	for line in popen_results[1]:
		contents = line.split()
		if contents[0] == str(pid):
			return True
	return False

def wallclock(command,lifetime=120,poll=1,repeat=0,time_unit="hours"):
	#This prevents zombie problems
	import signal
	signal.signal(signal.SIGCHLD, signal.SIG_IGN)
	if time_unit not in time_map.keys():
		print "This is not an accepted unit- treating as seconds. Acceptable units are:"
		for item in time_map.keys():
			print "\t" + item
	else:
		lifetime *= time_map[time_unit]
		poll *= time_map[time_unit]
	
	#The wallclock subroutine is written to run a command, but kill it and rerun it if doesn't finish in a certain time period (or after a given number of repeats
	#It is a "dumb" command- it should behave as a simple replacement for most "os.system()" calls, but with extra options
	pid = os.fork()
	if pid == 0:
		#child process- run the command and exit
		os.setpgrp() # Sets process session leader.
		os.system(command)
		#print "Child process command executed. I should exit"
		sys.stdout.flush()
		#This requires SIGCHLD to be set to SIG_IGN, to prevent a zombie process from appearing
		os._exit(0)
	else:
		current_time = 0
		while(current_time < lifetime):
			if get_pid_status(pid):
				#print "Child process " +str(pid) +" is running and I will sleep for " + str(poll) + " seconds"
				time.sleep(poll)
				current_time += poll
			else:
				#print "Parent process should now finish and return you"
				signal.signal(signal.SIGCHLD, signal.SIG_DFL)
				return 1
		#If you end up here, time has run out and the program has not finished: recurse & repeat (if required)
		#signal.signal(signal.SIGCHLD, signal.SIG_DFL)
		if repeat > 0:
			repeat -= 1
			os.kill(-pid,signal.SIGKILL)
			signal.signal(signal.SIGCHLD, signal.SIG_DFL)
			return wallclock(command,lifetime=lifetime,poll=poll,time_unit=time_unit,repeat=repeat)
		else:
			#give up, kill child and return 0
			os.kill(-pid,signal.SIGKILL)
			signal.signal(signal.SIGCHLD, signal.SIG_DFL)
			return 0

I run molecular simulations with the Gromacs engine using python code. Whilst on the whole, the simulations are as robust as the set up scripts, occasionally problems can arise where the "mdrun" program (which does the bulk of the hard number crunching) stops writing output, but continues to run on the processor. When running individual jobs on a cluster, this isn't a problem as these problem jobs can be manually killed, but when running large numbers of simulations, this can lead to a slow down over time as nodes on the cluster become occupied despite doing no work. This code is meant to be a simple replacement of os.system() as I was using it; given a time period and a polling interval it checks to see if a child process is finished, and if it hasn't after a given time period it is killed. I think the repetition of the return of SIGCHLD to its default is a bit ugly, but was necessary as the change otherwise prevents other things working later down the script. The only other thing I might add in the future is a set of other tests so that running out of time isn't the only reason to die- if mdrun has stopped writing for example.