srun_cr - Online in the Cloud

This is the command srun_cr that can be run in the OnWorks free hosting provider using one of our multiple free online workstations such as Ubuntu Online, Fedora Online, Windows online emulator or MAC OS online emulator

PROGRAM:

NAME


srun_cr - run parallel jobs with checkpoint/restart support

SYNOPSIS


srun_cr [OPTIONS...]

DESCRIPTION


The design of srun_cr is inspired by mpiexec_cr from MVAPICH2 and cr_restart form BLCR.
It is a wrapper around the srun command to enable batch job checkpoint/restart support
when used with Slurm's checkpoint/blcr plugin.

OPTIONS


The srun_cr execute line options are identical to those of the srun command. See "man
srun" for details.

DETAILS


After initialization, srun_cr registers a thread context callback function. Then it forks
a process and executes "cr_run --omit srun" with its arguments. cr_run is employed to
exclude the srun process from being dumped upon checkpoint. All catchable signals except
SIGCHLD sent to srun_cr will be forwarded to the child srun process. SIGCHLD will be
captured to mimic the exit status of srun when it exits. Then srun_cr loops waiting for
termination of tasks being launched from srun.

The step launch logic of Slurm is augmented to check if srun is running under srun_cr. If
true, the environment variable SLURM_SRUN_CR_SOCKET should be present, the value of which
is the address of a Unix domain socket created and listened to be srun_cr. After
launching the tasks, srun tries to connect to the socket and sends the job ID, step ID and
the nodes allocated to the step to srun_cr.

Upon checkpoint, srun_cr checks to see if the tasks have been launched. If not srun_cr
first forwards the checkpoint request to the tasks by calling the Slurm API
slurm_checkpoint_tasks() before dumping its process context.

Upon restart, srun_cr checks to see if the tasks have been previously launched and
checkpointed. If true, the environment variable SLURM_RESTART_DIR is set to the directory
of the checkpoint image files of the tasks. Then srun is forked and executed again. The
environment variable will be used by the srun command to restart execution of the tasks
from the previous checkpoint.

COPYING


Copyright (C) 2009 National University of Defense Technology, China. Produced at National
University of Defense Technology, China (cf, DISCLAIMER).

This file is part of Slurm, a resource management program. For details, see
<http://slurm.schedmd.com/>.

Slurm is free software; you can redistribute it and/or modify it under the terms of the
GNU General Public License as published by the Free Software Foundation; either version 2
of the License, or (at your option) any later version.

Slurm is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without
even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.

Use srun_cr online using onworks.net services



Latest Linux & Windows online programs