Differences between revisions 10 and 13 (spanning 3 versions)
Revision 10 as of 2016-05-25 07:30:02
Size: 2492
Editor: NicoleThomas
Comment:
Revision 13 as of 2018-06-08 07:55:52
Size: 3599
Comment:
Deletions are marked like this. Additions are marked like this.
Line 3: Line 3:
 * If the restart event is triggered, the status of the model is dumped with full precision to restart files If the restart event is triggered, the status of the model is dumped with full precision to restart files

=== Write restart files ===
Line 6: Line 8:
Line 15: Line 16:
<<BR>>   
 * If the job is submitted to a queue manager, it might be necessary to split the simulation into chain elements. The submodel QTIMER triggers the restart just before the maximum time reserved by the scheduler is reached (Development cycle 2 of the Modular Earth Submodel System, section 4). The queue time limit and the usable fraction can be specified in '''qtimer.nml''', e.g.:


 * If starttime of model is noon and restart files should be written at noon:
<<BR>>
      use unit 'hours'
  {{{
IO_RERUN_EV = 240,'hours','first',0
}}}
  or use unit 'days' and set offset to 86400 [sec]
  {{{
IO_RERUN_EV = 10,'days','first',86400
}}}
  This offset is not allowed for units 'years' and 'months'.



 * If the job is submitted to a queue manager, it might be necessary to split the simulation into chain elements. The submodel QTIMER triggers the restart just before the maximum time reserved by the scheduler is reached (Development cycle 2 of the Modular Earth Submodel System, section 4). The queue time limit QWCH can be specified the messy-script and is replaced in '''qtimer.nml''', e.g.:
Line 20: Line 35:
QTIME = 4,0,0, ! queue time limit (hh,mm,se); 0,0,0 to switch off QTIME = $QWCH,0,0, ! queue time limit (hh,mm,se); 0,0,0 to switch off
Line 25: Line 40:
 => When 95% of 4 hours CPU time are reached, restart files are written and the next chain-element is started.  => QWCH=4: When 95% of 4 hours CPU time are reached, restart files are written and the next chain-element is started.
Line 27: Line 42:
=== Restart model ===
Line 31: Line 47:
 * All files needed for a rerun starting from a specific chain element are saved in the subdirectory ''save/NNNN'' of the working directory.
   NNNN is the 4-digit number of the last complete chain element.
 * All files needed for a rerun starting from a specific chain element are saved in the subdirectory ''save/NNNN'' of the working directory.<<BR>>
Line 34: Line 49:
 * The restart files of the last chain-element are linked into the working directory  NNNN is the 4-digit number of the last complete chain element. <<BR>>
Line 36: Line 51:
 * In order to start a rerun with chain element NNNN+1, the script '''messy/util/init_restart''' can be used to link the correct restart files:  The restart files of the last chain-element are linked into the working directory.<<BR>><<BR>>

In order to start a rerun with chain element NNNN+1, the script '''messy/util/init_restart''' can be used to link the correct restart files:  
Line 38: Line 55:
  init_restart -r NNNN -c MMMM [-d dir] messy-dir/messy/util/init_restart -r NNNN -c MMMM [-d dir]
Line 41: Line 58:
 NNNN: restart number <<BR>>
 MMMM: cycle number
 NNNN: restart number (number of batch job in a job chain) <<BR>>
 MMMM: cycle number (number of restart within a batch job)

 * Restarts after abnormal termination:

 If the model stops due to an occured error or a hardware problem, it can be restarted manually.<<BR>>
 In this case all restart files are located in the working directory and not saved to subdirectories ''save/NNNN''.<<BR>>
 To clean up the working directory and move the restart files to subdirectories call the run-script with option '-c':
 {{{
xmessy_mmd -c
}}}

 Then the recent restart files can be linked into the working directory
 {{{
messy-dir/messy/util/init_restart -r NNNN -c CCCC
}}}
 Now the model can be restarted.

MESSy/CLaMS: Restarts

If the restart event is triggered, the status of the model is dumped with full precision to restart files

Write restart files

  • At the end of a MESSy simulation restart files are written.
  • Restart files can be written in a given simulation time interval. The simulation can be interrupted and restarted automatically when a given number of cycles is reached (TIMER-User-Manual, 4.4). The interval and the number of cycles can be specified in messy/nml/DEFAULTS/timer.nml, e.g.:

    IO_RERUN_EV = 1,'month','first',0,
    NO_CYCLES   = 12           ! restart cycles without break

    => Restart files are witten at the beginning of a new month and after 12 months the simulation will be interrupted and restarted automatically.

  • If starttime of model is noon and restart files should be written at noon:

    • use unit 'hours'
      IO_RERUN_EV = 240,'hours','first',0
      or use unit 'days' and set offset to 86400 [sec]
      IO_RERUN_EV = 10,'days','first',86400
      This offset is not allowed for units 'years' and 'months'.
  • If the job is submitted to a queue manager, it might be necessary to split the simulation into chain elements. The submodel QTIMER triggers the restart just before the maximum time reserved by the scheduler is reached (Development cycle 2 of the Modular Earth Submodel System, section 4). The queue time limit QWCH can be specified the messy-script and is replaced in qtimer.nml, e.g.:

    &CTRL
    QTIME  =  $QWCH,0,0,  ! queue time limit (hh,mm,se); 0,0,0  to switch off
    QCLOCK = 'wall',  ! queue clock type (wall|cpu|user|sys)
    QFRAC  = 0.95     ! usable fraction of queue time limit

    => QWCH=4: When 95% of 4 hours CPU time are reached, restart files are written and the next chain-element is started.

Restart model

  • If the file MSH_NO is in the working-directory, the model is started in rerun-mode. MSH_NO contains the number of the last chain-element.
    If you want run the simulation again from the beginning, remove file MSH_NO before starting the run script.

  • All files needed for a rerun starting from a specific chain element are saved in the subdirectory save/NNNN of the working directory.

    NNNN is the 4-digit number of the last complete chain element.

    The restart files of the last chain-element are linked into the working directory.

    In order to start a rerun with chain element NNNN+1, the script messy/util/init_restart can be used to link the correct restart files:

    messy-dir/messy/util/init_restart -r NNNN -c MMMM [-d dir]

    NNNN: restart number (number of batch job in a job chain)
    MMMM: cycle number (number of restart within a batch job)

  • Restarts after abnormal termination:

    If the model stops due to an occured error or a hardware problem, it can be restarted manually.
    In this case all restart files are located in the working directory and not saved to subdirectories save/NNNN.
    To clean up the working directory and move the restart files to subdirectories call the run-script with option '-c':

    xmessy_mmd -c
    Then the recent restart files can be linked into the working directory
    messy-dir/messy/util/init_restart -r NNNN -c CCCC
    Now the model can be restarted.
  • The name of the experiment (EXP_NAME in run-script) must not contain the substring restart.
    All files *restart* are removed before linking the current restart files.

messy/Restart (last edited 2023-01-31 10:52:36 by NicoleThomas)