Differences between revisions 8 and 9
Revision 8 as of 2010-07-27 12:04:38
Size: 2435
Editor: NicoleThomas
Comment:
Revision 9 as of 2010-10-01 07:31:06
Size: 2986
Comment:
Deletions are marked like this. Additions are marked like this.
Line 81: Line 81:
=== Problems with timeouts using MPI ===

Currently jobs requesting > 128 nodes ( >1024 cores) fail with PMI barrier timeouts. This is caused by a problem in the ParaStation process management which could not be fixed yet. As a workaround ParTec recommends the setting of the following environment variables in the job script of large MPI applications:
{{{
  export __PSI_LOGGER_TIMEOUT=3600
  export PMI_BARRIER_TMOUT=1800
}}}

just before the 'mpiexec' call. Similar problems have been observed on workstations as well.

Compile and execute CLaMS programs on JUROPA

Compiler

To use the Intel MPI compiler the following command will be necessary:

  • module load parastation/intel 

Compilation

All used libraries and module files are linked to:

  • /lustre/jhome4/jicg11/jicg1108/clams_lib
    /lustre/jhome4/jicg11/jicg1108/clams_mod

The CLaMS makefiles are adjusted for compilation of CLaMS programs on JUROPA with the Intel compiler (mkincl/config/config.Linux_ifc).

The netCDF version 3.6.3 is used. For using netCDF 4 see Used Libraries (netCDF3/netCDF4).

The CLaMS programs can be compiled with:

  •    1 gmake useComp=ifc [useMPI=true] progname
    

Create batch script

Write a batch script including the mpiexec command for the just compiled executable:

  •    1 #!/bin/bash -x
       2 #MSUB -l nodes=<no of nodes>:ppn=<no of procs/node>
       3 #MSUB -l walltime=<hh:mm:ss>
       4 #MSUB -e <full path for error file>
       5 #MSUB -o <full path for output file>
       6 #MSUB -M <mail-address>
       7 #      official mail address
       8 #MSUB -m n|a|b|e
       9 #      send mail: never, on abort, beginning or end of job
      10 
      11 ### start of jobscript
      12 cd $PBS_O_WORKDIR
      13 echo "workdir: $PBS_O_WORKDIR"
      14 NP=$(cat $PBS_NODEFILE | wc -l) # nodes*ppn
      15 echo "running on $NP cpus ..."
      16 mpiexec -np $NP <executable>
    

Job Limits

Batch jobs

  • max. number of nodes: 256 (default: 1)
  • max. wall-clock time: 12 hours (default: 30 min)

Interactive jobs:

  • max. number of nodes: 2 (default: 1)
  • max. wall-clock time: 30 min (default: 10 min)

Submit job

Submit the job with:

  • msub <jobscript>

On success msub returns the job ID of the submitted job.

Useful commands

  • checkjob -v <jobid>
    get detailed information about a job

  • showq [r]
    show status of all (running) jobs

  • canceljob <jobid>
    cancel a job

  • mjobctl -q starttime <jobid>
    show estimated starttime of specified job

  • mjobctl --help
    show detailed information about this command

Problems with timeouts using MPI

Currently jobs requesting > 128 nodes ( >1024 cores) fail with PMI barrier timeouts. This is caused by a problem in the ParaStation process management which could not be fixed yet. As a workaround ParTec recommends the setting of the following environment variables in the job script of large MPI applications:

  export __PSI_LOGGER_TIMEOUT=3600  
  export PMI_BARRIER_TMOUT=1800  

just before the 'mpiexec' call. Similar problems have been observed on workstations as well.

Interactive Session

  • Allocate interactive partition with X11 forwarding:
    msub -I -X -l nodes=#:ppn=8
    possible number of nodes: 1 or 2
  • Start your applications
  • Deallocate interactive partition with:
    exit

Juropa/CompileExecute (last edited 2011-04-05 08:52:14 by NicoleThomas)