Debugging the GEOS-5 GCM: Difference between revisions

From GEOS-5
Jump to navigation Jump to search
Add up to gcm_run edit
No edit summary
Line 1: Line 1:
{{rightTOC}}
{{rightTOC}}


This page describes the process of building GEOS-5 Ganymed 1.0 on NCCS discover for debugging purposes.  
This page describes the process of building GEOS-5 Ganymed 1.0 on NCCS discover for debugging purposes.


'''It is assumed that you are able to build and run the model as detailed on the [[Ganymed 1.0 Quick Start]] page.''' Indeed, if you couldn't, you wouldn't know that you had an issue worth debugging.
It is assumed that you are able to build and run the model as detailed on the [[Ganymed 1.0 Quick Start]] page. Indeed, if you couldn't, you wouldn't know that you had an issue worth debugging.


== Obtaining the model (optional) ==
== Obtaining the model (optional) ==
Line 32: Line 32:


This should take roughly 30 minutes to build.
This should take roughly 30 minutes to build.
'''NOTE:''' This will build the '''entire''' model with debugging symbols. Often this is overkill, and one can often only build the afflicted code with debugging on and the rest of the model without; for now, though, we take the safest bet and compile the whole model.


=== Optional method of debug compiling with <code>parallel_build.csh</code> ===
=== Optional method of debug compiling with <code>parallel_build.csh</code> ===
Line 69: Line 71:
As the output states, you can monitor the progress of the job in the <code>BUILD_LOG_DIR</code> directory. ('''Note''': This output will be slightly different depending on the Project Code and Username.)
As the output states, you can monitor the progress of the job in the <code>BUILD_LOG_DIR</code> directory. ('''Note''': This output will be slightly different depending on the Project Code and Username.)


== Set up a run ==
== Set up a GCM run ==


Following the [[Ganymed 1.0 Quick Start]] page, [[Ganymed 1.0 Quick Start#Setting up a Run|set up a run]] using the <code>gcm_setup</code> script.
Following the [[Ganymed 1.0 Quick Start]] page, [[Ganymed 1.0 Quick Start#Setting up a Run|set up a run]] using the <code>gcm_setup</code> script.  


== Running the GCM under Totalview ==
By now you probably have a similar run that crashed (which is why you are debugging). What you'll want to do is to copy the same restarts (and <code>cap_restart</code>) you used in the run that crashed into your new job. The best restarts to use are ones that are as close in time to the crash as possible. Because running the model in debugging mode is so expensive,
 
== Set up the GCM to run under Totalview ==


Much of what is included in this section is based on information from [http://www.nccs.nasa.gov/primer/computing.html#totalview NCCS's Computing Primer's entry on Totalview].
Much of what is included in this section is based on information from [http://www.nccs.nasa.gov/primer/computing.html#totalview NCCS's Computing Primer's entry on Totalview].
Line 116: Line 120:
=== Reload modules for model ===
=== Reload modules for model ===


Now that you are in a new shell, you must once again [[#Setup modules for compiling|source <code>g5_modules</code>]] so that you have the correct setup.
Now that you are in a new shell, you should once again [[#Setup modules for compiling|source <code>g5_modules</code>]] so that you have the correct setup.
 
=== Edit <code>gcm_run.j</code> to use Totalview ===
 
Once you have everything set up, you must alter <code>gcm_run.j</code> to run Totalview. To do so, open <code>gcm_run.j</code> in your favorite editor. First, add the two bolded lines under the <code>Architecture Specific Environment Variables</code> section:
 
#######################################################################
#          Architecture Specific Environment Variables
#######################################################################
setenv ARCH `uname`
setenv SITE            NCCS
setenv GEOSBIN          /discover/swdev/mathomp4/Models/Ganymed-1_0_p1/GEOSagcm/Linux/bin
setenv RUN_CMD        "mpirun -perhost 12 -np "
setenv GCMVER          Ganymed-1_0
source $GEOSBIN/g5_modules
setenv LD_LIBRARY_PATH ${LD_LIBRARY_PATH}:${BASEDIR}/${ARCH}/lib
'''module load tool/tview-8.9.2.2'''
'''setenv TVDSVRLAUNCHCMD ssh '''
 
This sets up the environment to run Totalview.


=== Load Totalview Module and Setup Environment ===
Next, look for the section where the <code>GEOSgcm.x</code> executable is run:


Next, you must load the Totalview module:
##################################################################
######
######        Perform multiple iterations of Model Run
######
##################################################################


  $ module load tool/tview-8.9.2.2
  @ counter    = 1
while ( $counter <= ${NUM_SGMT} )
        @  NPES = $NX * $NY 
# Assume gcm_setup set these properly for the local platform
    setenv I_MPI_USE_DYNAMIC_CONNECTIONS 0
    setenv I_MPI_JOB_STARTUP_TIMEOUT 10000
    setenv DAPL_ACK_RETRY 7
    setenv DAPL_ACK_TIMER 22
    setenv DAPL_RNR_RETRY 7
    setenv DAPL_RNR_TIMER 28
    setenv I_MPI_RDMA_RNDV_WRITE 1
'''$RUN_CMD $NPES ./GEOSgcm.x '''
set rc =  $status
echo      Status = $rc


At present, Totalview 8.9.2.2 is both the current and recommended Totalview version to load. You should still have the modules from building the module loaded, so your module environment should at least have:
where the line in bold is the line in question. To run in Totalview, replace this line with the three lines shown in bold below:


$ module list
    <snip>
Currently Loaded Modulefiles:
    setenv DAPL_RNR_TIMER 28
  1) comp/intel-11.0.083                      3) lib/mkl-10.0.3.020                      5) tool/tview-8.9.2.2
    setenv I_MPI_RDMA_RNDV_WRITE 1
  2) mpi/impi-3.2.2.006                      4) other/SIVO-PyD/spd_1.6.0_gcc-4.3.4-sp1 
'''mpdboot -n 4 -r ssh -f $PBS_NODEFILE'''
'''totalview ./GEOSgcm.x '''
'''exit'''
set rc =  $status
echo      Status = $rc


Then, in your <code>~/.cshrc</code> or <code>~/.tcshrc</code> file, add:
=== Run the GCM ===


setenv TVDSVRLAUNCHCMD ssh
As you have an interactive session, to run the model just run the <code>gcm_run.j</code> script:


and, for safety's sake, run this command interactively as well:
$ ./gcm_run.j


$ setenv TVDSVRLAUNCHCMD ssh
== Debugging with Totalview ==


This is needed by Totalview. It is safe to permanently leave it in your csh/tcsh setup file, since it only affects Totalview and is necessary.
To debu


=== Edit gcm_run.j to use Totalview ===
== Heisenbugs ==


Once you have everything set up, you must alter
We note for completeness that there is a chance that the process above could lead to what's known as a Heisenbug, a bug you only see when you aren't looking for it.

Revision as of 12:34, 26 November 2012

This page describes the process of building GEOS-5 Ganymed 1.0 on NCCS discover for debugging purposes.

It is assumed that you are able to build and run the model as detailed on the Ganymed 1.0 Quick Start page. Indeed, if you couldn't, you wouldn't know that you had an issue worth debugging.

Obtaining the model (optional)

If you need to, obtain the model doing so as described in the Ganymed 1.0 Quick Start page.

Compiling the model for debugging

Setup modules for compiling

To compile the model for debugging, first set up the environment for compiling by sourcing the g5_modules file located in GEOSagcm/src

$ cd GEOSagcm/src
$ source g5_modules

This should set up the modules for compiling:

$ module list
Currently Loaded Modulefiles:
 1) comp/intel-11.0.083                      3) lib/mkl-10.0.3.020
 2) mpi/impi-3.2.2.006                       4) other/SIVO-PyD/spd_1.6.0_gcc-4.3.4-sp1

Make the model with debug options

Now compile the model for debugging by making with an option argument BOPT=g:

$ make install BOPT=g

This should take roughly 30 minutes to build.

NOTE: This will build the entire model with debugging symbols. Often this is overkill, and one can often only build the afflicted code with debugging on and the rest of the model without; for now, though, we take the safest bet and compile the whole model.

Optional method of debug compiling with parallel_build.csh

You can also compile the model by using the parallel_build.csh script by using the optional -debug flag:

> ./parallel_build.csh -debug
g5_modules: Setting BASEDIR and modules for discover25

   ================
    PARALLEL BUILD 
   ================

The build will proceed with 10 parallel processes on 12 CPUs.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
LOG and info will be written to the src/BUILD_LOG_DIR directory.
Do the following to check the status/success of the build:

  cd BUILD_LOG_DIR
  ./gmh.pl [-v] LOG[.n]

Note: Check the latest version of LOG for the most recent build info.



Sp Code|  Org  | Sponsor            | Research
-------+-------+--------------------+----------------------------------
 g0620 | 610.1 | Michele Rienecker  | GMAO - Systems and Data Synthesis

select group: [g0620] 
qsub -W group_list=g0620 -N parallel_build -l select=1:ncpus=12:mpiprocs=10:proc=west -l walltime=1:00:00 -S /bin/csh -V -j oe ./parallel_build.csh
1203711.pbsa1
unset echo
1203711.pbsa1   mathomp4 debug    parallel_b    --    1  12 3892mb 01:00 Q   -- 

As the output states, you can monitor the progress of the job in the BUILD_LOG_DIR directory. (Note: This output will be slightly different depending on the Project Code and Username.)

Set up a GCM run

Following the Ganymed 1.0 Quick Start page, set up a run using the gcm_setup script.

By now you probably have a similar run that crashed (which is why you are debugging). What you'll want to do is to copy the same restarts (and cap_restart) you used in the run that crashed into your new job. The best restarts to use are ones that are as close in time to the crash as possible. Because running the model in debugging mode is so expensive,

Set up the GCM to run under Totalview

Much of what is included in this section is based on information from NCCS's Computing Primer's entry on Totalview.

Determine MPI Layout

Next, determine the MPI layout you will be running under. For example, if you have just set up a 2-degree lat-lon run, you'll most likely be running on 4 nodes with 12 cores per node. This can be determined by looking at both AGCM.rc:

$ head -10 AGCM.rc 

# Atmospheric Model Configuration Parameters
# ------------------------------------------
           NX: 4
           NY: 12
      AGCM_IM: 144
      AGCM_JM: 91
      AGCM_LM: 72
AGCM_GRIDNAME: PC144x91-DC

and gcm_run.j:

$ head -10 gcm_run.j 
#!/bin/csh -f

#######################################################################
#                     Batch Parameters for Run Job
####################################################################### 

#PBS -l walltime=12:00:00
#PBS -l select=4:ncpus=12:mpiprocs=12
#PBS -N test-G10p1-_RUN
#PBS -q general

where the bolded entries show you where to see that information.

Submit an Interactive Job

Next, submit an interactive job using the MPI geometry from gcm_run.j. Note we use xsub because we need to allow for DISPLAY to be passed through as Totalview is an X application:

$ xsub -I -V -l select=4:ncpus=12:mpiprocs=12,walltime=2:00:00

Reload modules for model

Now that you are in a new shell, you should once again source g5_modules so that you have the correct setup.

Edit gcm_run.j to use Totalview

Once you have everything set up, you must alter gcm_run.j to run Totalview. To do so, open gcm_run.j in your favorite editor. First, add the two bolded lines under the Architecture Specific Environment Variables section:

#######################################################################
#           Architecture Specific Environment Variables
#######################################################################

setenv ARCH `uname`

setenv SITE             NCCS
setenv GEOSBIN          /discover/swdev/mathomp4/Models/Ganymed-1_0_p1/GEOSagcm/Linux/bin 
setenv RUN_CMD         "mpirun -perhost 12 -np "
setenv GCMVER           Ganymed-1_0

source $GEOSBIN/g5_modules
setenv LD_LIBRARY_PATH ${LD_LIBRARY_PATH}:${BASEDIR}/${ARCH}/lib

module load tool/tview-8.9.2.2
setenv TVDSVRLAUNCHCMD ssh 

This sets up the environment to run Totalview.

Next, look for the section where the GEOSgcm.x executable is run:

##################################################################
######
######         Perform multiple iterations of Model Run
######
##################################################################
@ counter    = 1 
while ( $counter <= ${NUM_SGMT} )

       @  NPES = $NX * $NY  

# Assume gcm_setup set these properly for the local platform
    setenv I_MPI_USE_DYNAMIC_CONNECTIONS 0
    setenv I_MPI_JOB_STARTUP_TIMEOUT 10000
    setenv DAPL_ACK_RETRY 7
    setenv DAPL_ACK_TIMER 22
    setenv DAPL_RNR_RETRY 7
    setenv DAPL_RNR_TIMER 28
    setenv I_MPI_RDMA_RNDV_WRITE 1
$RUN_CMD $NPES ./GEOSgcm.x 

set rc =  $status
echo       Status = $rc 

where the line in bold is the line in question. To run in Totalview, replace this line with the three lines shown in bold below:

    <snip>
    setenv DAPL_RNR_TIMER 28
    setenv I_MPI_RDMA_RNDV_WRITE 1
mpdboot -n 4 -r ssh -f $PBS_NODEFILE
totalview ./GEOSgcm.x 
exit

set rc =  $status
echo       Status = $rc

Run the GCM

As you have an interactive session, to run the model just run the gcm_run.j script:

$ ./gcm_run.j

Debugging with Totalview

To debu

Heisenbugs

We note for completeness that there is a chance that the process above could lead to what's known as a Heisenbug, a bug you only see when you aren't looking for it.