Debugging the GEOS-5 GCM
This page describes the process of building the GEOS-5 GCM at NCCS discover for debugging purposes.
It is assumed that you are able to build and run the model. One place this is described is on the Ganymed 1.0 Quick Start page. Indeed, if you couldn't, you wouldn't know that you had an issue worth debugging.
Obtaining the model (optional)
If you need to, obtain the model doing so as described in the Ganymed 1.0 Quick Start page.
Compiling the model for debugging
Setup modules for compiling
To compile the model for debugging, first set up the environment for compiling by sourcing the g5_modules
file located in GEOSagcm/src
$ cd GEOSagcm/src $ source g5_modules
This should set up the modules for compiling:
$ module list Currently Loaded Modulefiles: 1) comp/intel-11.0.083 3) lib/mkl-10.0.3.020 2) mpi/impi-3.2.2.006 4) other/SIVO-PyD/spd_1.6.0_gcc-4.3.4-sp1
Make the model with debug options
Now compile the model for debugging by making with an option argument BOPT=g
:
$ make install BOPT=g
This should take roughly 30 minutes to build.
NOTE: This will build the entire model with debugging symbols. Often this is overkill, and one can often only build the afflicted code with debugging on and the rest of the model without; for now, though, we take the safest bet and compile the whole model.
Optional method of debug compiling with parallel_build.csh
You can also compile the model by using the parallel_build.csh
script by using the optional -debug
flag:
> ./parallel_build.csh -debug g5_modules: Setting BASEDIR and modules for discover25 ================ PARALLEL BUILD ================ The build will proceed with 10 parallel processes on 12 CPUs. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ LOG and info will be written to the src/BUILD_LOG_DIR directory. Do the following to check the status/success of the build: cd BUILD_LOG_DIR ./gmh.pl [-v] LOG[.n] Note: Check the latest version of LOG for the most recent build info. Sp Code| Org | Sponsor | Research -------+-------+--------------------+---------------------------------- g0620 | 610.1 | Michele Rienecker | GMAO - Systems and Data Synthesis select group: [g0620] qsub -W group_list=g0620 -N parallel_build -l select=1:ncpus=12:mpiprocs=10:proc=west -l walltime=1:00:00 -S /bin/csh -V -j oe ./parallel_build.csh 1203711.pbsa1 unset echo 1203711.pbsa1 mathomp4 debug parallel_b -- 1 12 3892mb 01:00 Q --
As the output states, you can monitor the progress of the job in the BUILD_LOG_DIR
directory. (Note: This output will be slightly different depending on the Project Code and Username.)
Set up a GCM run
Following the Ganymed 1.0 Quick Start page, set up a run using the gcm_setup
script.
By now you probably have a similar run that crashed (which is why you are debugging). What you'll want to do is to copy the same restarts (and cap_restart
) you used in the run that crashed into your new job. The best restarts to use are ones that are as close in time to the crash as possible. Because running the model in debugging mode is so expensive,
Set up the GCM to run under Totalview
Much of what is included in this section is based on information from NCCS's Computing Primer's entry on Totalview.
Determine MPI Layout
Next, determine the MPI layout you will be running under. For example, if you have just set up a 2-degree lat-lon run, you'll most likely be running on 4 nodes with 12 cores per node. This can be determined by looking at both AGCM.rc
:
$ head -10 AGCM.rc # Atmospheric Model Configuration Parameters # ------------------------------------------ NX: 4 NY: 12 AGCM_IM: 144 AGCM_JM: 91 AGCM_LM: 72 AGCM_GRIDNAME: PC144x91-DC
and gcm_run.j
:
$ head -10 gcm_run.j #!/bin/csh -f ####################################################################### # Batch Parameters for Run Job ####################################################################### #PBS -l walltime=12:00:00 #PBS -l select=4:ncpus=12:mpiprocs=12 #PBS -N test-G10p1-_RUN #PBS -q general
where the bolded entries show you where to see that information.
Submit an Interactive Job
Next, submit an interactive job using the MPI geometry from gcm_run.j
. Note we use xsub
because we need to allow for DISPLAY to be passed through as Totalview is an X application:
$ xsub -I -V -l select=4:ncpus=12:mpiprocs=12,walltime=2:00:00
Reload modules for model
Now that you are in a new shell, you should once again source g5_modules
so that you have the correct setup.
Edit gcm_run.j
to use Totalview
Once you have everything set up, you must alter gcm_run.j
to run Totalview. To do so, open gcm_run.j
in your favorite editor. First, add the two bolded lines under the Architecture Specific Environment Variables
section:
####################################################################### # Architecture Specific Environment Variables ####################################################################### setenv ARCH `uname` setenv SITE NCCS setenv GEOSBIN /discover/swdev/mathomp4/Models/Ganymed-1_0_p1/GEOSagcm/Linux/bin setenv RUN_CMD "mpirun -perhost 12 -np " setenv GCMVER Ganymed-1_0 source $GEOSBIN/g5_modules setenv LD_LIBRARY_PATH ${LD_LIBRARY_PATH}:${BASEDIR}/${ARCH}/lib module load tool/tview-8.9.2.2 setenv TVDSVRLAUNCHCMD ssh
This sets up the environment to run Totalview.
Next, look for the section where the GEOSgcm.x
executable is run:
################################################################## ###### ###### Perform multiple iterations of Model Run ###### ################################################################## @ counter = 1 while ( $counter <= ${NUM_SGMT} ) @ NPES = $NX * $NY # Assume gcm_setup set these properly for the local platform setenv I_MPI_USE_DYNAMIC_CONNECTIONS 0 setenv I_MPI_JOB_STARTUP_TIMEOUT 10000 setenv DAPL_ACK_RETRY 7 setenv DAPL_ACK_TIMER 22 setenv DAPL_RNR_RETRY 7 setenv DAPL_RNR_TIMER 28 setenv I_MPI_RDMA_RNDV_WRITE 1 $RUN_CMD $NPES ./GEOSgcm.x set rc = $status echo Status = $rc
where the line in bold is the line in question. To run in Totalview, replace this line with the three lines shown in bold below:
<snip> setenv DAPL_RNR_TIMER 28 setenv I_MPI_RDMA_RNDV_WRITE 1 mpdboot -n 4 -r ssh -f $PBS_NODEFILE totalview ./GEOSgcm.x exit set rc = $status echo Status = $rc
Run the GCM
As you have an interactive session, to run the model just run the gcm_run.j
script:
$ ./gcm_run.j
Debugging with Totalview
Assuming everything was set up correctly three windows, a process window, startup parameter window, and a TotalView status window will pop up when TotalView executes. Several parameters must be set in the startup parameter window. Click on the parallel tab (See below). Click on the parallel system box to set the MPI stack being used. In the example on this page we are using Intel MPI. Next select the number of nodes, in this case it is four. Finally set the total number of mpi tasks which is the number of nodes times the number of mpi process per host. In this case it is 48.
Heisenbugs
We note for completeness that there is a chance that the process above could lead to what's known as a Heisenbug, a bug you only see when you aren't looking for it.