Debugging the GEOS-5 GCM: Difference between revisions
Add up to gcm_run edit |
No edit summary |
||
Line 1: | Line 1: | ||
{{rightTOC}} | {{rightTOC}} | ||
This page describes the process of building GEOS-5 Ganymed 1.0 on NCCS discover for debugging purposes. | This page describes the process of building GEOS-5 Ganymed 1.0 on NCCS discover for debugging purposes. | ||
It is assumed that you are able to build and run the model as detailed on the [[Ganymed 1.0 Quick Start]] page. Indeed, if you couldn't, you wouldn't know that you had an issue worth debugging. | |||
== Obtaining the model (optional) == | == Obtaining the model (optional) == | ||
Line 32: | Line 32: | ||
This should take roughly 30 minutes to build. | This should take roughly 30 minutes to build. | ||
'''NOTE:''' This will build the '''entire''' model with debugging symbols. Often this is overkill, and one can often only build the afflicted code with debugging on and the rest of the model without; for now, though, we take the safest bet and compile the whole model. | |||
=== Optional method of debug compiling with <code>parallel_build.csh</code> === | === Optional method of debug compiling with <code>parallel_build.csh</code> === | ||
Line 69: | Line 71: | ||
As the output states, you can monitor the progress of the job in the <code>BUILD_LOG_DIR</code> directory. ('''Note''': This output will be slightly different depending on the Project Code and Username.) | As the output states, you can monitor the progress of the job in the <code>BUILD_LOG_DIR</code> directory. ('''Note''': This output will be slightly different depending on the Project Code and Username.) | ||
== Set up a run == | == Set up a GCM run == | ||
Following the [[Ganymed 1.0 Quick Start]] page, [[Ganymed 1.0 Quick Start#Setting up a Run|set up a run]] using the <code>gcm_setup</code> script. | Following the [[Ganymed 1.0 Quick Start]] page, [[Ganymed 1.0 Quick Start#Setting up a Run|set up a run]] using the <code>gcm_setup</code> script. | ||
== | By now you probably have a similar run that crashed (which is why you are debugging). What you'll want to do is to copy the same restarts (and <code>cap_restart</code>) you used in the run that crashed into your new job. The best restarts to use are ones that are as close in time to the crash as possible. Because running the model in debugging mode is so expensive, | ||
== Set up the GCM to run under Totalview == | |||
Much of what is included in this section is based on information from [http://www.nccs.nasa.gov/primer/computing.html#totalview NCCS's Computing Primer's entry on Totalview]. | Much of what is included in this section is based on information from [http://www.nccs.nasa.gov/primer/computing.html#totalview NCCS's Computing Primer's entry on Totalview]. | ||
Line 116: | Line 120: | ||
=== Reload modules for model === | === Reload modules for model === | ||
Now that you are in a new shell, you | Now that you are in a new shell, you should once again [[#Setup modules for compiling|source <code>g5_modules</code>]] so that you have the correct setup. | ||
=== Edit <code>gcm_run.j</code> to use Totalview === | |||
Once you have everything set up, you must alter <code>gcm_run.j</code> to run Totalview. To do so, open <code>gcm_run.j</code> in your favorite editor. First, add the two bolded lines under the <code>Architecture Specific Environment Variables</code> section: | |||
####################################################################### | |||
# Architecture Specific Environment Variables | |||
####################################################################### | |||
setenv ARCH `uname` | |||
setenv SITE NCCS | |||
setenv GEOSBIN /discover/swdev/mathomp4/Models/Ganymed-1_0_p1/GEOSagcm/Linux/bin | |||
setenv RUN_CMD "mpirun -perhost 12 -np " | |||
setenv GCMVER Ganymed-1_0 | |||
source $GEOSBIN/g5_modules | |||
setenv LD_LIBRARY_PATH ${LD_LIBRARY_PATH}:${BASEDIR}/${ARCH}/lib | |||
'''module load tool/tview-8.9.2.2''' | |||
'''setenv TVDSVRLAUNCHCMD ssh ''' | |||
This sets up the environment to run Totalview. | |||
Next, look for the section where the <code>GEOSgcm.x</code> executable is run: | |||
################################################################## | |||
###### | |||
###### Perform multiple iterations of Model Run | |||
###### | |||
################################################################## | |||
$ | @ counter = 1 | ||
while ( $counter <= ${NUM_SGMT} ) | |||
@ NPES = $NX * $NY | |||
# Assume gcm_setup set these properly for the local platform | |||
setenv I_MPI_USE_DYNAMIC_CONNECTIONS 0 | |||
setenv I_MPI_JOB_STARTUP_TIMEOUT 10000 | |||
setenv DAPL_ACK_RETRY 7 | |||
setenv DAPL_ACK_TIMER 22 | |||
setenv DAPL_RNR_RETRY 7 | |||
setenv DAPL_RNR_TIMER 28 | |||
setenv I_MPI_RDMA_RNDV_WRITE 1 | |||
'''$RUN_CMD $NPES ./GEOSgcm.x ''' | |||
set rc = $status | |||
echo Status = $rc | |||
where the line in bold is the line in question. To run in Totalview, replace this line with the three lines shown in bold below: | |||
<snip> | |||
setenv DAPL_RNR_TIMER 28 | |||
setenv I_MPI_RDMA_RNDV_WRITE 1 | |||
'''mpdboot -n 4 -r ssh -f $PBS_NODEFILE''' | |||
'''totalview ./GEOSgcm.x ''' | |||
'''exit''' | |||
set rc = $status | |||
echo Status = $rc | |||
=== Run the GCM === | |||
As you have an interactive session, to run the model just run the <code>gcm_run.j</code> script: | |||
$ ./gcm_run.j | |||
== Debugging with Totalview == | |||
To debu | |||
== | == Heisenbugs == | ||
We note for completeness that there is a chance that the process above could lead to what's known as a Heisenbug, a bug you only see when you aren't looking for it. |