Converting GEOS code from SLES 11 to SLES 12
This page will try and detail the changes needed to move a code base from SLES 11 to SLES 12.
If you have questions, please contact the SI Team.
Generic SLES 12 Information
NOTE: As of May 18, 2020, NCCS has moved to SLES 12 by default upon login. This means that when you do ssh discover you will now be on a SLES 12 head node and in the SLES 12 environment (SLURM 19, etc.). In order to get to SLES 11, you must then execute ssh discover-sles11 from a head node. The SLES 11 nodes are not available directly through the Bastion service, so if you are using it, you won't be able to ssh directly to discover-sles11 from outside discover.
NCCS SLES 12 FAQ
NCCS has created a Frequently Asked Questions page to help answer some generic questions about the SLES 12 transition.
Missing tools (xxdiff, ImageMagick, etc.)
You might notice that things like xxdiff and magick don't exist in your path anymore. On SLES 12, NCCS has moved many non-system-required utilities to LMod module files. So to get xxdiff, you should:
module load xxdiff
and similar for ImageMagick, tkcvs, and more. If a utility is not available, contacting NCCS Support is the first step.
Complaints on login about missing modules
You might see issues on logging into SLES 12 that modules cannot be found. The reason for this is often that you are loading modules in your .bashrc or .tcshrc files. A workaround while we are still in mixed SLES 11/SLES 12 mode is to do something like this for bash:
if [[ -e /etc/os-release ]] then module load <modules-available-on-sles12> else module load <modules-available-on-sles11> fi
and similar for tcsh:
if ( -e /etc/os-release) then module load <modules-available-on-sles12> else module load <modules-available-on-sles11> endif
Here we are using the fact that /etc/os-release only exists on the SLES 12 systems as a proxy.
GEOSenv
This also applies to module use statements. If, for convenience, you have been loading GEOSenv at startup in interactive shells, something like:
if [[ -e /etc/os-release ]] then module use -a /discover/swdev/gmao_SIteam/modulefiles-SLES12 else module use -a /discover/swdev/gmao_SIteam/modulefiles-SLES11 fi module load GEOSenv
or in tcsh:
if (-e /etc/os-release) then module use -a /discover/swdev/gmao_SIteam/modulefiles-SLES12 else module use -a /discover/swdev/gmao_SIteam/modulefiles-SLES11 endif module load GEOSenv
would work, as there is a GEOSenv in both the SLES 11 and SLES 12 SI Team modulefiles. (Of course, any other modules must be protected.)
If you don't do this, you'll get errors like this on SLES 12:
Lmod has detected the following error: The following module(s) are unknown: "other/git" Please check the spelling or version number. Also try "module spider ..." It is also possible your cache file is out-of-date; it may help to try: $ module --ignore-cache load "other/git" Also make sure that all modulefiles written in TCL start with the string #%Module Executing this command requires loading "other/git" which failed while processing the following module(s): Module fullname Module Filename --------------- --------------- GEOSenv /discover/swdev/gmao_SIteam/modulefiles-SLES11/GEOSenv
Eventually you might see something like:
rs_numtiles.x: error while loading shared libraries: libssl.so.0.9.8: cannot open shared object file: No such file or directory
What this usually means is that you are trying to run an executable built on SLES 11 on SLES 12. Not always, but in many cases.
GEOS Specific SLES 12
Building code on SLES 12
NCCS has requested that, if at all possible, GEOS parallel builds be done on compute nodes rather than head nodes. If you use parallel_build.csh this is done by default for you. However, if you use the more manual CMake then Make/Ninja build process for GEOS, you should run the make step on a compute node.
NCCS has noticed performance degradations on head nodes as people do builds like make -j12 as (at the moment) there are not as many head nodes on SLES 12 during this transition time (some are still SLES 11).
g5_modules
The first challenge is trying to get a g5_modules file that works with your tag. The first thing to do is look at the version of Baselibs being used. Note that if you want to take advantage of both the Haswell and Skylake nodes on SLES 12, you should use Intel MPI as your MPI stack.
NOTE: You should/must build with MPT on the Haswell compute nodes in SLES 12. If you try to build any code with MPT on a Skylake node it will fail due to missing libraries.
Baselibs 4
If your Baselibs is based on version 4 (say, 4.0.6), the best one to try is:
/gpfsm/dhome/mathomp4/GitG5Modules/SLES12/4.0.11/g5_modules.intel1805.impi1910
If you need MPT, you can use:
/gpfsm/dhome/mathomp4/GitG5Modules/SLES12/4.0.11/g5_modules.intel1805.mpt217
NOTE: It may not be an exact version match but it is essentially equivalent (has the same version of ESMF), with newer versions of some libraries that needed updating for newer OSs and Intel 18+.
Baselibs 5
Many of the tags with GEOS use Baselibs 5.1.x, so 5.1.8 is a good substitute. You can use:
/gpfsm/dhome/mathomp4/GitG5Modules/SLES12/5.1.8-Github/g5_modules.intel1805.impi1910
or for MPT:
/gpfsm/dhome/mathomp4/GitG5Modules/SLES12/5.1.8-Github/g5_modules.intel1805.mpt217
For 5.2.x tags, use:
/gpfsm/dhome/mathomp4/GitG5Modules/SLES12/5.2.8-Github/g5_modules.intel1805.impi1910
or for MPT:
/gpfsm/dhome/mathomp4/GitG5Modules/SLES12/5.2.8-Github/g5_modules.intel1805.mpt217
Baselibs 6
If you using Baselibs 6.0.x, the files to use are:
/gpfsm/dhome/mathomp4/GitG5Modules/SLES12/6.0.4/g5_modules.intel1805.impi1910
/gpfsm/dhome/mathomp4/GitG5Modules/SLES12/6.0.4/g5_modules.intel1805.mpt217
src/Config
NOTE: Most of the fundamental issues with moving to SLES 12 are due to the files in src/Config. The reason is that older tags of GEOS don't handle Intel 18+ and Intel MPI 19+ well due to differences in flags and library names. Without updating files in here, the whole build will fall apart because it doesn't know that Intel 18+ existed!
Heracles
For a Heracles-based tag, you can start with updating these files to bw_Heracles-5_4_p3-SLES12:
ESMA_arch.mk fdp
Icarus
Old-style make system
For an older Icarus-based tag (one that doesn't have ifort.mk and mpi.mk, try updating these files to Icarus-3_2_p9-SLES12:
ESMA_arch.mk fdp
New-style make system
For a new style make system with files like ifort.mk and mpi.mk you will need some changes due to changes in compiler and MPI stacks.
mpi.mk
For this file, the Intel MPI section should look like:
ifdef I_MPI_ROOT FC := mpiifort INC_MPI := $(I_MPI_ROOT)/include64 LIB_MPI := -L$(I_MPI_ROOT)/lib64 -lmpifort -lmpi # Intel MPI LIB_MPI_OMP := -L$(I_MPI_ROOT)/lib64 -lmpifort -lmpi # Intel MPI else
Jason
A good comparison for Jason tags would be to compare against Jason-3_6_p1 when it comes to Config and other make issues.
f2py
One large challenge will be f2py based files. Any f2py build that depends on $(LIB_SDF) will need updating due to GEOSpyD (the Python stack on SLES12). The fix can be demonstrated with GFIO_.so. It was originally built as:
GFIO_.$(F2PYEXT): GFIO_py.F90 r4_install $(F2PY) -c -m GFIO_ $(M). $(M)$(INC_SDF) \ GFIO_py.F90 r4/libGMAO_gfio_r4.a $(LIB_SDF) $(LIB_SYS) \ only: gfioopen gfiocreate gfiodiminquire gfioinquire\ gfiogetvar gfiogetvart gfioputvar gfiogetbegdatetime\ gfiointerpxy gfiointerpnn gfiocoordnn gfioclose :</nowiki>
Notice how we pass in $(LIB_SDF) to f2py? The fix for this is to define a new $(XLIBS) and add that:
XLIBS = ifeq ($(wildcard /etc/os-release/.*),) XLIBS = -L/usr/lib64 -lssl -lcrypto endif GFIO_.$(F2PYEXT): GFIO_py.F90 r4_install $(F2PY) -c -m GFIO_ $(M). $(M)$(INC_SDF) \ GFIO_py.F90 r4/libGMAO_gfio_r4.a $(LIB_SDF) $(LIB_SYS) $(XLIBS)\ only: gfioopen gfiocreate gfiodiminquire gfioinquire\ gfiogetvar gfiogetvart gfioputvar gfiogetbegdatetime\ gfiointerpxy gfiointerpnn gfiocoordnn gfioclose : </nowiki>
Here we use the fact that the file /etc/os-release doesn't exist on SLES 11.
NOTE that this does not present itself at compile time, but rather as a run-time error a la:
ImportError: ..../Linux/lib/Python/GFIO_.so: undefined symbol: SSLeay
A (partial?) list of f2py builds that need this fix are:
src/GMAO_Shared/GMAO_ods/GNUmakefile src/GMAO_Shared/Chem_Base/GNUmakefile src/GMAO_Shared/GMAO_gfio/GNUmakefile
Hardcoded -openmp in make
Build time error
Another common theme is the inclusion of a hardcoded -openmp flag in GNUmakefile. The reason is that Intel deprecated and then by Intel 18, removed the -openmp flag and changed it to -qopenmp.
Examples can be seen in GEOSgcs_GridComp/GEOSgcm_GridComp/GEOSagcm_GridComp/GEOSphysics_GridComp/GEOSsurface_GridComp/Shared/Raster/src/GNUmakefile:
RASTER_OMPFLAG = ifeq ($(ESMA_FC), ifort) # RASTER_OMPFLAG = -openmp endif OPENMP_FLAG = -openmp
As said above, this flag was changed by Intel to be -qopenmp but a better change is to use the generic OMPFLAG alias:
RASTER_OMPFLAG = ifeq ($(ESMA_FC), ifort) # RASTER_OMPFLAG = $(OMPFLAG) endif OPENMP_FLAG = $(OMPFLAG)
Link-time error
If, when you build, your executable has an issue saying things like kmp... not found, this can often be an issue with either trying to link with -openmp or not linking to it at all. An example is Applications/NCEP_Etc/NCEP_enkf/GNUmakefile_in in GEOSadas-5_17_0p5B:
USER_LDFLAGS = -openmp
which should become:
USER_LDFLAGS = $(OMPFLAG)
Issues with passing SetServices in ESMF_GridCompSetServices
You might occasionally get an error with a call to ESMF_GridCompSetServices. For example, building GEOSadas-5_17_0p5B, you will encounter:
geos_pertmod.F90(328): error #7061: The characteristics of dummy argument 1 of the associated actual procedure differ from the characteristics of dummy argument 1 of the dummy procedure. [AGCMPERT_SETSERVICES] call ESMF_GridCompSetServices ( pertmod_gc, agcmPert_SetServices, rc=ier) -----------------------------------------------^ compilation aborted for geos_pertmod.F90 (code 1)
The issue is that Intel 18+ is much stricter with Fortran, and requires the that procedure being passed to ESMF_GridCompSetServices have the exact signature required by ESMF for a SetServices. This interface is:
interface subroutine userRoutine(gridcomp, rc) use ESMF_CompMod implicit none type(ESMF_GridComp) :: gridcomp ! must not be optional integer, intent(out) :: rc ! must not be optional end subroutine end interface
So, if we look at the SetServices in GEOS_AgcmPertGridComp.F90, we see:
! !IROUTINE: SetServices -- Sets ESMF services for this component ! !INTERFACE: subroutine SetServices ( GC, RC ) ! !ARGUMENTS: type(ESMF_GridComp), intent(INOUT) :: GC ! gridded component integer , intent( out) :: RC ! return code
Notice that the GC is intent(INOUT), but ESMF's interface does not have this. The solution is to remove the intent:
! !IROUTINE: SetServices -- Sets ESMF services for this component ! !INTERFACE: subroutine SetServices ( GC, RC ) ! !ARGUMENTS: type(ESMF_GridComp) :: GC ! gridded component integer , intent( out) :: RC ! return code
This can occur with other Gridded Components where RC is option, or has the wrong intent.
Internal Compiler Error with ADA_Module.F90
Occasionally, when you try to build ADA_Module.F90 with Intel 18+, you will get an Internal Compiler Error (ICE) and the build will crash. This is a bug in Intel and can be fixed by changing the optimization level of this file to anything but -O2. However, doing that can (and most likely will) change answers if this file is used.
One possible workaround that has worked is to change your TMPDIR when building. Why? No idea. But it seems to help. If building by hand, do:
setenv TMPDIR /tmp
If using parallel_build.csh, use parallel_build.csh -tmpdir /tmp.
No rule to make target 'it'
This one is due to the differences in the base GNU include files. Namely, at some point larger C-block comments were added to standard include files. There are various ways to solve this.
.P90 file
If the error happens with a .P90 file, then src/Config/ESMA_base.mk needs to be changed such that:
.P90.o @sed -e "/\!.*'/s/'//g" $< | $(CPP) -C -ansi -DANSI_CPP $(FPPFLAGS) > $*___.s90
becomes:
.P90.o @sed -e "/\!.*'/s/'//g" $< | $(CPP) -C -nostdinc -std=c99 -DANSI_CPP $(FPPFLAGS) > $*___.s90
In this case, the -nostdinc solved the 'it' issue. It also turns out cpp removed the -ansi flag, so we substitute -std=c99.
Other files
If this happens with another Fortran file, usually that means that directory is doing its own preprocessing. For example, GMAO_gems can encounter this because it has:
$(SRCS): %.f90: src/%.f90 @echo Preprocessing $@ from $< -@$(FPP) -P -D_PARALLEL -C $< >$@
communication_primitives.f90: src/communication_primitives.mpi.f90 @echo Preprocessing $@ from $< -@$(FPP) -P -D_PARALLEL -C $< >$@
This needs to change to:
$(SRCS): %.f90: src/%.f90 @echo Preprocessing $@ from $< -@$(FPP) -P -D_PARALLEL -nostdinc -C $< >$@
communication_primitives.f90: src/communication_primitives.mpi.f90 @echo Preprocessing $@ from $< -@$(FPP) -P -D_PARALLEL -nostdinc -C $< >$@
Double continuation characters
Some codes have doubled continuation characters which lead to:
odas/odas_decorrelation.F90(354): warning #5152: Invalid use of '&'. Not followed by comment or end of line chi2 = chi0, angle1 = odas_grid.angles(i, j, 1), angle2 = v.angle, scale1 = odas_grid.scales(i, j, 1), scale2 = & & ---------------------------------------------------------------------------------------------------------------------------------^
As the error says, remove the second one. Note that this simple error can throw a lot of additional errors due to the compiler incorrectly parsing files after this.
Undefined references to MPI, PMPI, etc
If you encounter messages like this:
ld: /discover/swdev/mathomp4/Baselibs/ESMA-Baselibs-4.0.11-SLES12/x86_64-unknown-linux-gnu/ifort_18.0.5.274-mpt_2.17/Linux/lib/libesmf.so: undefined reference to `MPI::Op::Free()'
when linking an executable, it usually means that the link step is missing $(LIB_MPI). Add it to the link flags (in this case *after* $(LIB_ESMF)).
ld: failed to convert GOTPCREL relocation; relink with --no-relax
The S2S V2 code on transition to SLES 12 encountered this error:
ld: failed to convert GOTPCREL relocation; relink with --no-relax
A possible solution (still under testing) is to do exactly what it says and add -Wl,--no-relax to USER_LDFLAGS
Why so many Intel MPI (I_MPI) warnings?!
In running older tags with a newer Intel MPI, you might see a lot of startup messages like:
[0] MPI startup(): I_MPI_DAPL_UD_ACK_SEND_POOL_SIZE variable has been removed from the product, its value is ignored [0] MPI startup(): I_MPI_DAPL_UD_ACK_RECV_POOL_SIZE variable has been removed from the product, its value is ignored [0] MPI startup(): I_MPI_DAPL_CHECK_MAX_RDMA_SIZE variable has been removed from the product, its value is ignored [0] MPI startup(): I_MPI_DAPL_UD_RECV_BUFFER_NUM variable has been removed from the product, its value is ignored [0] MPI startup(): I_MPI_DAPL_UD_SEND_BUFFER_NUM variable has been removed from the product, its value is ignored [0] MPI startup(): I_MPI_DAPL_UD_REQ_EVD_SIZE variable has been removed from the product, its value is ignored [0] MPI startup(): I_MPI_FABRICS_LIST environment variable is not supported.
This is because many older tags of the GEOS ADAS and GCM had Intel MPI environment variables set that no longer apply on SLES 12. In most cases, the recommendations are to comment out/remove any I_MPI_* flags as Intel MPI has removed *many* of these. Please try your jobs without any of these set.
Some exceptions are I_MPI_DEBUG which is still valid and the I_MPI_ADJUST family (see below). However, if things are still not working after commenting out, NCCS Support might need to be contacted.
Model crashes in an AllReduce
One thing we've found with Intel MPI on SLES 12 is sometimes the model will crash in an AllReduce call. If so, please try setting:
setenv I_MPI_ADJUST_ALLREDUCE 12
The default value of this seems to have issues.
My code seems to crash in Dynamics Startup
One user in trying to run a version of the MERRA-2 AGCM on SLES 12 had a crash around the Dynamics layer:
NOT using buffer I/O for file: fvcore_internal_rst Character Resource Parameter LAYOUT: fvcore_layout.rc Integer*4 Resource Parameter COLDSTART: 0 Integer*4 Resource Parameter AGCM_IM: 180 Integer*4 Resource Parameter RUN_DT: 450 Number of Sponge Layers : 9 Dynamics NSPLIT: 7 Dynamics KSPLIT: 1 Dynamics MSPLIT: 0 Dynamics QSPLIT: 0 =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = RANK 0 PID 19837 RUNNING AT borgt005 = KILLED BY SIGNAL: 11 (Segmentation fault) ===================================================================================
The solution was to make sure you have an input.nml file in your scratch directory. Most GEOS models of recent-ish vintage will have a line like:
cd $SCRDIR
which means the script is now in the scratch directory. After this line add a:
touch input.nml
and that will create the file if not there.
My ensembles are crashing because of mpiexec_mpt
In many tags, if you use a ensemble, there are some hard coded mpiexec_mpt references that need to be changed for Intel MPI. An example is in:
src/Applications/NCEP_Etc/NCEP_enkf/scripts/gmao/etc/AtmEnsConfig.csh
If you grep for mpiexec_mpt you'll see many references:
AtmEnsConfig.csh 47:setenv MPIRUN_AOD "mpiexec_mpt " 58:#setenv MPIRUN_ATMENKFAERO "mpiexec_mpt -np 384 $AENKFAEROX" 59:setenv MPIRUN_ATMENKFAERO "mpiexec_mpt -np 96 $AENKFAEROX" 67:setenv MPIRUN_ENSIAU "mpiexec_mpt -np $ENSIAU_NCPUS $IAUX" 78:setenv MPIRUN_ATMENKF "mpiexec_mpt -np $AENKF_NCPUS $AENKFX" 90:setenv MPIRUN_ENSGCM "mpiexec_mpt -np $ENSGCM_NCPUS GEOSgcm.x" # mpiexec_mpt does not work in this context 100:setenv MPIRUN_ENSGCMADJ "mpiexec_mpt -np $ENSGCMADJ_NCPUS GEOSgcmPert.x" # esma_mpirun does not work in this context 101:setenv ADMRUN_OPT_BEGIN "mpiexec_mpt -np $ENSGCMADJ_NCPUS GEOSgcmPert.x" 116:setenv MPIRUN_ENSANA "mpiexec_mpt -np $ENSGSI_NCPUS GSIsa.x" # esma_mpirun does not work in this context 129:setenv AENSTAT_MPIRUN "mpiexec_mpt -np $AENSTAT_NCPUS mp_stats.x"
An easy solution is to convert all the mpiexec_mpt to mpirun which is the command for Intel MPI. (NOTE: if you ever decide to run this with MPT, make sure you convert back to mpiexec_mpt. HPE MPT does have an mpirun command but it is WEIRD. Contact Matt if you want to hear a screed about it.)
All my Perl scripts are failing!
As an overall comparison, a good tag to cvs diff against is GEOSadas-5_25_1_p2. You are looking for changes like those described below.
Shell
Many/All of the Perl scripts in the ADAS will fail due their dependence on an older version of Shell.pm that doesn't exist on SLES 12. Indeed, Shell.pm isn't even a correct call on SLES 12!
Now for many/most scripts, you don't even need a use Shell command. For example, if your file has:
use Shell qw(rm);
or:
use Shell qw(cat cut wc); # shell commands
and you search the script and there are no rm();, cat();, cut();, or wc calls, remove the line! Often it is cruft from years of accrual.
But, the main takeaway is, in the end, any Perl script with use Shell will need to be changed and use Shell must be removed. In one case (see below), you can change to use use CPAN::Shell, which is the Shell module on SLES 12, but your goal, again, is to remove use Shell.
rm (the most common)
This is by far the most common change needed. Change all:
rm();
to:
unlink();
cp
If you need cp(); do not use:
use Shell qw(cp);
use:
use File::Copy "cp";
mv
For mv(); use:
use File::Copy "mv";
rcp rsh...
Change:
use Shell qw(cat rcp rsh scp ssh); # make C-shell commands available
to:
use CPAN::Shell qw(cat rcp rsh scp ssh); # make C-shell commands available
Note this doesn't work for some of the Shell commands above because, say, rm(); was removed from Shell.pm.
timelocal.pl
Replace:
require('timelocal.pl');
with:
use Time::Local;
foreach loop issues
Another difference between the Perl on SLES 11 and in SLES 12 is in its handling of some foreach loops in GEOS. Examples can be found in fvsetup in some tags. For example:
foreach $dir qw(ana diag daotovs etc obs prog rs run recycle fcst asens anasa) {
will fail with something like:
syntax error at ./fvsetup line 4557, near "$dir qw(ana diag daotovs etc obs prog rs run recycle fcst asens anasa)"
The solution is that you have to surround the qw() with parentheses itself:
foreach $dir (qw(ana diag daotovs etc obs prog rs run recycle fcst asens anasa)) { * *
Essentially all your foreach calls should have parentheses around the array you are looping over.
A GEOSadas user found that these files:
idcheck.pl fvsetup gen_silo_arc.pl monthly.yyyymm.pl.tmpl
had foreach issues.
ImportError: No module named cross_validation
You might encounter this on SLES 12:
from sklearn.cross_validation import cross_val_score ImportError: No module named cross_validation
This is due to the newer version of Python on SLES 12. A solution is to do:
# sklearn changed where cross_val_score exists try: from sklearn.model_selection import cross_val_score except ImportError: from sklearn.cross_validation import cross_val_score
Use of #PBS pragmas in run scripts
With SLES 12, NCCS has removed support for using #PBS pragmas in SLURM sbatch scripts. You will need to convert them to the appropriate #SBATCH pragmas. You can find various pages about this on the web, such as this one.