These are notes about specifics of setting up and running ROMS, mainly on the UCAR computer yellowstone, but also the NERSC computer hopper, and waddle.
$ in the lines below means the Linux prompt
I organize things in 5 main directories, all assumed to be at the same level (I put them all in a directory .../roms/):
1. Getting the source code, ROMS/, and editing the compiler file
First execute (from any of the machines):
NOTE: a helpful command for looking for things is the grep command:
Then you may have to do a few edits to the compiler file (after saving a copy with the suffix _ORIG):
For yellowstone use ROMS/Compilers/Linux-ifort.mk and no edits are required.
For hopper you edit ROMS/Compilers/Linux-ftn.mk so that:
For waddle you edit ROMS/Compilers/Linux-pgi.mk so that:
2. Compiling the code, with files in makefiles/
Here you need three files to define a run. In this case I put them all in the directory makefiles/yellowstone/ptx_01/ (or other directories for different cppdefs flags, but note that the three files always have the same names):
Here it is for yellowstone (BOLD = new version):
Then to compile you (have to!) go to the directory ROMS and execute:
$ make -f /glade/p/cgd/oce/people/paulmac/roms/makefiles/yellowstone/ptx_01/makefile clean
This takes 5-10 minutes, and results in the executable: roms/makefiles/yellowstone/ptx_01/oceanM
This executable can be used for many different runs, and this is the reason it is separated out into its own directory.
For hopper the instructions are similar, except in the makefile you use FORT ?= ftn
3. The forcing files, in forcing/
Sarah made these using rtools, or I make my own for the SciDAC runs forced by parts of CESM. I move them to yellowstone by going to the dirctory forcing/ and then doing something like:
$ scp -r email@example.com:/pmraid2/sarahgid/runs/ptx_highT40_2_2004 .
which will prompt for my skua password, and then move the whole pile to roms_forcing/ptx_highT40_2_2004. This takes about a half hour per year.
4. Doing a run, in runs/
Running jobs on yellowstone (UCAR Supercomputer):
Now you operate in the directory runs/T2005/, for example, where you need to have four things:
These four are created using a python script I wrote (on my mac) and some templates. The script creates the four files and the directory they sit in. Email me if you want it.
To start a run you just execute the command (in T2005/):
$ bsub < my_script
and to restart a run execute:
$ bsub < re_script
Useful commands on yellowstone:
Doing a run on hopper:
To start a run you just execute the command:
$ qsub my_script
A typical "my_script" is a text file with lines like:
This is running on 576 cores (!), and will use 12 hours of the premiium queue.
The different queues have different priorities and allowable walltimes:
For reference, the 40-level ptx run with 5 dyes takes about 1.5 days on hopper with 576 cores, and creates about 1.2 TB of history files (hourly saves).
To restart a run that stopped in the middle all you have to do is change NRREC from 0 to -1, and change the initialization file to OUT/ocean_rst.nc, both changes in the .in file. This assumes you have been saving restart files.
Useful commands on hopper:
Doing a run on waddle:
Here is a command to running using MPI with 48 cores. Clearly you have to do this from the directory where oceanM is. The implied directroy structure is different from my hopper notes above, so beware.
$ mpirun -np 48 -machinefile hf oceanM ptx2.in > log1 &
...and you are off and running (and returned to the command line because of the &). Lots of useful screen output from ROMS will end up in log1.
NOTE: np is the number of cores, and must match NtileI*NtileJ from your .in file.
NOTE: hf is a text file with a list of waddle nodes David Darr has said you can use. My half of waddle has 8 cores per node and 12 nodes, numbered 0 to 11. For example to use nodes 7 to 12 the file hf would have 6 lines:
Killing a multi-core job on waddle (from David Darr):
On waddle do a "ps aux | grep mpirun" and find the PID number for the mpirun job with your username and then kill it with "kill -9 PID". This almost always works. However, for reasons I don't fully understand it doesn't work a small percentage of the time... in which case I just do the brute force approach.
To see what is happening on a specific node you can do "ssh n006" (e.g. to get to node 6) and then use top. Type "exit" to return to your main shell.
NOTE: that you also need some special lines in the .cshrc in your home directory to get mpi to work. To get there type cd ~, and then use ls -la to see hidden files. My .cshrc has:
And that's it!