Princeton Ocean Model Benchmark Results
			on Various Computers
CONTENTS:
                          *** POM code ***
	(1) Contributed by Leo Oey (lyo@splash.princeton.edu)
			(NEC, T90, SUN server, SGI Origin, DEC) 

	(2) Contribution by Dave Schwab (schwab@zeus.glerl.noaa.gov)
			(MSDOS PC, LINUX, DEC, HP) 

	(3) Contribution by Tal Ezer (ezer@splash.princeton.edu)
	             	(parallel code on T3E, DELL PC Window vs LINUX)
	             	(Pentium 2,3 & 4 vs SGI Origin)

	(4) Contribution by Steve Piacsek (piacsek@new-jersey.nrlssc.navy.mil)
			(T3E, SGI Origin, HPF vs MPI)

	(5) Contribution by Tommy Jensen (jensen@soest.hawaii.edu)
			(Cray SV/J90, SGI Origin)

                          *** POM2K code ***
	(6) Contribution by Lech Lobocki (llobocki@is.pw.edu.pl)
			(Visual Fortran, Win; gpf90, Linux, NEC)


-------------------------------------------------------------------

(1)

		      Test Dates: Jan-Mar/1998

		Table 1 Benchmarks w/a Small Problem

The Model:	Rectangular ocean basin w/closed walls on all 4 sides,
		of constant depth=750m, and w/initial T/S=10deg/35psu
Forcing:	None
Dimensions:	320 x 130 x 21
Grid Sizes:	2km x 2km x 37.5m
Time Steps:	240s internal & 6s external
Simulation Time:0.125 day
RAM Required:	~ 170Mbytes (or 40Mwords)
Contact:	lyo@splash.princeton.edu or (L.Oey, AOS, Sayre Hall, 
		Princeton U., Princeton, NJ 08544)

------------------------------------------------------------------------------
		   Princeton Ocean Model Benchmark RESULTS

Machine (Model# etc.) #CPUs   Auto.Parallel?  Execution Time	Conducted By
		      used    (Compiler Opt.)	(seconds)	(Inst./Contact)
--------------------  -----   -------------   --------------	--------------

NEC SX4			1	   No		  68		Bunmei Taguchi
		      (vector)					(JAMSTEC)
..............................................................................
CRAY T90		1	   No		  69		Leo Oey
     C90		1	   No		 133		(Princeton U.)
		   (both vectors)
..............................................................................
SGI Origin 2000		1	   No		 513		SGI
(32 CPU w/4GB RAM)	2	   Yes		 284		(Liz Orona)
			4	   Yes		 149
			8	   Yes		  93
		       12	   Yes		  77
		       16	   Yes		  78
		       32	   Yes		  91
..............................................................................
Sun E450 Server		1	   No		 543		SUN
(300 MHz)		3	   Yes		 224		(John Strong)

Sun HPC6000 Server	3	   Yes		 247
(250 MHz)		6	   Yes		 156
			8	   Yes		 158
		       10	   Yes		 153
		       16	   Yes		 139

			8	   Code-change*	 100/84(4MbCache)
		       16	   Code-change*	  68/52(4MbCache)

				   *interchange J & K loops


Sun HPC10000 Server	8	   Yes		 207
(250 MHz)	       16	   Yes		 175
..............................................................................
DEC 600MHz Alpha	1	   N/A		 477		Kyle Gilbertson
    533MHz		1	   N/A		 499		(Aspen Sys Inc)
    333MHz		1	   N/A		 893		Leo Oey
------------------------------------------------------------------------------


	Table.2 SGI/SUN Comparison of 3-GigaBytes Model

		SGI				SUN

Model-->    Origin 2000			     HPC 6000
----------------------------------------------------------
#CPUs
-----
 8		2:28				3:11
16		1:56				2:33
----------------------------------------------------------


	Table.3 More test comparisons from SGI w/bigger problems


	170MByte	  3 GigaByte	     10 GigaByte
        320x130x21        500x250x101        1000x500x101
        ----------      --------------       ------------
#CPUs     O2000         O2000    O2000       O2000	<--- Origin 2000
           195           195      250        250MHz
-----     -----         -----    -----       -----
  1        0:54         12:04     8:23       38:44
  2        0:34          6:23     4:55       21:20
  4        0:18          3:50     3:17       12:55
  8        0:13          2:28     2:22       8:25
 12        0:11          2:03     1:58       7:33
 16        0:11          1:56     1:49       7:01
 20        0:13          1:52     1:45       6:22
 24        0:13          1:47     1:47       6:03
 32        0:17          2:28     1:40       6:00
--------------------------------------------------------------------


(2)

Dave Schwab

Dr. David J. Schwab
NOAA Great Lakes Environmental Research Laboratory
2205 Commonwealth Blvd.
Ann Arbor, MI 48105
734-741-2120
734-741-2055 (FAX)

I can share our experiences using
the 20x20x24 (dti=5760,isplit=30,idays=10) pom.f example (not pom97.f)
that was available at gfdl.

POM benchmark: pom.f 
-----------------------------------------------
MSDOS PC's, FPS Compiler, /Ox option
-----------------------------------------------
486 DX2/50                      900.0 seconds
486 DX2/66                      456.0
AMD 486/133                     273.6
P60                             191.0
P100                            126.0
P133                             99.6
P166                             70.0
AMD K6/233                       59.0
PII/300                          27.0
------------------------------------------------
Multiprocessor Linux PC, Portland Group Compiler
------------------------------------------------
Dual PII/333, PGF90              23.6
Dual PII/333, PGF90 -Mconcur     13.8
------------------------------------------------
Workstations
------------------------------------------------
DEC Station 5000/200            286.0
HP 9000/710                     148.0
SGI Indy/150                    108.2
Sun Sparc 4m                    108.0
SGI Indy/200                     79.0
HP 9000/715-100                  61.9
HP 9000/735-100                  59.0
Sun Ultra 2/200                  52.0
DEC Alpha 3000/600               43.9
DEC Alpha 3000/800               37.8
HP K200-2 processor              34.3
HP K200-4 processor              26.5
HP C160                          22.6
------------------------------------------------
Supercomputers
------------------------------------------------
Cray Y-MP4E/232                  11.4
Cray C90                          5.9
------------------------------------------------
Although the results with the Dual PII look encouraging, we found that
with the Portland Group Compiler on the Intel architecture, the speedup
from multiprocessors apparently was better for the 2-d part of the code
than for the 3-d part.  With dti=960 and isplit=5 we obtained the
following times:
------------------------------------------------
HP C160                          85.6
Linux PC (Dual pII/333, PGF90)  130.7
------------------------------------------------
I have not tested pom97.f yet, but perhaps we should use this version
of the code as the basis for future benchmarks, with a provision for
separate 2-d and 3-d benchmarks.

***************************************************************

(3)


***** BENCHMARK calculations with parallel POMmpi on T3E *****
***** for the SEAMOUNT case. CPU excludes setup & 3D output
***** but includes a few 2D printouts.     T. Ezer, May 2000.

 (GFDL T3E: 128 PEs, 450 MHz each with 32 Mw mem. (1w=64b)

Parameters: DTE=3.6s, ISPLIT=24, DAYS=0.1 (100 time-steps)

  IM   JM   KB   NPROCS    cpu(sec)

  64   64   32     4      104
  64   64   16     4       55
  64   64   16     8       41
  64   64   16    16       26
  64   64   16    32       21

 128   64   16     8       61
  64  128   16     8       78

 128  128   16     4      225
 128  128   16     8      127
 128  128   16    16       65
 128  128   16    32       44

 256  256   16     4      886
 256  256   16     8      480
 256  256   16    16      241
 256  256   16    32      133

 512  512   32    32      927
 512  512   16    16      917
 512  512   16    32      490
 512  512   16    64      245


 Running Time of pom97.f - comparison of Window PC, LINUX & SGI workstation.
 ----------------------------------------------
 
  All runs with DTE=6 sec, ISPLIT=30, DAYS=1 (Seamount topography)
 
                                                Grid Points
 
Computer        Compiler                (20x10x12)      (65x49x21)
--------        --------                ----------      -----------
                                        MEM=0.3MW       MEM=2.8MW
 
 T90            f90                     4s              56s
 
 WinNT     Digital Visual Fortran       39s (x10)       1336s (x23)
(Dell PentiumII 333MHz)  
 
 GNU/LINUX      pgf90                   52s (x13)       1969s (x35)
(Dell PentiumIII 600MHz)  
(-fast option cut 10%
 -Mconcur (2 process.) 60 & 50%)

 SGI            f77                     96s (x24)       3080s (x55)
(IRIX5.3 IP22) 

(IRIX6.2 IP28 - x3 faster than IRIX5.3) 


*******************************************************************************

  CPU TIMING ON DIFFERENT MACHINES for pom2k code (including netCDF output)
  ---- comparing Linux based DELL PC and SGI Origin 3800 ----
*******************************************************************************

COMPILER        MACHINE         (IMxJMxKB)  days  dte  isplit           CPU

         (Pent2; 333MHz)                   0.500                       655s

g77 -O3         DELL/Linux      (65x49x21)  0.025  6s   30               37s
                                           0.500                       262s

     (Pent3; 256KB; 600MHz)         **** CPU = 474s x days + 25s ****

----------------------------------------------------------------------------
g77 -O3         DELL/Linux      (65x49x21)  0.025  6s   30               13s
                                           0.500                        63s

          (Pent4: 1GB; 2.2GHz)      **** CPU = 105s x days + 10s ****

----------------------------------------------------------------------------
g77 -O3         DELL/Linux      (65x49x21)  0.025  6s   30                9s
                                           0.500                        56s

          (Pent4: 1GB; 2.8GHz)      **** CPU = 99s x days + 7s ****

                             (260x196x21)  0.025  6s   30               64s

----------------------------------------------------------------------------
----------------------------------------------------------------------------

f90 -O3         SGI/Origin3K    (65x49x21)  0.025  6s   30               36s
                                           0.500                        83s

(1GB; 600MHz; 1processor)           **** CPU =  99s x days + 34s ****

----------------------------------------------------------------------------
f90 -Ofast -apo                 (65x49x21)  0.025  6s   30               54s
                                           0.500                        73s

(4 processors auto paral.)          **** CPU =  40s x days + 53s ****

----------------------------------------------------------------------------

***************************************************************

(4) Contribution by:

Steve Piacsek, NRL.

Timings of the  POM  Model: We have applied the model to the whole 
Mediterranean Basin, on a spherical grid of 256x128, with 26 vertical levels. 
This resulted in an average horizontal resolution of 14 km. The code
was run on the SGI Origin2000 (O2K), as well as on the Cray T3E. 
The timings for 15 time steps, running on varying number of processors is 
given below. Only the main time marching loop, without any I/O, is
included in these figures.

                      Scaling of HPF_POM  and  MPI_POM

                    (Wall time for 15 time steps - in seconds)

         # proc 	HPF/T3E  	HPF/O2K 	MPI/O2K        
     
       	4   		58.6  		398.0  		273.0
      	8   		39.8  		222.0  		141.0
  	16  		25.4  		104.0  		76.0
	32  		18.0  		55.0   		45.0
   	64  		10.9  		30.0   		23.0


***************************************************************

(5) Contribution by:

Tommy G. Jensen
International Pacific Research Center (IPRC)
School of Ocean & Earth Science & Technology (SOEST)
University of Hawaii       jensen@soest.hawaii.edu
2525 Correa Rd.            voice:     (808) 956-5468
Honolulu, HI 96822, USA    fax:       (808) 956-9425

POM, Kuroshio Model. Grid: 206x209x32, 30 time steps (5 hours model time). 
Diagnostic mode (T & S are fixed). 

platform                             time (s)   speed up    Mflops 
-------------------------------------------------------------------- 
Cray SV1-1, 12 CPUs                   39.3       6.09       1312.9 
Cray SV1-1,  8 CPUs                   46.7       5.12       1104.9 
Cray SV1-1,  4 CPUs                   73.6       3.25        701.0 
Cray SV1-1,  1 CPUs                  239.3                   215.6 

Cray J90se, 12 CPUs                   64.6       7.75        798.7 
Cray J90se,  8 CPUs                   80.2       6.24        643.3 
Cray J90se,  4 CPUs                  136.6       3.67        347.0 
Cray J90se,  1 CPUs                  500.8                   103.0 
  
Origin 2k/250, 16 CPUs               106.2       5.72        485.8 
Origin 2k/250, 15 CPUs               101.5       5.99        508.3 
Origin 2k/250, 14 CPUs               101.8       5.97        506.8 
Origin 2k/250, 13 CPUs               101.9       5.96        506.3 
Origin 2k/250, 12 CPUs               107.7       5.64        479.1 
Origin 2k/250, 11 CPUs               111.8       5.43        461.5 
Origin 2k/250, 10 CPUs               115.7       5.25        446.0 
Origin 2k/250,  9 CPUs               123.5       4.92        417.8 
Origin 2k/250,  8 CPUs               131.9       4.61        391.2 
Origin 2k/250,  7 CPUs               145.3       4.18        355.1 
Origin 2k/250,  6 CPUs               159.2       3.82        324.1 
Origin 2k/250,  5 CPUs               182.3       3.33        283.0 
Origin 2k/250,  4 CPUs               213.4       2.85        241.8 
Origin 2k/250,  3 CPUs               276.1       2.20        186.9 
Origin 2k/250,  2 CPUs               396.2       1.53        130.2 
Origin 2k/250,  1 CPUs               607.6                    84.9 
Origin 2k/195, 14 CPUs               114.9       7.81        449.0 
Origin 2k/195, 13 CPUs               115.8       7.74        445.6 
Origin 2k/195, 12 CPUs               122.1       7.34        422.6 
Origin 2k/195, 11 CPUs               134.7       6.66        383.0 
Origin 2k/195, 10 CPUs               141.9       6.32        363.6 
Origin 2k/195,  8 CPUs               159.9       5.61        322.7 
Origin 2k/195,  7 CPUs               204.6       4.38        252.2 
Origin 2k/195,  6 CPUs               222.5       4.03        231.9 
Origin 2k/195,  5 CPUs               258.2       3.47        199.8 
Origin 2k/195,  4 CPUs               308.9       2.90        167.0 
Origin 2k/195,  3 CPUs               422.4       2.12        122.2 
Origin 2k/195,  2 CPUs               537.3       1.67         96.0 
Origin 2k/195,  1 CPUs               896.8                    57.5 

These parallel runs wer done using automatic parallelization/autotasking 
on the original pom97 code. 

Two first series of runs were made by Matt Clark, SGI/CRAY. They 
show actual run times. 
Runs on the SGI 2k were done at the International Pacific Research Center 
(IPRC) by Takuji Waseda and Tommy Jensen. 
  
Mflops measured using hpm on the J90se. 
SGI rates are estimated based on those numbers. 

Times for parallel runs on the Origin 2k/250, 250Mhz 
were obtained with a single user, while run times on the Origin 2k/195Mhz 
may be affected slightly by loads from other users. 
----------------------------------------------------------------------- 

***************************************************************

(6) Contribution by:

Lech Lobocki
Warsaw University of Technology, Poland


 --------------------------------

POM2K, seamount problem,  DTE=6 sec, ISPLIT=30, DAYS=0.25

                                               (65x49x21) (251x61x21)
PII/233,    Compaq Visual Fortran 6.5, Win98         234 s    1140 s
PIII/2x733, Portland Group Fortran, Linux RH7.0:
            -fast                                     87 s     460 s
            -Mconcur, NCPUS=2                         54 s     305 s

NEC SX-4, 1CPU, f90 -C vsafe                                   113 s
NEC SX-4, 1CPU, f90 -C hopt                                    106 s


Comments:
the test was done with the code taken as is, with as few changes
as it was necessary to run the model - I had to disable the netcdf
output as I did not have it installed on every machine/compiler.

The printing diagnostics interval option prtd1 was used with its
original value 0.0125, which put some burden on CPU time (although
the screen output was redirected to a file). Giorgio Amati suggested
the test to be repeated for this reason. When prtd was changed
to exceed the simulation time, the changes were small for PC boxes
(a few seconds), but large on the NEC (a factor of two), which was
not surprising.

With the reduced printing (initial timestep printouts still remain),
the timing is now:
POM2K, seamount problem,  DTE=6 sec, ISPLIT=30, DAYS=0.25

						(65x49x21)  (251x61x21)
NEC-SX4B, 1 CPU, f90 -C vsafe                                     75 s 
		     -C hopt                                      55 s

PIII/2x733, (133 MHz FSB, 133 MHz SDRAM) Portland Group Fortran, Linux
RH7.0: -fast (1 CPU)                           	 80 s            456 s
        -Mconcur ,  NCPUS=2			 50 s            297 s

PIII/2x733, running VMWare virtual machine with Win-98 SE under Linux
RH7.0,
Compaq Visual Fortran 6.6 (previously DEC VF 5.0 and earlier yet
Microsoft Powerstation 4.0), 'Release' settings (1 CPU):

						 85 s            450 s

There was also neither no attempt to optimize speed by choosing the grid
dimensions, geometry, etc. nor to fit the particular machine architecture, 
e.g. cache memory.  It appears that PC's run the POM at the rate of minutes 
to hours, per day simulated depending on the grid size, and the single-CPU 
of the NEC-SX4B (comparable to T90 in some other benchmarks with POM) does 
it almost 10x faster.

Lech Lobocki

 --------------------------------