Princeton Ocean Model Benchmark Results on Various Computers CONTENTS: *** POM code *** (1) Contributed by Leo Oey (lyo@splash.princeton.edu) (NEC, T90, SUN server, SGI Origin, DEC) (2) Contribution by Dave Schwab (schwab@zeus.glerl.noaa.gov) (MSDOS PC, LINUX, DEC, HP) (3) Contribution by Tal Ezer (ezer@splash.princeton.edu) (parallel code on T3E, DELL PC Window vs LINUX) (Pentium 2,3 & 4 vs SGI Origin) (4) Contribution by Steve Piacsek (piacsek@new-jersey.nrlssc.navy.mil) (T3E, SGI Origin, HPF vs MPI) (5) Contribution by Tommy Jensen (jensen@soest.hawaii.edu) (Cray SV/J90, SGI Origin) *** POM2K code *** (6) Contribution by Lech Lobocki (llobocki@is.pw.edu.pl) (Visual Fortran, Win; gpf90, Linux, NEC) ------------------------------------------------------------------- (1) Test Dates: Jan-Mar/1998 Table 1 Benchmarks w/a Small Problem The Model: Rectangular ocean basin w/closed walls on all 4 sides, of constant depth=750m, and w/initial T/S=10deg/35psu Forcing: None Dimensions: 320 x 130 x 21 Grid Sizes: 2km x 2km x 37.5m Time Steps: 240s internal & 6s external Simulation Time:0.125 day RAM Required: ~ 170Mbytes (or 40Mwords) Contact: lyo@splash.princeton.edu or (L.Oey, AOS, Sayre Hall, Princeton U., Princeton, NJ 08544) ------------------------------------------------------------------------------ Princeton Ocean Model Benchmark RESULTS Machine (Model# etc.) #CPUs Auto.Parallel? Execution Time Conducted By used (Compiler Opt.) (seconds) (Inst./Contact) -------------------- ----- ------------- -------------- -------------- NEC SX4 1 No 68 Bunmei Taguchi (vector) (JAMSTEC) .............................................................................. CRAY T90 1 No 69 Leo Oey C90 1 No 133 (Princeton U.) (both vectors) .............................................................................. SGI Origin 2000 1 No 513 SGI (32 CPU w/4GB RAM) 2 Yes 284 (Liz Orona) 4 Yes 149 8 Yes 93 12 Yes 77 16 Yes 78 32 Yes 91 .............................................................................. Sun E450 Server 1 No 543 SUN (300 MHz) 3 Yes 224 (John Strong) Sun HPC6000 Server 3 Yes 247 (250 MHz) 6 Yes 156 8 Yes 158 10 Yes 153 16 Yes 139 8 Code-change* 100/84(4MbCache) 16 Code-change* 68/52(4MbCache) *interchange J & K loops Sun HPC10000 Server 8 Yes 207 (250 MHz) 16 Yes 175 .............................................................................. DEC 600MHz Alpha 1 N/A 477 Kyle Gilbertson 533MHz 1 N/A 499 (Aspen Sys Inc) 333MHz 1 N/A 893 Leo Oey ------------------------------------------------------------------------------ Table.2 SGI/SUN Comparison of 3-GigaBytes Model SGI SUN Model--> Origin 2000 HPC 6000 ---------------------------------------------------------- #CPUs ----- 8 2:28 3:11 16 1:56 2:33 ---------------------------------------------------------- Table.3 More test comparisons from SGI w/bigger problems 170MByte 3 GigaByte 10 GigaByte 320x130x21 500x250x101 1000x500x101 ---------- -------------- ------------ #CPUs O2000 O2000 O2000 O2000 <--- Origin 2000 195 195 250 250MHz ----- ----- ----- ----- ----- 1 0:54 12:04 8:23 38:44 2 0:34 6:23 4:55 21:20 4 0:18 3:50 3:17 12:55 8 0:13 2:28 2:22 8:25 12 0:11 2:03 1:58 7:33 16 0:11 1:56 1:49 7:01 20 0:13 1:52 1:45 6:22 24 0:13 1:47 1:47 6:03 32 0:17 2:28 1:40 6:00 -------------------------------------------------------------------- (2) Dave Schwab Dr. David J. Schwab NOAA Great Lakes Environmental Research Laboratory 2205 Commonwealth Blvd. Ann Arbor, MI 48105 734-741-2120 734-741-2055 (FAX) I can share our experiences using the 20x20x24 (dti=5760,isplit=30,idays=10) pom.f example (not pom97.f) that was available at gfdl. POM benchmark: pom.f ----------------------------------------------- MSDOS PC's, FPS Compiler, /Ox option ----------------------------------------------- 486 DX2/50 900.0 seconds 486 DX2/66 456.0 AMD 486/133 273.6 P60 191.0 P100 126.0 P133 99.6 P166 70.0 AMD K6/233 59.0 PII/300 27.0 ------------------------------------------------ Multiprocessor Linux PC, Portland Group Compiler ------------------------------------------------ Dual PII/333, PGF90 23.6 Dual PII/333, PGF90 -Mconcur 13.8 ------------------------------------------------ Workstations ------------------------------------------------ DEC Station 5000/200 286.0 HP 9000/710 148.0 SGI Indy/150 108.2 Sun Sparc 4m 108.0 SGI Indy/200 79.0 HP 9000/715-100 61.9 HP 9000/735-100 59.0 Sun Ultra 2/200 52.0 DEC Alpha 3000/600 43.9 DEC Alpha 3000/800 37.8 HP K200-2 processor 34.3 HP K200-4 processor 26.5 HP C160 22.6 ------------------------------------------------ Supercomputers ------------------------------------------------ Cray Y-MP4E/232 11.4 Cray C90 5.9 ------------------------------------------------ Although the results with the Dual PII look encouraging, we found that with the Portland Group Compiler on the Intel architecture, the speedup from multiprocessors apparently was better for the 2-d part of the code than for the 3-d part. With dti=960 and isplit=5 we obtained the following times: ------------------------------------------------ HP C160 85.6 Linux PC (Dual pII/333, PGF90) 130.7 ------------------------------------------------ I have not tested pom97.f yet, but perhaps we should use this version of the code as the basis for future benchmarks, with a provision for separate 2-d and 3-d benchmarks. *************************************************************** (3) ***** BENCHMARK calculations with parallel POMmpi on T3E ***** ***** for the SEAMOUNT case. CPU excludes setup & 3D output ***** but includes a few 2D printouts. T. Ezer, May 2000. (GFDL T3E: 128 PEs, 450 MHz each with 32 Mw mem. (1w=64b) Parameters: DTE=3.6s, ISPLIT=24, DAYS=0.1 (100 time-steps) IM JM KB NPROCS cpu(sec) 64 64 32 4 104 64 64 16 4 55 64 64 16 8 41 64 64 16 16 26 64 64 16 32 21 128 64 16 8 61 64 128 16 8 78 128 128 16 4 225 128 128 16 8 127 128 128 16 16 65 128 128 16 32 44 256 256 16 4 886 256 256 16 8 480 256 256 16 16 241 256 256 16 32 133 512 512 32 32 927 512 512 16 16 917 512 512 16 32 490 512 512 16 64 245 Running Time of pom97.f - comparison of Window PC, LINUX & SGI workstation. ---------------------------------------------- All runs with DTE=6 sec, ISPLIT=30, DAYS=1 (Seamount topography) Grid Points Computer Compiler (20x10x12) (65x49x21) -------- -------- ---------- ----------- MEM=0.3MW MEM=2.8MW T90 f90 4s 56s WinNT Digital Visual Fortran 39s (x10) 1336s (x23) (Dell PentiumII 333MHz) GNU/LINUX pgf90 52s (x13) 1969s (x35) (Dell PentiumIII 600MHz) (-fast option cut 10% -Mconcur (2 process.) 60 & 50%) SGI f77 96s (x24) 3080s (x55) (IRIX5.3 IP22) (IRIX6.2 IP28 - x3 faster than IRIX5.3) ******************************************************************************* CPU TIMING ON DIFFERENT MACHINES for pom2k code (including netCDF output) ---- comparing Linux based DELL PC and SGI Origin 3800 ---- ******************************************************************************* COMPILER MACHINE (IMxJMxKB) days dte isplit CPU (Pent2; 333MHz) 0.500 655s g77 -O3 DELL/Linux (65x49x21) 0.025 6s 30 37s 0.500 262s (Pent3; 256KB; 600MHz) **** CPU = 474s x days + 25s **** ---------------------------------------------------------------------------- g77 -O3 DELL/Linux (65x49x21) 0.025 6s 30 13s 0.500 63s (Pent4: 1GB; 2.2GHz) **** CPU = 105s x days + 10s **** ---------------------------------------------------------------------------- g77 -O3 DELL/Linux (65x49x21) 0.025 6s 30 9s 0.500 56s (Pent4: 1GB; 2.8GHz) **** CPU = 99s x days + 7s **** (260x196x21) 0.025 6s 30 64s ---------------------------------------------------------------------------- ---------------------------------------------------------------------------- f90 -O3 SGI/Origin3K (65x49x21) 0.025 6s 30 36s 0.500 83s (1GB; 600MHz; 1processor) **** CPU = 99s x days + 34s **** ---------------------------------------------------------------------------- f90 -Ofast -apo (65x49x21) 0.025 6s 30 54s 0.500 73s (4 processors auto paral.) **** CPU = 40s x days + 53s **** ---------------------------------------------------------------------------- *************************************************************** (4) Contribution by: Steve Piacsek, NRL. Timings of the POM Model: We have applied the model to the whole Mediterranean Basin, on a spherical grid of 256x128, with 26 vertical levels. This resulted in an average horizontal resolution of 14 km. The code was run on the SGI Origin2000 (O2K), as well as on the Cray T3E. The timings for 15 time steps, running on varying number of processors is given below. Only the main time marching loop, without any I/O, is included in these figures. Scaling of HPF_POM and MPI_POM (Wall time for 15 time steps - in seconds) # proc HPF/T3E HPF/O2K MPI/O2K 4 58.6 398.0 273.0 8 39.8 222.0 141.0 16 25.4 104.0 76.0 32 18.0 55.0 45.0 64 10.9 30.0 23.0 *************************************************************** (5) Contribution by: Tommy G. Jensen International Pacific Research Center (IPRC) School of Ocean & Earth Science & Technology (SOEST) University of Hawaii jensen@soest.hawaii.edu 2525 Correa Rd. voice: (808) 956-5468 Honolulu, HI 96822, USA fax: (808) 956-9425 POM, Kuroshio Model. Grid: 206x209x32, 30 time steps (5 hours model time). Diagnostic mode (T & S are fixed). platform time (s) speed up Mflops -------------------------------------------------------------------- Cray SV1-1, 12 CPUs 39.3 6.09 1312.9 Cray SV1-1, 8 CPUs 46.7 5.12 1104.9 Cray SV1-1, 4 CPUs 73.6 3.25 701.0 Cray SV1-1, 1 CPUs 239.3 215.6 Cray J90se, 12 CPUs 64.6 7.75 798.7 Cray J90se, 8 CPUs 80.2 6.24 643.3 Cray J90se, 4 CPUs 136.6 3.67 347.0 Cray J90se, 1 CPUs 500.8 103.0 Origin 2k/250, 16 CPUs 106.2 5.72 485.8 Origin 2k/250, 15 CPUs 101.5 5.99 508.3 Origin 2k/250, 14 CPUs 101.8 5.97 506.8 Origin 2k/250, 13 CPUs 101.9 5.96 506.3 Origin 2k/250, 12 CPUs 107.7 5.64 479.1 Origin 2k/250, 11 CPUs 111.8 5.43 461.5 Origin 2k/250, 10 CPUs 115.7 5.25 446.0 Origin 2k/250, 9 CPUs 123.5 4.92 417.8 Origin 2k/250, 8 CPUs 131.9 4.61 391.2 Origin 2k/250, 7 CPUs 145.3 4.18 355.1 Origin 2k/250, 6 CPUs 159.2 3.82 324.1 Origin 2k/250, 5 CPUs 182.3 3.33 283.0 Origin 2k/250, 4 CPUs 213.4 2.85 241.8 Origin 2k/250, 3 CPUs 276.1 2.20 186.9 Origin 2k/250, 2 CPUs 396.2 1.53 130.2 Origin 2k/250, 1 CPUs 607.6 84.9 Origin 2k/195, 14 CPUs 114.9 7.81 449.0 Origin 2k/195, 13 CPUs 115.8 7.74 445.6 Origin 2k/195, 12 CPUs 122.1 7.34 422.6 Origin 2k/195, 11 CPUs 134.7 6.66 383.0 Origin 2k/195, 10 CPUs 141.9 6.32 363.6 Origin 2k/195, 8 CPUs 159.9 5.61 322.7 Origin 2k/195, 7 CPUs 204.6 4.38 252.2 Origin 2k/195, 6 CPUs 222.5 4.03 231.9 Origin 2k/195, 5 CPUs 258.2 3.47 199.8 Origin 2k/195, 4 CPUs 308.9 2.90 167.0 Origin 2k/195, 3 CPUs 422.4 2.12 122.2 Origin 2k/195, 2 CPUs 537.3 1.67 96.0 Origin 2k/195, 1 CPUs 896.8 57.5 These parallel runs wer done using automatic parallelization/autotasking on the original pom97 code. Two first series of runs were made by Matt Clark, SGI/CRAY. They show actual run times. Runs on the SGI 2k were done at the International Pacific Research Center (IPRC) by Takuji Waseda and Tommy Jensen. Mflops measured using hpm on the J90se. SGI rates are estimated based on those numbers. Times for parallel runs on the Origin 2k/250, 250Mhz were obtained with a single user, while run times on the Origin 2k/195Mhz may be affected slightly by loads from other users. ----------------------------------------------------------------------- *************************************************************** (6) Contribution by: Lech Lobocki Warsaw University of Technology, Poland -------------------------------- POM2K, seamount problem, DTE=6 sec, ISPLIT=30, DAYS=0.25 (65x49x21) (251x61x21) PII/233, Compaq Visual Fortran 6.5, Win98 234 s 1140 s PIII/2x733, Portland Group Fortran, Linux RH7.0: -fast 87 s 460 s -Mconcur, NCPUS=2 54 s 305 s NEC SX-4, 1CPU, f90 -C vsafe 113 s NEC SX-4, 1CPU, f90 -C hopt 106 s Comments: the test was done with the code taken as is, with as few changes as it was necessary to run the model - I had to disable the netcdf output as I did not have it installed on every machine/compiler. The printing diagnostics interval option prtd1 was used with its original value 0.0125, which put some burden on CPU time (although the screen output was redirected to a file). Giorgio Amati suggested the test to be repeated for this reason. When prtd was changed to exceed the simulation time, the changes were small for PC boxes (a few seconds), but large on the NEC (a factor of two), which was not surprising. With the reduced printing (initial timestep printouts still remain), the timing is now: POM2K, seamount problem, DTE=6 sec, ISPLIT=30, DAYS=0.25 (65x49x21) (251x61x21) NEC-SX4B, 1 CPU, f90 -C vsafe 75 s -C hopt 55 s PIII/2x733, (133 MHz FSB, 133 MHz SDRAM) Portland Group Fortran, Linux RH7.0: -fast (1 CPU) 80 s 456 s -Mconcur , NCPUS=2 50 s 297 s PIII/2x733, running VMWare virtual machine with Win-98 SE under Linux RH7.0, Compaq Visual Fortran 6.6 (previously DEC VF 5.0 and earlier yet Microsoft Powerstation 4.0), 'Release' settings (1 CPU): 85 s 450 s There was also neither no attempt to optimize speed by choosing the grid dimensions, geometry, etc. nor to fit the particular machine architecture, e.g. cache memory. It appears that PC's run the POM at the rate of minutes to hours, per day simulated depending on the grid size, and the single-CPU of the NEC-SX4B (comparable to T90 in some other benchmarks with POM) does it almost 10x faster. Lech Lobocki --------------------------------