High Performance Linpack on Wand Quad + Ubuntu 12.04

Discuss Ubuntu related items.

High Performance Linpack on Wand Quad + Ubuntu 12.04

Postby Wandboarder » Tue Nov 12, 2013 5:55 pm

I hope someone can help me with this :)

I am in the process of building a wandboard array but before the boards arrive I am trying to benchmark a single board with High Performance Linpack. I am running Ubuntu 12.04 LTS and I have installed openmpi v1.5 as well as libatlas-base-dev. I havent worried about optimizing the BLAS libraries as ill do this when I know it works. I will build ATLAS then.

The setup I largely followed can be found here: Mini HowTo: Linpack HPL on Raspberry Pi (look towards the bottom)

So first things first. I installed the required packages. I downloaded HPL-2.1 and copied the Make.UNKOWN file which I then edited as follows:

Code: Select all

#  -- High Performance Computing Linpack Benchmark (HPL)               
#     HPL - 2.1 - October 26, 2012                         
#     Antoine P. Petitet                                               
#     University of Tennessee, Knoxville                               
#     Innovative Computing Laboratory                                 
#     (C) Copyright 2000-2008 All Rights Reserved                       
#                                                                       
#  -- Copyright notice and Licensing terms:                             
#                                                                       
#  Redistribution  and  use in  source and binary forms, with or without
#  modification, are  permitted provided  that the following  conditions
#  are met:                                                             
#                                                                       
#  1. Redistributions  of  source  code  must retain the above copyright
#  notice, this list of conditions and the following disclaimer.       
#                                                                       
#  2. Redistributions in binary form must reproduce  the above copyright
#  notice, this list of conditions,  and the following disclaimer in the
#  documentation and/or other materials provided with the distribution.
#                                                                       
#  3. All  advertising  materials  mentioning  features  or  use of this
#  software must display the following acknowledgement:                 
#  This  product  includes  software  developed  at  the  University  of
#  Tennessee, Knoxville, Innovative Computing Laboratory.             
#                                                                       
#  4. The name of the  University,  the name of the  Laboratory,  or the
#  names  of  its  contributors  may  not  be used to endorse or promote
#  products  derived   from   this  software  without  specific  written
#  permission.                                                         
#                                                                       
#  -- Disclaimer:                                                       
#                                                                       
#  THIS  SOFTWARE  IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
#  ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES,  INCLUDING,  BUT NOT
#  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
#  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE UNIVERSITY
#  OR  CONTRIBUTORS  BE  LIABLE FOR ANY  DIRECT,  INDIRECT,  INCIDENTAL,
#  SPECIAL,  EXEMPLARY,  OR  CONSEQUENTIAL DAMAGES  (INCLUDING,  BUT NOT
#  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
#  DATA OR PROFITS; OR BUSINESS INTERRUPTION)  HOWEVER CAUSED AND ON ANY
#  THEORY OF LIABILITY, WHETHER IN CONTRACT,  STRICT LIABILITY,  OR TORT
#  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
#  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
# ######################################################################

# ----------------------------------------------------------------------
# - shell --------------------------------------------------------------
# ----------------------------------------------------------------------
#
SHELL        = /bin/sh
#
CD           = cd
CP           = cp
LN_S         = ln -s
MKDIR        = mkdir
RM           = /bin/rm -f
TOUCH        = touch
#
# ----------------------------------------------------------------------
# - Platform identifier ------------------------------------------------
# ----------------------------------------------------------------------
#
ARCH         = wandboard
#
# ----------------------------------------------------------------------
# - HPL Directory Structure / HPL library ------------------------------
# ----------------------------------------------------------------------
#
TOPdir       = $(HOME)/HDD/hpl-2.1
INCdir       = $(TOPdir)/include
BINdir       = $(TOPdir)/bin/$(ARCH)
LIBdir       = $(TOPdir)/lib/$(ARCH)
#
HPLlib       = $(LIBdir)/libhpl.a
#
# ----------------------------------------------------------------------
# - Message Passing library (MPI) --------------------------------------
# ----------------------------------------------------------------------
# MPinc tells the  C  compiler where to find the Message Passing library
# header files,  MPlib  is defined  to be the name of  the library to be
# used. The variable MPdir is only used for defining MPinc and MPlib.
#
MPdir        =
MPinc        =
MPlib        =
#
# ----------------------------------------------------------------------
# - Linear Algebra library (BLAS or VSIPL) -----------------------------
# ----------------------------------------------------------------------
# LAinc tells the  C  compiler where to find the Linear Algebra  library
# header files,  LAlib  is defined  to be the name of  the library to be
# used. The variable LAdir is only used for defining LAinc and LAlib.
#
LAdir        = /usr/lib/atlas-base/
LAinc        =
LAlib        = $(LAdir)/libf77blas.a $(LAdir)/libatlas.a

#
# ----------------------------------------------------------------------
# - F77 / C interface --------------------------------------------------
# ----------------------------------------------------------------------
# You can skip this section  if and only if  you are not planning to use
# a  BLAS  library featuring a Fortran 77 interface.  Otherwise,  it  is
# necessary  to  fill out the  F2CDEFS  variable  with  the  appropriate
# options.  **One and only one**  option should be chosen in **each** of
# the 3 following categories:
#
# 1) name space (How C calls a Fortran 77 routine)
#
# -DAdd_              : all lower case and a suffixed underscore  (Suns,
#                       Intel, ...),                           [default]
# -DNoChange          : all lower case (IBM RS6000),
# -DUpCase            : all upper case (Cray),
# -DAdd__             : the FORTRAN compiler in use is f2c.
#
# 2) C and Fortran 77 integer mapping
#
# -DF77_INTEGER=int   : Fortran 77 INTEGER is a C int,         [default]
# -DF77_INTEGER=long  : Fortran 77 INTEGER is a C long,
# -DF77_INTEGER=short : Fortran 77 INTEGER is a C short.
#
# 3) Fortran 77 string handling
#
# -DStringSunStyle    : The string address is passed at the string loca-
#                       tion on the stack, and the string length is then
#                       passed as  an  F77_INTEGER  after  all  explicit
#                       stack arguments,                       [default]
# -DStringStructPtr   : The address  of  a  structure  is  passed  by  a
#                       Fortran 77  string,  and the structure is of the
#                       form: struct {char *cp; F77_INTEGER len;},
# -DStringStructVal   : A structure is passed by value for each  Fortran
#                       77 string,  and  the  structure is  of the form:
#                       struct {char *cp; F77_INTEGER len;},
# -DStringCrayStyle   : Special option for  Cray  machines,  which  uses
#                       Cray  fcd  (fortran  character  descriptor)  for
#                       interoperation.
#
F2CDEFS      = -DAdd_ -DF77_INTEGER=int -DStringSunStyle
#
# ----------------------------------------------------------------------
# - HPL includes / libraries / specifics -------------------------------
# ----------------------------------------------------------------------
#
HPL_INCLUDES = -I$(INCdir) -I$(INCdir)/$(ARCH) $(LAinc) $(MPinc)
HPL_LIBS     = $(HPLlib) $(LAlib) $(MPlib)
#
# - Compile time options -----------------------------------------------
#
# -DHPL_COPY_L           force the copy of the panel L before bcast;
# -DHPL_CALL_CBLAS       call the cblas interface;
# -DHPL_CALL_VSIPL       call the vsip  library;
# -DHPL_DETAILED_TIMING  enable detailed timers;
#
# By default HPL will:
#    *) not copy L before broadcast,
#    *) call the BLAS Fortran 77 interface,
#    *) not display detailed timing information.
#
HPL_OPTS     =
#
# ----------------------------------------------------------------------
#
HPL_DEFS     = $(F2CDEFS) $(HPL_OPTS) $(HPL_INCLUDES)
#
# ----------------------------------------------------------------------
# - Compilers / linkers - Optimization flags ---------------------------
# ----------------------------------------------------------------------
#
CC           = mpicc
CCNOOPT      = $(HPL_DEFS)
CCFLAGS      = $(HPL_DEFS) -mfpu=neon -mfloat-abi=softfp -funsafe-math-optimizations -ffast-math -O3
#
LINKER       = mpif77
LINKFLAGS    =
#
ARCHIVER     = ar
ARFLAGS      = r
RANLIB       = echo
#
# ----------------------------------------------------------------------


Once the build is finished I use a very standard set of inputs in the HPL.dat file. This is just to run a quick test on one core. Here is the file

Code: Select all
HPLinpack benchmark input file
HPL.out      output file name (if any)
8            device out (6=stdout,7=stderr,file)
1            # of problems sizes (N)
500         Ns
1            # of NBs
128           NBs
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
1            Ps
1            Qs
16.0         threshold
1            # of panel fact
2            PFACTs (0=left, 1=Crout, 2=Right)
1            # of recursive stopping criterium
4            NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
1            # of recursive panel fact.
1            RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
1            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
1            DEPTHs (>=0)
2            SWAP (0=bin-exch,1=long,2=mix)
64           swapping threshold
0            L1 in (0=transposed,1=no-transposed) form
0            U  in (0=transposed,1=no-transposed) form
1            Equilibration (0=no,1=yes)
8            memory alignment in double (> 0)
##### This line (no. 32) is ignored (it serves as a separator). ######
0                               Number of additional problem sizes for PTRANS
1200 10000 30000                values of N
0                               number of additional blocking sizes for PTRANS
40 9 8 13 13 20 16 32 64        values of NB


Note that the product of Ps and Qs determines the number of cores. I.E 1x1 = 1
I then test this running the command

Code: Select all
./xhpl


and I get the following results

Code: Select all
================================================================================
HPLinpack 2.1  --  High-Performance Linpack benchmark  --   October 26, 2012
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :     500
NB     :     128
PMAP   : Row-major process mapping
P      :       1
Q      :       1
PFACT  :   Right
NBMIN  :       4
NDIV   :       2
RFACT  :   Crout
BCAST  :  1ringM
DEPTH  :       1
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4         500   128     1     1               0.20              4.165e-01
HPL_pdgesv() start time Tue Nov 12 17:40:12 2013

HPL_pdgesv() end time   Tue Nov 12 17:40:12 2013

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0061553 ...... PASSED
================================================================================

Finished      1 tests with the following results:
              1 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================


So for one core with no optimisation to BLAS I am seeing 416 MFLOPS. I assume for the Cortex a9 I should be seeing something closer to 1GFLOP?

Ok so back to the problem. I want to run this on all 4 cores. I do this by setting the Ps and Qs to 2 2 respectively. I.E 2x2 = 4.
Then I need to run this in parallel using mpi. The final command is

Code: Select all
mpirun -np 4 ./xhpl


This is the output

Code: Select all
root@wandboard:~/HDD/hpl-2.1/bin/wandboard# mpirun -np 4 ./xhpl
HPL ERROR from process # 0, on line 419 of function HPL_pdinfo:
>>> Need at least 4 processes for these tests <<<

HPL ERROR from process # 0, on line 621 of function HPL_pdinfo:
>>> Illegal input in file HPL.dat. Exiting ... <<<

HPL ERROR from process # 0, on line 419 of function HPL_pdinfo:
>>> Need at least 4 processes for these tests <<<

HPL ERROR from process # 0, on line 621 of function HPL_pdinfo:
>>> Illegal input in file HPL.dat. Exiting ... <<<

HPL ERROR from process # 0, on line 419 of function HPL_pdinfo:
>>> Need at least 4 processes for these tests <<<

HPL ERROR from process # 0, on line 621 of function HPL_pdinfo:
>>> Illegal input in file HPL.dat. Exiting ... <<<

HPL ERROR from process # 0, on line 419 of function HPL_pdinfo:
>>> Need at least 4 processes for these tests <<<

HPL ERROR from process # 0, on line 621 of function HPL_pdinfo:
>>> Illegal input in file HPL.dat. Exiting ... <<<

--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------



So basically HPL doesn't register that there are 4 cores. I have tried running the same mpirun command with the added machinefile where I have entered the wandboards IP/Hostname four times (one for each core) into a hosts file. I don't know if I have to set anything else up other than just entering in the hostnames so that may be an issue but since I am running on 4 cores on one CPU then I should be able to leave that out.

I have read quite a bit and I think this may have to do with the linking of the mpi libraries when building the HPL?

Any help would be awesome!

Thanks
Wandboarder
 
Posts: 4
Joined: Tue Nov 12, 2013 5:26 pm

Re: High Performance Linpack on Wand Quad + Ubuntu 12.04

Postby Tapani » Wed Nov 13, 2013 6:25 am

Never tried LINPACK on the WandBoard, so cannot help you with benchmarking all cores.
Regarding performance, my first "huh?" is the use of ubuntu 12.04 for HPC.

There are two major binary formats for arm: soft float (emulation) or hard float (FPU).
Our Ubuntu 12.04 unfortunately uses softfp. There is Ubuntu 13.04 floating around for WB that
uses hardfloat (but lacks video acceleration).

Don't know if that addresses any of your performance issues
Tapani
Site Admin
 
Posts: 707
Joined: Tue Aug 27, 2013 8:32 am

Re: High Performance Linpack on Wand Quad + Ubuntu 12.04

Postby Wandboarder » Wed Nov 13, 2013 11:44 am

Thanks for your reply Tapani!

Yes I am aware that Ubuntu 12.04 uses softfp and that the newer 13.10 is a hardfp system. Unfortunately when I started work on the board the latter version was not available.
In any case that shouldn't cause any issues with running on multiple cores. It would show itself during the config and make steps.

I managed to sort out the problem by changing to MPICH2 and linking the correct libraries in the HPL make file. I have now managed to run it on all 4 cores and I am getting some decent-ish results.
I will change to 13.10 soon though as I am curious about the hardfp / softfp comparisons.

Thanks
Wandboarder
 
Posts: 4
Joined: Tue Nov 12, 2013 5:26 pm

Re: High Performance Linpack on Wand Quad + Ubuntu 12.04

Postby spapadem » Tue May 06, 2014 8:37 am

Hello,
I'm probably a bit late to the party, but I'd like to ask Wandboarder how you worked around your last problem by switching to mpich2. I tried installing mpich2, but the proper libraries, e.g. libmpich.a do not appear. I only see some libmpich.so files and so on.
Any help on this matter would be greatly appreaciated!

Thanks
spapadem
 
Posts: 1
Joined: Tue May 06, 2014 8:35 am

Re: High Performance Linpack on Wand Quad + Ubuntu 12.04

Postby Xandra » Tue Oct 14, 2014 12:15 pm

I tested this with kernel versions 3.11.x, 3.12.0-rc2 and 3.12.0-rc3 vanilla and with the fixes from rmk/for-next and the RobertCNelson patch collection. I tried many different kernel configurations (different IO schedulers, etc...) but the problem persited.
I've also tried to increase the netdev watchdog timeout in the FEC driver but that only increased the recovery time of the ethernet port.
Xandra
 
Posts: 1
Joined: Tue Oct 14, 2014 12:13 pm

Re: High Performance Linpack on Wand Quad + Ubuntu 12.04

Postby wenxuzheng » Fri Mar 02, 2018 3:45 pm

Hello,would you please help me with this: :D

I am trying to benchmark a single node with HPC Challenge. The information of the machine is "Linux COMPUTE-1-45 3.10.0-693.5.2.el7.x86_64 #1 SMP Fri Oct 20 20:32:50 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux". This node has 2 CPUs with 40 cores totally.
I installed the requried packages. When I excute the "#make arch=Linux", the makefile can be made correctly.

Here is the input file named "hpccinf.txt":
************************************************************************************************
Code: Select all
HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any)
8            device out (6=stdout,7=stderr,file)
1            # of problems sizes (N)
80000         Ns
1            # of NBs
80           NBs
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
10            Ps
4            Qs
16.0         threshold
1            # of panel fact
2            PFACTs (0=left, 1=Crout, 2=Right)
1            # of recursive stopping criterium
4            NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
1            # of recursive panel fact.
1            RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
1            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
1            DEPTHs (>=0)
2            SWAP (0=bin-exch,1=long,2=mix)
64           swapping threshold
0            L1 in (0=transposed,1=no-transposed) form
0            U  in (0=transposed,1=no-transposed) form
1            Equilibration (0=no,1=yes)
8            memory alignment in double (> 0)
##### This line (no. 32) is ignored (it serves as a separator). ######
0                            Number of additional problem sizes for PTRANS
1200 10000 30000           values of N
0                          number of additional blocking sizes for PTRANS
40 9 8 13 13 20 16 32 64          values of NB

*************************************************************************************************

But, when I run the "hpcc.c" file with the command "# mpirun -np 40 ./hpcc", it outputs:
*************************************************************************************************
Code: Select all
HPL ERROR from process # 0, on line 440 of function HPL_pdinfo:
>>> Need at least 40 processes for these tests <<<

HPL ERROR from process # 0, on line 440 of function HPL_pdinfo:
>>> Need at least 40 processes for these tests <<<

HPL ERROR from process # 0, on line 440 of function HPL_pdinfo:
>>> Need at least 40 processes for these tests <<<

HPL ERROR from process # 0, on line 440 of function HPL_pdinfo:
>>> Need at least 40 processes for these tests <<<

HPL ERROR from process # 0, on line 440 of function HPL_pdinfo:
>>> Need at least 40 processes for these tests <<<

HPL ERROR from process # 0, on line 440 of function HPL_pdinfo:
>>> Need at least 40 processes for these tests <<<

HPL ERROR from process # 0, on line 440 of function HPL_pdinfo:
>>> Need at least 40 processes for these tests <<<

HPL ERROR from process # 0, on line 440 of function HPL_pdinfo:
>>> Need at least 40 processes for these tests <<<

HPL ERROR from process # 0, on line 440 of function HPL_pdinfo:
>>> Need at least 40 processes for these tests <<<

HPL ERROR from process # 0, on line 440 of function HPL_pdinfo:
>>> Need at least 40 processes for these tests <<<

HPL ERROR from process # 0, on line 440 of function HPL_pdinfo:
>>> Need at least 40 processes for these tests <<<

HPL ERROR from process # 0, on line 440 of function HPL_pdinfo:
>>> Need at least 40 processes for these tests <<<

HPL ERROR from process # 0, on line 440 of function HPL_pdinfo:
>>> Need at least 40 processes for these tests <<<

HPL ERROR from process # 0, on line 440 of function HPL_pdinfo:
>>> Need at least 40 processes for these tests <<<
......
......
this StarRandomAccess MAX_ENERGY_STATUS_JOULES is 262143.999939
0   33.2376   11.8474   56.0355   19.9735
1   27.7910   20.4224   46.8530   34.4302
time is 0.593153
this is PTRANS MAX_ENERGY_STATUS_JOULES 262143.999939
0   83.2165   49.9026   52.2176   31.3135
1   77.4271   70.3126   48.5848   44.1205
time is 1.593649
this StarSTREAM MAX_ENERGY_STATUS_JOULES is 262143.999939
0   62.0610   33.8760   54.6481   29.8296
1   57.3278   50.4441   50.4802   44.4188
time is 1.135648
this StarFFT MAX_ENERGY_STATUS_JOULES is 262143.999939
0   37.0407   26.2950   62.0549   44.0525
1   33.4549   26.1916   56.0476   43.8792
time is 0.596902
this StarRandomAccess MAX_ENERGY_STATUS_JOULES is 262143.999939
0   34.6433   17.3290   51.8108   25.9164
1   33.0316   29.5806   49.4004   44.2393
......
......

*************************************************************************************************

The omitted part is consistent with the previous result.

I have tried running the same mpirun command with the different number of cores for times, but it outputed the same errors. I don't know if I have to set anything else up to solve the error.

I have read your case for times. Do I need to change any file or package?

Any help on this matter would be greatly appreciated!
wenxuzheng
 
Posts: 1
Joined: Fri Mar 02, 2018 3:12 pm


Return to Software - Ubuntu

Who is online

Users browsing this forum: No registered users and 11 guests

cron