APPENDIX 8

High Performance Fortran

Introduction

Very important, at the introduction of a new programming language standard, is nowadays that the language should permit efficient compilation and execution not only on conventional computers, but also on parallel and massively parallel systems. Somewhat simplified it could be stated that Fortran 90 is efficient on conventional computers and on vector processors, but less efficient on parallel processors. The reason for this is that in Fortran it does not exist any commands for partition into parallel processes, except for those that follow directly from the array concept, with its operators and functions.

A group based in Houston, High Performance Fortran Forum, has developed a proposal in the form of an extension to Fortran 90. The aim of this project High Performance Fortran or HPF is to offer a portable language which permits an efficient use on different parallel systems. The project has issued a final proposal in May 1993, and aims towards a de facto standard, and not formal standard from ANSI or ISO. In order to facilitate the introduction and general acceptance of HPF the group has also defined a subset based on Fortran 77 and just a few parts of Fortran 90. A number of manufacturers were involved in the group, and the proposal will hopefully get a fast acceptance, with many implementations. Those parts of the proposal that were controversial, and therefore requires further study, is available in a separate document "Journal of Development".

A very good book on HPF is the one by Koelbel et al.

The High Performance Fortran Forum has continued to develop the HPF language and are aiming to publish a new specification later this year. New features address asynchronous I/O, tasking, more general distributions, interfacing to ANSI C, reduction operations and more. It is likely that the core language will not be significantly expanded and that the specification will define extensions which are standardized but optional for a given implementation. There will also be a definition of a kernel of facilities which should be most efficient across a wide range of implementations.

The proposal is a Fortran-based language with facilities to control distribution of arrays onto distributed memory parallel computers and includes extensions for

The complete HPF is based on Fortran 90 and includes The Subset HPF is instead based on Fortran 77 in order to become implemented already in 1993 and includes The directives have been given such a form that it will be possible to use them as ordinary statements in a future version of Fortran. At present they will have to be written as comments.

CHPF$ directive	   ! Fortran 77 and fix form Fortran 90
!HPF$ directive	   ! Fortran 90 (both fix and free form)
Since the founding of the group HPFF, much has been accomplished and many implementations of HPF are now available, see the list of HPF Compilers.

Some links to further HPF information

Data storage

HPF has directives on how data ought to be stored. These directives do not influence the computational results, but they may influence the computational speed (hopefully increasing it). The reason for introducing the directives for data storage are the following two very simple observations: Using the data storage directives we can distribute the data on the various processors. Sometimes it is desirable first to align an array against a template, and then distribute the template on the processors. A first example can explain the situation.
	REAL A(1000), B(1000), C(1000), X(500), Y(0:501)
	INTEGER INX(1000)
!HPF$   PROCESSORS PROCS(10)
!HPF$   DISTRIBUTE A(BLOCK) ONTO PROCS
!HPF$   DISTRIBUTE B(BLOCK) ONTO PROCS
!HPF$   DISTRIBUTE INX(BLOCK) ONTO PROCS
!HPF$   DISTRIBUTE C(CYCLIC) ONTO PROCS
!HPF$   ALIGN X(I) WITH Y(I-1)

	A(I) = B(I)                             ! (1)
	X(I) = Y(I-1)                           ! (2)
	A(I) = C(I)                             ! (3)
	A(I) = A(I-1) + A(I) + A(I+1)           ! (4)
	C(I) = C(I-1) + C(I) + C(I+1)           ! (5)
	X(I) = Y(I)                             ! (6)
	A(I) = A(INX(I)) + B(INX(I))            ! (7)
We here work with 10 processors, and we have distributed the floating point vectors A and B and also the integer vector INX block wise over these processors. We then have the first one hundred elements from each one of the vectors on the first processor, the next one hundred on the next processor, and so on. The floating point vector C is however distributed cyclically, the first, eleventh, twenty-first etc. elements are on the first processor, the second, twelfth, twenty-second etc. elements are on the second, and so on.

Without instructing the system exactly how the two vectors X and Y are to be distributed, we require that the I-th element of the vector X is on the same processor as the (I-1)-th element of the vector Y.

The assignments in the cases (1) and (2) above can thus be done without any inter-processor communication, since all data all the time is on the same processor.

The assignment in case (3) will in most instances be between different processors, only in exceptional cases will the data be on the same processor.

The assignment in case (4) on the other hand will in most instances be between data on the same processor, except in the exceptional cases at block boundaries. The case (5) looks like case (4), but in this case the data is all the time on different processors.

For case (6) we only know that X(I) and Y(I-1) are on the same processor, but this gives no information whatsoever on how X(I) and Y(I) are distributed. At block wise storage both of them are normally on the same processor, at cyclic storage they are always on different processor (provided that the cycles are the same).

We do not know very much about the distribution in case (7). It is however easy to realize that A(INX(I)) and B(INX(I)) always are on the same processor, and that also A(I) and INX(I) are on the same processor. It is however not likely that these two sets will coincide.

Also the formal arguments in a function or a subroutine can be stored in different ways using directives. Assume that we have a subroutine with two arguments, both floating point vectors.
	SUBROUTINE CAROLUS (KARL, CHARLES)
	REAL, DIMENSION (:) :: KARL, CHARLES
!HPF$   INHERIT :: KARL
!HPF$   ALIGN WITH KARL :: CHARLES
!       ...
        END SUBROUTINE CAROLUS
This is called with
        PROGRAM APP83B
	REAL, DIMENSION (1718) :: GUSTAV, ADOLF
        INTERFACE
          SUBROUTINE CAROLUS (KARL, CHARLES)
  	  REAL, DIMENSION (:) :: KARL, CHARLES
          END SUBROUTINE CAROLUS
        END INTERFACE
!       ..
	CALL CAROLUS (GUSTAV, ADOLF)
        END PROGRAM APP83B
This means that the first formal argument KARL is distributed in the same way as the actual argument GUSTAV. The second formal argument CHARLES is distributed in the same way as the first formal argument, in this case in the same way as GUSTAV, and not as the second actual argument ADOLF.

Another example is when you require the four corner submatrices of the size N*N from a larger matrix of the size (N+1)*(N+1) . In this case we call the large matrix EARTH and the submatrices are called NW, NE, SW, and SE. In some cases the large matrix is not required to be used in any calculations, its only purpose is to assist at the distribution of the data in the submatrices. It is therefore a TEMPLATE, which is not stored.
!HPF$   TEMPLATE, DISTRIBUTE (BLOCK, BLOCK) :: EARTH (N+1, N+1)
	REAL, DIMENSION (N, N) :: NW, NE, SW, SE
!HPF$   ALIGN NW(I, J) WITH EARTH(I, J)
!HPF$   ALIGN NE(I, J) WITH EARTH(I, J+1)
!HPF$   ALIGN SW(I, J) WITH EARTH(I+1, J)
!HPF$   ALIGN SE(I, J) WITH EARTH(I+1, J+1)
Since a TEMPLATE does not represent any real storage, it can not be part of any COMMON. The alignment subscripts can be just a little more complicated than the old array indices in Fortran 66, they are of the general form m*i+n , where i is a variable which is permitted to appear only once in the expression, and m and n are constants (PARAMETER or numerical constant).
A somewhat more complicated directive is
!HPF$   ALIGN A(:) WITH D(:,*)
which means that a copy of the vector A is associated with each column of the matrix C. An example on this follows, where it is also shown that the HPF syntax permits several ways for expressing identical requirements.
!HPF$   TEMPLATE, D1(N), D2(N, N)
	REAL, DIMENSION (N, N) :: X, A, B, C, AR1, AR2, &
                                  P, Q, R, S
!HPF$   ALIGN X(:,*) WITH D1(:)
!HPF$   ALIGN (:,*) WITH D1 :: A, B, C, AR1, AR2
!HPF$   ALIGN WITH D2, DYNAMIC :: P, Q, R, S
The following is a more complete example, where an intrinsic function finds the present number of processors, which is used at the distribution of the arrays. In addition a usual (external) function is being used.
        PROGRAM APP83C
        IMPLICIT NONE

        INTERFACE
           FUNCTION F(X)
           REAL                :: F
           REAL, DIMENSION (:) :: X
           END FUNCTION F
        END INTERFACE 

        REAL, DIMENSION (1000)  :: X, Y, Z
        REAL, DIMENSION (5000)  :: V, W
        REAL :: TEMP
        REAL, DIMENSION (10,1000) :: A

!HPF$   PROCESSORS MPP(NUMBER_OF_PROCESSORS())

!HPF$   TEMPLATE T(1000), S(5000)
!HPF$   DISTRIBUTE T(BLOCK) ONTO MPP    ! Block size may
!HPF$   DISTRIBUTE S(CYCLIC) ONTO MPP   ! be specified
!HPF$   ALIGN WITH T :: X, Y, Z
!HPF$   ALIGN WITH S :: V, W

!HPF$   ALIGN A(*,:) WITH T(:)          ! Vector of columns

!	...
        TEMP = F(V)
!	...
        END PROGRAM APP83C

	REAL FUNCTION F(X)
        IMPLICIT NONE
	REAL, DIMENSION (:) :: X
!HPF$	INHERIT :: X            ! Distribute the formal argument
                                ! X as the real argument V
	REAL, DIMENSION (SIZE(X)) :: S
!HPF$	ALIGN WITH X :: S       ! Distribute the local vector S
                                ! as the formal argument X, or 
                                ! as the actual argument V
!	...
        F = SUM(S)              ! Only dummy example
	RETURN
	END FUNCTION F
You can specify different distributions along different dimensions. The following specifications
        REAL A(100,100), B(100,100), C(200)
!HPF$   DISTRIBUTE A(BLOCK,*), B(*,CYCLIC), C(BLOCK(5))
means that the first processor of a four processor computer stores the following array sections
	A(1:25, 1:100)
	B(1:100, 1:97:4)
	C(1:5), C(21:25), C(41:45), C(61,65), C(81:85),
	C(101:105), C(121:125), C(141:145), C(161,165),
	C(181:185)
It is also possible to distribute several dimensions in completely independent ways,
	REAL D(8,100,100)
!HPF$   DISTRIBUTE D(*,BLOCK, CYCLIC)
means that the first processor of a four processor computer, configured as a 2*2 matrix, stores the following array sections
	D(1:8, 1:50, 1:99:2)
In addition to the static directives discussed above, there are also the two dynamic directives REDISTRIBUTE and REALIGN, who permit an array to switch its distribution within the subroutine. At the use of a subroutine it does exist three different possibilities for the formal arguments. At the exit from a subroutine it is necessary that the effect of any redistribution is removed.

Execution

A new statement FORALL is introduced as an alternative to the DO-loop. The intent with the FORALL statement is that its content can be executed in any order, independent of the index! It therefore gives the possibility of a parallel implementation. Some rules are however introduced in order to avoid side effects. Some simple examples on FORALL follow.
	FORALL (I = 1:N, J = 1:N) H(I, J) = 1.0/REAL(I+J-1)
	FORALL (I = 1:N, J = 1:N, Y(I, J) .NE. 0.0) &
			X(I,J) = 1.0/Y(I,J)
	FORALL (I = 1:N) A(I,I+1:N) = 3.141592654 
The first of these define a Hilbert matrix of order N, the second inverts the elements of a matrix, avoiding division with zero. In the third example all elements above the main diagonal of the matrix are assigned the value of the mathematical constant pi.

In all the three statements above, FORALL can be considered as a double loop, which can be executed in arbitrary order. The general form of the FORALL statement is

	FORALL ( v1 = l1:u1:s1, ... , vn = ln:un:sn, mask ) &
		a(e1, ... , em) = right_hand_side
and is evaluated according to certain well specified rules, in principle all indices are evaluated first.

In addition there is FORALL construct, which has to be distinguished from the single line FORALL statement. The construct is interpreted in the same way as if its internal statements were written as FORALL statements in the same order. This restriction is important in order to achieve a unique computational result. Within a FORALL construct calls of functions or subroutines are permitted only if these subprograms are pure (without side effects). First we give a simple example of a FORALL construct.
	REAL, DIMENSION(N, N) :: A, B
	...
	FORALL (I = 2:N-1, J = 2:N-1)
	   A(I,J) = 0.25*(A(I,J-1)+A(I,J+1)+A(I-1,J)+A(I+1,J))
	   B(I,J) = A(I,J)
	END FORALL
When these statements have been executed the arrays A and B have identical values in the internal points, while B has kept its previous values on the boundaries.

In addition a directive INDEPENDENT has been introduced for both DO loops and FORALL constructs. This directive is placed immediately before the DO statement or FORALL construct, and is valid until the corresponding END DO (or the old form of terminating a DO loop) or END FORALL. The directive assures the system that this part of the program can be executed in an arbitrary order, including parallel, without any computational differences in the result (no semantic change).

In the example below it is thus assured that the integer vector P does not have any repeated values (which would have meant that last one wins at a normal sequential execution). A potential conflict at parallel execution is thus avoided. It is also implicitly assured that all values of P are within the permitted bounds of 1 and 200.

	REAL, DIMENSION(200) 	 :: A
	REAL, DIMENSION(100) 	 :: B
	INTEGER, DIMENSION(100)  :: P
	...
!HPF$   INDEPENDENT
	DO I = 1, 100
	   A(P(I)) = B(I)
	END DO
It is also possible to indicate that certain parts of a nested loop shall be considered as independent. In the example below the innermost loop is not independent since each element of A is assigned several times, for all values of I4.
	REAL, DIMENSION(:, :, :), ALLOCATABLE :: A, B, C
        ...
        ALLOCATE (A(N,N,N))
        ALLOCATE (B(N,N,N))
        ALLOCATE (C(N,N,N))
        ...
!HPF$   INDEPENDENT, NEW (I2)
	DO I1 = 1, N1
!HPF$     INDEPENDENT, NEW (I3)
	  DO I2 = 1, N2
!HPF$       INDEPENDENT, NEW (I4)
	    DO I3 = 1, N3
	      DO I4 = 1, N4     ! The innermost loop is NOT
                                ! independent !
		 A(I1, I2, I3) = A(I1, I2, I3) &
		   + B(I1, I2, I4) * C(I2, I3, I4)
	      END DO
	    END DO
	  END DO
	END DO
The NEW clauses are required, since the inner loop indices are assigned and used in different iterations of the outer loops.

A DO loop and a FORALL construct with the same content are equivalent if they both have been given the INDEPENDENT directive.

Intrinsic functions

In addition to all the intrinsic functions of Fortran 90 the language extension HPF introduces a few new, and modifies two of those already in Fortran 90.

To the functions MAXLOC and MINLOC have been added an optional parameter DIM, which informs along which dimension the maximum or minimum is to be taken. An example illustrates how it works on MAXLOC. Without the extra parameter the index for the largest element is returned. If the optional parameter DIM = 1 the row numbers for the largest element in each column are returned. If the optional parameter DIM = 2 the column numbers for the largest element in each row are returned. The corresponding applies to MINLOC. These new facilities are the same as those that already exist in Fortran 90 for MAXVAL and MINVAL.
                0	-5	 8	-3
        A =     3	 4	-1	 2
                1	 5	 6	-4
gives the following values
	MAXLOC(A)               = (/ 1, 3 /)
	MAXLOC(A, DIM = 1)      = (/ 2, 3, 1, 2 /)
	MAXLOC(A, DIM = 2)      = (/ 3, 2, 3 /)
The following completely new functions have been added. The inquiry functions are to be intrinsic, but the others may instead be available in a library as external functions.

Storage rules

For Fortran 90 there exist certain rules about storage of variables, especially for arrays who have to be stored in a well specified order, with the first index varying fastest. There are also special rules for the storage of variables that are in COMMON and/or EQUIVALENCE. These rules on sequential storage can conflict with the distribution requested by the HPF statements ALIGN, REALIGN, DISTRIBUTE and REDISTRIBUTE. The Fortran model works with linear storage with one storage unit for integers and ordinary floating point variables, and two storage units for double precision floating point numbers.

Fortran has storage association and sequence association. The Fortran 90 standard states in (14.6.3) that storage association is the association of two or more data objects that occurs when two or more storage sequences share or are aligned with one or more storage units, and in (12.4.1.4) that sequence association is the order that Fortran requires when an array is associated a formal argument. The rank and shape of the actual argument need not agree with the rank and shape of the dummy argument, but the number of elements in the dummy argument must not exceed the number of elements in the element sequence of the actual argument. If the dummy argument is assumed size, the number of elements in the dummy argument is exactly the number of elements in the element sequence.

Note that HPF has no problem with array parameters distributed over the processors, as long as both the actual and the dummy arguments have the same rank and shape. It is when the properties of Fortran, with respect to COMMON and EQUIVALENCE, are used too much, that we get into problems. If we use a subroutine that contains the following specifications

        SUBROUTINE HOME(X)
	DIMENSION X(20,10)
it can be called with CALL HOME(Y(2,1)) provided that the array X is specified SEQUENTIAL in the subroutine HOME and the array Y is also specified SEQUENTIAL in the calling program unit. The simple call CALL HOME(Y) is permitted if X and Y are both sequential, or if X and Y are dimensioned in exactly the same way.

A directive SEQUENCE has therefore been introduced in order to inform the system that a variable or a COMMON block shall be considered sequential. Using HPF the normal case is that COMMON blocks are considered non-sequential, while variables normally also are non-sequential (except in a few cases, as when the variable is part of a sequential COMMON block or it is an array with assumed dimension).

In order to get old programs to produce the same results with HPF the following is recommended:

HPF Subset

The following news from Fortran 90 are included in the HPF Subset: The following from the complete HPF are however not included in the HPF Subset: In addition there are certain restrictions in the subset on the permitted values of the argument DIM to MAXLOC and MINLOC, and there is no definition of extrinsic subprograms and no binding to Fortran 90.

HPF 2.0

The High Performance Fortran Forum is pleased to announce the availability of The High Performance Fortran Language Specification, version 2.0. HPF 2.0 contains: The HPF 2.0 language includes features that are expected of every HPF implementation within a year of release of the language specification. These include: The Approved Extensions include advanced features that meet specific needs, but are not likely to be supported in all initial compiler implementations. These include: The specification is available in several forms:


Last modified: 10 September 1998
boein@nsc.liu.se