Bug 31067 - MINLOC should sometimes be inlined (gas_dyn is sooooo sloooow)
Summary: MINLOC should sometimes be inlined (gas_dyn is sooooo sloooow)
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: fortran (show other bugs)
Version: 4.3.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on: 31066
Blocks:
  Show dependency treegraph
 
Reported: 2007-03-07 10:20 UTC by Francois-Xavier Coudert
Modified: 2012-09-21 18:50 UTC (History)
7 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2007-04-02 17:44:28


Attachments
Setting the correct rank in minloc (251 bytes, patch)
2007-03-07 21:29 UTC, Thomas Koenig
Details | Diff
gcc47-pr31067.patch (3.22 KB, patch)
2011-07-28 13:51 UTC, Jakub Jelinek
Details | Diff
gcc47-pr31067.patch (3.44 KB, patch)
2011-07-28 16:00 UTC, Jakub Jelinek
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Francois-Xavier Coudert 2007-03-07 10:20:28 UTC
[see http://www.polyhedron.co.uk/pb05/linux/f90bench_AMD.html for the original polyhedron benchmark results, an explanation of what the benchmark is and the source code]

Typical timings for the gas_dyn.f90 benchmark on my AMD64/linux system are:

* ifort -O3 -xW -ipo -static -V gas_dyn.f90 -o gas_dyn.intel
=> ./gas_dyn.intel  10.53s user 0.43s system 99% cpu 10.976 total

* gfortran -static -ftree-vectorize -march=opteron -ffast-math -funroll-loops -O3 gas_dyn.f90 -o gas_dyn.gfortran
./gas_dyn.gfortran  15.92s user 0.05s system 99% cpu 15.969 total

Experimenting a bit with Intel options to understand why it is so fast, I found that:
  * disabling inlining doesn't change the execution time
  * disabling vectorization drops it to the same execution time as gfortran (roughly speaking)

Following an analysis by Tobias Burnus, and noting that 22.16% of the total time is spent in the MINLOC library routine, I modified the source by replacing a call to MINLOC by inline code:
--- gas_dyn.f90 2007-03-07 09:36:23.000000000 +0100
+++ gas_dyn.modified.f90        2007-03-07 10:44:14.000000000 +0100
@@ -234,12 +234,23 @@ end module ints
 !-----------------------------------------------
 !   L o c a l   V a r i a b l e s
 !-----------------------------------------------
-      INTEGER :: ISET(1)
-      REAL :: VSET, SSET
+      INTEGER :: ISET(1), I
+      REAL :: VSET, SSET, T
       REAL, DIMENSION (NODES) :: DTEMP
 !-----------------------------------------------
       DTEMP = DX/(ABS(VEL) + SOUND)
-      ISET = MINLOC (DTEMP)
+! FXC replace this:
+!      ISET = MINLOC (DTEMP)
+! by this:
+      ISET(1) = 0
+      T = HUGE(T)
+      DO I = 1, NODES
+        IF (DTEMP(I) < T) THEN
+          T = DTEMP(I)
+          ISET(1) = I
+        END IF
+      END DO
+! end of modification
       DT = DTEMP(ISET(1))
       VSET = VEL(ISET(1))
       SSET = SOUND(ISET(1))

this makes the code faster by 14%:
./gas_dyn.modified.gfortran  13.56s user 0.05s system 99% cpu 13.614 total

Maybe we should have MINLOC inlined when there's no mask, stride 1 and one-dimensional?


PS: Other hot spots are:
  %   cumulative   self              self     total
 time   seconds   seconds    calls  Ts/call  Ts/call  name
 29.13      4.18     4.18                             eos_ (gas_dyn.f90:410 @ 413386)
 14.22      6.22     2.04                             chozdt_ (gas_dyn.f90:241 @ 4152b3)

Both lines are whole-array operations, corresponding to:
    CS(:NODES) = SQRT(CGAMMA*PRES(:NODES)/DENS(:NODES))
and
    DTEMP = DX/(ABS(VEL) + SOUND)

I filed PR31066, which is I think a small reproducer for the two lines above.
Comment 1 Thomas Koenig 2007-03-07 11:27:51 UTC
(In reply to comment #0)

> Maybe we should have MINLOC inlined when there's no mask, stride 1 and
> one-dimensional?

Definitely.  We do this for minval, and from glancing at
gfc_conv_intrinsic_minmaxval and gfc_conv_intrinsic_minmaxloc,
it should happen already.
Comment 2 Francois-Xavier Coudert 2007-03-07 12:18:10 UTC
(In reply to comment #1)
> We do this for minval, and from glancing at
> gfc_conv_intrinsic_minmaxval and gfc_conv_intrinsic_minmaxloc,
> it should happen already.

No, because we never get into gfc_conv_intrinsic_minmaxloc. We translate the expression directly into a function call by calling gfc_conv_intrinsic_funcall() at the head of gfc_conv_intrinsic_function(), instead of going down the list. I wonder how SUM and PRODUCT are inlined...
Comment 3 Thomas Koenig 2007-03-07 21:00:13 UTC
(In reply to comment #2)

> No, because we never get into gfc_conv_intrinsic_minmaxloc. We translate the
> expression directly into a function call by calling
> gfc_conv_intrinsic_funcall() at the head of gfc_conv_intrinsic_function(),
> instead of going down the list. I wonder how SUM and PRODUCT are inlined...

In  gfc_conv_intrinsic_function, expr->rank is 0 for minval
and 1 for minloc (which is bogus).  I wonder where this is
set...
Comment 4 fxcoudert@gmail.com 2007-03-07 21:09:44 UTC
Subject: Re:  MINLOC should sometimes be inlined (gas_dyn is sooooo sloooow)

> In  gfc_conv_intrinsic_function, expr->rank is 0 for minval
> and 1 for minloc (which is bogus).

It's not bogus. The MINLOC is an array of rank 1, that's correct.
Comment 5 Thomas Koenig 2007-03-07 21:09:52 UTC
(In reply to comment #3)

> In  gfc_conv_intrinsic_function, expr->rank is 0 for minval
> and 1 for minloc (which is bogus).  I wonder where this is
> set...

To answer my own question:  This is set in gfc_resolve_minloc.
I'll try to give it a shot.
Comment 6 Thomas Koenig 2007-03-07 21:29:41 UTC
Created attachment 13165 [details]
Setting the correct rank in minloc

This makes minloc have rank 0, and allows for
inlining.  I guess we'll find out now wether the
inline code works :-)
Comment 7 Francois-Xavier Coudert 2007-03-08 05:50:25 UTC
(In reply to comment #6)
> This makes minloc have rank 0, and allows for
> inlining.

No, it's wrong. See F95 13.14.71: "Result Characteristics. The result is of type default integer.  If DIM is absent, the result is an array of rank one and of size equal to the rank of ARRAY; otherwise, the result is of rank n-1 and shape ..."

Note in particular the example for case (i), on the next page: "The value of MINLOC((/4,3,6,3/)) is [2]". If DIM is absent, the result is always an array of rank 1. Only if DIM is present, and the ARRAY is of rank 1, then MINLOC is a scalar: MINLOC((/4,3,6,3/),1) == 2
Comment 8 Thomas Koenig 2007-03-10 12:34:46 UTC
(In reply to comment #7)
> (In reply to comment #6)
> > This makes minloc have rank 0, and allows for
> > inlining.
> 
> No, it's wrong. See F95 13.14.71: "Result Characteristics. The result is of
> type default integer.  If DIM is absent, the result is an array of rank one and
> of size equal to the rank of ARRAY; otherwise, the result is of rank n-1 and
> shape ..."

You're right.  Back to the drawing board.
Comment 9 Thomas Koenig 2007-03-11 19:43:23 UTC
I have looked at this some more.  Channging gfc_conv_intrinsic_function so that we call gfc_conv_intrinsic_minmaxloc is easy enough:

@@ -3481,7 +3481,9 @@ gfc_conv_intrinsic_function (gfc_se * se

   name = &expr->value.function.name[2];

-  if (expr->rank > 0 && !expr->inline_noncopying_intrinsic)
+  if (expr->rank > 0 && !expr->inline_noncopying_intrinsic
+      && ! (expr->rank == 1 && (isym->generic_id ==  GFC_ISYM_MINLOC
+                               || isym->generic_id == GFC_ISYM_MAXLOC)))
     {
       lib = gfc_is_intrinsic_libcall (expr);
       if (lib != 0)

If we do that, we hit the  "if (se->ss)" contition on top of that function, and we would have to handle scalarization of that one-trip loop.  I have currently no idea how to go about that.  Simply removing the condition doesn't work :-)

As a workaround, one could always use "minloc(...,dim=1)", then
we get the inline version.

Unassigning myself.
Comment 10 Paul Thomas 2007-03-12 19:04:31 UTC
(In reply to comment #9)

> 
> As a workaround, one could always use "minloc(...,dim=1)", then
> we get the inline version.
> 
Thomas, it's a bit kludgy, but why not add a constant expression = 1, if dim is not present?

Cheers

Paul

Comment 11 Thomas Koenig 2007-03-13 20:12:08 UTC
(In reply to comment #10)

> Thomas, it's a bit kludgy, but why not add a constant expression = 1, if dim is
> not present?

Hi Paul,

unless I'm mistaken, this would also change the rank of the
function to 0, which FX explained is wrong.

Do you have any idea what I cold do to turn this into an array?
All the "interesting" gfc_conv_intrinsic_* functions have the
"if (se->ss)" statement on top.

Cheers,
      Thomas
Comment 12 Paul Thomas 2007-03-26 11:37:55 UTC
(In reply to comment #11)
> (In reply to comment #10)

> Do you have any idea what I cold do to turn this into an array?
> All the "interesting" gfc_conv_intrinsic_* functions have the
> "if (se->ss)" statement on top.

I'll put my thinking cap on.  To first order, you have to follow the same route as array_transfer..... I'll come back to you.

Paul
Comment 13 Paul Thomas 2007-03-26 12:43:49 UTC
(In reply to comment #11)
> (In reply to comment #10)
Thomas,

It does not look too bad:

Look at the tail end of array_transfer -

  gfc_trans_create_temp_array (&se->pre, &se->post, se->loop,
			       info, mold_type, false, true, false);

  /* Cast the pointer to the result.  */
  tmp = gfc_conv_descriptor_data_get (info->descriptor);

Will produce a temporary array of the right dimension, together with its descriptor.

For minmaxloc, mold_type will have to be replaced by the TREE_TYPE of the result (gfc_array_index_type, I suppose?  or else, you will have to use gfc_typenode_for_spec (&expr->ts);, I think). tmp will now be a pointer to your result array.  This needs to appear fairly early on in minmaxloc, so that the array can be set to zero to initialize it and so that the location can be tranferred to it.  The standard checks will have to be made that ss->loop exists etc.. - just check some of the other array valued in-line intrinsics.

Having done this, you will need to replace the line (~2049), folowing "remember where we are", with a loop over the n dimensions (note, we do not have to restrict ourselves to one dimension:) ).

Something like:

  /* Remember where we are.  */
  for (n = 0; n < loop.dimen; n++)
    {
      pos = build_fold_indirect_ref (gfc_conv_array_data info->descriptor));
      pos = gfc_build_array_ref (pos, build_int_cst (gfc_array_index_type, n))
      gfc_add_modify_expr (&ifblock, pos, loop.loopvar[n]);
    }
should bang the position into the result array, which is transferred at the end with
  se->expr = info->descriptor;

Good luck

Cheers

Paul
Comment 14 Thomas Koenig 2007-04-02 17:44:28 UTC
I'll give this another shot.

Maybe inlining isn't even necessary for good performance...
Comment 15 Thomas Koenig 2007-04-02 21:00:52 UTC
The library version doesn't do too badly compared
to the inline version:

$ cat benchmark-inline.f90
program main
  implicit none
  integer, dimension(1) :: n
  real, allocatable :: a(:)
  integer :: i

  allocate (a(100))
  call random_number(a)
  do i=1, 10000000
    n = minloc(a, dim=1)
 end do
end program main
$ cat benchmark-library.f90
program main
  implicit none
  integer, dimension(1) :: n
  real, allocatable :: a(:)
  integer :: i

  allocate (a(100))
  call random_number(a)
  do i=1, 10000000
    n = minloc(a)
 end do
end program main
$ gfortran -O3 -static benchmark-inline.f90 && ./a.out && time ./a.out

real    0m6.941s
user    0m5.232s
sys     0m0.016s
$ gfortran -O3 -static benchmark-library.f90 && ./a.out && time ./a.out

real    0m5.720s
user    0m5.472s
sys     0m0.004s
Comment 16 Janne Blomqvist 2007-05-18 21:15:58 UTC
The critical thing with inlining array intrinsics, IMHO is to give the optimizer more data to work with allowing it to get rid of temp arrays, perform loop fusion or fission etc. So with a trivial benchmark like #15, you don't see any difference, except with potentially higher optimization for user code than libgfortran. For example in gas_dyn/chozdt:

      REAL, DIMENSION (NODES) :: DTEMP
!-----------------------------------------------
! Profile for gfortran 4.3:
! CPU_CLK_UNHALTED L2_CACHE_MISS
! samp  %runtim  samp %tot
! 59887 22.4783  1484 10.9828   :      DTEMP = DX/(ABS(VEL) + SOUND)
! ifort 9.1 profile
! 40104 16.2034  1198  8.8166   :      DTEMP = DX/(ABS(VEL) + SOUND)

      DTEMP = DX/(ABS(VEL) + SOUND)
      ISET = MINLOC (DTEMP)
      DT = DTEMP(ISET(1))

If MINLOC were inlined, perhaps a sufficiently optimizer could convert this into the equivalent

REAL :: DTEMPMIN
INTEGER :: i

DTEMPMIN = HUGE(1.0d0)
DO I = 1, NODES
    DT = DX(I)/(ABS(VEL(I)+SOUND(I))
    IF (DT < DTEMPMIN) THEN
        DTEMPMIN = DT
    END IF
END DO
DT = DTEMPMIN

i.e. avoid the temporary array entirely. Yes, I guess this is quite a lot to ask.
Comment 17 Janne Blomqvist 2007-05-18 21:20:17 UTC
Or even better (duh):

REAL :: DTEMP

DT = HUGE(1.0d0)
DO I = 1, NODES
    DTEMP = DX(I)/(ABS(VEL(I)+SOUND(I))
    IF (DTEMP < DT) THEN
        DT = DTEMP
    END IF
END DO
Comment 18 Thomas Koenig 2007-06-15 20:35:30 UTC
Too little time right now.

Unassigning myself.
Comment 19 Janne Blomqvist 2007-06-27 14:49:03 UTC
gfortran does inline most array intrinsics, but only if the result is a scalar. For most array intrinsics this isn't that much of a problem since usually one uses the variant that returns a scalar, but MINLOC is different in that usually one wants to use the version that returns an array. If one implements this I guess it would be straightforward to replicate the solution to many other array intrinsics as well.
Comment 20 Paul Thomas 2008-01-02 19:30:42 UTC
(In reply to comment #19)
> gfortran does inline most array intrinsics, but only if the result is a scalar.
> For most array intrinsics this isn't that much of a problem since usually one
> uses the variant that returns a scalar, but MINLOC is different in that usually
> one wants to use the version that returns an array. If one implements this I
> guess it would be straightforward to replicate the solution to many other array
> intrinsics as well.
> 
Janne,

In contemplating what to do with gfortran in the New Year, I have been mulling over in-lining of array intrinsics; MATMUL is one distinctly possible one, as are MINLOC and MAXLOC.

Cheers

Paul
Comment 21 Dominique d'Humieres 2008-01-02 20:27:46 UTC
> MATMUL is one distinctly possible one

Paul,
If you are interested, I have a variant of induct.f90 in which I have replaced three dot-products by the matrix-vector product for a total disaster on all compilers I have tried (for gfortran from 66s to 196s).

Comment 22 Richard Biener 2009-07-03 10:00:46 UTC
One issue is that

     ISET = MINLOC (DTEMP)

will cause GCC to assume that DTEMP is clobbered.
Comment 23 Richard Biener 2009-07-03 12:19:37 UTC
We are not able to vectorize the loop in

program main
  implicit none
  integer, volatile, dimension(1) :: n
  real, allocatable :: a(:)
  integer :: i
  real :: t1, t2

  allocate (a(100))
  call random_number(a) ! negligible time
  !call cpu_time(t1)
  do i=1, 10000000
    n = minloc(a, dim=1)
  end do
  !call cpu_time(t2)
 print *, n !, t2-t1
end program main

because there are two reductions in that loop which I think the vectorizer
cannot handle:

<bb 7>:
  # pos.0_31 = PHI <pos.0_3(10), 0(6)>
  # limit.2_8 = PHI <limit.2_5(10), 3.4028234663852885981170418348451692544e+38(6)>
  # S.3_74 = PHI <S.3_33(10), pretmp.22_79(6)>
  D.1593_21 = S.3_74 + pretmp.22_77;
  limit.2_22 = (*D.1568_14)[D.1593_21];
  D.1595_23 = limit.2_22 < limit.2_8;
  D.1596_24 = pos.0_31 == 0;
  D.1597_27 = limit.2_8 == limit.2_22;
  D.1598_28 = D.1597_27 & D.1596_24;
  D.1599_29 = D.1595_23 | D.1598_28;
  pos.0_32 = S.3_74 + pretmp.28_90;
  pos.0_3 = [cond_expr] D.1599_29 ? pos.0_32 : pos.0_31;
  limit.2_5 = [cond_expr] D.1599_29 ? limit.2_22 : limit.2_8;
  S.3_33 = S.3_74 + 1;
  if (S.3_33 > pretmp.22_81)
    goto <bb 11>;
  else
    goto <bb 10>;

<bb 10>:
  goto <bb 7>;

we reduce over limit.2_5 and pos.0_3.  The intel compiler vectorizes
the function just fine.
Comment 24 Tobias Burnus 2009-07-03 12:40:51 UTC
> One issue is that
>      ISET = MINLOC (DTEMP)
> will cause GCC to assume that DTEMP is clobbered.

The problem is that while "MINLOC" is pure, we cannot use DECL_PURE_P as the result is passed by reference:
  (void)  minloc(&isset, DTEMP)
                 ^^^^^^--- result
Comment 25 Richard Biener 2009-07-03 12:57:08 UTC
Btw, the inlined minloc

              D.1570 = a.dim[0].lbound;
              D.1571 = a.dim[0].ubound;
              pos.0 = 0;
              {
                integer(kind=8) S.3;

                 ({ S.3 = D.1570;
                while (1)
                  {
                     ({ if (S.3 > D.1571) goto L.3;
                    offset.1 = 1 - D.1570;
                    if ((*D.1568)[S.3 + D.1569] < limit.2 || pos.0 == 0 && (*D.1568)[S.3 + D.1569] == limit.2)
                      {
                         ({ limit.2 = (*D.1568)[S.3 + D.1569];
                        pos.0 = S.3 + offset.1; }) void
                      }
                    S.3 = S.3 + 1; }) void
                  }
                L.3:; }) void

has a superfluous check || (pos.0 == 0 && (*D.1568)[S.3 + D.1569] == limit.2)
at least for flag_finite_math_only.  If the array cannot contain Inf or NaN
then it either has all elements == FLT_MAX, so pos will stay zero, or at
least one is less than FLT_MAX in which case pos will be adjusted anyway.
Comment 26 Tobias Burnus 2009-07-03 13:07:32 UTC
> has a superfluous check || (pos.0 == 0 && (*D.1568)[S.3 + D.1569] == limit.2)
> at least for flag_finite_math_only.  If the array cannot contain Inf or NaN
> then it either has all elements == FLT_MAX, so pos will stay zero, or at
> least one is less than FLT_MAX in which case pos will be adjusted anyway.

I have not checked whether algorithm requires the check; NaN/Inf are possible, but maybe the check is still not needed. And if it is, one could enclose it in a if(!flag_finite_math_only) condition.
Comment 27 Ira Rosen 2009-07-05 06:48:01 UTC
(In reply to comment #23)
> because there are two reductions in that loop which I think the vectorizer
> cannot handle:

Actually, the vectorizer can vectorize two reductions. I think, the problem is in cond_expr in reduction:

>   pos.0_3 = [cond_expr] D.1599_29 ? pos.0_32 : pos.0_31;
>   limit.2_5 = [cond_expr] D.1599_29 ? limit.2_22 : limit.2_8;

I'll look into it.

Ira
Comment 28 Ira Rosen 2009-07-20 12:03:54 UTC
I've just committed a patch that adds support of cond_expr in reductions in nested cycles (http://gcc.gnu.org/ml/gcc-patches/2009-07/msg01124.html). 

cond_expr cannot be vectorized in reduction of inner-most loop, because such reduction changes the order of computation, and that cannot be done for cond_expr.

Ira
Comment 29 Jakub Jelinek 2009-07-24 07:57:36 UTC
Subject: Bug 31067

Author: jakub
Date: Fri Jul 24 07:57:13 2009
New Revision: 150041

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=150041
Log:
	PR fortran/40643
	PR fortran/31067
	* trans-intrinsic.c (gfc_conv_intrinsic_minmaxloc,
	gfc_conv_intrinsic_minmaxval): Handle Infinities and NaNs properly,
	optimize.
	* trans-array.c (gfc_trans_scalarized_loop_end): No longer static.
	* trans-array.h (gfc_trans_scalarized_loop_end): New prototype.

	* libgfortran.h (GFC_REAL_4_INFINITY, GFC_REAL_8_INFINITY,
	GFC_REAL_10_INFINITY, GFC_REAL_16_INFINITY, GFC_REAL_4_QUIET_NAN,
	GFC_REAL_8_QUIET_NAN, GFC_REAL_10_QUIET_NAN, GFC_REAL_16_QUIET_NAN):
	Define.
	* m4/iparm.m4 (atype_inf, atype_nan): Define.
	* m4/ifunction.m4: Formatting.
	* m4/iforeach.m4: Likewise.
	(START_FOREACH_FUNCTION): Initialize dest to all 1s, not all 0s.
	(START_FOREACH_BLOCK, FINISH_FOREACH_FUNCTION,
	FINISH_MASKED_FOREACH_FUNCTION): Run foreach block inside a loop
	until count[0] == extent[0].
	* m4/minval.m4: Formatting.  Handle NaNs and infinities.  Optimize.
	* m4/maxval.m4: Likewise.
	* m4/minloc0.m4: Likewise.
	* m4/maxloc0.m4: Likewise.
	* m4/minloc1.m4: Likewise.
	* m4/maxloc1.m4: Likewise.
	* generated/maxloc0_16_i16.c: Regenerated.
	* generated/maxloc0_16_i1.c: Likewise.
	* generated/maxloc0_16_i2.c: Likewise.
	* generated/maxloc0_16_i4.c: Likewise.
	* generated/maxloc0_16_i8.c: Likewise.
	* generated/maxloc0_16_r10.c: Likewise.
	* generated/maxloc0_16_r16.c: Likewise.
	* generated/maxloc0_16_r4.c: Likewise.
	* generated/maxloc0_16_r8.c: Likewise.
	* generated/maxloc0_4_i16.c: Likewise.
	* generated/maxloc0_4_i1.c: Likewise.
	* generated/maxloc0_4_i2.c: Likewise.
	* generated/maxloc0_4_i4.c: Likewise.
	* generated/maxloc0_4_i8.c: Likewise.
	* generated/maxloc0_4_r10.c: Likewise.
	* generated/maxloc0_4_r16.c: Likewise.
	* generated/maxloc0_4_r4.c: Likewise.
	* generated/maxloc0_4_r8.c: Likewise.
	* generated/maxloc0_8_i16.c: Likewise.
	* generated/maxloc0_8_i1.c: Likewise.
	* generated/maxloc0_8_i2.c: Likewise.
	* generated/maxloc0_8_i4.c: Likewise.
	* generated/maxloc0_8_i8.c: Likewise.
	* generated/maxloc0_8_r10.c: Likewise.
	* generated/maxloc0_8_r16.c: Likewise.
	* generated/maxloc0_8_r4.c: Likewise.
	* generated/maxloc0_8_r8.c: Likewise.
	* generated/maxloc1_16_i16.c: Likewise.
	* generated/maxloc1_16_i1.c: Likewise.
	* generated/maxloc1_16_i2.c: Likewise.
	* generated/maxloc1_16_i4.c: Likewise.
	* generated/maxloc1_16_i8.c: Likewise.
	* generated/maxloc1_16_r10.c: Likewise.
	* generated/maxloc1_16_r16.c: Likewise.
	* generated/maxloc1_16_r4.c: Likewise.
	* generated/maxloc1_16_r8.c: Likewise.
	* generated/maxloc1_4_i16.c: Likewise.
	* generated/maxloc1_4_i1.c: Likewise.
	* generated/maxloc1_4_i2.c: Likewise.
	* generated/maxloc1_4_i4.c: Likewise.
	* generated/maxloc1_4_i8.c: Likewise.
	* generated/maxloc1_4_r10.c: Likewise.
	* generated/maxloc1_4_r16.c: Likewise.
	* generated/maxloc1_4_r4.c: Likewise.
	* generated/maxloc1_4_r8.c: Likewise.
	* generated/maxloc1_8_i16.c: Likewise.
	* generated/maxloc1_8_i1.c: Likewise.
	* generated/maxloc1_8_i2.c: Likewise.
	* generated/maxloc1_8_i4.c: Likewise.
	* generated/maxloc1_8_i8.c: Likewise.
	* generated/maxloc1_8_r10.c: Likewise.
	* generated/maxloc1_8_r16.c: Likewise.
	* generated/maxloc1_8_r4.c: Likewise.
	* generated/maxloc1_8_r8.c: Likewise.
	* generated/maxval_i16.c: Likewise.
	* generated/maxval_i1.c: Likewise.
	* generated/maxval_i2.c: Likewise.
	* generated/maxval_i4.c: Likewise.
	* generated/maxval_i8.c: Likewise.
	* generated/maxval_r10.c: Likewise.
	* generated/maxval_r16.c: Likewise.
	* generated/maxval_r4.c: Likewise.
	* generated/maxval_r8.c: Likewise.
	* generated/minloc0_16_i16.c: Likewise.
	* generated/minloc0_16_i1.c: Likewise.
	* generated/minloc0_16_i2.c: Likewise.
	* generated/minloc0_16_i4.c: Likewise.
	* generated/minloc0_16_i8.c: Likewise.
	* generated/minloc0_16_r10.c: Likewise.
	* generated/minloc0_16_r16.c: Likewise.
	* generated/minloc0_16_r4.c: Likewise.
	* generated/minloc0_16_r8.c: Likewise.
	* generated/minloc0_4_i16.c: Likewise.
	* generated/minloc0_4_i1.c: Likewise.
	* generated/minloc0_4_i2.c: Likewise.
	* generated/minloc0_4_i4.c: Likewise.
	* generated/minloc0_4_i8.c: Likewise.
	* generated/minloc0_4_r10.c: Likewise.
	* generated/minloc0_4_r16.c: Likewise.
	* generated/minloc0_4_r4.c: Likewise.
	* generated/minloc0_4_r8.c: Likewise.
	* generated/minloc0_8_i16.c: Likewise.
	* generated/minloc0_8_i1.c: Likewise.
	* generated/minloc0_8_i2.c: Likewise.
	* generated/minloc0_8_i4.c: Likewise.
	* generated/minloc0_8_i8.c: Likewise.
	* generated/minloc0_8_r10.c: Likewise.
	* generated/minloc0_8_r16.c: Likewise.
	* generated/minloc0_8_r4.c: Likewise.
	* generated/minloc0_8_r8.c: Likewise.
	* generated/minloc1_16_i16.c: Likewise.
	* generated/minloc1_16_i1.c: Likewise.
	* generated/minloc1_16_i2.c: Likewise.
	* generated/minloc1_16_i4.c: Likewise.
	* generated/minloc1_16_i8.c: Likewise.
	* generated/minloc1_16_r10.c: Likewise.
	* generated/minloc1_16_r16.c: Likewise.
	* generated/minloc1_16_r4.c: Likewise.
	* generated/minloc1_16_r8.c: Likewise.
	* generated/minloc1_4_i16.c: Likewise.
	* generated/minloc1_4_i1.c: Likewise.
	* generated/minloc1_4_i2.c: Likewise.
	* generated/minloc1_4_i4.c: Likewise.
	* generated/minloc1_4_i8.c: Likewise.
	* generated/minloc1_4_r10.c: Likewise.
	* generated/minloc1_4_r16.c: Likewise.
	* generated/minloc1_4_r4.c: Likewise.
	* generated/minloc1_4_r8.c: Likewise.
	* generated/minloc1_8_i16.c: Likewise.
	* generated/minloc1_8_i1.c: Likewise.
	* generated/minloc1_8_i2.c: Likewise.
	* generated/minloc1_8_i4.c: Likewise.
	* generated/minloc1_8_i8.c: Likewise.
	* generated/minloc1_8_r10.c: Likewise.
	* generated/minloc1_8_r16.c: Likewise.
	* generated/minloc1_8_r4.c: Likewise.
	* generated/minloc1_8_r8.c: Likewise.
	* generated/minval_i16.c: Likewise.
	* generated/minval_i1.c: Likewise.
	* generated/minval_i2.c: Likewise.
	* generated/minval_i4.c: Likewise.
	* generated/minval_i8.c: Likewise.
	* generated/minval_r10.c: Likewise.
	* generated/minval_r16.c: Likewise.
	* generated/minval_r4.c: Likewise.
	* generated/minval_r8.c: Likewise.
	* generated/product_c10.c: Likewise.
	* generated/product_c16.c: Likewise.
	* generated/product_c4.c: Likewise.
	* generated/product_c8.c: Likewise.
	* generated/product_i16.c: Likewise.
	* generated/product_i1.c: Likewise.
	* generated/product_i2.c: Likewise.
	* generated/product_i4.c: Likewise.
	* generated/product_i8.c: Likewise.
	* generated/product_r10.c: Likewise.
	* generated/product_r16.c: Likewise.
	* generated/product_r4.c: Likewise.
	* generated/product_r8.c: Likewise.
	* generated/sum_c10.c: Likewise.
	* generated/sum_c16.c: Likewise.
	* generated/sum_c4.c: Likewise.
	* generated/sum_c8.c: Likewise.
	* generated/sum_i16.c: Likewise.
	* generated/sum_i1.c: Likewise.
	* generated/sum_i2.c: Likewise.
	* generated/sum_i4.c: Likewise.
	* generated/sum_i8.c: Likewise.
	* generated/sum_r10.c: Likewise.
	* generated/sum_r16.c: Likewise.
	* generated/sum_r4.c: Likewise.
	* generated/sum_r8.c: Likewise.

	* gfortran.dg/maxlocval_2.f90: New test.
	* gfortran.dg/maxlocval_3.f90: New test.
	* gfortran.dg/maxlocval_4.f90: New test.
	* gfortran.dg/minlocval_1.f90: New test.
	* gfortran.dg/minlocval_2.f90: New test.
	* gfortran.dg/minlocval_3.f90: New test.
	* gfortran.dg/minlocval_4.f90: New test.

Added:
    trunk/gcc/testsuite/gfortran.dg/maxlocval_2.f90
    trunk/gcc/testsuite/gfortran.dg/maxlocval_3.f90
    trunk/gcc/testsuite/gfortran.dg/maxlocval_4.f90
    trunk/gcc/testsuite/gfortran.dg/minlocval_1.f90
    trunk/gcc/testsuite/gfortran.dg/minlocval_2.f90
    trunk/gcc/testsuite/gfortran.dg/minlocval_3.f90
    trunk/gcc/testsuite/gfortran.dg/minlocval_4.f90
Modified:
    trunk/gcc/fortran/ChangeLog
    trunk/gcc/fortran/trans-array.c
    trunk/gcc/fortran/trans-array.h
    trunk/gcc/fortran/trans-intrinsic.c
    trunk/gcc/testsuite/ChangeLog
    trunk/libgfortran/ChangeLog
    trunk/libgfortran/generated/maxloc0_16_i1.c
    trunk/libgfortran/generated/maxloc0_16_i16.c
    trunk/libgfortran/generated/maxloc0_16_i2.c
    trunk/libgfortran/generated/maxloc0_16_i4.c
    trunk/libgfortran/generated/maxloc0_16_i8.c
    trunk/libgfortran/generated/maxloc0_16_r10.c
    trunk/libgfortran/generated/maxloc0_16_r16.c
    trunk/libgfortran/generated/maxloc0_16_r4.c
    trunk/libgfortran/generated/maxloc0_16_r8.c
    trunk/libgfortran/generated/maxloc0_4_i1.c
    trunk/libgfortran/generated/maxloc0_4_i16.c
    trunk/libgfortran/generated/maxloc0_4_i2.c
    trunk/libgfortran/generated/maxloc0_4_i4.c
    trunk/libgfortran/generated/maxloc0_4_i8.c
    trunk/libgfortran/generated/maxloc0_4_r10.c
    trunk/libgfortran/generated/maxloc0_4_r16.c
    trunk/libgfortran/generated/maxloc0_4_r4.c
    trunk/libgfortran/generated/maxloc0_4_r8.c
    trunk/libgfortran/generated/maxloc0_8_i1.c
    trunk/libgfortran/generated/maxloc0_8_i16.c
    trunk/libgfortran/generated/maxloc0_8_i2.c
    trunk/libgfortran/generated/maxloc0_8_i4.c
    trunk/libgfortran/generated/maxloc0_8_i8.c
    trunk/libgfortran/generated/maxloc0_8_r10.c
    trunk/libgfortran/generated/maxloc0_8_r16.c
    trunk/libgfortran/generated/maxloc0_8_r4.c
    trunk/libgfortran/generated/maxloc0_8_r8.c
    trunk/libgfortran/generated/maxloc1_16_i1.c
    trunk/libgfortran/generated/maxloc1_16_i16.c
    trunk/libgfortran/generated/maxloc1_16_i2.c
    trunk/libgfortran/generated/maxloc1_16_i4.c
    trunk/libgfortran/generated/maxloc1_16_i8.c
    trunk/libgfortran/generated/maxloc1_16_r10.c
    trunk/libgfortran/generated/maxloc1_16_r16.c
    trunk/libgfortran/generated/maxloc1_16_r4.c
    trunk/libgfortran/generated/maxloc1_16_r8.c
    trunk/libgfortran/generated/maxloc1_4_i1.c
    trunk/libgfortran/generated/maxloc1_4_i16.c
    trunk/libgfortran/generated/maxloc1_4_i2.c
    trunk/libgfortran/generated/maxloc1_4_i4.c
    trunk/libgfortran/generated/maxloc1_4_i8.c
    trunk/libgfortran/generated/maxloc1_4_r10.c
    trunk/libgfortran/generated/maxloc1_4_r16.c
    trunk/libgfortran/generated/maxloc1_4_r4.c
    trunk/libgfortran/generated/maxloc1_4_r8.c
    trunk/libgfortran/generated/maxloc1_8_i1.c
    trunk/libgfortran/generated/maxloc1_8_i16.c
    trunk/libgfortran/generated/maxloc1_8_i2.c
    trunk/libgfortran/generated/maxloc1_8_i4.c
    trunk/libgfortran/generated/maxloc1_8_i8.c
    trunk/libgfortran/generated/maxloc1_8_r10.c
    trunk/libgfortran/generated/maxloc1_8_r16.c
    trunk/libgfortran/generated/maxloc1_8_r4.c
    trunk/libgfortran/generated/maxloc1_8_r8.c
    trunk/libgfortran/generated/maxval_i1.c
    trunk/libgfortran/generated/maxval_i16.c
    trunk/libgfortran/generated/maxval_i2.c
    trunk/libgfortran/generated/maxval_i4.c
    trunk/libgfortran/generated/maxval_i8.c
    trunk/libgfortran/generated/maxval_r10.c
    trunk/libgfortran/generated/maxval_r16.c
    trunk/libgfortran/generated/maxval_r4.c
    trunk/libgfortran/generated/maxval_r8.c
    trunk/libgfortran/generated/minloc0_16_i1.c
    trunk/libgfortran/generated/minloc0_16_i16.c
    trunk/libgfortran/generated/minloc0_16_i2.c
    trunk/libgfortran/generated/minloc0_16_i4.c
    trunk/libgfortran/generated/minloc0_16_i8.c
    trunk/libgfortran/generated/minloc0_16_r10.c
    trunk/libgfortran/generated/minloc0_16_r16.c
    trunk/libgfortran/generated/minloc0_16_r4.c
    trunk/libgfortran/generated/minloc0_16_r8.c
    trunk/libgfortran/generated/minloc0_4_i1.c
    trunk/libgfortran/generated/minloc0_4_i16.c
    trunk/libgfortran/generated/minloc0_4_i2.c
    trunk/libgfortran/generated/minloc0_4_i4.c
    trunk/libgfortran/generated/minloc0_4_i8.c
    trunk/libgfortran/generated/minloc0_4_r10.c
    trunk/libgfortran/generated/minloc0_4_r16.c
    trunk/libgfortran/generated/minloc0_4_r4.c
    trunk/libgfortran/generated/minloc0_4_r8.c
    trunk/libgfortran/generated/minloc0_8_i1.c
    trunk/libgfortran/generated/minloc0_8_i16.c
    trunk/libgfortran/generated/minloc0_8_i2.c
    trunk/libgfortran/generated/minloc0_8_i4.c
    trunk/libgfortran/generated/minloc0_8_i8.c
    trunk/libgfortran/generated/minloc0_8_r10.c
    trunk/libgfortran/generated/minloc0_8_r16.c
    trunk/libgfortran/generated/minloc0_8_r4.c
    trunk/libgfortran/generated/minloc0_8_r8.c
    trunk/libgfortran/generated/minloc1_16_i1.c
    trunk/libgfortran/generated/minloc1_16_i16.c
    trunk/libgfortran/generated/minloc1_16_i2.c
    trunk/libgfortran/generated/minloc1_16_i4.c
    trunk/libgfortran/generated/minloc1_16_i8.c
    trunk/libgfortran/generated/minloc1_16_r10.c
    trunk/libgfortran/generated/minloc1_16_r16.c
    trunk/libgfortran/generated/minloc1_16_r4.c
    trunk/libgfortran/generated/minloc1_16_r8.c
    trunk/libgfortran/generated/minloc1_4_i1.c
    trunk/libgfortran/generated/minloc1_4_i16.c
    trunk/libgfortran/generated/minloc1_4_i2.c
    trunk/libgfortran/generated/minloc1_4_i4.c
    trunk/libgfortran/generated/minloc1_4_i8.c
    trunk/libgfortran/generated/minloc1_4_r10.c
    trunk/libgfortran/generated/minloc1_4_r16.c
    trunk/libgfortran/generated/minloc1_4_r4.c
    trunk/libgfortran/generated/minloc1_4_r8.c
    trunk/libgfortran/generated/minloc1_8_i1.c
    trunk/libgfortran/generated/minloc1_8_i16.c
    trunk/libgfortran/generated/minloc1_8_i2.c
    trunk/libgfortran/generated/minloc1_8_i4.c
    trunk/libgfortran/generated/minloc1_8_i8.c
    trunk/libgfortran/generated/minloc1_8_r10.c
    trunk/libgfortran/generated/minloc1_8_r16.c
    trunk/libgfortran/generated/minloc1_8_r4.c
    trunk/libgfortran/generated/minloc1_8_r8.c
    trunk/libgfortran/generated/minval_i1.c
    trunk/libgfortran/generated/minval_i16.c
    trunk/libgfortran/generated/minval_i2.c
    trunk/libgfortran/generated/minval_i4.c
    trunk/libgfortran/generated/minval_i8.c
    trunk/libgfortran/generated/minval_r10.c
    trunk/libgfortran/generated/minval_r16.c
    trunk/libgfortran/generated/minval_r4.c
    trunk/libgfortran/generated/minval_r8.c
    trunk/libgfortran/generated/product_c10.c
    trunk/libgfortran/generated/product_c16.c
    trunk/libgfortran/generated/product_c4.c
    trunk/libgfortran/generated/product_c8.c
    trunk/libgfortran/generated/product_i1.c
    trunk/libgfortran/generated/product_i16.c
    trunk/libgfortran/generated/product_i2.c
    trunk/libgfortran/generated/product_i4.c
    trunk/libgfortran/generated/product_i8.c
    trunk/libgfortran/generated/product_r10.c
    trunk/libgfortran/generated/product_r16.c
    trunk/libgfortran/generated/product_r4.c
    trunk/libgfortran/generated/product_r8.c
    trunk/libgfortran/generated/sum_c10.c
    trunk/libgfortran/generated/sum_c16.c
    trunk/libgfortran/generated/sum_c4.c
    trunk/libgfortran/generated/sum_c8.c
    trunk/libgfortran/generated/sum_i1.c
    trunk/libgfortran/generated/sum_i16.c
    trunk/libgfortran/generated/sum_i2.c
    trunk/libgfortran/generated/sum_i4.c
    trunk/libgfortran/generated/sum_i8.c
    trunk/libgfortran/generated/sum_r10.c
    trunk/libgfortran/generated/sum_r16.c
    trunk/libgfortran/generated/sum_r4.c
    trunk/libgfortran/generated/sum_r8.c
    trunk/libgfortran/libgfortran.h
    trunk/libgfortran/m4/iforeach.m4
    trunk/libgfortran/m4/ifunction.m4
    trunk/libgfortran/m4/iparm.m4
    trunk/libgfortran/m4/maxloc0.m4
    trunk/libgfortran/m4/maxloc1.m4
    trunk/libgfortran/m4/maxval.m4
    trunk/libgfortran/m4/minloc0.m4
    trunk/libgfortran/m4/minloc1.m4
    trunk/libgfortran/m4/minval.m4

Comment 30 Tobias Burnus 2009-07-24 08:19:38 UTC
Regarding the just committed inline version: It would be interesting to know whether it is vectorizable (with/without -ffinite-math-only [i.e. -ffast-math]).

Additionally, for size-1 result arrays, the function should be inlined except for -O0 and -Os. That affects especially {min,max}loc as for {min,max}val this looks like a very special case.
Comment 31 Jakub Jelinek 2009-07-24 08:30:15 UTC
Vectorization questions I'll defer to Ira.

For !optimize I even had a change to forcibly use the function call instead of inline version.  But it didn't really work, as there are only array versions of the library functions.  In case of minloc/maxloc we could just set up a fake array descriptor for one element array and call the library functions (the *loc0 ones).  But minval/maxval library functions only handle DIM=N cases, not finding a minimum/maximum of the whole array.
Comment 32 Ira Rosen 2009-07-26 07:48:24 UTC
(In reply to comment #30)
> Regarding the just committed inline version: It would be interesting to know
> whether it is vectorizable (with/without -ffinite-math-only [i.e.
> -ffast-math]).

It depends on where it is inlined. It has to be vectorized in outer loop (see my previous comment), so it needs another loop around it.

Ira

Comment 33 Tobias Burnus 2009-07-26 09:50:40 UTC
(In reply to comment #32)
> > Regarding the just committed inline version: It would be interesting to know
> > whether it is vectorizable (with/without -ffinite-math-only [i.e.
> > -ffast-math]).
> 
> It depends on where it is inlined. It has to be vectorized in outer loop (see
> my previous comment), so it needs another loop around it.

Using the example from comment 23 with
a) gfortran -O3 -ffast-math -march=native -ftree-vectorize -ftree-vectorizer-verbose=5
b) ifort -O3 -xHost -diag-enable all

ifort shows: test.f90(12): (col. 9) remark: LOOP WAS VECTORIZED.
and needs 1.476s.

gfortran shows: test.f90:12: note: not vectorized: unsupported use in stmt.
and needs 2.272s. (By comparison. 4.4 needs 3.688s.)
Comment 34 Ira Rosen 2009-07-27 08:36:16 UTC
(In reply to comment #33)
> Using the example from comment 23 with
...
> gfortran shows: test.f90:12: note: not vectorized: unsupported use in stmt.
> and needs 2.272s. (By comparison. 4.4 needs 3.688s.)

This is for the inner loop vectorization. For the outer loop we get:
tmp.f90:11: note: not vectorized: control flow in loop.
because of the if's. Maybe loop unswitching can help us. 
Vectorizable outer-loops look like this:

                        (pre-header)
                           |
                          header <---+
                           |         |
                          inner-loop |
                           |         |
                          tail ------+
                           |
                        (exit-bb)


Does ifort vectorize the exact same implemantion of minloc?

Ira

Comment 35 Tobias Burnus 2009-07-27 09:18:18 UTC
(In reply to comment #34)
> Does ifort vectorize the exact same implemantion of minloc?

I tried to convert the minloc implementation into Fortran loops - and the result is at http://users.physik.fu-berlin.de/~tburnus/tmp/vect-PR31067/maxloc2.f90

$ ifort -O3 -xHost -diag-enable all maxloc2.f90
maxloc2.f90(25): (col. 5) remark: LOOP WAS VECTORIZED.
[timing: 0m1.384s]

$ gfortran -O3 -ffast-math -march=native -ftree-vectorize -ftree-vectorizer-verbose=5 maxloc2.f90
maxloc2.f90:1: note: vectorized 0 loops in function.
[timing: 0m2.212s]

In case it helps: I put the ifort assembler output at:
http://users.physik.fu-berlin.de/~tburnus/tmp/vect-PR31067/maxloc2.s
Comment 36 Jakub Jelinek 2009-07-27 11:02:50 UTC
Here is the loop in C and vectorized by hand as well:
#include <emmintrin.h>

float arr[1024];

unsigned int
foo (unsigned int end)
{
  unsigned int pos = 1;
  unsigned int i;
  float limit = __FLT_MAX__;
  for (i = 0; i < end; i++)
    if (arr[i] < limit)
      {
limit = arr[i];
pos = i + 1;
      }
  return pos;
}

unsigned int
bar (unsigned int end)
{
  __m128 pos = (__m128) _mm_set1_epi32 (1);
  __m128 limit = _mm_set1_ps (__FLT_MAX__);
  __m128i curi = _mm_set_epi32 (4, 3, 2, 1);
  __m128i inc = _mm_set1_epi32 (4);
  unsigned int i = 0;
  if (end > 4)
    {
      for (; i < end - 4; i += 4)
{
  __m128 val = _mm_loadu_ps (arr + i);
  __m128 mask = _mm_cmplt_ps (val, limit);
  limit = _mm_min_ps (limit, val);
  pos = _mm_andnot_ps (mask, pos);
  pos = _mm_or_ps (pos, _mm_and_ps (mask, (__m128) curi));
  curi = _mm_add_epi32 (curi, inc);
}
      /* Reduction.  */
      __m128 tmp1 = _mm_movehl_ps (limit, limit);
      __m128 tmp2 = _mm_movehl_ps (pos, pos);
      __m128 mask = _mm_cmplt_ps (tmp1, limit);
      limit = _mm_min_ps (tmp1, limit);
      tmp2 = _mm_and_ps (mask, tmp2);
      pos = _mm_or_ps (tmp2, _mm_andnot_ps (mask, pos));
      tmp1 = _mm_shuffle_ps (limit, limit, _MM_SHUFFLE (1, 1, 1, 1));
      tmp2 = _mm_shuffle_ps (pos, pos, _MM_SHUFFLE (1, 1, 1, 1));
      mask = _mm_cmplt_ps (tmp1, limit);
      limit = _mm_min_ps (tmp1, limit);
      tmp2 = _mm_and_ps (mask, tmp2);
      pos = _mm_or_ps (tmp2, _mm_andnot_ps (mask, pos));
    }
  float limit_ = _mm_cvtss_f32 (limit);
  unsigned int pos_ = (unsigned int) _mm_cvtsi128_si32 ((__m128i) pos);
  for (; i < end; i++)
    if (arr[i] < limit_)
      {
limit_ = arr[i];
pos_ = i + 1;
      }
  return pos_;
}

int
main (void)
{
  unsigned int k;
  arr[0] = -1;
  arr[2] = -3;
  arr[8] = -5;
  arr[9] = -6;
  if (foo (32) != bar (32))
    __builtin_abort ();
  for (k = 10; k < 32; k++)
    {
      arr[k] = -k;
      if (foo (32) != bar (32))
        __builtin_abort ();
    }
  return 0;
}

Don't know how hard would be to vectorize this in the vectorizer, but clearly icc manages to handle that.
The loop is:
<bb 4>:
  # pos_22 = PHI <pos_1(7), 1(3)>
  # i_23 = PHI <i_15(7), 0(3)>
  # limit_24 = PHI <limit_4(7), 3.4028234663852885981170418348451692544e+38(3)>
  limit_11 = arr[i_23];
  D.2700_12 = limit_11 < limit_24;
  pos_1 = [cond_expr] D.2700_12 ? i_23 : pos_22;
  limit_4 = [cond_expr] D.2700_12 ? limit_11 : limit_24;
  i_15 = i_23 + 1;
  D.2703_9 = (long unsigned int) i_15;
  if (D.2703_9 < end_10(D))
    goto <bb 7>;
  else
    goto <bb 8>;

<bb 7>:
  goto <bb 4>;
before vectorization.
Comment 37 Jakub Jelinek 2009-07-27 11:10:12 UTC
Oh, and on 64-bit arches and float or 32-bit arches and double there is another complication - the comparison has different mode size from the cond_expr for pos.  For 32-bit pos and 64-bit double it could perhaps just do the computation in 64-bit integers (vector of 2 (resp. 4 for avx)), for the other case it would need to shuffle the max and compute the pos in 2 vectors, or e.g. the Fortran FE could hand in the common case, emit a likely look for the case where array size is smaller than 4GB, using 32-bit pos zero extended to 64-bits at the end.
Comment 38 Ira Rosen 2009-07-27 12:44:44 UTC
I am not sure that that kind of computation can be generated automatically, since in general the order of caclulation of cond_expr cannot be changed. 

However, the loop can be split:

  for (i = 0; i < end; i++)
    if (arr[i] < limit)
      limit = arr[i];

  for (i = 0; i < end; i++)
    if (arr[i] == limit)
      {
        pos = i + 1;
        break;
      }

making the first loop vectorizable (inner-most loop vectorization).

Ira
Comment 39 Tobias Burnus 2009-07-27 13:15:16 UTC
(In reply to comment #38)
> However, the loop can be split: [..]
> making the first loop vectorizable (inner-most loop vectorization).

OK. I tried it with a Fortran program:
http://users.physik.fu-berlin.de/~tburnus/tmp/vect-PR31067/maxloc.f90
http://users.physik.fu-berlin.de/~tburnus/tmp/vect-PR31067/maxloc2.f90
http://users.physik.fu-berlin.de/~tburnus/tmp/vect-PR31067/maxloc3.f90

maxloc.f90 is the program from comment 34  (maxloc.s = intel assembler)
maxloc2.f90 models what gfortran does for maxloc (maxloc.s = intel assembler)
maxloc3.f90 models what has a split loop

The splitting plus vectorization makes the calculation 5% faster - 0m2.152s (maxloc3) vs 0m2.260s (maxloc2). Still, that's 35% more than ifort needs.

For some reason, maxloc2 with -fno-tree-vectorize takes only 0m1.840s.(Identical to intel's result for maxloc2/maxloc3. While for the original maxloc.f90, there is no performance effect, and for maxloc3 vectorization makes it faster.)
Comment 40 Jakub Jelinek 2009-07-27 14:51:09 UTC
If the cond_expr compute a minimum or maximum and the other cond_exprs compute something based on the IV at the extremum, then I don't see why it couldn't be vectorized by computing extremes of odd/even and corresponding values based on the IV at those points, then merging them in the final step, and similarly for bigger vectorization steps.
Comment 41 Ira Rosen 2009-07-28 08:12:25 UTC
That requires pattern recognition. MIN/MAX_EXPR are recognized by the first phiopt pass, so MIN/MAXLOC should be either also recognized there or in the vectorizer. (The phiopt pass transforms if clause to MIN/MAX_EXPR. The vectorizer gets COND_EXPR after if-conversion pass).
Comment 42 Richard Biener 2011-07-25 15:39:54 UTC
With gas_dyn changed to use MINLOC (DTEMP, 1) we now inline the intrinsic
(but not with MINLOC (DTEMP), even though we know it'll be a single-element
array result ...).

We completely lack a way to fuse the loops though.  Inlining the intrinsic
gives a moderate 5% speedup.
Comment 43 Jakub Jelinek 2011-07-28 13:51:32 UTC
Created attachment 24856 [details]
gcc47-pr31067.patch

Patch to optimize a = minloc (b) for rank 1 b into a = minloc (b, dim = 1),
according to the standard the latter function is supposed to return the first element of the array returned by former function (which should return a rank 1,  1 element array).  So, by instead initializing the (one element) array with the scalar we can get it actually inlined.  Doing this during genericization looked much harder to me.
Comment 44 Jakub Jelinek 2011-07-28 16:00:54 UTC
Created attachment 24858 [details]
gcc47-pr31067.patch

Based on IRC discussions, this patch instead replaces
MINLOC (rank1)
with
/( MINLOC (rank1, DIM=1) )/
so that it should work even with allocatable LHS that should be reallocated on assignment etc.  As an bonus, even e.g.
if (any (minloc (rank1).ne.6))
etc. can be optimized.  So far tested just with dg.exp=m*.f90
Comment 45 Jakub Jelinek 2011-07-28 20:56:57 UTC
Author: jakub
Date: Thu Jul 28 20:56:50 2011
New Revision: 176897

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=176897
Log:
	PR fortran/31067
	* frontend-passes.c (optimize_minmaxloc): New function.
	(optimize_expr): Call it.

	* gfortran.dg/maxloc_2.f90: New test.
	* gfortran.dg/maxloc_3.f90: New test.
	* gfortran.dg/minloc_1.f90: New test.
	* gfortran.dg/minloc_2.f90: New test.
	* gfortran.dg/minloc_3.f90: New test.
	* gfortran.dg/minmaxloc_7.f90: New test.

Added:
    trunk/gcc/testsuite/gfortran.dg/maxloc_2.f90
    trunk/gcc/testsuite/gfortran.dg/maxloc_3.f90
    trunk/gcc/testsuite/gfortran.dg/minloc_1.f90
    trunk/gcc/testsuite/gfortran.dg/minloc_2.f90
    trunk/gcc/testsuite/gfortran.dg/minloc_3.f90
    trunk/gcc/testsuite/gfortran.dg/minmaxloc_7.f90
Modified:
    trunk/gcc/fortran/ChangeLog
    trunk/gcc/fortran/frontend-passes.c
    trunk/gcc/testsuite/ChangeLog
Comment 46 Tobias Burnus 2011-07-29 07:12:47 UTC
(In reply to comment #45)
[Commit to inline MINLOC/MAXLOC for a rank-1 array, which returns a single-element rank-1 array.]


On my ~5 year old Athlon64 x2, I get with "-Ofast -march=native" (and with or without -fwhole-program) a performance improvement of 3% (10.72s -> 10.41s).


The performance should further improve, if one fuses the loops (cf. comment 42, but also comment 34 ff.) - and if one could move the memory allocation/freeing of the automatic-array DTEMP out of the loop (after inlining). (Recall that with -fstack-arrays/-Ofast, automatic arrays are allocated on the stack.)


As already mentioned indirectly in comment 0 (via PR31066): If one uses reciprocal
approximation instructions and a Newton-Rhapson step (namely: -mrecip), the performance improves a lot: 6.895s. By comparison, with ifort (11.1) -xHost -O3 the run time is 7.319s.