Bug 40589 - efficiency problem with V2HI add
Summary: efficiency problem with V2HI add
Status: UNCONFIRMED
Alias: None
Product: gcc
Classification: Unclassified
Component: middle-end (show other bugs)
Version: 4.3.0
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-06-29 15:06 UTC by Tom de Vries
Modified: 2009-06-29 15:40 UTC (History)
1 user (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Tom de Vries 2009-06-29 15:06:09 UTC
I am working on a port for the TriMedia processor family, and I was playing around with the following example (extracted from gcc.c-torture/execute/simd-2.c) to see how well our port takes advantage of the addv2hi3 operator of our tm3271 processor.

test.c:
...
typedef short __attribute__((vector_size (N))) vecint;
vecint i, j, k;
void f () {  k = i + j; }
...

test.c.016veclower, N=4. This looks good, the addv2hi3 has been used once.
...
  vector short int k.2;
  vector short int j.1;
  vector short int i.0;
  i.0 = i;
  j.1 = j;
  k.2 = i.0 + j.1;
  k = k.2;
}
...

test.c.016veclower, N=32. This also looks good, the addv2hi3 has been used 8x.
...
  vector short unsigned int D.1445;
  vector short unsigned int D.1444;
  vector short unsigned int D.1443;
  vector short unsigned int D.1442;
  vector short unsigned int D.1441;
  vector short unsigned int D.1440;
  vector short unsigned int D.1439;
  vector short unsigned int D.1438;
  vector short unsigned int D.1437;
  vector short unsigned int D.1436;
  vector short unsigned int D.1435;
  vector short unsigned int D.1434;
  vector short unsigned int D.1433;
  vector short unsigned int D.1432;
  vector short unsigned int D.1431;
  vector short unsigned int D.1430;
  vector short unsigned int D.1429;
  vector short unsigned int D.1428;
  vector short unsigned int D.1427;
  vector short unsigned int D.1426;
  vector short unsigned int D.1425;
  vector short unsigned int D.1424;
  vector short unsigned int D.1423;
  vector short unsigned int D.1422;
  vector short int k.2;
  vector short int j.1;
  vector short int i.0;
  i.0 = i;
  j.1 = j;
  D.1422 = BIT_FIELD_REF <i.0, 32, 0>;
  D.1423 = BIT_FIELD_REF <j.1, 32, 0>;
  D.1424 = D.1422 + D.1423;
  D.1425 = BIT_FIELD_REF <i.0, 32, 32>;
  D.1426 = BIT_FIELD_REF <j.1, 32, 32>;
  D.1427 = D.1425 + D.1426;
  D.1428 = BIT_FIELD_REF <i.0, 32, 64>;
  D.1429 = BIT_FIELD_REF <j.1, 32, 64>;
  D.1430 = D.1428 + D.1429;
  D.1431 = BIT_FIELD_REF <i.0, 32, 96>;
  D.1432 = BIT_FIELD_REF <j.1, 32, 96>;
  D.1433 = D.1431 + D.1432;
  D.1434 = BIT_FIELD_REF <i.0, 32, 128>;
  D.1435 = BIT_FIELD_REF <j.1, 32, 128>;
  D.1436 = D.1434 + D.1435;
  D.1437 = BIT_FIELD_REF <i.0, 32, 160>;
  D.1438 = BIT_FIELD_REF <j.1, 32, 160>;
  D.1439 = D.1437 + D.1438;
  D.1440 = BIT_FIELD_REF <i.0, 32, 192>;
  D.1441 = BIT_FIELD_REF <j.1, 32, 192>;
  D.1442 = D.1440 + D.1441;
  D.1443 = BIT_FIELD_REF <i.0, 32, 224>;
  D.1444 = BIT_FIELD_REF <j.1, 32, 224>;
  D.1445 = D.1443 + D.1444;
  k.2 = {D.1424, D.1427, D.1430, D.1433, D.1436, D.1439, D.1442, D.1445};
  k = k.2;
...

test.c.016veclower, N=8. This does not look good. The addv2hi3 has not been used. The addsi3 has been used 4 times, while the addv2hi3 could have been used only 2 times.
...
  short int D.1431;
  short int D.1430;
  short int D.1429;
  short int D.1428;
  short int D.1427;
  short int D.1426;
  short int D.1425;
  short int D.1424;
  short int D.1423;
  short int D.1422;
  short int D.1421;
  short int D.1420;
  vector short int k.2;
  vector short int j.1;
  vector short int i.0;
  i.0 = i;
  j.1 = j;
  D.1420 = BIT_FIELD_REF <i.0, 16, 0>;
  D.1421 = BIT_FIELD_REF <j.1, 16, 0>;
  D.1422 = D.1420 + D.1421;
  D.1423 = BIT_FIELD_REF <i.0, 16, 16>;
  D.1424 = BIT_FIELD_REF <j.1, 16, 16>;
  D.1425 = D.1423 + D.1424;
  D.1426 = BIT_FIELD_REF <i.0, 16, 32>;
  D.1427 = BIT_FIELD_REF <j.1, 16, 32>;
  D.1428 = D.1426 + D.1427;
  D.1429 = BIT_FIELD_REF <i.0, 16, 48>;
  D.1430 = BIT_FIELD_REF <j.1, 16, 48>;
  D.1431 = D.1429 + D.1430;
  k.2 = {D.1422, D.1425, D.1428, D.1431};
  k = k.2;
...

This grep illustrates that the problem only occurs for N=8/16:
...
$ for N in 4 8 16 32 64; do \
  rm -f *.c.* ; \
  cc1 test.c -quiet -march=tm3271 -O2 -DN=${N} \
      -fdump-rtl-all -fdump-tree-all \
  && grep -c '+' test.c.016t.veclower ; \
done
1
4
8
8
16
...

So why does the problem occur? Lets look at the TYPE_MODE (type) in expand_vector_operations_1() for different values of N:
...
N=4  V2HI
N=8  DImode
N=16 TImode
N=32 BLKmode
N=64 BLKmode
...

For the DImode and TImode, we don't generate efficient code, due to the test on BLKmode:
...
  /* For very wide vectors, try using a smaller vector mode.  */
  compute_type = type;
  if (TYPE_MODE (type) == BLKmode && op)
...
in expand_vector_operations_1(). For my target, which has a native addv2hi3 operator, also DImode/TImode can be considered a 'wide vector'.

Using this patch, I also generate addv2hi3 for N=8/N=16: 
...
Index: tree-vect-generic.c
===================================================================
--- tree-vect-generic.c (revision 14)
+++ tree-vect-generic.c (working copy)
@@ -462,7 +462,7 @@
 
   /* For very wide vectors, try using a smaller vector mode.  */
   compute_type = type;
-  if (TYPE_MODE (type) == BLKmode && op)
+  if (op)
     {
       tree vector_compute_type
         = type_for_widest_vector_mode (TYPE_MODE (TREE_TYPE (type)), op,
...


Furthermore, I think this patch (in the style of expmed.c:extract_bit_field_1()) could be useful:
...
Index: tree-vect-generic.c
===================================================================
--- tree-vect-generic.c (revision 14)
+++ tree-vect-generic.c (working copy)
@@ -35,6 +35,7 @@
 #include "tree-pass.h"
 #include "flags.h"
 #include "ggc.h"
+#include "target.h"
 
 
 /* Build a constant of type TYPE, made of VALUE's bits replicated
@@ -369,6 +370,7 @@
   for (; mode != VOIDmode; mode = GET_MODE_WIDER_MODE (mode))
     if (GET_MODE_INNER (mode) == inner_mode
         && GET_MODE_NUNITS (mode) > best_nunits
+       && targetm.vector_mode_supported_p(mode)
        && optab_handler (op, mode)->insn_code != CODE_FOR_nothing)
       best_mode = mode, best_nunits = GET_MODE_NUNITS (mode);
...
It automatically disables a addv4hi3 if v4hi is disabled in TARGET_VECTOR_MODE_SUPPORTED_P.
Comment 1 Richard Biener 2009-06-29 15:40:05 UTC
Good observations.  Patches should be sent to gcc-patches@gcc.gnu.org together
with a changelog entry following existing practice and a note how you tested
the patch.  See gcc.gnu.org/contribute.html.