[Bug middle-end/40589] New: efficiency problem with V2HI add
Tom dot de dot Vries at Nxp dot com
gcc-bugzilla@gcc.gnu.org
Mon Jun 29 15:06:00 GMT 2009
I am working on a port for the TriMedia processor family, and I was playing
around with the following example (extracted from
gcc.c-torture/execute/simd-2.c) to see how well our port takes advantage of the
addv2hi3 operator of our tm3271 processor.
test.c:
...
typedef short __attribute__((vector_size (N))) vecint;
vecint i, j, k;
void f () { k = i + j; }
...
test.c.016veclower, N=4. This looks good, the addv2hi3 has been used once.
...
vector short int k.2;
vector short int j.1;
vector short int i.0;
i.0 = i;
j.1 = j;
k.2 = i.0 + j.1;
k = k.2;
}
...
test.c.016veclower, N=32. This also looks good, the addv2hi3 has been used 8x.
...
vector short unsigned int D.1445;
vector short unsigned int D.1444;
vector short unsigned int D.1443;
vector short unsigned int D.1442;
vector short unsigned int D.1441;
vector short unsigned int D.1440;
vector short unsigned int D.1439;
vector short unsigned int D.1438;
vector short unsigned int D.1437;
vector short unsigned int D.1436;
vector short unsigned int D.1435;
vector short unsigned int D.1434;
vector short unsigned int D.1433;
vector short unsigned int D.1432;
vector short unsigned int D.1431;
vector short unsigned int D.1430;
vector short unsigned int D.1429;
vector short unsigned int D.1428;
vector short unsigned int D.1427;
vector short unsigned int D.1426;
vector short unsigned int D.1425;
vector short unsigned int D.1424;
vector short unsigned int D.1423;
vector short unsigned int D.1422;
vector short int k.2;
vector short int j.1;
vector short int i.0;
i.0 = i;
j.1 = j;
D.1422 = BIT_FIELD_REF <i.0, 32, 0>;
D.1423 = BIT_FIELD_REF <j.1, 32, 0>;
D.1424 = D.1422 + D.1423;
D.1425 = BIT_FIELD_REF <i.0, 32, 32>;
D.1426 = BIT_FIELD_REF <j.1, 32, 32>;
D.1427 = D.1425 + D.1426;
D.1428 = BIT_FIELD_REF <i.0, 32, 64>;
D.1429 = BIT_FIELD_REF <j.1, 32, 64>;
D.1430 = D.1428 + D.1429;
D.1431 = BIT_FIELD_REF <i.0, 32, 96>;
D.1432 = BIT_FIELD_REF <j.1, 32, 96>;
D.1433 = D.1431 + D.1432;
D.1434 = BIT_FIELD_REF <i.0, 32, 128>;
D.1435 = BIT_FIELD_REF <j.1, 32, 128>;
D.1436 = D.1434 + D.1435;
D.1437 = BIT_FIELD_REF <i.0, 32, 160>;
D.1438 = BIT_FIELD_REF <j.1, 32, 160>;
D.1439 = D.1437 + D.1438;
D.1440 = BIT_FIELD_REF <i.0, 32, 192>;
D.1441 = BIT_FIELD_REF <j.1, 32, 192>;
D.1442 = D.1440 + D.1441;
D.1443 = BIT_FIELD_REF <i.0, 32, 224>;
D.1444 = BIT_FIELD_REF <j.1, 32, 224>;
D.1445 = D.1443 + D.1444;
k.2 = {D.1424, D.1427, D.1430, D.1433, D.1436, D.1439, D.1442, D.1445};
k = k.2;
...
test.c.016veclower, N=8. This does not look good. The addv2hi3 has not been
used. The addsi3 has been used 4 times, while the addv2hi3 could have been used
only 2 times.
...
short int D.1431;
short int D.1430;
short int D.1429;
short int D.1428;
short int D.1427;
short int D.1426;
short int D.1425;
short int D.1424;
short int D.1423;
short int D.1422;
short int D.1421;
short int D.1420;
vector short int k.2;
vector short int j.1;
vector short int i.0;
i.0 = i;
j.1 = j;
D.1420 = BIT_FIELD_REF <i.0, 16, 0>;
D.1421 = BIT_FIELD_REF <j.1, 16, 0>;
D.1422 = D.1420 + D.1421;
D.1423 = BIT_FIELD_REF <i.0, 16, 16>;
D.1424 = BIT_FIELD_REF <j.1, 16, 16>;
D.1425 = D.1423 + D.1424;
D.1426 = BIT_FIELD_REF <i.0, 16, 32>;
D.1427 = BIT_FIELD_REF <j.1, 16, 32>;
D.1428 = D.1426 + D.1427;
D.1429 = BIT_FIELD_REF <i.0, 16, 48>;
D.1430 = BIT_FIELD_REF <j.1, 16, 48>;
D.1431 = D.1429 + D.1430;
k.2 = {D.1422, D.1425, D.1428, D.1431};
k = k.2;
...
This grep illustrates that the problem only occurs for N=8/16:
...
$ for N in 4 8 16 32 64; do \
rm -f *.c.* ; \
cc1 test.c -quiet -march=tm3271 -O2 -DN=${N} \
-fdump-rtl-all -fdump-tree-all \
&& grep -c '+' test.c.016t.veclower ; \
done
1
4
8
8
16
...
So why does the problem occur? Lets look at the TYPE_MODE (type) in
expand_vector_operations_1() for different values of N:
...
N=4 V2HI
N=8 DImode
N=16 TImode
N=32 BLKmode
N=64 BLKmode
...
For the DImode and TImode, we don't generate efficient code, due to the test on
BLKmode:
...
/* For very wide vectors, try using a smaller vector mode. */
compute_type = type;
if (TYPE_MODE (type) == BLKmode && op)
...
in expand_vector_operations_1(). For my target, which has a native addv2hi3
operator, also DImode/TImode can be considered a 'wide vector'.
Using this patch, I also generate addv2hi3 for N=8/N=16:
...
Index: tree-vect-generic.c
===================================================================
--- tree-vect-generic.c (revision 14)
+++ tree-vect-generic.c (working copy)
@@ -462,7 +462,7 @@
/* For very wide vectors, try using a smaller vector mode. */
compute_type = type;
- if (TYPE_MODE (type) == BLKmode && op)
+ if (op)
{
tree vector_compute_type
= type_for_widest_vector_mode (TYPE_MODE (TREE_TYPE (type)), op,
...
Furthermore, I think this patch (in the style of
expmed.c:extract_bit_field_1()) could be useful:
...
Index: tree-vect-generic.c
===================================================================
--- tree-vect-generic.c (revision 14)
+++ tree-vect-generic.c (working copy)
@@ -35,6 +35,7 @@
#include "tree-pass.h"
#include "flags.h"
#include "ggc.h"
+#include "target.h"
/* Build a constant of type TYPE, made of VALUE's bits replicated
@@ -369,6 +370,7 @@
for (; mode != VOIDmode; mode = GET_MODE_WIDER_MODE (mode))
if (GET_MODE_INNER (mode) == inner_mode
&& GET_MODE_NUNITS (mode) > best_nunits
+ && targetm.vector_mode_supported_p(mode)
&& optab_handler (op, mode)->insn_code != CODE_FOR_nothing)
best_mode = mode, best_nunits = GET_MODE_NUNITS (mode);
...
It automatically disables a addv4hi3 if v4hi is disabled in
TARGET_VECTOR_MODE_SUPPORTED_P.
--
Summary: efficiency problem with V2HI add
Product: gcc
Version: 4.3.0
Status: UNCONFIRMED
Severity: enhancement
Priority: P3
Component: middle-end
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: Tom dot de dot Vries at Nxp dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40589
More information about the Gcc-bugs
mailing list