H8/300H code size issues (RTX_COSTS)

tm tm@mail.kloo.net
Tue Nov 12 14:11:00 GMT 2002


BTW, I've been looking at code size issues on the H8/300H, and I've
found some interesting things lately.

On one particular testcase, blowfish.i from stress-1.17 compiled
with -O2 -int32 -mh, the code size on the H8/300H is almost twice as
large as the SH. The code size problem does not appear to be
caused by a single particular function in the file - all
functions appear to be bloated:

                          file offsets
file                    H8/300    SH
ssh_blowfish_encrypt:      0       0
ssh_blowfish_decrypt:    88a     3e0
ssh_blwofish_set_key:   1114     7c0
ssh_blowfish_ctxsize:   1362     940
ssh_blowfish_init:      1374     960
ssh_blowfish_ecb:       139a     980
ssh_blowfish_cbc:       1626     b20
ssh_blowfish_cfb:       1a62     d80
EOF                     21a0    115e

It appears most of the problem is caused by a misoptimization.

The blowfish code appears to do a lot of byte shuffling of the
form:

 ( yr  ^= ((((S[0 * 256 + (((    yl   ) >> (24)) & (0x00ff)) ] + S[1 * 256
+ (((    yl   ) >> (16)) & (0x00ff)) ]) ^ S[2 * 256 + (((    yl   ) >>
(8)) & (0x00ff)) ]) + S[3 * 256 + (((    yl   ) & 0x00ff)) ])  ^ P[  1
])) ;  ( yl  ^= ((((S[0 * 256 + (((    yr   ) >> (24)) & (0x00ff)) ] + S[1
* 256 + (((    yr   ) >> (16)) & (0x00ff)) ]) ^ S[2 * 256 + (((    yr
) >> (8)) & (0x00ff)) ]) + S[3 * 256 + (((    yr   ) & 0x00ff)) ])  ^ P[
2 ])) ;

What happens is the expression:

S[(yl >> 24) & 0xff]

is converted to:

*(int *) ((char *)(S << 2) + ((yl >> 22) & 0x3fc)

by combine.

The RTX before combine looks like this:

(insn 33 31 34 0 0x4015d44c (parallel [
            (set (reg:SI 30)
                (lshiftrt:SI (reg/v:SI 20)
                    (const_int 16 [0x10])))
            (clobber (scratch:QI))
        ]) 81 {*h8300.md:1823} (nil)
    (expr_list:REG_UNUSED (scratch:QI)
        (expr_list:REG_EQUAL (lshiftrt:SI (reg/v:SI 20)
                (const_int 16 [0x10]))
            (nil))))

(insn 34 33 37 0 0x4015d44c (set (reg:SI 31)
        (and:SI (reg:SI 30)
            (const_int 255 [0xff]))) 50 {*h8300.md:1173} (insn_list 33
(nil))
    (expr_list:REG_DEAD (reg:SI 30)
        (nil)))

The RTX after combine looks like this:

(insn 34 33 37 0 0x4015d44c (parallel [
            (set (reg:SI 31)
                (lshiftrt:SI (reg/v:SI 20)
                    (const_int 14 [0xe])))
            (clobber (scratch:QI))
        ]) 81 {*h8300.md:1823} (nil)
    (expr_list:REG_UNUSED (scratch:QI)
        (nil)))

(insn 37 34 38 0 0x4015d44c (set (reg:SI 33)
        (and:SI (reg:SI 31)
            (const_int 1020 [0x3fc]))) 50 {*h8300.md:1173} (insn_list 34
(nil))
    (expr_list:REG_DEAD (reg:SI 31)
        (nil)))

This converts the ((x >> 24) & 0xff) which can easily be done by:

        mov.w   em,rm           ; 0D xx, 2 states
        mov.b   rmh,rml         ; 0C xx, 2 states
        extu.w  rm              ; 17 5x, 2 states
        extu.l  erm             ; 17 7x, 2 states

                                ; total: 8 bytes, 8 states

into a very horrible ((x >> 22) & 0x3fc) which is currently performed
as a loop of single-bit shifts:

        (shift by 8 elsewhere)
        mov.b   #0xe,r4l        ; FC 0E, 2 states
loop:
        shlr.l  er3             ; 11 33, 2 states
        add.b   #0xff,r4l       ; 8C FF, 2 states
        bne     loop            ; 46 00, 4 states
        and.l   #0x3fc,er3      ; 7A 63 00 00 03 FC, 6 states

                                ; total: 14 bytes,
                                ;        8 + (8 * 14) = 120 states!

I'm guessing the root problem is that RTX_COSTS for the H8/300H is not
properly defined.

The combiner checks if ((x >> 22) & 0x3fc) is cheaper than (((x >> 24) &
0xff) << 2). RTX_COSTS is broken, and returns 4 for most operations,
therefore it evaluates 8 < 12 and performs the substitution.

The two biggest problems on the H8/300H currently seem to be:

1. 32-bit shifts are still performed as loops of single-bit shifts.

2. RTX_COSTS is broken, which makes the combiner perform pessimal
   combinations, which makes code worse rather than better.

These two factors combined seem to cause a 2x code size increase
(and significant performance problems) on code which implement
shift-intensive algorithms such as cryptographic encryption and
decryption.

Toshi




More information about the Gcc-bugs mailing list