This is the mail archive of the
mailing list for the GCC project.
Re: VLIW scheduling and delayed branch
- From: Hariharan Sandanagobalane <hariharans at picochip dot com>
- To: Thomas Sailer <sailer at ife dot ee dot ethz dot ch>
- Cc: gcc at gcc dot gnu dot org, jeffchen at magima dot com dot cn
- Date: Mon, 10 Dec 2007 10:17:49 +0000
- Subject: Re: VLIW scheduling and delayed branch
- References: <475AB2D4.email@example.com> <firstname.lastname@example.org>
Thanks for your reply. A couple of questions below.
Thomas Sailer wrote:
Has anyone faced a similar problem before? Are there targets for which
both VLIW and DBR are enabled? Perhaps ia64?
I did something similar a few months ago.
What was your target? Is the target code available in Gcc mainline? If
not, could you pass your code to me?
The problem is that haifa and the delayed branch scheduling passes don't
really fit together. delayed branch scheduling happily undoes all the
The question is how much you gain by delayed branch scheduling. I don't
have numbers, but it wasn't much in my case. And since your company name
is picochip, you certainly value size more than speed ?!
Yeah. We do. But, in our architecture, a branch has to have a delay slot
instruction anyway. In the absence of one, we put a "nop" in there. If
GCC manages to move a "single" instruction vliw into the delay slot, we
would benefit in both size and speed, otherwise, we will just have no
impact on either.
I pursued two approaches. The first one was to insert "stop bit" pseudo
insns into the RTL stream in machdep reorg, so I didn't have to rely on
TImode insn flags during output. But then delayed branch scheduling just
took one insn out of an insn group and put it into the delay slot,
meaning there was usually no cycle gain at all, just larger code size
(due to insn duplication).
This seems fairly straightforward to implement.
The second approach was having lots of parallel insns (using match
parallel and a custom predicate). machdep reorg then converts insn
bundles into a single parallel insn. Delayed branch scheduling then does
the right thing. This approach works fairly well for me, but there are a
few complications. My output code is pretty hackish, as I didn't want to
duplicate outputing a single insn / outputing the same insn as component
of a parallel insn group.
When do you un-parallel those instructions? And, how?