This is the mail archive of the gcc-help@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: generating unaligned vector load instructions?


Am 19.09.2013, 03:27 Uhr, schrieb Tim Prince <n8tm@aol.com>:

On 9/18/2013 7:01 PM, Norbert Lange wrote:
Hello Tim,

can you specify which versions, maybe post the commandline, or trying to compile for 32bit (-m32 switch)? Also I dont understand the comment about splitting - to avoid misunderstanding - the generated code segfaults on my AthlonX2 so its not a question about optimal code, but actually working one

Im unable to generate the right instruction, and I dont exactly know why it should differ between versions (... except bugs of course...). And I just want to know the right way to force unaligned loads, without inline assembly.

Btw: The code doesnt compile on gcc < 4.7 as I just realised - cant multipy vector with scalars on older versions.
I wasn't even certain which of my gcc installations had 32-bit counterparts, but Red Hat 4.4.6 appeared to accept your code for -m64 but reject it for -m32. Intel icc, which shares a lot of stuff with the active gcc, rejected your code. Many people here advocate options such as -pedantic -Wall to increase the number of warnings, so you will get those warnings even where gcc accepts your code. I thought X2 could accept nearly all normal sse2 code (original Turion didn't) but I guess you are wanting to test its limits. Now that you've revealed your actual target, someone might suggest a more appropriate arch option. Did you read about the errata for this instruction on your chip? http://support.amd.com/us/Processor_TechDocs/25759.pdf Splitting unaligned 128-bit moves into separate 64-bit moves was a common tactic likely to improve performance on CPUs prior to AMD Barcelona and Intel Nehalem (not to mention avoid bugs in hardware implementation). It probably didn't hurt to split the instruction explicitly on a CPU where the hardware would split it anyway (I thought this might be true of X2). Even with Intel Westmere there were situations where splitting might improve performance. So gcc can't be faulted if it makes that translation, when you didn't tell it to compile for a more recent CPU, or you specify a target which is known to have problems with certain instructions.


Thanks for your time and help, but I believe you miss the main point.
the code in question generates an aligned load instruction "movdqa" which will cause an alignment fault on ALL cpus (unless the data appears at a 16-byte boundary, but thats based on luck since its alignment is 4). "movdqu" is the one that should be generated, and it works fine if I use inline-assembly for the load - but thats precisely what I dont want. It simply produces wrong code (and consistently, no matter what I put into march), nothing about tuning. The idea was to use the vector extension and let gcc output the optimal scalar or vector code. Well, I added a new version with a main routine, so this should allow running the code. with -msse2 the binary does segfault with the unaligned pointer, no matter what I do.

some other funny bits:
*compiling for arm correctly generates unaligned byte-loads with this code (doesnt has vector isa for ints), so it might be the x86 backend that loses the unaligned property somewhere *memcpy seems to be able to generate the "movdqu" instruction, but its very fragile... using a pointer to the packed struct generates the singular "movdqu" instruction while correctly using a pointer to the member generates a scalar inline-memcopy

Attachment: testvecs.c
Description: Binary data


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]