This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] fold builtin_tolower, builtin_toupper


On 10 July 2015 at 08:51, OndÅej BÃlka <neleai@seznam.cz> wrote:
> On Thu, Jul 09, 2015 at 03:46:08PM +0200, Richard Biener wrote:
>> On Thu, 9 Jul 2015, Bernhard Reutner-Fischer wrote:

[toupper/tolower patch withdrawn]

>> I don't think this can be correct for all locales which need not
>> have a lower-case character for all upper-case ones nor do
>> all letters having one need to be in the range of 'A' to 'Z'.
>>
>> Joseph will surely correct me if I am wrong.
>>
> Thats correct as this doesn't handle toupper('Ä') with appropriate
> single byte locale. You cannot even rely on fact that if x<128 then only
> conversion is happens in 'A'..'Z' range, there are locales where that
> doesn't hold and we need to check _NL_CTYPE_NONASCII_CASE. We don't
> export that so you would need to check that while constructing table with 256 entries.

I detest locales.
>
> Also your example is invalid as you used __builtin_tolower instead
> tolower. As usual gcc builtins are slow, you will get better performance

You're of course right, libc usually has a map-lookup for the fast
path for these.
(tolower) (...)  comes to mind but doesn't matter here.

> with following.
>
> #include <ctype.h>
> int foo(char *c)
> {
>  int i;
>  for(i=0;i<1000;i++)
>    c[i]=tolower(c[i]);
> }
>
>
> As your example first problem is that it doesn't work with utf8 due
> multibyte characters.

yea, the app i saw doing that strcpy/tolower has a defined input of
ASCII A-Za-z0-9- so i should not have used toupper in the example in
the first place.

>
> Second problem is that sse4.2 doesn't help at all as generating masks
> with it is quite slow. Using just sse2 is faster here.

The point of the PR was that a) loop-fusion is missing and b) nothing
is vectorized.
The quick sse4.2 example was just an extension my CPU happens to
support and that showed the result would be smaller than before and
maybe even a tiny bit faster.. ;)

>
> It could be possible to add such function to libc. For vectorization you

I think it would be better if GCC was able to fuse two or more loops
and grok to vectorize patterns like these.
As you point out, toupper is a bad example, a better one would perhaps
be something like the attached.

I guess that there is real-world code that does a
memcpy/memmove/str[n]cpy and then mask out some bits in the
destination so this should be useful generally.

thanks for your comments, though!
cheers,
/* PR middle-end/66741 */
/* Manually expanded variant */
/* We were not fusing the 2 loops (strcpy and tolower) and we did not
 * vectorize the loop.  */
typedef __SIZE_TYPE__ size_t;
static __attribute__ ((noinline, noclone)) char *
tolower_strcpy_1(char *dest, const char *src) {
	char *d = dest, *s = (char *)src;
	while (*s) /* strcpy */
		*d++ = *s++;
	*d = '\0';
	d = dest;
	/* while (*d) should work as well but might be too complicated, so: */
	/* use same loop condition as above */
	s = (char *)src;
	while (*s) { /* ascii_tolower */
		int ch = *d;
		*d++ = ch >= 'A' && ch <= 'Z' ? ch | 0x20 : ch;
		s++;
	}
	return dest;
}
char *tolower_strcpy(char *dest, const char *src) {
	char *s = (char *)src;
	unsigned int len = 0;
	while (*s)
		if (*s < '-' || *s > 'z' || ++len > 255)
			return (void*)0;
	return tolower_strcpy_1(dest, src);
}
#ifdef MAIN
#include <unistd.h>
#include <string.h>
#define N 128
int main(void) {
	unsigned long sum = 0;
	char src[N + 1], dest[N + 1];
	while (1) {
		int n = read(0, &src, N);
		if (n == 0)
			break;
		if (n < 0)
			return 1;
		src[n] = 0;
		sum |= (unsigned long)tolower_strcpy(dest, src);
//		write(1, dest, strlen(dest));
	}
	return sum == 42;
}
#endif

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]