Don't let search bots look at buglist.cgi

Mon May 16 13:39:00 GMT 2011

On 05/16/2011 02:10 PM, Richard Guenther wrote:
> On Mon, May 16, 2011 at 3:04 PM, Andrew Haley <aph@redhat.com> wrote:
>> On 05/16/2011 01:09 PM, Michael Matz wrote:
>>> Hi,
>>>
>>> On Mon, 16 May 2011, Andrew Haley wrote:
>>>
>>>> On 16/05/11 10:45, Richard Guenther wrote:
>>>>> On Fri, May 13, 2011 at 7:14 PM, Ian Lance Taylor <iant@google.com> wrote:
>>>>>> I noticed that buglist.cgi was taking quite a bit of CPU time.  I looked
>>>>>> at some of the long running instances, and they were coming from
>>>>>> searchbots.  I can't think of a good reason for this, so I have
>>>>>> committed this patch to the gcc.gnu.org robots.txt file to not let
>>>>>> searchbots search through lists of bugs.  I plan to make a similar
>>>>>> change on the sourceware.org and cygwin.com sides.  Please let me know
>>>>>> if this seems like a mistake.
>>>>>>
>>>>>> Does anybody have any experience with
>>>>>> http://code.google.com/p/bugzilla-sitemap/ ?  That might be a slightly
>>>>>> better approach.
>>>>>
>>>>> Shouldn't we keep searchbots way from bugzilla completely?  Searchbots
>>>>> can crawl the gcc-bugs mailinglist archives.
>>>>
>>>> I don't understand this.  Surely it is super-useful for Google etc. to
>>>> be able to search gcc's Bugzilla.
>>>
>>> gcc-bugs provides exactly the same information, and doesn't have to
>>> regenerate the full web page for each access to a bug report.
>>
>> It's not quite the same information, surely.  Wouldn't searchers be directed
>> to an email rather than the bug itself?
> 
> Yes, though there is a link in all mails.

Right, so we are contemplating a reduction in search quality in
exchange for a reduction in server load.  That is not an improvement
from the point of view of our users, and is therefore not the sort of
thing we should do unless the server load is so great that it impedes
our mission.

Andrew.