Patchwork Don't let search bots look at buglist.cgi

login
register
mail settings
Submitter Ian Taylor
Date May 13, 2011, 5:14 p.m.
Message ID <mcry62a7cau.fsf@coign.corp.google.com>
Download mbox | patch
Permalink /patch/95501/
State New
Headers show

Comments

Ian Taylor - May 13, 2011, 5:14 p.m.
I noticed that buglist.cgi was taking quite a bit of CPU time.  I looked
at some of the long running instances, and they were coming from
searchbots.  I can't think of a good reason for this, so I have
committed this patch to the gcc.gnu.org robots.txt file to not let
searchbots search through lists of bugs.  I plan to make a similar
change on the sourceware.org and cygwin.com sides.  Please let me know
if this seems like a mistake.

Does anybody have any experience with
http://code.google.com/p/bugzilla-sitemap/ ?  That might be a slightly
better approach.

Ian
Richard Guenther - May 16, 2011, 9:45 a.m.
On Fri, May 13, 2011 at 7:14 PM, Ian Lance Taylor <iant@google.com> wrote:
> I noticed that buglist.cgi was taking quite a bit of CPU time.  I looked
> at some of the long running instances, and they were coming from
> searchbots.  I can't think of a good reason for this, so I have
> committed this patch to the gcc.gnu.org robots.txt file to not let
> searchbots search through lists of bugs.  I plan to make a similar
> change on the sourceware.org and cygwin.com sides.  Please let me know
> if this seems like a mistake.
>
> Does anybody have any experience with
> http://code.google.com/p/bugzilla-sitemap/ ?  That might be a slightly
> better approach.

Shouldn't we keep searchbots way from bugzilla completely?  Searchbots
can crawl the gcc-bugs mailinglist archives.

Richard.

> Ian
>
>
Andrew Haley - May 16, 2011, 11:10 a.m.
On 16/05/11 10:45, Richard Guenther wrote:
> On Fri, May 13, 2011 at 7:14 PM, Ian Lance Taylor <iant@google.com> wrote:
>> I noticed that buglist.cgi was taking quite a bit of CPU time.  I looked
>> at some of the long running instances, and they were coming from
>> searchbots.  I can't think of a good reason for this, so I have
>> committed this patch to the gcc.gnu.org robots.txt file to not let
>> searchbots search through lists of bugs.  I plan to make a similar
>> change on the sourceware.org and cygwin.com sides.  Please let me know
>> if this seems like a mistake.
>>
>> Does anybody have any experience with
>> http://code.google.com/p/bugzilla-sitemap/ ?  That might be a slightly
>> better approach.
> 
> Shouldn't we keep searchbots way from bugzilla completely?  Searchbots
> can crawl the gcc-bugs mailinglist archives.

I don't understand this.  Surely it is super-useful for Google etc. to
be able to search gcc's Bugzilla.

Andrew.
Michael Matz - May 16, 2011, 12:09 p.m.
Hi,

On Mon, 16 May 2011, Andrew Haley wrote:

> On 16/05/11 10:45, Richard Guenther wrote:
> > On Fri, May 13, 2011 at 7:14 PM, Ian Lance Taylor <iant@google.com> wrote:
> >> I noticed that buglist.cgi was taking quite a bit of CPU time.  I looked
> >> at some of the long running instances, and they were coming from
> >> searchbots.  I can't think of a good reason for this, so I have
> >> committed this patch to the gcc.gnu.org robots.txt file to not let
> >> searchbots search through lists of bugs.  I plan to make a similar
> >> change on the sourceware.org and cygwin.com sides.  Please let me know
> >> if this seems like a mistake.
> >>
> >> Does anybody have any experience with
> >> http://code.google.com/p/bugzilla-sitemap/ ?  That might be a slightly
> >> better approach.
> > 
> > Shouldn't we keep searchbots way from bugzilla completely?  Searchbots
> > can crawl the gcc-bugs mailinglist archives.
> 
> I don't understand this.  Surely it is super-useful for Google etc. to
> be able to search gcc's Bugzilla.

gcc-bugs provides exactly the same information, and doesn't have to 
regenerate the full web page for each access to a bug report.


Ciao,
Michael.
Andrew Haley - May 16, 2011, 1:04 p.m.
On 05/16/2011 01:09 PM, Michael Matz wrote:
> Hi,
> 
> On Mon, 16 May 2011, Andrew Haley wrote:
> 
>> On 16/05/11 10:45, Richard Guenther wrote:
>>> On Fri, May 13, 2011 at 7:14 PM, Ian Lance Taylor <iant@google.com> wrote:
>>>> I noticed that buglist.cgi was taking quite a bit of CPU time.  I looked
>>>> at some of the long running instances, and they were coming from
>>>> searchbots.  I can't think of a good reason for this, so I have
>>>> committed this patch to the gcc.gnu.org robots.txt file to not let
>>>> searchbots search through lists of bugs.  I plan to make a similar
>>>> change on the sourceware.org and cygwin.com sides.  Please let me know
>>>> if this seems like a mistake.
>>>>
>>>> Does anybody have any experience with
>>>> http://code.google.com/p/bugzilla-sitemap/ ?  That might be a slightly
>>>> better approach.
>>>
>>> Shouldn't we keep searchbots way from bugzilla completely?  Searchbots
>>> can crawl the gcc-bugs mailinglist archives.
>>
>> I don't understand this.  Surely it is super-useful for Google etc. to
>> be able to search gcc's Bugzilla.
> 
> gcc-bugs provides exactly the same information, and doesn't have to 
> regenerate the full web page for each access to a bug report.

It's not quite the same information, surely.  Wouldn't searchers be directed
to an email rather than the bug itself?

Andrew.
Andreas Schwab - May 16, 2011, 1:09 p.m.
Andrew Haley <aph@redhat.com> writes:

> It's not quite the same information, surely.  Wouldn't searchers be directed
> to an email rather than the bug itself?

The mail contains the bugzilla link, so they can easily get there if
needed.

Andreas.
Richard Guenther - May 16, 2011, 1:10 p.m.
On Mon, May 16, 2011 at 3:04 PM, Andrew Haley <aph@redhat.com> wrote:
> On 05/16/2011 01:09 PM, Michael Matz wrote:
>> Hi,
>>
>> On Mon, 16 May 2011, Andrew Haley wrote:
>>
>>> On 16/05/11 10:45, Richard Guenther wrote:
>>>> On Fri, May 13, 2011 at 7:14 PM, Ian Lance Taylor <iant@google.com> wrote:
>>>>> I noticed that buglist.cgi was taking quite a bit of CPU time.  I looked
>>>>> at some of the long running instances, and they were coming from
>>>>> searchbots.  I can't think of a good reason for this, so I have
>>>>> committed this patch to the gcc.gnu.org robots.txt file to not let
>>>>> searchbots search through lists of bugs.  I plan to make a similar
>>>>> change on the sourceware.org and cygwin.com sides.  Please let me know
>>>>> if this seems like a mistake.
>>>>>
>>>>> Does anybody have any experience with
>>>>> http://code.google.com/p/bugzilla-sitemap/ ?  That might be a slightly
>>>>> better approach.
>>>>
>>>> Shouldn't we keep searchbots way from bugzilla completely?  Searchbots
>>>> can crawl the gcc-bugs mailinglist archives.
>>>
>>> I don't understand this.  Surely it is super-useful for Google etc. to
>>> be able to search gcc's Bugzilla.
>>
>> gcc-bugs provides exactly the same information, and doesn't have to
>> regenerate the full web page for each access to a bug report.
>
> It's not quite the same information, surely.  Wouldn't searchers be directed
> to an email rather than the bug itself?

Yes, though there is a link in all mails.

Richard.
Andrew Haley - May 16, 2011, 1:18 p.m.
On 05/16/2011 02:10 PM, Richard Guenther wrote:
> On Mon, May 16, 2011 at 3:04 PM, Andrew Haley <aph@redhat.com> wrote:
>> On 05/16/2011 01:09 PM, Michael Matz wrote:
>>> Hi,
>>>
>>> On Mon, 16 May 2011, Andrew Haley wrote:
>>>
>>>> On 16/05/11 10:45, Richard Guenther wrote:
>>>>> On Fri, May 13, 2011 at 7:14 PM, Ian Lance Taylor <iant@google.com> wrote:
>>>>>> I noticed that buglist.cgi was taking quite a bit of CPU time.  I looked
>>>>>> at some of the long running instances, and they were coming from
>>>>>> searchbots.  I can't think of a good reason for this, so I have
>>>>>> committed this patch to the gcc.gnu.org robots.txt file to not let
>>>>>> searchbots search through lists of bugs.  I plan to make a similar
>>>>>> change on the sourceware.org and cygwin.com sides.  Please let me know
>>>>>> if this seems like a mistake.
>>>>>>
>>>>>> Does anybody have any experience with
>>>>>> http://code.google.com/p/bugzilla-sitemap/ ?  That might be a slightly
>>>>>> better approach.
>>>>>
>>>>> Shouldn't we keep searchbots way from bugzilla completely?  Searchbots
>>>>> can crawl the gcc-bugs mailinglist archives.
>>>>
>>>> I don't understand this.  Surely it is super-useful for Google etc. to
>>>> be able to search gcc's Bugzilla.
>>>
>>> gcc-bugs provides exactly the same information, and doesn't have to
>>> regenerate the full web page for each access to a bug report.
>>
>> It's not quite the same information, surely.  Wouldn't searchers be directed
>> to an email rather than the bug itself?
> 
> Yes, though there is a link in all mails.

Right, so we are contemplating a reduction in search quality in
exchange for a reduction in server load.  That is not an improvement
from the point of view of our users, and is therefore not the sort of
thing we should do unless the server load is so great that it impedes
our mission.

Andrew.
Michael Matz - May 16, 2011, 1:22 p.m.
Hi,

On Mon, 16 May 2011, Andrew Haley wrote:

> >> It's not quite the same information, surely.  Wouldn't searchers be 
> >> directed to an email rather than the bug itself?
> > 
> > Yes, though there is a link in all mails.
> 
> Right, so we are contemplating a reduction in search quality in exchange 
> for a reduction in server load.  That is not an improvement from the 
> point of view of our users, and is therefore not the sort of thing we 
> should do unless the server load is so great that it impedes our 
> mission.

It routinely is.  bugzilla performance is terrible most of the time for me 
(up to the point of five timeouts in sequence), svn speed is mediocre at 
best, and people with access to gcc.gnu.org often observe loads > 25, 
mostly due to I/O .


Ciao,
Michael.
Andrew Haley - May 16, 2011, 1:25 p.m.
On 05/16/2011 02:22 PM, Michael Matz wrote:
> Hi,
> 
> On Mon, 16 May 2011, Andrew Haley wrote:
> 
>>>> It's not quite the same information, surely.  Wouldn't searchers be 
>>>> directed to an email rather than the bug itself?
>>>
>>> Yes, though there is a link in all mails.
>>
>> Right, so we are contemplating a reduction in search quality in exchange 
>> for a reduction in server load.  That is not an improvement from the 
>> point of view of our users, and is therefore not the sort of thing we 
>> should do unless the server load is so great that it impedes our 
>> mission.
> 
> It routinely is.  bugzilla performance is terrible most of the time for me 
> (up to the point of five timeouts in sequence), svn speed is mediocre at 
> best, and people with access to gcc.gnu.org often observe loads > 25, 
> mostly due to I/O .

And how have you concluded that is due to web crawlers?

Andrew.
Michael Matz - May 16, 2011, 1:32 p.m.
Hi,

On Mon, 16 May 2011, Andrew Haley wrote:

> > It routinely is.  bugzilla performance is terrible most of the time 
> > for me (up to the point of five timeouts in sequence), svn speed is 
> > mediocre at best, and people with access to gcc.gnu.org often observe 
> > loads > 25, mostly due to I/O .
> 
> And how have you concluded that is due to web crawlers?

httpd being in the top-10 always, fiddling with bugzilla URLs?
(Note, I don't have access to gcc.gnu.org, I'm relaying info from multiple 
instances of discussion on #gcc and richi poking on it; that said, it 
still might not be web crawlers, that's right, but I'll happily accept 
_any_ load improvement on gcc.gnu.org, how unfounded they might seem)


Ciao,
Michael.
Andrew Haley - May 16, 2011, 1:34 p.m.
On 05/16/2011 02:32 PM, Michael Matz wrote:
> 
> On Mon, 16 May 2011, Andrew Haley wrote:
> 
>>> It routinely is.  bugzilla performance is terrible most of the time 
>>> for me (up to the point of five timeouts in sequence), svn speed is 
>>> mediocre at best, and people with access to gcc.gnu.org often observe 
>>> loads > 25, mostly due to I/O .
>>
>> And how have you concluded that is due to web crawlers?
> 
> httpd being in the top-10 always, fiddling with bugzilla URLs?
> (Note, I don't have access to gcc.gnu.org, I'm relaying info from multiple 
> instances of discussion on #gcc and richi poking on it; that said, it 
> still might not be web crawlers, that's right, but I'll happily accept 
> _any_ load improvement on gcc.gnu.org, how unfounded they might seem)

Well, we have to be sensible.  If blocking crawlers only results in a
small load reduction that isn't, IMHO, a good deal for our users.

Andrew.
Richard Guenther - May 16, 2011, 1:42 p.m.
On Mon, May 16, 2011 at 3:34 PM, Andrew Haley <aph@redhat.com> wrote:
> On 05/16/2011 02:32 PM, Michael Matz wrote:
>>
>> On Mon, 16 May 2011, Andrew Haley wrote:
>>
>>>> It routinely is.  bugzilla performance is terrible most of the time
>>>> for me (up to the point of five timeouts in sequence), svn speed is
>>>> mediocre at best, and people with access to gcc.gnu.org often observe
>>>> loads > 25, mostly due to I/O .
>>>
>>> And how have you concluded that is due to web crawlers?
>>
>> httpd being in the top-10 always, fiddling with bugzilla URLs?
>> (Note, I don't have access to gcc.gnu.org, I'm relaying info from multiple
>> instances of discussion on #gcc and richi poking on it; that said, it
>> still might not be web crawlers, that's right, but I'll happily accept
>> _any_ load improvement on gcc.gnu.org, how unfounded they might seem)
>
> Well, we have to be sensible.  If blocking crawlers only results in a
> small load reduction that isn't, IMHO, a good deal for our users.

I for example see also

66.249.71.59 - - [16/May/2011:13:37:58 +0000] "GET
/viewcvs?view=revision&revision=169814 HTTP/1.1" 200 1334 "-"
"Mozilla/5.0 (compatible; Googlebot/2.1;
+http://www.google.com/bot.html)" (35%) 2060117us

and viewvc is certainly even worse (from an I/O perspecive).  I thought
we blocked all bot traffic from the viewvc stuff ...

Richard.

> Andrew.
>
Andrew Haley - May 16, 2011, 1:44 p.m.
On 05/16/2011 02:42 PM, Richard Guenther wrote:
> On Mon, May 16, 2011 at 3:34 PM, Andrew Haley <aph@redhat.com> wrote:
>> On 05/16/2011 02:32 PM, Michael Matz wrote:
>>>
>>> On Mon, 16 May 2011, Andrew Haley wrote:
>>>
>>>>> It routinely is.  bugzilla performance is terrible most of the time
>>>>> for me (up to the point of five timeouts in sequence), svn speed is
>>>>> mediocre at best, and people with access to gcc.gnu.org often observe
>>>>> loads > 25, mostly due to I/O .
>>>>
>>>> And how have you concluded that is due to web crawlers?
>>>
>>> httpd being in the top-10 always, fiddling with bugzilla URLs?
>>> (Note, I don't have access to gcc.gnu.org, I'm relaying info from multiple
>>> instances of discussion on #gcc and richi poking on it; that said, it
>>> still might not be web crawlers, that's right, but I'll happily accept
>>> _any_ load improvement on gcc.gnu.org, how unfounded they might seem)
>>
>> Well, we have to be sensible.  If blocking crawlers only results in a
>> small load reduction that isn't, IMHO, a good deal for our users.
> 
> I for example see also
> 
> 66.249.71.59 - - [16/May/2011:13:37:58 +0000] "GET
> /viewcvs?view=revision&revision=169814 HTTP/1.1" 200 1334 "-"
> "Mozilla/5.0 (compatible; Googlebot/2.1;
> +http://www.google.com/bot.html)" (35%) 2060117us
> 
> and viewvc is certainly even worse (from an I/O perspecive).  I thought
> we blocked all bot traffic from the viewvc stuff ...

It makes sense to block viewcvs, but I don't think it makes as
much sense to block the bugs themselves.  That's the part that
is useful to our users.

Andrew.
Ian Taylor - May 16, 2011, 8:35 p.m.
Richard Guenther <richard.guenther@gmail.com> writes:

> On Fri, May 13, 2011 at 7:14 PM, Ian Lance Taylor <iant@google.com> wrote:
>> I noticed that buglist.cgi was taking quite a bit of CPU time.  I looked
>> at some of the long running instances, and they were coming from
>> searchbots.  I can't think of a good reason for this, so I have
>> committed this patch to the gcc.gnu.org robots.txt file to not let
>> searchbots search through lists of bugs.  I plan to make a similar
>> change on the sourceware.org and cygwin.com sides.  Please let me know
>> if this seems like a mistake.
>>
>> Does anybody have any experience with
>> http://code.google.com/p/bugzilla-sitemap/ ?  That might be a slightly
>> better approach.
>
> Shouldn't we keep searchbots way from bugzilla completely?  Searchbots
> can crawl the gcc-bugs mailinglist archives.

I don't see anything wrong with crawling bugzilla, though, and the
resulting links should be better.

Ian
Joseph S. Myers - May 16, 2011, 9:19 p.m.
On Mon, 16 May 2011, Ian Lance Taylor wrote:

> Richard Guenther <richard.guenther@gmail.com> writes:
> 
> > On Fri, May 13, 2011 at 7:14 PM, Ian Lance Taylor <iant@google.com> wrote:
> >> I noticed that buglist.cgi was taking quite a bit of CPU time.  I looked
> >> at some of the long running instances, and they were coming from
> >> searchbots.  I can't think of a good reason for this, so I have
> >> committed this patch to the gcc.gnu.org robots.txt file to not let
> >> searchbots search through lists of bugs.  I plan to make a similar
> >> change on the sourceware.org and cygwin.com sides.  Please let me know
> >> if this seems like a mistake.
> >>
> >> Does anybody have any experience with
> >> http://code.google.com/p/bugzilla-sitemap/ ?  That might be a slightly
> >> better approach.
> >
> > Shouldn't we keep searchbots way from bugzilla completely?  Searchbots
> > can crawl the gcc-bugs mailinglist archives.
> 
> I don't see anything wrong with crawling bugzilla, though, and the
> resulting links should be better.

Indeed.  I think the individual bugs, and the GCC-specific help texts 
(such as describekeywords.cgi and describecomponents.cgi), should be 
indexed.

Patch

Index: robots.txt
===================================================================
RCS file: /cvs/gcc/wwwdocs/htdocs/robots.txt,v
retrieving revision 1.9
diff -u -r1.9 robots.txt
--- robots.txt	22 Sep 2009 19:19:30 -0000	1.9
+++ robots.txt	13 May 2011 17:08:33 -0000
@@ -5,4 +5,5 @@ 
 User-Agent: *
 Disallow: /viewcvs/
 Disallow: /cgi-bin/
+Disallow: /bugzilla/buglist.cgi
 Crawl-Delay: 60