Message ID | mcry62a7cau.fsf@coign.corp.google.com |
---|---|
State | New |
Headers | show |
On Fri, May 13, 2011 at 7:14 PM, Ian Lance Taylor <iant@google.com> wrote: > I noticed that buglist.cgi was taking quite a bit of CPU time. I looked > at some of the long running instances, and they were coming from > searchbots. I can't think of a good reason for this, so I have > committed this patch to the gcc.gnu.org robots.txt file to not let > searchbots search through lists of bugs. I plan to make a similar > change on the sourceware.org and cygwin.com sides. Please let me know > if this seems like a mistake. > > Does anybody have any experience with > http://code.google.com/p/bugzilla-sitemap/ ? That might be a slightly > better approach. Shouldn't we keep searchbots way from bugzilla completely? Searchbots can crawl the gcc-bugs mailinglist archives. Richard. > Ian > >
On 16/05/11 10:45, Richard Guenther wrote: > On Fri, May 13, 2011 at 7:14 PM, Ian Lance Taylor <iant@google.com> wrote: >> I noticed that buglist.cgi was taking quite a bit of CPU time. I looked >> at some of the long running instances, and they were coming from >> searchbots. I can't think of a good reason for this, so I have >> committed this patch to the gcc.gnu.org robots.txt file to not let >> searchbots search through lists of bugs. I plan to make a similar >> change on the sourceware.org and cygwin.com sides. Please let me know >> if this seems like a mistake. >> >> Does anybody have any experience with >> http://code.google.com/p/bugzilla-sitemap/ ? That might be a slightly >> better approach. > > Shouldn't we keep searchbots way from bugzilla completely? Searchbots > can crawl the gcc-bugs mailinglist archives. I don't understand this. Surely it is super-useful for Google etc. to be able to search gcc's Bugzilla. Andrew.
Hi, On Mon, 16 May 2011, Andrew Haley wrote: > On 16/05/11 10:45, Richard Guenther wrote: > > On Fri, May 13, 2011 at 7:14 PM, Ian Lance Taylor <iant@google.com> wrote: > >> I noticed that buglist.cgi was taking quite a bit of CPU time. I looked > >> at some of the long running instances, and they were coming from > >> searchbots. I can't think of a good reason for this, so I have > >> committed this patch to the gcc.gnu.org robots.txt file to not let > >> searchbots search through lists of bugs. I plan to make a similar > >> change on the sourceware.org and cygwin.com sides. Please let me know > >> if this seems like a mistake. > >> > >> Does anybody have any experience with > >> http://code.google.com/p/bugzilla-sitemap/ ? That might be a slightly > >> better approach. > > > > Shouldn't we keep searchbots way from bugzilla completely? Searchbots > > can crawl the gcc-bugs mailinglist archives. > > I don't understand this. Surely it is super-useful for Google etc. to > be able to search gcc's Bugzilla. gcc-bugs provides exactly the same information, and doesn't have to regenerate the full web page for each access to a bug report. Ciao, Michael.
On 05/16/2011 01:09 PM, Michael Matz wrote: > Hi, > > On Mon, 16 May 2011, Andrew Haley wrote: > >> On 16/05/11 10:45, Richard Guenther wrote: >>> On Fri, May 13, 2011 at 7:14 PM, Ian Lance Taylor <iant@google.com> wrote: >>>> I noticed that buglist.cgi was taking quite a bit of CPU time. I looked >>>> at some of the long running instances, and they were coming from >>>> searchbots. I can't think of a good reason for this, so I have >>>> committed this patch to the gcc.gnu.org robots.txt file to not let >>>> searchbots search through lists of bugs. I plan to make a similar >>>> change on the sourceware.org and cygwin.com sides. Please let me know >>>> if this seems like a mistake. >>>> >>>> Does anybody have any experience with >>>> http://code.google.com/p/bugzilla-sitemap/ ? That might be a slightly >>>> better approach. >>> >>> Shouldn't we keep searchbots way from bugzilla completely? Searchbots >>> can crawl the gcc-bugs mailinglist archives. >> >> I don't understand this. Surely it is super-useful for Google etc. to >> be able to search gcc's Bugzilla. > > gcc-bugs provides exactly the same information, and doesn't have to > regenerate the full web page for each access to a bug report. It's not quite the same information, surely. Wouldn't searchers be directed to an email rather than the bug itself? Andrew.
Andrew Haley <aph@redhat.com> writes: > It's not quite the same information, surely. Wouldn't searchers be directed > to an email rather than the bug itself? The mail contains the bugzilla link, so they can easily get there if needed. Andreas.
On Mon, May 16, 2011 at 3:04 PM, Andrew Haley <aph@redhat.com> wrote: > On 05/16/2011 01:09 PM, Michael Matz wrote: >> Hi, >> >> On Mon, 16 May 2011, Andrew Haley wrote: >> >>> On 16/05/11 10:45, Richard Guenther wrote: >>>> On Fri, May 13, 2011 at 7:14 PM, Ian Lance Taylor <iant@google.com> wrote: >>>>> I noticed that buglist.cgi was taking quite a bit of CPU time. I looked >>>>> at some of the long running instances, and they were coming from >>>>> searchbots. I can't think of a good reason for this, so I have >>>>> committed this patch to the gcc.gnu.org robots.txt file to not let >>>>> searchbots search through lists of bugs. I plan to make a similar >>>>> change on the sourceware.org and cygwin.com sides. Please let me know >>>>> if this seems like a mistake. >>>>> >>>>> Does anybody have any experience with >>>>> http://code.google.com/p/bugzilla-sitemap/ ? That might be a slightly >>>>> better approach. >>>> >>>> Shouldn't we keep searchbots way from bugzilla completely? Searchbots >>>> can crawl the gcc-bugs mailinglist archives. >>> >>> I don't understand this. Surely it is super-useful for Google etc. to >>> be able to search gcc's Bugzilla. >> >> gcc-bugs provides exactly the same information, and doesn't have to >> regenerate the full web page for each access to a bug report. > > It's not quite the same information, surely. Wouldn't searchers be directed > to an email rather than the bug itself? Yes, though there is a link in all mails. Richard.
On 05/16/2011 02:10 PM, Richard Guenther wrote: > On Mon, May 16, 2011 at 3:04 PM, Andrew Haley <aph@redhat.com> wrote: >> On 05/16/2011 01:09 PM, Michael Matz wrote: >>> Hi, >>> >>> On Mon, 16 May 2011, Andrew Haley wrote: >>> >>>> On 16/05/11 10:45, Richard Guenther wrote: >>>>> On Fri, May 13, 2011 at 7:14 PM, Ian Lance Taylor <iant@google.com> wrote: >>>>>> I noticed that buglist.cgi was taking quite a bit of CPU time. I looked >>>>>> at some of the long running instances, and they were coming from >>>>>> searchbots. I can't think of a good reason for this, so I have >>>>>> committed this patch to the gcc.gnu.org robots.txt file to not let >>>>>> searchbots search through lists of bugs. I plan to make a similar >>>>>> change on the sourceware.org and cygwin.com sides. Please let me know >>>>>> if this seems like a mistake. >>>>>> >>>>>> Does anybody have any experience with >>>>>> http://code.google.com/p/bugzilla-sitemap/ ? That might be a slightly >>>>>> better approach. >>>>> >>>>> Shouldn't we keep searchbots way from bugzilla completely? Searchbots >>>>> can crawl the gcc-bugs mailinglist archives. >>>> >>>> I don't understand this. Surely it is super-useful for Google etc. to >>>> be able to search gcc's Bugzilla. >>> >>> gcc-bugs provides exactly the same information, and doesn't have to >>> regenerate the full web page for each access to a bug report. >> >> It's not quite the same information, surely. Wouldn't searchers be directed >> to an email rather than the bug itself? > > Yes, though there is a link in all mails. Right, so we are contemplating a reduction in search quality in exchange for a reduction in server load. That is not an improvement from the point of view of our users, and is therefore not the sort of thing we should do unless the server load is so great that it impedes our mission. Andrew.
Hi, On Mon, 16 May 2011, Andrew Haley wrote: > >> It's not quite the same information, surely. Wouldn't searchers be > >> directed to an email rather than the bug itself? > > > > Yes, though there is a link in all mails. > > Right, so we are contemplating a reduction in search quality in exchange > for a reduction in server load. That is not an improvement from the > point of view of our users, and is therefore not the sort of thing we > should do unless the server load is so great that it impedes our > mission. It routinely is. bugzilla performance is terrible most of the time for me (up to the point of five timeouts in sequence), svn speed is mediocre at best, and people with access to gcc.gnu.org often observe loads > 25, mostly due to I/O . Ciao, Michael.
On 05/16/2011 02:22 PM, Michael Matz wrote: > Hi, > > On Mon, 16 May 2011, Andrew Haley wrote: > >>>> It's not quite the same information, surely. Wouldn't searchers be >>>> directed to an email rather than the bug itself? >>> >>> Yes, though there is a link in all mails. >> >> Right, so we are contemplating a reduction in search quality in exchange >> for a reduction in server load. That is not an improvement from the >> point of view of our users, and is therefore not the sort of thing we >> should do unless the server load is so great that it impedes our >> mission. > > It routinely is. bugzilla performance is terrible most of the time for me > (up to the point of five timeouts in sequence), svn speed is mediocre at > best, and people with access to gcc.gnu.org often observe loads > 25, > mostly due to I/O . And how have you concluded that is due to web crawlers? Andrew.
Hi, On Mon, 16 May 2011, Andrew Haley wrote: > > It routinely is. bugzilla performance is terrible most of the time > > for me (up to the point of five timeouts in sequence), svn speed is > > mediocre at best, and people with access to gcc.gnu.org often observe > > loads > 25, mostly due to I/O . > > And how have you concluded that is due to web crawlers? httpd being in the top-10 always, fiddling with bugzilla URLs? (Note, I don't have access to gcc.gnu.org, I'm relaying info from multiple instances of discussion on #gcc and richi poking on it; that said, it still might not be web crawlers, that's right, but I'll happily accept _any_ load improvement on gcc.gnu.org, how unfounded they might seem) Ciao, Michael.
On 05/16/2011 02:32 PM, Michael Matz wrote: > > On Mon, 16 May 2011, Andrew Haley wrote: > >>> It routinely is. bugzilla performance is terrible most of the time >>> for me (up to the point of five timeouts in sequence), svn speed is >>> mediocre at best, and people with access to gcc.gnu.org often observe >>> loads > 25, mostly due to I/O . >> >> And how have you concluded that is due to web crawlers? > > httpd being in the top-10 always, fiddling with bugzilla URLs? > (Note, I don't have access to gcc.gnu.org, I'm relaying info from multiple > instances of discussion on #gcc and richi poking on it; that said, it > still might not be web crawlers, that's right, but I'll happily accept > _any_ load improvement on gcc.gnu.org, how unfounded they might seem) Well, we have to be sensible. If blocking crawlers only results in a small load reduction that isn't, IMHO, a good deal for our users. Andrew.
On Mon, May 16, 2011 at 3:34 PM, Andrew Haley <aph@redhat.com> wrote: > On 05/16/2011 02:32 PM, Michael Matz wrote: >> >> On Mon, 16 May 2011, Andrew Haley wrote: >> >>>> It routinely is. bugzilla performance is terrible most of the time >>>> for me (up to the point of five timeouts in sequence), svn speed is >>>> mediocre at best, and people with access to gcc.gnu.org often observe >>>> loads > 25, mostly due to I/O . >>> >>> And how have you concluded that is due to web crawlers? >> >> httpd being in the top-10 always, fiddling with bugzilla URLs? >> (Note, I don't have access to gcc.gnu.org, I'm relaying info from multiple >> instances of discussion on #gcc and richi poking on it; that said, it >> still might not be web crawlers, that's right, but I'll happily accept >> _any_ load improvement on gcc.gnu.org, how unfounded they might seem) > > Well, we have to be sensible. If blocking crawlers only results in a > small load reduction that isn't, IMHO, a good deal for our users. I for example see also 66.249.71.59 - - [16/May/2011:13:37:58 +0000] "GET /viewcvs?view=revision&revision=169814 HTTP/1.1" 200 1334 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" (35%) 2060117us and viewvc is certainly even worse (from an I/O perspecive). I thought we blocked all bot traffic from the viewvc stuff ... Richard. > Andrew. >
On 05/16/2011 02:42 PM, Richard Guenther wrote: > On Mon, May 16, 2011 at 3:34 PM, Andrew Haley <aph@redhat.com> wrote: >> On 05/16/2011 02:32 PM, Michael Matz wrote: >>> >>> On Mon, 16 May 2011, Andrew Haley wrote: >>> >>>>> It routinely is. bugzilla performance is terrible most of the time >>>>> for me (up to the point of five timeouts in sequence), svn speed is >>>>> mediocre at best, and people with access to gcc.gnu.org often observe >>>>> loads > 25, mostly due to I/O . >>>> >>>> And how have you concluded that is due to web crawlers? >>> >>> httpd being in the top-10 always, fiddling with bugzilla URLs? >>> (Note, I don't have access to gcc.gnu.org, I'm relaying info from multiple >>> instances of discussion on #gcc and richi poking on it; that said, it >>> still might not be web crawlers, that's right, but I'll happily accept >>> _any_ load improvement on gcc.gnu.org, how unfounded they might seem) >> >> Well, we have to be sensible. If blocking crawlers only results in a >> small load reduction that isn't, IMHO, a good deal for our users. > > I for example see also > > 66.249.71.59 - - [16/May/2011:13:37:58 +0000] "GET > /viewcvs?view=revision&revision=169814 HTTP/1.1" 200 1334 "-" > "Mozilla/5.0 (compatible; Googlebot/2.1; > +http://www.google.com/bot.html)" (35%) 2060117us > > and viewvc is certainly even worse (from an I/O perspecive). I thought > we blocked all bot traffic from the viewvc stuff ... It makes sense to block viewcvs, but I don't think it makes as much sense to block the bugs themselves. That's the part that is useful to our users. Andrew.
Richard Guenther <richard.guenther@gmail.com> writes: > On Fri, May 13, 2011 at 7:14 PM, Ian Lance Taylor <iant@google.com> wrote: >> I noticed that buglist.cgi was taking quite a bit of CPU time. I looked >> at some of the long running instances, and they were coming from >> searchbots. I can't think of a good reason for this, so I have >> committed this patch to the gcc.gnu.org robots.txt file to not let >> searchbots search through lists of bugs. I plan to make a similar >> change on the sourceware.org and cygwin.com sides. Please let me know >> if this seems like a mistake. >> >> Does anybody have any experience with >> http://code.google.com/p/bugzilla-sitemap/ ? That might be a slightly >> better approach. > > Shouldn't we keep searchbots way from bugzilla completely? Searchbots > can crawl the gcc-bugs mailinglist archives. I don't see anything wrong with crawling bugzilla, though, and the resulting links should be better. Ian
On Mon, 16 May 2011, Ian Lance Taylor wrote: > Richard Guenther <richard.guenther@gmail.com> writes: > > > On Fri, May 13, 2011 at 7:14 PM, Ian Lance Taylor <iant@google.com> wrote: > >> I noticed that buglist.cgi was taking quite a bit of CPU time. I looked > >> at some of the long running instances, and they were coming from > >> searchbots. I can't think of a good reason for this, so I have > >> committed this patch to the gcc.gnu.org robots.txt file to not let > >> searchbots search through lists of bugs. I plan to make a similar > >> change on the sourceware.org and cygwin.com sides. Please let me know > >> if this seems like a mistake. > >> > >> Does anybody have any experience with > >> http://code.google.com/p/bugzilla-sitemap/ ? That might be a slightly > >> better approach. > > > > Shouldn't we keep searchbots way from bugzilla completely? Searchbots > > can crawl the gcc-bugs mailinglist archives. > > I don't see anything wrong with crawling bugzilla, though, and the > resulting links should be better. Indeed. I think the individual bugs, and the GCC-specific help texts (such as describekeywords.cgi and describecomponents.cgi), should be indexed.
Index: robots.txt =================================================================== RCS file: /cvs/gcc/wwwdocs/htdocs/robots.txt,v retrieving revision 1.9 diff -u -r1.9 robots.txt --- robots.txt 22 Sep 2009 19:19:30 -0000 1.9 +++ robots.txt 13 May 2011 17:08:33 -0000 @@ -5,4 +5,5 @@ User-Agent: * Disallow: /viewcvs/ Disallow: /cgi-bin/ +Disallow: /bugzilla/buglist.cgi Crawl-Delay: 60