Message ID | BANLkTikwjjfsbuCacXdvd=5CyqYdEzzChA@mail.gmail.com |
---|---|
State | New |
Headers | show |
On Mon, May 16, 2011 at 10:27:44PM -0700, Ian Lance Taylor wrote: > On Mon, May 16, 2011 at 6:42 AM, Richard Guenther > <richard.guenther@gmail.com> wrote: > >>> > >>> httpd being in the top-10 always, fiddling with bugzilla URLs? > >>> (Note, I don't have access to gcc.gnu.org, I'm relaying info from multiple > >>> instances of discussion on #gcc and richi poking on it; that said, it > >>> still might not be web crawlers, that's right, but I'll happily accept > >>> _any_ load improvement on gcc.gnu.org, how unfounded they might seem) > > I think that simply blocking buglist.cgi has dropped bugzilla off the > immediate radar. > It also seems to have lowered the load, although I'm not sure if we > are still keeping > historical data. > > > > I for example see also > > > > 66.249.71.59 - - [16/May/2011:13:37:58 +0000] "GET > > /viewcvs?view=revision&revision=169814 HTTP/1.1" 200 1334 "-" > > "Mozilla/5.0 (compatible; Googlebot/2.1; > > +http://www.google.com/bot.html)" (35%) 2060117us > > > > and viewvc is certainly even worse (from an I/O perspecive). I thought > > we blocked all bot traffic from the viewvc stuff ... > > This is only happening at top level. I committed this patch to fix this. Probably you know it much better than me, but wouldn't it be a possibility to only allow some of google crawlers? (if all try to crawl bugzilla) As I read http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=1061943 it would be possible to block the Crawlers Googlebot-Mobile, Mediapartners-Google and AdsBot-Google, (which seem to be independent Crawlers?) while allowing the main Googlebot (Well, I don't know how often which crawler appears how often on bugzilla...) Axel
Hi, On Mon, 16 May 2011, Ian Lance Taylor wrote: > >>> httpd being in the top-10 always, fiddling with bugzilla URLs? > >>> (Note, I don't have access to gcc.gnu.org, I'm relaying info from > >>> multiple instances of discussion on #gcc and richi poking on it; > >>> that said, it still might not be web crawlers, that's right, but > >>> I'll happily accept > >>> _any_ load improvement on gcc.gnu.org, how unfounded they might seem) > > I think that simply blocking buglist.cgi has dropped bugzilla off the > immediate radar. It also seems to have lowered the load, although I'm > not sure if we are still keeping historical data. Btw. FWIW, I had a quick look at one of the httpd log files, and in seven hours on last Saturday (from 5:30 to 12:30), there were overall 435203 GET requests, and 391319 of them came from our own MnoGoSearch engine, that's 90%. Granted many are then in fact 304 (not modified) responses, but still, perhaps the eagerness of our own crawler can be turned down a bit. Ciao, Michael.
Index: robots.txt =================================================================== RCS file: /cvs/gcc/wwwdocs/htdocs/robots.txt,v retrieving revision 1.10 diff -u -r1.10 robots.txt --- robots.txt 13 May 2011 17:09:11 -0000 1.10 +++ robots.txt 17 May 2011 05:19:11 -0000 @@ -2,8 +2,8 @@ # for information about the file format. # Contact gcc@gcc.gnu.org for questions. -User-Agent: * -Disallow: /viewcvs/ +User-agent: * +Disallow: /viewcvs Disallow: /cgi-bin/ Disallow: /bugzilla/buglist.cgi Crawl-Delay: 60