diff mbox

Don't let search bots look at buglist.cgi

Message ID BANLkTikwjjfsbuCacXdvd=5CyqYdEzzChA@mail.gmail.com
State New
Headers show

Commit Message

Ian Lance Taylor May 17, 2011, 5:27 a.m. UTC
On Mon, May 16, 2011 at 6:42 AM, Richard Guenther
<richard.guenther@gmail.com> wrote:
>>>
>>> httpd being in the top-10 always, fiddling with bugzilla URLs?
>>> (Note, I don't have access to gcc.gnu.org, I'm relaying info from multiple
>>> instances of discussion on #gcc and richi poking on it; that said, it
>>> still might not be web crawlers, that's right, but I'll happily accept
>>> _any_ load improvement on gcc.gnu.org, how unfounded they might seem)

I think that simply blocking buglist.cgi has dropped bugzilla off the
immediate radar.
It also seems to have lowered the load, although I'm not sure if we
are still keeping
historical data.


> I for example see also
>
> 66.249.71.59 - - [16/May/2011:13:37:58 +0000] "GET
> /viewcvs?view=revision&revision=169814 HTTP/1.1" 200 1334 "-"
> "Mozilla/5.0 (compatible; Googlebot/2.1;
> +http://www.google.com/bot.html)" (35%) 2060117us
>
> and viewvc is certainly even worse (from an I/O perspecive).  I thought
> we blocked all bot traffic from the viewvc stuff ...

This is only happening at top level.  I committed this patch to fix this.

Ian

Comments

Axel Freyn May 17, 2011, 8:12 a.m. UTC | #1
On Mon, May 16, 2011 at 10:27:44PM -0700, Ian Lance Taylor wrote:
> On Mon, May 16, 2011 at 6:42 AM, Richard Guenther
> <richard.guenther@gmail.com> wrote:
> >>>
> >>> httpd being in the top-10 always, fiddling with bugzilla URLs?
> >>> (Note, I don't have access to gcc.gnu.org, I'm relaying info from multiple
> >>> instances of discussion on #gcc and richi poking on it; that said, it
> >>> still might not be web crawlers, that's right, but I'll happily accept
> >>> _any_ load improvement on gcc.gnu.org, how unfounded they might seem)
> 
> I think that simply blocking buglist.cgi has dropped bugzilla off the
> immediate radar.
> It also seems to have lowered the load, although I'm not sure if we
> are still keeping
> historical data.
> 
> 
> > I for example see also
> >
> > 66.249.71.59 - - [16/May/2011:13:37:58 +0000] "GET
> > /viewcvs?view=revision&revision=169814 HTTP/1.1" 200 1334 "-"
> > "Mozilla/5.0 (compatible; Googlebot/2.1;
> > +http://www.google.com/bot.html)" (35%) 2060117us
> >
> > and viewvc is certainly even worse (from an I/O perspecive).  I thought
> > we blocked all bot traffic from the viewvc stuff ...
> 
> This is only happening at top level.  I committed this patch to fix this.
Probably you know it much better than me, but wouldn't it be a
possibility to only allow some of google crawlers? (if all try to crawl
bugzilla)
As I read
http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=1061943
it would be possible to block the Crawlers Googlebot-Mobile,
Mediapartners-Google and AdsBot-Google, (which seem to be independent
Crawlers?) while allowing the main Googlebot (Well, I don't know how
often which crawler appears how often on bugzilla...)

Axel
Michael Matz May 17, 2011, 11:12 a.m. UTC | #2
Hi,

On Mon, 16 May 2011, Ian Lance Taylor wrote:

> >>> httpd being in the top-10 always, fiddling with bugzilla URLs? 
> >>> (Note, I don't have access to gcc.gnu.org, I'm relaying info from 
> >>> multiple instances of discussion on #gcc and richi poking on it; 
> >>> that said, it still might not be web crawlers, that's right, but 
> >>> I'll happily accept
> >>> _any_ load improvement on gcc.gnu.org, how unfounded they might seem)
> 
> I think that simply blocking buglist.cgi has dropped bugzilla off the 
> immediate radar. It also seems to have lowered the load, although I'm 
> not sure if we are still keeping historical data.

Btw. FWIW, I had a quick look at one of the httpd log files, and in seven 
hours on last Saturday (from 5:30 to 12:30), there were overall 435203 GET 
requests, and 391319 of them came from our own MnoGoSearch engine, that's 
90%.  Granted many are then in fact 304 (not modified) responses, but 
still, perhaps the eagerness of our own crawler can be turned down a bit.


Ciao,
Michael.
diff mbox

Patch

Index: robots.txt
===================================================================
RCS file: /cvs/gcc/wwwdocs/htdocs/robots.txt,v
retrieving revision 1.10
diff -u -r1.10 robots.txt
--- robots.txt	13 May 2011 17:09:11 -0000	1.10
+++ robots.txt	17 May 2011 05:19:11 -0000
@@ -2,8 +2,8 @@ 
 # for information about the file format.
 # Contact gcc@gcc.gnu.org for questions.
 
-User-Agent: *
-Disallow: /viewcvs/
+User-agent: *
+Disallow: /viewcvs
 Disallow: /cgi-bin/
 Disallow: /bugzilla/buglist.cgi
 Crawl-Delay: 60