Don't let search bots look at buglist.cgi

Message ID	BANLkTikwjjfsbuCacXdvd=5CyqYdEzzChA@mail.gmail.com
State	New
Headers	show Return-Path: <gcc-patches-return-292013-incoming=patchwork.ozlabs.org@gcc.gnu.org> MIME-Version: 1.0 In-Reply-To: <BANLkTikJP8i6i+55qCKfD4YhfMyhJNLigg@mail.gmail.com> References: <mcry62a7cau.fsf@coign.corp.google.com> <BANLkTim22-m_jqhHCiNiGKVs5km96t6M3w@mail.gmail.com> <4DD10623.40705@redhat.com> <Pine.LNX.4.64.1105161408460.1989@wotan.suse.de> <4DD120F3.9050100@redhat.com> <BANLkTinJeYEx4HNQP3+N3OJj-13Z-C0D9A@mail.gmail.com> <4DD1240F.8080809@redhat.com> <Pine.LNX.4.64.1105161519310.1989@wotan.suse.de> <4DD125C8.8090105@redhat.com> <Pine.LNX.4.64.1105161527360.1989@wotan.suse.de> <4DD12802.4040705@redhat.com> <BANLkTikJP8i6i+55qCKfD4YhfMyhJNLigg@mail.gmail.com> Date: Mon, 16 May 2011 22:27:44 -0700 Message-ID: <BANLkTikwjjfsbuCacXdvd=5CyqYdEzzChA@mail.gmail.com> Subject: Re: Don't let search bots look at buglist.cgi From: Ian Lance Taylor <iant@google.com> To: Richard Guenther <richard.guenther@gmail.com> Cc: Andrew Haley <aph@redhat.com>, Michael Matz <matz@suse.de>, gcc-patches@gcc.gnu.org Content-Type: multipart/mixed; boundary=000e0cd6afb221aed004a3720813 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk Sender: gcc-patches-owner@gcc.gnu.org

Message ID

BANLkTikwjjfsbuCacXdvd=5CyqYdEzzChA@mail.gmail.com

State

New

Headers

MIME-Version: 1.0
In-Reply-To: <BANLkTikJP8i6i+55qCKfD4YhfMyhJNLigg@mail.gmail.com>
References: <mcry62a7cau.fsf@coign.corp.google.com>	<BANLkTim22-m_jqhHCiNiGKVs5km96t6M3w@mail.gmail.com>	<4DD10623.40705@redhat.com>	<Pine.LNX.4.64.1105161408460.1989@wotan.suse.de>	<4DD120F3.9050100@redhat.com>	<BANLkTinJeYEx4HNQP3+N3OJj-13Z-C0D9A@mail.gmail.com>	<4DD1240F.8080809@redhat.com>	<Pine.LNX.4.64.1105161519310.1989@wotan.suse.de>	<4DD125C8.8090105@redhat.com>	<Pine.LNX.4.64.1105161527360.1989@wotan.suse.de>	<4DD12802.4040705@redhat.com>	<BANLkTikJP8i6i+55qCKfD4YhfMyhJNLigg@mail.gmail.com>
Date: Mon, 16 May 2011 22:27:44 -0700
Message-ID: <BANLkTikwjjfsbuCacXdvd=5CyqYdEzzChA@mail.gmail.com>
Subject: Re: Don't let search bots look at buglist.cgi
From: Ian Lance Taylor <iant@google.com>
To: Richard Guenther <richard.guenther@gmail.com>
Cc: Andrew Haley <aph@redhat.com>, Michael Matz <matz@suse.de>,
	gcc-patches@gcc.gnu.org
Content-Type: multipart/mixed; boundary=000e0cd6afb221aed004a3720813
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
Sender: gcc-patches-owner@gcc.gnu.org

Commit Message

Ian Lance Taylor May 17, 2011, 5:27 a.m. UTC

On Mon, May 16, 2011 at 6:42 AM, Richard Guenther
<richard.guenther@gmail.com> wrote:
>>>
>>> httpd being in the top-10 always, fiddling with bugzilla URLs?
>>> (Note, I don't have access to gcc.gnu.org, I'm relaying info from multiple
>>> instances of discussion on #gcc and richi poking on it; that said, it
>>> still might not be web crawlers, that's right, but I'll happily accept
>>> _any_ load improvement on gcc.gnu.org, how unfounded they might seem)

I think that simply blocking buglist.cgi has dropped bugzilla off the
immediate radar.
It also seems to have lowered the load, although I'm not sure if we
are still keeping
historical data.


> I for example see also
>
> 66.249.71.59 - - [16/May/2011:13:37:58 +0000] "GET
> /viewcvs?view=revision&revision=169814 HTTP/1.1" 200 1334 "-"
> "Mozilla/5.0 (compatible; Googlebot/2.1;
> +http://www.google.com/bot.html)" (35%) 2060117us
>
> and viewvc is certainly even worse (from an I/O perspecive).  I thought
> we blocked all bot traffic from the viewvc stuff ...

This is only happening at top level.  I committed this patch to fix this.

Ian

Comments

Axel Freyn May 17, 2011, 8:12 a.m. UTC | #1

On Mon, May 16, 2011 at 10:27:44PM -0700, Ian Lance Taylor wrote:
> On Mon, May 16, 2011 at 6:42 AM, Richard Guenther
> <richard.guenther@gmail.com> wrote:
> >>>
> >>> httpd being in the top-10 always, fiddling with bugzilla URLs?
> >>> (Note, I don't have access to gcc.gnu.org, I'm relaying info from multiple
> >>> instances of discussion on #gcc and richi poking on it; that said, it
> >>> still might not be web crawlers, that's right, but I'll happily accept
> >>> _any_ load improvement on gcc.gnu.org, how unfounded they might seem)
> 
> I think that simply blocking buglist.cgi has dropped bugzilla off the
> immediate radar.
> It also seems to have lowered the load, although I'm not sure if we
> are still keeping
> historical data.
> 
> 
> > I for example see also
> >
> > 66.249.71.59 - - [16/May/2011:13:37:58 +0000] "GET
> > /viewcvs?view=revision&revision=169814 HTTP/1.1" 200 1334 "-"
> > "Mozilla/5.0 (compatible; Googlebot/2.1;
> > +http://www.google.com/bot.html)" (35%) 2060117us
> >
> > and viewvc is certainly even worse (from an I/O perspecive).  I thought
> > we blocked all bot traffic from the viewvc stuff ...
> 
> This is only happening at top level.  I committed this patch to fix this.
Probably you know it much better than me, but wouldn't it be a
possibility to only allow some of google crawlers? (if all try to crawl
bugzilla)
As I read
http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=1061943
it would be possible to block the Crawlers Googlebot-Mobile,
Mediapartners-Google and AdsBot-Google, (which seem to be independent
Crawlers?) while allowing the main Googlebot (Well, I don't know how
often which crawler appears how often on bugzilla...)

Axel

Michael Matz May 17, 2011, 11:12 a.m. UTC | #2

Hi,

On Mon, 16 May 2011, Ian Lance Taylor wrote:

> >>> httpd being in the top-10 always, fiddling with bugzilla URLs? 
> >>> (Note, I don't have access to gcc.gnu.org, I'm relaying info from 
> >>> multiple instances of discussion on #gcc and richi poking on it; 
> >>> that said, it still might not be web crawlers, that's right, but 
> >>> I'll happily accept
> >>> _any_ load improvement on gcc.gnu.org, how unfounded they might seem)
> 
> I think that simply blocking buglist.cgi has dropped bugzilla off the 
> immediate radar. It also seems to have lowered the load, although I'm 
> not sure if we are still keeping historical data.

Btw. FWIW, I had a quick look at one of the httpd log files, and in seven 
hours on last Saturday (from 5:30 to 12:30), there were overall 435203 GET 
requests, and 391319 of them came from our own MnoGoSearch engine, that's 
90%.  Granted many are then in fact 304 (not modified) responses, but 
still, perhaps the eagerness of our own crawler can be turned down a bit.

Ciao,
Michael.

Index: robots.txt
===================================================================
RCS file: /cvs/gcc/wwwdocs/htdocs/robots.txt,v
retrieving revision 1.10
diff -u -r1.10 robots.txt
--- robots.txt	13 May 2011 17:09:11 -0000	1.10
+++ robots.txt	17 May 2011 05:19:11 -0000
@@ -2,8 +2,8 @@ 
 # for information about the file format.
 # Contact gcc@gcc.gnu.org for questions.
 
-User-Agent: *
-Disallow: /viewcvs/
+User-agent: *
+Disallow: /viewcvs
 Disallow: /cgi-bin/
 Disallow: /bugzilla/buglist.cgi
 Crawl-Delay: 60

Don't let search bots look at buglist.cgi

Commit Message

Comments

Patch