diff mbox

[RFC] parsemail.py: Don't search for patches in replies

Message ID 1391726961-32072-1-git-send-email-markus.mayer@linaro.org
State Rejected
Headers show

Commit Message

Markus Mayer Feb. 6, 2014, 10:49 p.m. UTC
Make sure we don't attempt to search for a patch in a reply e-mail.
There are MUAs out there who leave the quoted e-mail intact without
prepending quote characters such as ">" at the beginning of each line.

When that happens, parse_patch() thinks the quoted patch is new. The
result are multiple database entries containing the same patch (one for
each such reply) when one would really expect a consolidated thread
containing the entire discussion and only one copy of the patch.

Signed-off-by: Markus Mayer <markus.mayer@linaro.org>
---
This problem is mainly caused when replies to patches are sent using
Outlook.

The approach below seems to work, although there is the downside that
it relies on English MUA settings. If a mail client translates "Re:" to
some other string such as "AW:" the proposed code will not detect that
the e-mail in question is a reply. (Although it wouldn't be any worse
than it is now.)

To avoid this, I tried using the presence of "In-Reply-To:" and
"References:" headers to detect a reply, but "git send-email" inserts
references into patches that aren't replies (e.g. v2 of a patch
referencing v1), which then leads to the opposite problem: mails being
categorized as replies when they are not.

So, checking for "Re:" still seems to be the better option. Please let
me know your thoughts.

 apps/patchwork/bin/parsemail.py |    6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

Comments

Keller, Jacob E Feb. 6, 2014, 11:02 p.m. UTC | #1
Just adding that we have this problem occasionally. Would be nice of
Outlook wasn't such a broken email client. :(

Yea, I don't know of any better solution.. I don't think Outlook has any
other marker for reply addresses.

Thanks,
Jake

On Thu, 2014-02-06 at 14:49 -0800, Markus Mayer wrote:
> Make sure we don't attempt to search for a patch in a reply e-mail.
> There are MUAs out there who leave the quoted e-mail intact without
> prepending quote characters such as ">" at the beginning of each line.
> 
> When that happens, parse_patch() thinks the quoted patch is new. The
> result are multiple database entries containing the same patch (one for
> each such reply) when one would really expect a consolidated thread
> containing the entire discussion and only one copy of the patch.
> 
> Signed-off-by: Markus Mayer <markus.mayer@linaro.org>
> ---
> This problem is mainly caused when replies to patches are sent using
> Outlook.
> 
> The approach below seems to work, although there is the downside that
> it relies on English MUA settings. If a mail client translates "Re:" to
> some other string such as "AW:" the proposed code will not detect that
> the e-mail in question is a reply. (Although it wouldn't be any worse
> than it is now.)
> 
> To avoid this, I tried using the presence of "In-Reply-To:" and
> "References:" headers to detect a reply, but "git send-email" inserts
> references into patches that aren't replies (e.g. v2 of a patch
> referencing v1), which then leads to the opposite problem: mails being
> categorized as replies when they are not.
> 
> So, checking for "Re:" still seems to be the better option. Please let
> me know your thoughts.
> 
>  apps/patchwork/bin/parsemail.py |    6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/apps/patchwork/bin/parsemail.py b/apps/patchwork/bin/parsemail.py
> index b6eb97a..405fb69 100755
> --- a/apps/patchwork/bin/parsemail.py
> +++ b/apps/patchwork/bin/parsemail.py
> @@ -151,6 +151,8 @@ def find_content(project, mail):
>      patchbuf = None
>      commentbuf = ''
>      pullurl = None
> +    subject = mail.get('Subject')
> +    is_reply = (subject.lower().find("re:") == 0)
>  
>      for part in mail.walk():
>          if part.get_content_maintype() != 'text':
> @@ -185,8 +187,8 @@ def find_content(project, mail):
>      patch = None
>      comment = None
>  
> -    if pullurl or patchbuf:
> -        name = clean_subject(mail.get('Subject'), [project.linkname])
> +    if not is_reply and (pullurl or patchbuf):
> +        name = clean_subject(subject, [project.linkname])
>          patch = Patch(name = name, pull_url = pullurl, content = patchbuf,
>                      date = mail_date(mail), headers = mail_headers(mail))
>
Carl-Daniel Hailfinger Feb. 7, 2014, 10:06 p.m. UTC | #2
Am 06.02.2014 23:49 schrieb Markus Mayer:
> Make sure we don't attempt to search for a patch in a reply e-mail.
> There are MUAs out there who leave the quoted e-mail intact without
> prepending quote characters such as ">" at the beginning of each line.
>
> When that happens, parse_patch() thinks the quoted patch is new. The
> result are multiple database entries containing the same patch (one for
> each such reply) when one would really expect a consolidated thread
> containing the entire discussion and only one copy of the patch.
>
> Signed-off-by: Markus Mayer <markus.mayer@linaro.org>

AFAICS this will cause loss of patches which are sent as replies. This
would totally kill roughly 90% of the second iterations of patches sent
to the flashrom mailing list (and the flashrom project uses patchwork
heavily). A reply mail to a patch may contain a counterproposal. After
all, email has the Re: subject prefix exactly to denote replies, and of
course replies can contain patches. Manually removing Re: from the
subject removes the visual clue that some mail is a reply to another
one. You wouldn't strip the Re: from normal reply mails, and why would
you strip the Re: from replies containing patches?

Regards,
Carl-Daniel
Wolfram Sang Feb. 7, 2014, 10:37 p.m. UTC | #3
On Fri, Feb 07, 2014 at 11:06:39PM +0100, Carl-Daniel Hailfinger wrote:
> Am 06.02.2014 23:49 schrieb Markus Mayer:
> > Make sure we don't attempt to search for a patch in a reply e-mail.
> > There are MUAs out there who leave the quoted e-mail intact without
> > prepending quote characters such as ">" at the beginning of each line.
> >
> > When that happens, parse_patch() thinks the quoted patch is new. The
> > result are multiple database entries containing the same patch (one for
> > each such reply) when one would really expect a consolidated thread
> > containing the entire discussion and only one copy of the patch.
> >
> > Signed-off-by: Markus Mayer <markus.mayer@linaro.org>
> 
> AFAICS this will cause loss of patches which are sent as replies.

+1. This approach will cause more pain than gain. Move those users away
from Outlook, would be my recommendation.
Markus Mayer Feb. 7, 2014, 10:52 p.m. UTC | #4
On 7 February 2014 14:37, Wolfram Sang <wsa@the-dreams.de> wrote:
> On Fri, Feb 07, 2014 at 11:06:39PM +0100, Carl-Daniel Hailfinger wrote:
>> Am 06.02.2014 23:49 schrieb Markus Mayer:
>> > Make sure we don't attempt to search for a patch in a reply e-mail.
>> > There are MUAs out there who leave the quoted e-mail intact without
>> > prepending quote characters such as ">" at the beginning of each line.
>> >
>> > When that happens, parse_patch() thinks the quoted patch is new. The
>> > result are multiple database entries containing the same patch (one for
>> > each such reply) when one would really expect a consolidated thread
>> > containing the entire discussion and only one copy of the patch.
>> >
>> > Signed-off-by: Markus Mayer <markus.mayer@linaro.org>
>>
>> AFAICS this will cause loss of patches which are sent as replies.
>
> +1. This approach will cause more pain than gain. Move those users away
> from Outlook, would be my recommendation.

Have you ever worked for a 10,000+ employee organization?

And no, I don't particularly like this solution, either. But I
couldn't come up with a better one.

-Markus
Mauro Carvalho Chehab Feb. 7, 2014, 11:13 p.m. UTC | #5
Em Fri, 7 Feb 2014 14:52:52 -0800
Markus Mayer <markus.mayer@linaro.org> escreveu:

> On 7 February 2014 14:37, Wolfram Sang <wsa@the-dreams.de> wrote:
> > On Fri, Feb 07, 2014 at 11:06:39PM +0100, Carl-Daniel Hailfinger wrote:
> >> Am 06.02.2014 23:49 schrieb Markus Mayer:
> >> > Make sure we don't attempt to search for a patch in a reply e-mail.
> >> > There are MUAs out there who leave the quoted e-mail intact without
> >> > prepending quote characters such as ">" at the beginning of each line.
> >> >
> >> > When that happens, parse_patch() thinks the quoted patch is new. The
> >> > result are multiple database entries containing the same patch (one for
> >> > each such reply) when one would really expect a consolidated thread
> >> > containing the entire discussion and only one copy of the patch.
> >> >
> >> > Signed-off-by: Markus Mayer <markus.mayer@linaro.org>
> >>
> >> AFAICS this will cause loss of patches which are sent as replies.
> >
> > +1. This approach will cause more pain than gain. Move those users away
> > from Outlook, would be my recommendation.

+1. The replied patches are generally better versions than the
original one.

> Have you ever worked for a 10,000+ employee organization?
> 
> And no, I don't particularly like this solution, either. But I
> couldn't come up with a better one.

This has nothing to do with the number of employees, but if they're
using or not the right tools.

If contributors are using some MUA that adds space/tabs/whatever instead
of ">", the patches will already be mangled by such MUA.

Btw, at Kernel, there are thousands of contributions, and I get
a way more patches with whitespaces mangled than bad replies that
re-add the same patch.

Still, I even prefer to receive whitespace mangled patches,
as I can easily fix trivial whitespace mangling, or otherwise
reply to the author to fix it for me. This is a way better than
losing a good patch.

Regards,


Cheers,
Mauro
Markus Mayer Feb. 8, 2014, midnight UTC | #6
On 7 February 2014 15:13, Mauro Carvalho Chehab <mchehab@infradead.org> wrote:
> Em Fri, 7 Feb 2014 14:52:52 -0800
> Markus Mayer <markus.mayer@linaro.org> escreveu:
>
>> On 7 February 2014 14:37, Wolfram Sang <wsa@the-dreams.de> wrote:
>> > On Fri, Feb 07, 2014 at 11:06:39PM +0100, Carl-Daniel Hailfinger wrote:
>> >> Am 06.02.2014 23:49 schrieb Markus Mayer:
>> >> > Make sure we don't attempt to search for a patch in a reply e-mail.
>> >> > There are MUAs out there who leave the quoted e-mail intact without
>> >> > prepending quote characters such as ">" at the beginning of each line.
>> >> >
>> >> > When that happens, parse_patch() thinks the quoted patch is new. The
>> >> > result are multiple database entries containing the same patch (one for
>> >> > each such reply) when one would really expect a consolidated thread
>> >> > containing the entire discussion and only one copy of the patch.
>> >> >
>> >> > Signed-off-by: Markus Mayer <markus.mayer@linaro.org>
>> >>
>> >> AFAICS this will cause loss of patches which are sent as replies.
>> >
>> > +1. This approach will cause more pain than gain. Move those users away
>> > from Outlook, would be my recommendation.
>
> +1. The replied patches are generally better versions than the
> original one.
>
>> Have you ever worked for a 10,000+ employee organization?
>>
>> And no, I don't particularly like this solution, either. But I
>> couldn't come up with a better one.
>
> This has nothing to do with the number of employees, but if they're
> using or not the right tools.

It's just that I can't *make* people use a sane mailer. I can suggest
it and remind them, etc, but if they won't do it or forget there's
nothing I can do. Trust me that I tell anybody who'll listen not to
use Outlook for Open Source mailing lists. The more people there are
in an organization, however, the higher the likelihood of Outlook
being involved sooner or later. Plus there is the "corporate standard"
which programs to use.

> Still, I even prefer to receive whitespace mangled patches,
> as I can easily fix trivial whitespace mangling, or otherwise
> reply to the author to fix it for me. This is a way better than
> losing a good patch.

Agreed that losing a patch is not good.

I my instance I don't need to worry about genuine patches being sent
in replies, only comments. I won't lose a patch based on the proposed
change, but I end up with dozens of duplicate patches without it. (I
am trying to import a mailing list archive.) So, in my case it makes
sense, but I understand in others it may not.

Maybe this change could become optional so it can be switched on for
situations where it is helpful, but the default behaviour doesn't
change? For parsemail.py it could be a command line option or a config
file setting. I don't think parsemail.py is currently looking at the
config file, however.

Regards,
-Markus
Wolfram Sang Feb. 8, 2014, 9:29 a.m. UTC | #7
> >> Have you ever worked for a 10,000+ employee organization?

I know about such things, yet we shouldn't break patchwork which is
probably used by way more users. If your patch solves your issue, this
is great nonetheless.

> I my instance I don't need to worry about genuine patches being sent
> in replies, only comments.

So, if you are running your own instance, you could keep this as an
out-of-tree patch? I'd think so, but it is up to Jeremy, of course.

Regards,

   Wolfram
Jeremy Kerr Feb. 10, 2014, 1:20 a.m. UTC | #8
Hi Markus,

> Make sure we don't attempt to search for a patch in a reply e-mail.
> There are MUAs out there who leave the quoted e-mail intact without
> prepending quote characters such as ">" at the beginning of each line.

D'oh!

> When that happens, parse_patch() thinks the quoted patch is new. The
> result are multiple database entries containing the same patch (one for
> each such reply) when one would really expect a consolidated thread
> containing the entire discussion and only one copy of the patch.

OK, sounds like something that we might be able to address, but I don't
think this is the correct approach (as others have mentioned too). We
don't want to drop follow-ups, as they may contain patches that we need
to track.

Perhaps there's some other way we could identify these replies that
shouldn't be parsed as patches? Could you send an unmodified (mbox-form)
mail that we can take a look at?

Cheers,


Jeremy
Markus Mayer Feb. 11, 2014, 5:51 p.m. UTC | #9
On 8 February 2014 01:29, Wolfram Sang <wsa@the-dreams.de> wrote:
>
>> >> Have you ever worked for a 10,000+ employee organization?
>
> I know about such things, yet we shouldn't break patchwork which is
> probably used by way more users. If your patch solves your issue, this
> is great nonetheless.

Agreed, I don't want to break patchwork for everybody else. When I
sent in the patch, I wasn't sure if there was a use case for patches
in replies or not. That's why I asked for opinions. Now I know that
there is.

>> I my instance I don't need to worry about genuine patches being sent
>> in replies, only comments.
>
> So, if you are running your own instance, you could keep this as an
> out-of-tree patch? I'd think so, but it is up to Jeremy, of course.

I'll do that if there isn't a better solution to my issue. I mostly
figured that I'm probably not the only person to have run into this
issue, so I wanted to see what other users of patchwork have to say.
Maybe there's even a way to solve the problem for everyone.

Regards,
-Markus
Markus Mayer Feb. 11, 2014, 5:52 p.m. UTC | #10
On 9 February 2014 17:20, Jeremy Kerr <jk@ozlabs.org> wrote:
> Hi Markus,
>
>> Make sure we don't attempt to search for a patch in a reply e-mail.
>> There are MUAs out there who leave the quoted e-mail intact without
>> prepending quote characters such as ">" at the beginning of each line.
>
> D'oh!

Yeah, I know.

>> When that happens, parse_patch() thinks the quoted patch is new. The
>> result are multiple database entries containing the same patch (one for
>> each such reply) when one would really expect a consolidated thread
>> containing the entire discussion and only one copy of the patch.
>
> OK, sounds like something that we might be able to address, but I don't
> think this is the correct approach (as others have mentioned too). We
> don't want to drop follow-ups, as they may contain patches that we need
> to track.

I didn't particularly like the approach I sent, either. It's not at
all elegant. But in my case, I need something, as I end up with about
100 duplicate patches otherwise. Making it configurable ("Outlook
mode" vs. "patches in replies" mode) was the best idea I could come up
with. If there's a different way, all the better.

> Perhaps there's some other way we could identify these replies that
> shouldn't be parsed as patches? Could you send an unmodified (mbox-form)
> mail that we can take a look at?

Yes, absolutely. I'll send a conversation your way so you can have a
look at all the mail headers.

Thanks,
-Markus
diff mbox

Patch

diff --git a/apps/patchwork/bin/parsemail.py b/apps/patchwork/bin/parsemail.py
index b6eb97a..405fb69 100755
--- a/apps/patchwork/bin/parsemail.py
+++ b/apps/patchwork/bin/parsemail.py
@@ -151,6 +151,8 @@  def find_content(project, mail):
     patchbuf = None
     commentbuf = ''
     pullurl = None
+    subject = mail.get('Subject')
+    is_reply = (subject.lower().find("re:") == 0)
 
     for part in mail.walk():
         if part.get_content_maintype() != 'text':
@@ -185,8 +187,8 @@  def find_content(project, mail):
     patch = None
     comment = None
 
-    if pullurl or patchbuf:
-        name = clean_subject(mail.get('Subject'), [project.linkname])
+    if not is_reply and (pullurl or patchbuf):
+        name = clean_subject(subject, [project.linkname])
         patch = Patch(name = name, pull_url = pullurl, content = patchbuf,
                     date = mail_date(mail), headers = mail_headers(mail))