diff mbox series

parser: remove in-reply-to/references comments

Message ID CAF-EcGk7CeBpDWe_+1x-5+kjQ7gJYRzAYv4jhkXbsRoqKv1uoQ@mail.gmail.com
State Changes Requested
Headers show
Series parser: remove in-reply-to/references comments | expand

Commit Message

Raxel Gutierrez June 23, 2021, 4:51 p.m. UTC
When contributors using Gnus reply to a patch, the In-Reply-To &
References header fields include comments in the form described in
remove_rfc2822_comments spec. These comments can lead to threading
issues where users' replies don't get associated to the patch they
reply to because the added comment will be part of the message-id when
parsed. Following RFC2822, a comment is not supposed to affect threading
in any way, so we remove it. To use regex, we assume that comments are
not nested; further changes to more thoroughly match the RFC2822 grammar
remains for anyone interested.

Signed-off-by: Raxel Gutierrez <raxel@google.com>
Closes: #399
---
 patchwork/parser.py                           | 25 +++++++++++++++++--
 .../notes/issue-399-584c5be5b71dcf63.yaml     |  7 ++++++
 2 files changed, 30 insertions(+), 2 deletions(-)
 create mode 100644 releasenotes/notes/issue-399-584c5be5b71dcf63.yaml

Comments

Emily Shaffer June 24, 2021, 6:52 p.m. UTC | #1
On Wed, Jun 23, 2021 at 12:51:46PM -0400, Raxel Gutierrez wrote:
> 
> When contributors using Gnus reply to a patch, the In-Reply-To &
> References header fields include comments in the form described in
> remove_rfc2822_comments spec. These comments can lead to threading
> issues where users' replies don't get associated to the patch they
> reply to because the added comment will be part of the message-id when
> parsed. Following RFC2822, a comment is not supposed to affect threading
> in any way, so we remove it. To use regex, we assume that comments are
> not nested; further changes to more thoroughly match the RFC2822 grammar
> remains for anyone interested.

Hm, does the nesting level of the comments really matter? Or is the
issue that they may be multiline?

That is - it's pretty trivial to write a regex to match
(foo(bar)baz((quot)meh)), as long as you don't actually care about the
semantics of the nesting.

> 
> Signed-off-by: Raxel Gutierrez <raxel@google.com>
> Closes: #399
> ---
>  patchwork/parser.py                           | 25 +++++++++++++++++--
>  .../notes/issue-399-584c5be5b71dcf63.yaml     |  7 ++++++
>  2 files changed, 30 insertions(+), 2 deletions(-)
>  create mode 100644 releasenotes/notes/issue-399-584c5be5b71dcf63.yaml
> 
> diff --git a/patchwork/parser.py b/patchwork/parser.py
> index 61a8124..683ff55 100644
> --- a/patchwork/parser.py
> +++ b/patchwork/parser.py
> @@ -70,6 +70,27 @@ def normalise_space(value):
>      return whitespace_re.sub(' ', value).strip()
> 
> 
> +def remove_rfc2822_comments(header_contents):
> +    """Removes RFC2822 comments from header fields.
> +
> +    Gnus create reply emails with commments like In-Reply-To/References:
> +    <msg-id> (User's message of Sun, 01 Jan 2012 12:34:56 +0700) [comment].
> +    Patchwork parses the values of the In-Reply-To & References header fields
> +    with the comment included as part of their value. A side effect of the
> +    comment not being removed is that message-ids are mismatched. These
> +    comments do not provide useful information for processing patches
> +    because they are ignored for threading and not rendered by mail readers.
> +    """
> +
> +    # Captures comments in header fields.

Firstly, I'd like to point out for other reviewers that Raxel commented
the expression this way because I told him to - if you hate it, blame
me, not him ;)

> +    comment_pattern = re.compile(r"""
> +                                \(      # The opening parenthesis of comment
> +                                [^()]*  # The contents of the comment
I *think* this is the bit that's making it not support nesting.
"Match anything besides another open- or close-paren".

https://docs.python.org/3/library/re.html tells me that Python treats
'*' as greedy by default, so wouldn't "\(.*\)" handle nested comments?
Or is there an issue that you can have more that one, e.g.

  In-Reply-To: (danica's mail) abcd1-40-8d@mail.google.com (from gnus)

in which case greedy-matching would also obliterate the actual
message-id?

This actually brings to mind that I'd like to see an example of one such
problematic line in the commit message, if you've got one handy.

> +                                \)      # The closing parenthesis of comment
> +                                """, re.X)
> +    return re.sub(comment_pattern, '', header_contents)
> +
> +
>  def sanitise_header(header_contents, header_name=None):
>      """Clean and individual mail header.
> 
> @@ -483,13 +504,13 @@ def find_references(mail):
> 
>      if 'In-Reply-To' in mail:
>          for in_reply_to in mail.get_all('In-Reply-To'):
> -            r = clean_header(in_reply_to)
> +            r = remove_rfc2822_comments(clean_header(in_reply_to))
>              if r:
>                  refs.append(r)
> 
>      if 'References' in mail:
>          for references_header in mail.get_all('References'):
> -            h = clean_header(references_header)
> +            h = remove_rfc2822_comments(clean_header(references_header))
>              if not h:
>                  continue
>              references = h.split()
> diff --git a/releasenotes/notes/issue-399-584c5be5b71dcf63.yaml
> b/releasenotes/notes/issue-399-584c5be5b71dcf63.yaml
> new file mode 100644
> index 0000000..a3e2ec1
> --- /dev/null
> +++ b/releasenotes/notes/issue-399-584c5be5b71dcf63.yaml
> @@ -0,0 +1,7 @@
> +---
> +fixes:
> +  - |
> +    Gnus users' replies would not be associated to the patch they reply to
> +    because their In-Reply-To & References field generated explanatory
> +    comments that were added to the value of the message-id parsed.
> +    (`#399 <https://github.com/getpatchwork/patchwork/issues/399>`__)

Hum. Is there a test suite we can add a regression test to for this
specific kind of line format?


Sorry not to have left any lines on the Python style - I'm the last
person you should ask. ;) Please don't consider my lack of comment to be
an approval, there.

 - Emily
Daniel Axtens June 25, 2021, 3:40 a.m. UTC | #2
Hi,

Thanks for your contribution Raxel!

>> When contributors using Gnus reply to a patch, the In-Reply-To &
>> References header fields include comments in the form described in
>> remove_rfc2822_comments spec. These comments can lead to threading
>> issues where users' replies don't get associated to the patch they
>> reply to because the added comment will be part of the message-id when
>> parsed. Following RFC2822, a comment is not supposed to affect threading
>> in any way, so we remove it. To use regex, we assume that comments are
>> not nested; further changes to more thoroughly match the RFC2822 grammar
>> remains for anyone interested.
>
> Hm, does the nesting level of the comments really matter? Or is the
> issue that they may be multiline?
>
> That is - it's pretty trivial to write a regex to match
> (foo(bar)baz((quot)meh)), as long as you don't actually care about the
> semantics of the nesting.

https://datatracker.ietf.org/doc/html/rfc2822#section-3.2.3 suggests
that comments may nest and, if I am reading the spec correctly, that
they may be multi-line: comment can contain folding white-space and
folding white-space can contain CRLF.

However, given that this is the first time comments have mattered at
all, I'd be OK to ignore multi-line comments. I would like nested
comments to work though, unless it makes a real mess of things.

>> 
>> Signed-off-by: Raxel Gutierrez <raxel@google.com>
>> Closes: #399
>> ---
>>  patchwork/parser.py                           | 25 +++++++++++++++++--
>>  .../notes/issue-399-584c5be5b71dcf63.yaml     |  7 ++++++
>>  2 files changed, 30 insertions(+), 2 deletions(-)
>>  create mode 100644 releasenotes/notes/issue-399-584c5be5b71dcf63.yaml
>> 
>> diff --git a/patchwork/parser.py b/patchwork/parser.py
>> index 61a8124..683ff55 100644
>> --- a/patchwork/parser.py
>> +++ b/patchwork/parser.py
>> @@ -70,6 +70,27 @@ def normalise_space(value):
>>      return whitespace_re.sub(' ', value).strip()
>> 
>> 
>> +def remove_rfc2822_comments(header_contents):
>> +    """Removes RFC2822 comments from header fields.
>> +
>> +    Gnus create reply emails with commments like In-Reply-To/References:
>> +    <msg-id> (User's message of Sun, 01 Jan 2012 12:34:56 +0700) [comment].
>> +    Patchwork parses the values of the In-Reply-To & References header fields
>> +    with the comment included as part of their value. A side effect of the
>> +    comment not being removed is that message-ids are mismatched. These
>> +    comments do not provide useful information for processing patches
>> +    because they are ignored for threading and not rendered by mail readers.
>> +    """
>> +
>> +    # Captures comments in header fields.
>
> Firstly, I'd like to point out for other reviewers that Raxel commented
> the expression this way because I told him to - if you hate it, blame
> me, not him ;)

If `tox -e flake8` is happy, I am happy :)

>> +    comment_pattern = re.compile(r"""
>> +                                \(      # The opening parenthesis of comment
>> +                                [^()]*  # The contents of the comment
> I *think* this is the bit that's making it not support nesting.
> "Match anything besides another open- or close-paren".
>
> https://docs.python.org/3/library/re.html tells me that Python treats
> '*' as greedy by default, so wouldn't "\(.*\)" handle nested comments?
> Or is there an issue that you can have more that one, e.g.
>
>   In-Reply-To: (danica's mail) abcd1-40-8d@mail.google.com (from gnus)
>
> in which case greedy-matching would also obliterate the actual
> message-id?
>
> This actually brings to mind that I'd like to see an example of one such
> problematic line in the commit message, if you've got one handy.

I've asked on the issue
(https://github.com/getpatchwork/patchwork/issues/399) to see if we can
get some examples. Ostensibly emacs generates them, but I use
emacs+notmuch and I don't see them so I think it might be gnus specific.

Kind Regards,
Daniel
Daniel Axtens June 25, 2021, 9:09 a.m. UTC | #3
Daniel Axtens <dja@axtens.net> writes:
> However, given that this is the first time comments have mattered at
> all, I'd be OK to ignore multi-line comments. I would like nested
> comments to work though, unless it makes a real mess of things.

Ah so it turns out I'm totally wrong here: we've been given some sample
data on the GH issue (http://www.delorie.com/tmp/patchwork-399-1.txt)
and it contains a multi-line comment:

In-Reply-To: <4574b99b-edac-d8dc-9141-79c3109d2fcc@huawei.com> (message from
 liqingqing on Thu, 1 Apr 2021 16:51:45 +0800)

I don't know if Python's email module will fold multi-line headers
automatically - it's very possible it does - or if the regex will work
over multiple lines... I lose track of which regex engine does what!

Kind regards,
Daniel

>
>>> 
>>> Signed-off-by: Raxel Gutierrez <raxel@google.com>
>>> Closes: #399
>>> ---
>>>  patchwork/parser.py                           | 25 +++++++++++++++++--
>>>  .../notes/issue-399-584c5be5b71dcf63.yaml     |  7 ++++++
>>>  2 files changed, 30 insertions(+), 2 deletions(-)
>>>  create mode 100644 releasenotes/notes/issue-399-584c5be5b71dcf63.yaml
>>> 
>>> diff --git a/patchwork/parser.py b/patchwork/parser.py
>>> index 61a8124..683ff55 100644
>>> --- a/patchwork/parser.py
>>> +++ b/patchwork/parser.py
>>> @@ -70,6 +70,27 @@ def normalise_space(value):
>>>      return whitespace_re.sub(' ', value).strip()
>>> 
>>> 
>>> +def remove_rfc2822_comments(header_contents):
>>> +    """Removes RFC2822 comments from header fields.
>>> +
>>> +    Gnus create reply emails with commments like In-Reply-To/References:
>>> +    <msg-id> (User's message of Sun, 01 Jan 2012 12:34:56 +0700) [comment].
>>> +    Patchwork parses the values of the In-Reply-To & References header fields
>>> +    with the comment included as part of their value. A side effect of the
>>> +    comment not being removed is that message-ids are mismatched. These
>>> +    comments do not provide useful information for processing patches
>>> +    because they are ignored for threading and not rendered by mail readers.
>>> +    """
>>> +
>>> +    # Captures comments in header fields.
>>
>> Firstly, I'd like to point out for other reviewers that Raxel commented
>> the expression this way because I told him to - if you hate it, blame
>> me, not him ;)
>
> If `tox -e flake8` is happy, I am happy :)
>
>>> +    comment_pattern = re.compile(r"""
>>> +                                \(      # The opening parenthesis of comment
>>> +                                [^()]*  # The contents of the comment
>> I *think* this is the bit that's making it not support nesting.
>> "Match anything besides another open- or close-paren".
>>
>> https://docs.python.org/3/library/re.html tells me that Python treats
>> '*' as greedy by default, so wouldn't "\(.*\)" handle nested comments?
>> Or is there an issue that you can have more that one, e.g.
>>
>>   In-Reply-To: (danica's mail) abcd1-40-8d@mail.google.com (from gnus)
>>
>> in which case greedy-matching would also obliterate the actual
>> message-id?
>>
>> This actually brings to mind that I'd like to see an example of one such
>> problematic line in the commit message, if you've got one handy.
>
> I've asked on the issue
> (https://github.com/getpatchwork/patchwork/issues/399) to see if we can
> get some examples. Ostensibly emacs generates them, but I use
> emacs+notmuch and I don't see them so I think it might be gnus specific.
>
> Kind Regards,
> Daniel
Raxel Gutierrez June 30, 2021, 5 p.m. UTC | #4
Thanks for the feedback! I have enough information to write a v2 now,
but I'm focusing on adding new functionality to change state in
Patchwork. I'll likely be able to write v2 within the next 2 weeks.

On Thu, Jun 24, 2021 at 2:52 PM Emily Shaffer <emilyshaffer@google.com> wrote:
> Hm, does the nesting level of the comments really matter? Or is the
> issue that they may be multiline?
>
> That is - it's pretty trivial to write a regex to match
> (foo(bar)baz((quot)meh)), as long as you don't actually care about the
> semantics of the nesting.

There are odd examples
(https://datatracker.ietf.org/doc/html/rfc2822#page-47) where nesting
can occur.

> I *think* this is the bit that's making it not support nesting.
> "Match anything besides another open- or close-paren".
>
> https://docs.python.org/3/library/re.html tells me that Python treats
> '*' as greedy by default, so wouldn't "\(.*\)" handle nested comments?
> Or is there an issue that you can have more that one, e.g.
>
>   In-Reply-To: (danica's mail) abcd1-40-8d@mail.google.com (from gnus)
>
> in which case greedy-matching would also obliterate the actual
> message-id?

I believe multiple comments are possible, especially in the context of
references where parentheses could be attached to separate
message-ids. So, I think message-ids would be cut out in those cases
by  "\(.*\)".

> This actually brings to mind that I'd like to see an example of one such
> problematic line in the commit message, if you've got one handy.

My commit message references the spec for an example of an In-Reply-To
field being problematic because of a comment.

> Hum. Is there a test suite we can add a regression test to for this
> specific kind of line format?

I will be working on adding tests for v2. Luckily, what to consider
for a comment was still being addressed :)
Raxel Gutierrez June 30, 2021, 5:04 p.m. UTC | #5
On Fri, Jun 25, 2021 at 5:09 AM Daniel Axtens <dja@axtens.net> wrote:
> Ah so it turns out I'm totally wrong here: we've been given some sample
> data on the GH issue (http://www.delorie.com/tmp/patchwork-399-1.txt)
> and it contains a multi-line comment:
>
> In-Reply-To: <4574b99b-edac-d8dc-9141-79c3109d2fcc@huawei.com> (message from
>  liqingqing on Thu, 1 Apr 2021 16:51:45 +0800)
>
> I don't know if Python's email module will fold multi-line headers
> automatically - it's very possible it does - or if the regex will work
> over multiple lines... I lose track of which regex engine does what!

Thanks for the feedback and information! I will work on adding support
for both multi-line and nested comments :) Hopefully it doesn't get
too messy.

Best,
Raxel
diff mbox series

Patch

diff --git a/patchwork/parser.py b/patchwork/parser.py
index 61a8124..683ff55 100644
--- a/patchwork/parser.py
+++ b/patchwork/parser.py
@@ -70,6 +70,27 @@  def normalise_space(value):
     return whitespace_re.sub(' ', value).strip()


+def remove_rfc2822_comments(header_contents):
+    """Removes RFC2822 comments from header fields.
+
+    Gnus create reply emails with commments like In-Reply-To/References:
+    <msg-id> (User's message of Sun, 01 Jan 2012 12:34:56 +0700) [comment].
+    Patchwork parses the values of the In-Reply-To & References header fields
+    with the comment included as part of their value. A side effect of the
+    comment not being removed is that message-ids are mismatched. These
+    comments do not provide useful information for processing patches
+    because they are ignored for threading and not rendered by mail readers.
+    """
+
+    # Captures comments in header fields.
+    comment_pattern = re.compile(r"""
+                                \(      # The opening parenthesis of comment
+                                [^()]*  # The contents of the comment
+                                \)      # The closing parenthesis of comment
+                                """, re.X)
+    return re.sub(comment_pattern, '', header_contents)
+
+
 def sanitise_header(header_contents, header_name=None):
     """Clean and individual mail header.

@@ -483,13 +504,13 @@  def find_references(mail):

     if 'In-Reply-To' in mail:
         for in_reply_to in mail.get_all('In-Reply-To'):
-            r = clean_header(in_reply_to)
+            r = remove_rfc2822_comments(clean_header(in_reply_to))
             if r:
                 refs.append(r)

     if 'References' in mail:
         for references_header in mail.get_all('References'):
-            h = clean_header(references_header)
+            h = remove_rfc2822_comments(clean_header(references_header))
             if not h:
                 continue
             references = h.split()
diff --git a/releasenotes/notes/issue-399-584c5be5b71dcf63.yaml
b/releasenotes/notes/issue-399-584c5be5b71dcf63.yaml
new file mode 100644
index 0000000..a3e2ec1
--- /dev/null
+++ b/releasenotes/notes/issue-399-584c5be5b71dcf63.yaml
@@ -0,0 +1,7 @@ 
+---
+fixes:
+  - |
+    Gnus users' replies would not be associated to the patch they reply to
+    because their In-Reply-To & References field generated explanatory
+    comments that were added to the value of the message-id parsed.
+    (`#399 <https://github.com/getpatchwork/patchwork/issues/399>`__)