diff mbox series

Revert "parser: Ensure whitespace is stripped for longheaders"

Message ID 20190514061125.13637-1-dja@axtens.net
State Accepted
Headers show
Series Revert "parser: Ensure whitespace is stripped for longheaders" | expand

Commit Message

Daniel Axtens May 14, 2019, 6:11 a.m. UTC
This reverts commit 841f966b8d54b2f51ab1c498eed6e5391f2546a9.

In July 2018, we received a report of OzLabs patchwork mangling
emails that have subjects containing words with internal commas,
like "Insert DT binding for foo,bar" (#197).

Stephen took a look and came up with the comment this reverts. Quoting
the commit message:

    RFC2822 states that long headers can be wrapped using CRLF followed by
    WSP [1]. For example:

        Subject: Foo bar,
         baz

    Should be parsed as:

        Foo bar,baz

As it turns out, this is not the case. Journey with me to
section 2.2.3 of RFC 2822:

2.2.3. Long Header Fields

   Each header field is logically a single line of characters comprising
   the field name, the colon, and the field body.  For convenience
   however, and to deal with the 998/78 character limitations per line,
   the field body portion of a header field can be split into a multiple
   line representation; this is called "folding".  The general rule is
   that wherever this standard allows for folding white space (not
   simply WSP characters), a CRLF may be inserted before any WSP.  For
   example, the header field:

           Subject: This is a test

   can be represented as:

           Subject: This
            is a test

So the issue with the example in the reverted commit is that
there is no folding white space in "bar,baz", so it's not valid
to split it.

These are valid:
        Subject: Foo bar,baz
        Subject: Foo
	 bar,baz

but splitting "bar,baz" into "bar,\n baz" is not valid.

What then is correct unfolding behaviour? Quoting the RFC again:

   The process of moving from this folded multiple-line representation
   of a header field to its single line representation is called
   "unfolding". Unfolding is accomplished by simply removing any CRLF
   that is immediately followed by WSP.  Each header field should be
   treated in its unfolded form for further syntactic and semantic
   evaluation.

In other words, the unfolding rule requires you to strip the CRLF,
but it does not permit you to strip the WSP. Indeed, if "bar,\n baz"
is received, the correct unfolding is "bar, baz".

If you do strip the WSP, you end up mashing words together, such
as in https://patchwork.ozlabs.org/patch/1097852/

So revert the commit, restoring original behaviour, but keep a
corrected version of the test.

This presents a big question though: how did Rob's email up with
a mangled subject line?

To answer this question, you end up having to learn about OzLabs
Patchwork and how it differs from Patchwork the project.

OzLabs Patchwork (patchwork.ozlabs.org) is an installation of
Patchwork. Part of what makes it so useful for so many projects is
a little intervening layer that can massage some mail to make it
end up in the right project. Email that lands in the device tree
project is an example of email that goes through this process.
I only learned about this today and I haven't looked in any detail
at precisely what is done to the mail. The script is not part of the
Patchwork project.

This intervening filter is a Python script that runs - and this
is an important detail - in Python 2.7.

Ignoring all the details, the filter basically operates in a pipe
between the mail program and patchwork's parsemail, like

  (mail from system) | filter.py | parsemail

At it's very simplest, filter.py acts as follows:

 import email
 import sys
 mail = email.parse_from_file(sys.stdin)
 sys.stdout.write(mail.as_string())

Fascinatingly, if you take Rob's email from #197 and put it through
this process, you can see that it is getting mangled:

Before:
Subject: [PATCH v2 3/4] dt-bindings: sound: wm8994: document wlf,csnaddr-pd property

After:
Subject: [PATCH v2 3/4] dt-bindings: sound: wm8994: document wlf,
 csnaddr-pd property

You can see that python27 has incorrectly wrapped the header, breaking
where there is not a foldable space. Python3 does not have this issue.

To summarise:
 - part of the magic of OzLabs PW is a filter to make sure mail gets
   to the right place. This isn't part of the Patchwork project and
   so is usually invisible to patchwork developers.

 - the filter is written in python27. The email module in py27 has
   a bug that incorrectly breaks subjects around commas within words.

 - patchwork correctly unfolds those broken subjects with a space
   after the comma.

 - thr extra space was interpreted as a bug in patchwork, leading to
   a misinterpretation of the spec to strip out the whitespace that
   was believed to be in error.

 - that broke other wrapped subjects.

To solve this, revert the commit and I'll work with jk to get the
filter script into py3 compatibility. (Given that py27 sunsets in
~7mo, trying to fix it is not worth it.)

Closes: #197
CC: stable
---
 patchwork/parser.py                                | 1 -
 patchwork/tests/test_parser.py                     | 2 +-
 releasenotes/notes/issue-197-4f7594db1e4c9887.yaml | 9 +++++----
 3 files changed, 6 insertions(+), 6 deletions(-)

Comments

Stephen Finucane May 14, 2019, 2:57 p.m. UTC | #1
On Tue, 2019-05-14 at 16:11 +1000, Daniel Axtens wrote:
> This reverts commit 841f966b8d54b2f51ab1c498eed6e5391f2546a9.
> 
> In July 2018, we received a report of OzLabs patchwork mangling
> emails that have subjects containing words with internal commas,
> like "Insert DT binding for foo,bar" (#197).
> 
> Stephen took a look and came up with the comment this reverts. Quoting
> the commit message:
> 
>     RFC2822 states that long headers can be wrapped using CRLF followed by
>     WSP [1]. For example:
> 
>         Subject: Foo bar,
>          baz
> 
>     Should be parsed as:
> 
>         Foo bar,baz
> 
> As it turns out, this is not the case. Journey with me to
> section 2.2.3 of RFC 2822:
> 
> 2.2.3. Long Header Fields
> 
>    Each header field is logically a single line of characters comprising
>    the field name, the colon, and the field body.  For convenience
>    however, and to deal with the 998/78 character limitations per line,
>    the field body portion of a header field can be split into a multiple
>    line representation; this is called "folding".  The general rule is
>    that wherever this standard allows for folding white space (not
>    simply WSP characters), a CRLF may be inserted before any WSP.  For
>    example, the header field:
> 
>            Subject: This is a test
> 
>    can be represented as:
> 
>            Subject: This
>             is a test
> 
> So the issue with the example in the reverted commit is that
> there is no folding white space in "bar,baz", so it's not valid
> to split it.
> 
> These are valid:
>         Subject: Foo bar,baz
>         Subject: Foo
> 	 bar,baz
> 
> but splitting "bar,baz" into "bar,\n baz" is not valid.
> 
> What then is correct unfolding behaviour? Quoting the RFC again:
> 
>    The process of moving from this folded multiple-line representation
>    of a header field to its single line representation is called
>    "unfolding". Unfolding is accomplished by simply removing any CRLF
>    that is immediately followed by WSP.  Each header field should be
>    treated in its unfolded form for further syntactic and semantic
>    evaluation.
> 
> In other words, the unfolding rule requires you to strip the CRLF,
> but it does not permit you to strip the WSP. Indeed, if "bar,\n baz"
> is received, the correct unfolding is "bar, baz".
> 
> If you do strip the WSP, you end up mashing words together, such
> as in https://patchwork.ozlabs.org/patch/1097852/
> 
> So revert the commit, restoring original behaviour, but keep a
> corrected version of the test.
> 
> This presents a big question though: how did Rob's email up with
> a mangled subject line?
> 
> To answer this question, you end up having to learn about OzLabs
> Patchwork and how it differs from Patchwork the project.
> 
> OzLabs Patchwork (patchwork.ozlabs.org) is an installation of
> Patchwork. Part of what makes it so useful for so many projects is
> a little intervening layer that can massage some mail to make it
> end up in the right project. Email that lands in the device tree
> project is an example of email that goes through this process.
> I only learned about this today and I haven't looked in any detail
> at precisely what is done to the mail. The script is not part of the
> Patchwork project.
> 
> This intervening filter is a Python script that runs - and this
> is an important detail - in Python 2.7.
> 
> Ignoring all the details, the filter basically operates in a pipe
> between the mail program and patchwork's parsemail, like
> 
>   (mail from system) | filter.py | parsemail
> 
> At it's very simplest, filter.py acts as follows:
> 
>  import email
>  import sys
>  mail = email.parse_from_file(sys.stdin)
>  sys.stdout.write(mail.as_string())
> 
> Fascinatingly, if you take Rob's email from #197 and put it through
> this process, you can see that it is getting mangled:
> 
> Before:
> Subject: [PATCH v2 3/4] dt-bindings: sound: wm8994: document wlf,csnaddr-pd property
> 
> After:
> Subject: [PATCH v2 3/4] dt-bindings: sound: wm8994: document wlf,
>  csnaddr-pd property
> 
> You can see that python27 has incorrectly wrapped the header, breaking
> where there is not a foldable space. Python3 does not have this issue.
> 
> To summarise:
>  - part of the magic of OzLabs PW is a filter to make sure mail gets
>    to the right place. This isn't part of the Patchwork project and
>    so is usually invisible to patchwork developers.
> 
>  - the filter is written in python27. The email module in py27 has
>    a bug that incorrectly breaks subjects around commas within words.
> 
>  - patchwork correctly unfolds those broken subjects with a space
>    after the comma.
> 
>  - thr extra space was interpreted as a bug in patchwork, leading to
>    a misinterpretation of the spec to strip out the whitespace that
>    was believed to be in error.
> 
>  - that broke other wrapped subjects.
> 
> To solve this, revert the commit and I'll work with jk to get the
> filter script into py3 compatibility. (Given that py27 sunsets in
> ~7mo, trying to fix it is not worth it.)
> 
> Closes: #197
> CC: stable

Thanks for the detailed overview. I dropped the changes to the release
note and simply added a new one because this revert won't appear in
2.1.2, for example, so the release note for that shouldn't be changed.
Otherwise all good though so I've applied this to both master and
stable/2.1.

Stephen

> ---
>  patchwork/parser.py                                | 1 -
>  patchwork/tests/test_parser.py                     | 2 +-
>  releasenotes/notes/issue-197-4f7594db1e4c9887.yaml | 9 +++++----
>  3 files changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/patchwork/parser.py b/patchwork/parser.py
> index 712780a498c4..91e9920c8782 100644
> --- a/patchwork/parser.py
> +++ b/patchwork/parser.py
> @@ -47,7 +47,6 @@ class DuplicateMailError(Exception):
>  
>  
>  def normalise_space(value):
> -    value = ''.join(re.split(r'\n\s+', value))
>      whitespace_re = re.compile(r'\s+')
>      return whitespace_re.sub(' ', value).strip()
>  
> diff --git a/patchwork/tests/test_parser.py b/patchwork/tests/test_parser.py
> index ddbcf5b15a19..f18220298078 100644
> --- a/patchwork/tests/test_parser.py
> +++ b/patchwork/tests/test_parser.py
> @@ -838,7 +838,7 @@ class SubjectTest(TestCase):
>          self.assertEqual(clean_subject("[PATCH] meep \n meep"),
>                           ('meep meep', []))
>          self.assertEqual(clean_subject("[PATCH] meep,\n meep"),
> -                         ('meep,meep', []))
> +                         ('meep, meep', []))
>          self.assertEqual(clean_subject('[PATCH RFC] meep'),
>                           ('[RFC] meep', ['RFC']))
>          self.assertEqual(clean_subject('[PATCH,RFC] meep'),
> diff --git a/releasenotes/notes/issue-197-4f7594db1e4c9887.yaml b/releasenotes/notes/issue-197-4f7594db1e4c9887.yaml
> index 2777fbc2f85b..41b86c064b8a 100644
> --- a/releasenotes/notes/issue-197-4f7594db1e4c9887.yaml
> +++ b/releasenotes/notes/issue-197-4f7594db1e4c9887.yaml
> @@ -1,7 +1,8 @@
>  ---
>  fixes:
>    - |
> -    Long headers can be wrapped using CRLF followed by WSP (whitespace). This
> -    whitespace was not being stripped, resulting in errant whitespace being
> -    saved for the patch subject. This is resolved though existing patches and
> -    cover letters will need to be updated manually.
> +    Long headers can be wrapped using CRLF followed by WSP (whitespace). There
> +    was an incorrect fix that would lead to whitespace being stripped where it
> +    shouldn't be, resulting in words being stuck together (likethis). This is
> +    resolved, though existing patches and cover letters will need to be
> +    updated manually.
Daniel Axtens May 14, 2019, 4:23 p.m. UTC | #2
Stephen Finucane <stephen@that.guru> writes:

> On Tue, 2019-05-14 at 16:11 +1000, Daniel Axtens wrote:
>> This reverts commit 841f966b8d54b2f51ab1c498eed6e5391f2546a9.
>> 
>> In July 2018, we received a report of OzLabs patchwork mangling
>> emails that have subjects containing words with internal commas,
>> like "Insert DT binding for foo,bar" (#197).
>> 
>> Stephen took a look and came up with the comment this reverts. Quoting
>> the commit message:
>> 
>>     RFC2822 states that long headers can be wrapped using CRLF followed by
>>     WSP [1]. For example:
>> 
>>         Subject: Foo bar,
>>          baz
>> 
>>     Should be parsed as:
>> 
>>         Foo bar,baz
>> 
>> As it turns out, this is not the case. Journey with me to
>> section 2.2.3 of RFC 2822:
>> 
>> 2.2.3. Long Header Fields
>> 
>>    Each header field is logically a single line of characters comprising
>>    the field name, the colon, and the field body.  For convenience
>>    however, and to deal with the 998/78 character limitations per line,
>>    the field body portion of a header field can be split into a multiple
>>    line representation; this is called "folding".  The general rule is
>>    that wherever this standard allows for folding white space (not
>>    simply WSP characters), a CRLF may be inserted before any WSP.  For
>>    example, the header field:
>> 
>>            Subject: This is a test
>> 
>>    can be represented as:
>> 
>>            Subject: This
>>             is a test
>> 
>> So the issue with the example in the reverted commit is that
>> there is no folding white space in "bar,baz", so it's not valid
>> to split it.
>> 
>> These are valid:
>>         Subject: Foo bar,baz
>>         Subject: Foo
>> 	 bar,baz
>> 
>> but splitting "bar,baz" into "bar,\n baz" is not valid.
>> 
>> What then is correct unfolding behaviour? Quoting the RFC again:
>> 
>>    The process of moving from this folded multiple-line representation
>>    of a header field to its single line representation is called
>>    "unfolding". Unfolding is accomplished by simply removing any CRLF
>>    that is immediately followed by WSP.  Each header field should be
>>    treated in its unfolded form for further syntactic and semantic
>>    evaluation.
>> 
>> In other words, the unfolding rule requires you to strip the CRLF,
>> but it does not permit you to strip the WSP. Indeed, if "bar,\n baz"
>> is received, the correct unfolding is "bar, baz".
>> 
>> If you do strip the WSP, you end up mashing words together, such
>> as in https://patchwork.ozlabs.org/patch/1097852/
>> 
>> So revert the commit, restoring original behaviour, but keep a
>> corrected version of the test.
>> 
>> This presents a big question though: how did Rob's email up with
>> a mangled subject line?
>> 
>> To answer this question, you end up having to learn about OzLabs
>> Patchwork and how it differs from Patchwork the project.
>> 
>> OzLabs Patchwork (patchwork.ozlabs.org) is an installation of
>> Patchwork. Part of what makes it so useful for so many projects is
>> a little intervening layer that can massage some mail to make it
>> end up in the right project. Email that lands in the device tree
>> project is an example of email that goes through this process.
>> I only learned about this today and I haven't looked in any detail
>> at precisely what is done to the mail. The script is not part of the
>> Patchwork project.
>> 
>> This intervening filter is a Python script that runs - and this
>> is an important detail - in Python 2.7.
>> 
>> Ignoring all the details, the filter basically operates in a pipe
>> between the mail program and patchwork's parsemail, like
>> 
>>   (mail from system) | filter.py | parsemail
>> 
>> At it's very simplest, filter.py acts as follows:
>> 
>>  import email
>>  import sys
>>  mail = email.parse_from_file(sys.stdin)
>>  sys.stdout.write(mail.as_string())
>> 
>> Fascinatingly, if you take Rob's email from #197 and put it through
>> this process, you can see that it is getting mangled:
>> 
>> Before:
>> Subject: [PATCH v2 3/4] dt-bindings: sound: wm8994: document wlf,csnaddr-pd property
>> 
>> After:
>> Subject: [PATCH v2 3/4] dt-bindings: sound: wm8994: document wlf,
>>  csnaddr-pd property
>> 
>> You can see that python27 has incorrectly wrapped the header, breaking
>> where there is not a foldable space. Python3 does not have this issue.
>> 
>> To summarise:
>>  - part of the magic of OzLabs PW is a filter to make sure mail gets
>>    to the right place. This isn't part of the Patchwork project and
>>    so is usually invisible to patchwork developers.
>> 
>>  - the filter is written in python27. The email module in py27 has
>>    a bug that incorrectly breaks subjects around commas within words.
>> 
>>  - patchwork correctly unfolds those broken subjects with a space
>>    after the comma.
>> 
>>  - thr extra space was interpreted as a bug in patchwork, leading to
>>    a misinterpretation of the spec to strip out the whitespace that
>>    was believed to be in error.
>> 
>>  - that broke other wrapped subjects.
>> 
>> To solve this, revert the commit and I'll work with jk to get the
>> filter script into py3 compatibility. (Given that py27 sunsets in
>> ~7mo, trying to fix it is not worth it.)
>> 
>> Closes: #197
>> CC: stable
>
> Thanks for the detailed overview. I dropped the changes to the release
> note and simply added a new one because this revert won't appear in
> 2.1.2, for example, so the release note for that shouldn't be changed.
> Otherwise all good though so I've applied this to both master and
> stable/2.1.
>

Thanks - release notes are still something of a mystery to me so thanks
for fixing that up.

I'll talk to the OzLabs crew about whether they want a 2.1.3 with this
patch or if they're happy to just carry it for a while.

Regards,
Daniel


> Stephen
>
>> ---
>>  patchwork/parser.py                                | 1 -
>>  patchwork/tests/test_parser.py                     | 2 +-
>>  releasenotes/notes/issue-197-4f7594db1e4c9887.yaml | 9 +++++----
>>  3 files changed, 6 insertions(+), 6 deletions(-)
>> 
>> diff --git a/patchwork/parser.py b/patchwork/parser.py
>> index 712780a498c4..91e9920c8782 100644
>> --- a/patchwork/parser.py
>> +++ b/patchwork/parser.py
>> @@ -47,7 +47,6 @@ class DuplicateMailError(Exception):
>>  
>>  
>>  def normalise_space(value):
>> -    value = ''.join(re.split(r'\n\s+', value))
>>      whitespace_re = re.compile(r'\s+')
>>      return whitespace_re.sub(' ', value).strip()
>>  
>> diff --git a/patchwork/tests/test_parser.py b/patchwork/tests/test_parser.py
>> index ddbcf5b15a19..f18220298078 100644
>> --- a/patchwork/tests/test_parser.py
>> +++ b/patchwork/tests/test_parser.py
>> @@ -838,7 +838,7 @@ class SubjectTest(TestCase):
>>          self.assertEqual(clean_subject("[PATCH] meep \n meep"),
>>                           ('meep meep', []))
>>          self.assertEqual(clean_subject("[PATCH] meep,\n meep"),
>> -                         ('meep,meep', []))
>> +                         ('meep, meep', []))
>>          self.assertEqual(clean_subject('[PATCH RFC] meep'),
>>                           ('[RFC] meep', ['RFC']))
>>          self.assertEqual(clean_subject('[PATCH,RFC] meep'),
>> diff --git a/releasenotes/notes/issue-197-4f7594db1e4c9887.yaml b/releasenotes/notes/issue-197-4f7594db1e4c9887.yaml
>> index 2777fbc2f85b..41b86c064b8a 100644
>> --- a/releasenotes/notes/issue-197-4f7594db1e4c9887.yaml
>> +++ b/releasenotes/notes/issue-197-4f7594db1e4c9887.yaml
>> @@ -1,7 +1,8 @@
>>  ---
>>  fixes:
>>    - |
>> -    Long headers can be wrapped using CRLF followed by WSP (whitespace). This
>> -    whitespace was not being stripped, resulting in errant whitespace being
>> -    saved for the patch subject. This is resolved though existing patches and
>> -    cover letters will need to be updated manually.
>> +    Long headers can be wrapped using CRLF followed by WSP (whitespace). There
>> +    was an incorrect fix that would lead to whitespace being stripped where it
>> +    shouldn't be, resulting in words being stuck together (likethis). This is
>> +    resolved, though existing patches and cover letters will need to be
>> +    updated manually.
diff mbox series

Patch

diff --git a/patchwork/parser.py b/patchwork/parser.py
index 712780a498c4..91e9920c8782 100644
--- a/patchwork/parser.py
+++ b/patchwork/parser.py
@@ -47,7 +47,6 @@  class DuplicateMailError(Exception):
 
 
 def normalise_space(value):
-    value = ''.join(re.split(r'\n\s+', value))
     whitespace_re = re.compile(r'\s+')
     return whitespace_re.sub(' ', value).strip()
 
diff --git a/patchwork/tests/test_parser.py b/patchwork/tests/test_parser.py
index ddbcf5b15a19..f18220298078 100644
--- a/patchwork/tests/test_parser.py
+++ b/patchwork/tests/test_parser.py
@@ -838,7 +838,7 @@  class SubjectTest(TestCase):
         self.assertEqual(clean_subject("[PATCH] meep \n meep"),
                          ('meep meep', []))
         self.assertEqual(clean_subject("[PATCH] meep,\n meep"),
-                         ('meep,meep', []))
+                         ('meep, meep', []))
         self.assertEqual(clean_subject('[PATCH RFC] meep'),
                          ('[RFC] meep', ['RFC']))
         self.assertEqual(clean_subject('[PATCH,RFC] meep'),
diff --git a/releasenotes/notes/issue-197-4f7594db1e4c9887.yaml b/releasenotes/notes/issue-197-4f7594db1e4c9887.yaml
index 2777fbc2f85b..41b86c064b8a 100644
--- a/releasenotes/notes/issue-197-4f7594db1e4c9887.yaml
+++ b/releasenotes/notes/issue-197-4f7594db1e4c9887.yaml
@@ -1,7 +1,8 @@ 
 ---
 fixes:
   - |
-    Long headers can be wrapped using CRLF followed by WSP (whitespace). This
-    whitespace was not being stripped, resulting in errant whitespace being
-    saved for the patch subject. This is resolved though existing patches and
-    cover letters will need to be updated manually.
+    Long headers can be wrapped using CRLF followed by WSP (whitespace). There
+    was an incorrect fix that would lead to whitespace being stripped where it
+    shouldn't be, resulting in words being stuck together (likethis). This is
+    resolved, though existing patches and cover letters will need to be
+    updated manually.