[3/6] parser: parse headers containing invalid characters

Message ID	1474589177-10328-4-git-send-email-dja@axtens.net
State	Superseded
Headers	show Return-Path: <patchwork-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org> From: Daniel Axtens <dja@axtens.net> To: patchwork@lists.ozlabs.org Subject: [PATCH 3/6] parser: parse headers containing invalid characters Date: Fri, 23 Sep 2016 10:06:14 +1000 Message-Id: <1474589177-10328-4-git-send-email-dja@axtens.net> In-Reply-To: <1474589177-10328-1-git-send-email-dja@axtens.net> References: <1474589177-10328-1-git-send-email-dja@axtens.net> Precedence: list Cc: andrew.donnellan@au1.ibm.com MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: patchwork-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org Sender: "Patchwork" <patchwork-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org>

Message ID

1474589177-10328-4-git-send-email-dja@axtens.net

State

Superseded

Headers

From: Daniel Axtens <dja@axtens.net>
To: patchwork@lists.ozlabs.org
Subject: [PATCH 3/6] parser: parse headers containing invalid characters
Date: Fri, 23 Sep 2016 10:06:14 +1000
Message-Id: <1474589177-10328-4-git-send-email-dja@axtens.net>
In-Reply-To: <1474589177-10328-1-git-send-email-dja@axtens.net>
References: <1474589177-10328-1-git-send-email-dja@axtens.net>
Precedence: list
Cc: andrew.donnellan@au1.ibm.com
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
Errors-To: patchwork-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org
Sender: "Patchwork"
	<patchwork-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org>

Commit Message

Daniel Axtens Sept. 23, 2016, 12:06 a.m. UTC

If there is a non-ascii character in a header, parsing fails,
even on Py27.

This has huge Py2/Py3 complexities. The Py3 email package has tools
to handle this - we just need to use them. Py2, on the other hand,
needs a lot of hand-holding, as explained in the comments.

This is handy for mails with malformed headers containing weird
bytes.

Reported-by: Thomas Monjalon <thomas.monjalon@6wind.com>
Suggested-by: Stephen Finucane <stephenfinucane@hotmail.com>
Signed-off-by: Daniel Axtens <dja@axtens.net>

---

Many thanks to Thomas for his help debugging this, and to Stephen
for a much better patch.

This should probably go to a stable branch too. We'll need to start
some discussion about how to handle bug fixes for people not running
git mainline (like ozlabs.org and kernel.org).
---
 patchwork/parser.py | 65 ++++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 60 insertions(+), 5 deletions(-)

Comments

Stephen Finucane Sept. 25, 2016, 3:09 p.m. UTC | #1

On 23 Sep 10:06, Daniel Axtens wrote:
> If there is a non-ascii character in a header, parsing fails,
> even on Py27.
> 
> This has huge Py2/Py3 complexities. The Py3 email package has tools
> to handle this - we just need to use them. Py2, on the other hand,
> needs a lot of hand-holding, as explained in the comments.
> 
> This is handy for mails with malformed headers containing weird
> bytes.
> 
> Reported-by: Thomas Monjalon <thomas.monjalon@6wind.com>
> Suggested-by: Stephen Finucane <stephenfinucane@hotmail.com>
> Signed-off-by: Daniel Axtens <dja@axtens.net>
> 
> ---
> 
> Many thanks to Thomas for his help debugging this, and to Stephen
> for a much better patch.
> 
> This should probably go to a stable branch too. We'll need to start
> some discussion about how to handle bug fixes for people not running
> git mainline (like ozlabs.org and kernel.org).
> ---
>  patchwork/parser.py | 65 ++++++++++++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 60 insertions(+), 5 deletions(-)
> 
> diff --git a/patchwork/parser.py b/patchwork/parser.py
> index 3389e96c4f3e..1b4cab1eb1a8 100644
> --- a/patchwork/parser.py
> +++ b/patchwork/parser.py
> @@ -21,7 +21,7 @@
>  
>  import codecs
>  import datetime
> -from email.header import Header, decode_header
> +from email.header import decode_header, make_header
>  from email.utils import parsedate_tz, mktime_tz
>  from fnmatch import fnmatch
>  from functools import reduce
> @@ -155,10 +155,65 @@ def find_date(mail):
>  
>  
>  def find_headers(mail):
> -    return reduce(operator.__concat__,
> -                  ['%s: %s\n' % (k, Header(v, header_name=k,
> -                                           continuation_ws='\t').encode())
> -                   for (k, v) in list(mail.items())])
> +    # We have some Py2/Py3 issues here.
> +    #
> +    # Firstly, the email parser (before we get here)
> +    # Python 3: headers with weird chars are email.header.Header
> +    #           class, others as str
> +    # Python 2: every header is an str
> +    #
> +    # Secondly, the behaviour of decode_header:
> +    # Python 3: weird headers are labelled as unknown-8bit
> +    # Python 2: weird headers are not labelled differently, causing an
> +    #           encoding issue when Py2 goes to then encode it in
> +    #           make_header.
> +    #
> +    # Lastly, making matters worse, in Python2, unknown-8bit doesn't
> +    # seem to be supported as an input to make_header, so not only do
> +    # we have to detect dodgy headers, we have to fix them ourselves.
> +    #
> +    # We solve this by catching any Unicode errors, and then manually
> +    # handling any interesting headers. This will only happen under
> +    # Py2.
> +
> +    headers = {key: decode_header(value) for key, value in
> +               mail.items()}
> +
> +    strings = []
> +    for key, value in headers.items():
> +        try:
> +            header = make_header(value,
> +                                 header_name=key,
> +                                 continuation_ws='\t')
> +        except UnicodeDecodeError:
> +            # We should only get here under Python 2.
> +            # At least one of the parts cannot be encoded as ascii.
> +            # Find out which one and fix it somehow.
> +            new_value=[]
> +            for (part, coding) in value:
> +                if (coding is not None or
> +                    all([ord(x) in range(128) for x in part])):
> +
> +                    # we either have a coding hint or it's all ascii
> +                    # let it through unchanged.
> +                    # TODO: handle invalid data with a coding hint
> +                    new_value += [(part, coding)]
> +                else:
> +                    # We have random bytes that aren't properly coded.
> +                    # We should do the unknown-8bit coding ourselves.
> +                    # For now, we're just going to replace any dubious
> +                    # chars with ?. TODO: replace it with a proper QP
> +                    # unknown-8bit codec.
> +                    new_value += [(part.decode('ascii', errors='replace')
> +                                   .encode('ascii', errors='replace'),
> +                                   None)]
> +            header = make_header(new_value,
> +                                 header_name=key,
> +                                 continuation_ws='\t')
> +        finally:
> +            strings += ['%s: %s' % (key, header.encode())]
> +
> +    return '\n'.join(strings)

Excellent work: this works well and the tests are much appreciated :)
I spent some time reviewing this and I have one question: should we
extend this to other header parsing, such as 'Subject'? I ask because I
noticed that we have a 'clean_header' function which is already used to
handle some unicode headers and it should probably handle things like
invalid characters too. I did some hacking to see if it could slot in
place of the above changes:

     def find_headers(mail):
    +    return '\n'.join(['%s: %s' % (key, clean_header(value)) for key, value
    +                      in mail.items()])
         # We have some Py2/Py3 issues here.

On running this, this failed with a LookupError on Python 3 and the
UnicodeDecodeError on Python 2, so I tried to handle these:

     def clean_header(header):
         """Decode (possibly non-ascii) headers."""
         def decode(fragment):
    -        (frag_str, frag_encoding) = fragment
    +        frag_str, frag_encoding = fragment
             if frag_encoding:
    -            return frag_str.decode(frag_encoding)
    +            if frag_encoding != 'unknown-8bit':
    +                return frag_str.decode(frag_encoding)
    +            else:
    +                return frag_str.decode('ascii', errors='replace')
             elif isinstance(frag_str, six.binary_type):  # python 2
    -            return frag_str.decode()
    +            try:
    +                return frag_str.decode()
    +            except UnicodeDecodeError:
    +                return frag_str.decode('ascii', errors='replace')
    +
             return frag_str

         fragments = [decode(x) for x in decode_header(header)]

When I make this change, all tests pass. Perhaps we should go about
integrating your changes in this function and adding tests for things
like invalid subjects lines to make sure this does what it says on the
tin?

Stephen

Daniel Axtens Sept. 25, 2016, 10:20 p.m. UTC | #2

Hi Stephen,

> Excellent work: this works well and the tests are much appreciated :)
> I spent some time reviewing this and I have one question: should we
> extend this to other header parsing, such as 'Subject'? I ask because I
> noticed that we have a 'clean_header' function which is already used to
> handle some unicode headers and it should probably handle things like
> invalid characters too. I did some hacking to see if it could slot in
> place of the above changes:
>
>      def find_headers(mail):
>     +    return '\n'.join(['%s: %s' % (key, clean_header(value)) for key, value
>     +                      in mail.items()])
>          # We have some Py2/Py3 issues here.

Oooh, good catch. We might need to try to integreate make_header because
it deals with wrapping lines correctly, but that should be pretty easy.

> On running this, this failed with a LookupError on Python 3 and the
> UnicodeDecodeError on Python 2, so I tried to handle these:
>
>      def clean_header(header):
>          """Decode (possibly non-ascii) headers."""
>          def decode(fragment):
>     -        (frag_str, frag_encoding) = fragment
>     +        frag_str, frag_encoding = fragment
>              if frag_encoding:
>     -            return frag_str.decode(frag_encoding)
>     +            if frag_encoding != 'unknown-8bit':
>     +                return frag_str.decode(frag_encoding)
>     +            else:
>     +                return frag_str.decode('ascii', errors='replace')
>              elif isinstance(frag_str, six.binary_type):  # python 2
>     -            return frag_str.decode()
>     +            try:
>     +                return frag_str.decode()
>     +            except UnicodeDecodeError:
>     +                return frag_str.decode('ascii', errors='replace')
>     +
>              return frag_str
>
>          fragments = [decode(x) for x in decode_header(header)]
>
> When I make this change, all tests pass. Perhaps we should go about
> integrating your changes in this function and adding tests for things
> like invalid subjects lines to make sure this does what it says on the
> tin?

Yep. So conformant email clients should produce email headers that are
7-bit ascii, with UTF-8/other charsets encoded using quoted
printables. But there will certainly be those that don't, as we've
discovered. I'll add some tests for a variety of headers with invalid
characters and integrate your revision of this for v2.

Regards,
Daniel

> Stephen

Daniel Axtens Sept. 28, 2016, 3:27 a.m. UTC | #3

Hi Stephen,

>
> Excellent work: this works well and the tests are much appreciated :)
> I spent some time reviewing this and I have one question: should we
> extend this to other header parsing, such as 'Subject'? I ask because I
> noticed that we have a 'clean_header' function which is already used to
> handle some unicode headers and it should probably handle things like
> invalid characters too. I did some hacking to see if it could slot in
> place of the above changes:
>
>      def find_headers(mail):
>     +    return '\n'.join(['%s: %s' % (key, clean_header(value)) for key, value
>     +                      in mail.items()])
>          # We have some Py2/Py3 issues here.
>
> On running this, this failed with a LookupError on Python 3 and the
> UnicodeDecodeError on Python 2, so I tried to handle these:
>
>      def clean_header(header):
>          """Decode (possibly non-ascii) headers."""
>          def decode(fragment):
>     -        (frag_str, frag_encoding) = fragment
>     +        frag_str, frag_encoding = fragment
>              if frag_encoding:
>     -            return frag_str.decode(frag_encoding)
>     +            if frag_encoding != 'unknown-8bit':
>     +                return frag_str.decode(frag_encoding)
>     +            else:
>     +                return frag_str.decode('ascii', errors='replace')
>              elif isinstance(frag_str, six.binary_type):  # python 2
>     -            return frag_str.decode()
>     +            try:
>     +                return frag_str.decode()
>     +            except UnicodeDecodeError:
>     +                return frag_str.decode('ascii', errors='replace')
>     +
>              return frag_str
>
>          fragments = [decode(x) for x in decode_header(header)]
>
> When I make this change, all tests pass. Perhaps we should go about
> integrating your changes in this function and adding tests for things
> like invalid subjects lines to make sure this does what it says on the
> tin?

So it turns out that:

- we need the change to clean_header() you identified to properly handle
  invalid characters in fields that are parsed by clean_header()

- clean_header() returns a decoded (8-bit) string, without
  wrapping. find_headers() on the other hand, needs to return encoded
  (7-bit) wrapped header lines.

I'll probably add a helper function that deals with the decoding
properly, then I'll pipe that into both clean_header() and
find_header().
    
Regards,
Daniel

diff --git a/patchwork/parser.py b/patchwork/parser.py
index 3389e96c4f3e..1b4cab1eb1a8 100644
--- a/patchwork/parser.py
+++ b/patchwork/parser.py
@@ -21,7 +21,7 @@ 
 
 import codecs
 import datetime
-from email.header import Header, decode_header
+from email.header import decode_header, make_header
 from email.utils import parsedate_tz, mktime_tz
 from fnmatch import fnmatch
 from functools import reduce
@@ -155,10 +155,65 @@  def find_date(mail):
 
 
 def find_headers(mail):
-    return reduce(operator.__concat__,
-                  ['%s: %s\n' % (k, Header(v, header_name=k,
-                                           continuation_ws='\t').encode())
-                   for (k, v) in list(mail.items())])
+    # We have some Py2/Py3 issues here.
+    #
+    # Firstly, the email parser (before we get here)
+    # Python 3: headers with weird chars are email.header.Header
+    #           class, others as str
+    # Python 2: every header is an str
+    #
+    # Secondly, the behaviour of decode_header:
+    # Python 3: weird headers are labelled as unknown-8bit
+    # Python 2: weird headers are not labelled differently, causing an
+    #           encoding issue when Py2 goes to then encode it in
+    #           make_header.
+    #
+    # Lastly, making matters worse, in Python2, unknown-8bit doesn't
+    # seem to be supported as an input to make_header, so not only do
+    # we have to detect dodgy headers, we have to fix them ourselves.
+    #
+    # We solve this by catching any Unicode errors, and then manually
+    # handling any interesting headers. This will only happen under
+    # Py2.
+
+    headers = {key: decode_header(value) for key, value in
+               mail.items()}
+
+    strings = []
+    for key, value in headers.items():
+        try:
+            header = make_header(value,
+                                 header_name=key,
+                                 continuation_ws='\t')
+        except UnicodeDecodeError:
+            # We should only get here under Python 2.
+            # At least one of the parts cannot be encoded as ascii.
+            # Find out which one and fix it somehow.
+            new_value=[]
+            for (part, coding) in value:
+                if (coding is not None or
+                    all([ord(x) in range(128) for x in part])):
+
+                    # we either have a coding hint or it's all ascii
+                    # let it through unchanged.
+                    # TODO: handle invalid data with a coding hint
+                    new_value += [(part, coding)]
+                else:
+                    # We have random bytes that aren't properly coded.
+                    # We should do the unknown-8bit coding ourselves.
+                    # For now, we're just going to replace any dubious
+                    # chars with ?. TODO: replace it with a proper QP
+                    # unknown-8bit codec.
+                    new_value += [(part.decode('ascii', errors='replace')
+                                   .encode('ascii', errors='replace'),
+                                   None)]
+            header = make_header(new_value,
+                                 header_name=key,
+                                 continuation_ws='\t')
+        finally:
+            strings += ['%s: %s' % (key, header.encode())]
+
+    return '\n'.join(strings)
 
 
 def find_references(mail):