diff mbox series

[RFC,1/1] support/download: allow to use part of file in checksum checking

Message ID 20230727125123.927568-1-herve.codina@bootlin.com
State Rejected
Headers show
Series [RFC,1/1] support/download: allow to use part of file in checksum checking | expand

Commit Message

Herve Codina July 27, 2023, 12:51 p.m. UTC
The checksum checking is done on whole files only using the <pkg>.hash
checksum references.
Among the files checked, files related to licenses are checked.

Some packages do not contain any specific license files and, for
them, some source files are used. These source files contain license
information (usually comments at the beginning of the file).

Using the whole source file for checksum checking in this case can lead
to issues if a br2-external is present and applies some patches to this
source file.
Indeed, patching a package from a br2-external is allowed but in that
case the whole file checksum change and all checksum verification done
on that file fails. In particular 'make legal-info' fails.

Using only the license related part of a source file for checksum
checking solve this issue.
'make legal-info' will fail only if the license part is modify.

Introduce the possibility to have a lines range in <pkg>.hash and, if
present, compute checksum on the part defined by this lines range.

For instance, in <pkg>.hash:
sha256  xxxxxx  foo.c        <-- sha256 on the whole foo.c file
sha256  xxxxxx  foo.c  1,15  <-- sha256 on extraction from line 1 to 15

Signed-off-by: Herve Codina <herve.codina@bootlin.com>
---
 support/download/check-hash | 24 +++++++++++++++++++-----
 1 file changed, 19 insertions(+), 5 deletions(-)

Comments

Yann E. MORIN July 27, 2023, 3:59 p.m. UTC | #1
Hervé, All,

On 2023-07-27 14:51 +0200, Herve Codina spake thusly:
> The checksum checking is done on whole files only using the <pkg>.hash
> checksum references.
> Among the files checked, files related to licenses are checked.
> 
> Some packages do not contain any specific license files and, for
> them, some source files are used. These source files contain license
> information (usually comments at the beginning of the file).
> 
> Using the whole source file for checksum checking in this case can lead
> to issues if a br2-external is present and applies some patches to this

s/br2-external/global patch-dir/

> source file.
> Indeed, patching a package from a br2-external is allowed but in that

s/br2-external/global patch-dir/

> case the whole file checksum change and all checksum verification done
> on that file fails. In particular 'make legal-info' fails.
> 
> Using only the license related part of a source file for checksum
> checking solve this issue.
> 'make legal-info' will fail only if the license part is modify.

modified

> Introduce the possibility to have a lines range in <pkg>.hash and, if
> present, compute checksum on the part defined by this lines range.

I'm afraid this is fraught with limitations.

For example, I've seen source files where the license texts (yes,
plural) are spread out out in the source file, intermixed between
parts of the code.

If you patch such a file, you are certain that:
 1. its hash as a whole changes, as you noticed
 2. the license will not be on the lines you expect them to be, and will
    defeat your proposed scheme.

> For instance, in <pkg>.hash:
> sha256  xxxxxx  foo.c        <-- sha256 on the whole foo.c file
> sha256  xxxxxx  foo.c  1,15  <-- sha256 on extraction from line 1 to 15

That's hackish.

Instead, I would argue for either or both of the following:

 1. considering that the source file is used as a license file, it is
    not non-redistributable, so it is already redistributed as part of
    the legal-info artefacts, so I would suggest that we simply stop
    using source file as a license file;

 2. for such packages, add a post-extract hook that prepares the license
    file(s), and register those as _FOO_LICENSE_FILES instead. By using
    code, you can search for patterns rather than depend on constant
    line numbers.

Both are trivial enough, I believe.

Regards,
Yann E. MORIN.

> Signed-off-by: Herve Codina <herve.codina@bootlin.com>
> ---
>  support/download/check-hash | 24 +++++++++++++++++++-----
>  1 file changed, 19 insertions(+), 5 deletions(-)
> 
> diff --git a/support/download/check-hash b/support/download/check-hash
> index 5a47f49bc3..a90c9ca58a 100755
> --- a/support/download/check-hash
> +++ b/support/download/check-hash
> @@ -45,10 +45,20 @@ fi
>  # $1: algo hash
>  # $2: known hash
>  # $3: file (full path)
> +# $4: lines ranges in the form s,e. If present, the checksum is compute on the
> +#     file extracted part from line number s to line number e included.
> +#     The first line in the file is the line number 1.
> +#     If not present, the whole file is used.
>  check_one_hash() {
>      _h="${1}"
>      _known="${2}"
>      _file="${3}"
> +    _r="${4}"
> +
> +    base_with_range=${base}
> +    if [ ${_r} ]; then
> +        base_with_range="${base_with_range}:${_r}"
> +    fi
>  
>      # Note: md5 is supported, but undocumented on purpose.
>      # Note: sha3 is not supported, since there is currently no implementation
> @@ -64,13 +74,17 @@ check_one_hash() {
>      esac
>  
>      # Do the hashes match?
> -    _hash=$( ${_h}sum "${_file}" |cut -d ' ' -f 1 )
> +    if [ ${_r} ]; then
> +        _hash=$( sed -n "${r} p" ${file} | ${_h}sum |cut -d ' ' -f 1 )
> +    else
> +        _hash=$( ${_h}sum "${_file}" |cut -d ' ' -f 1 )
> +    fi
>      if [ "${_hash}" = "${_known}" ]; then
> -        printf "%s: OK (%s: %s)\n" "${base}" "${_h}" "${_hash}"
> +        printf "%s: OK (%s: %s)\n" "${base_with_range}" "${_h}" "${_hash}"
>          return 0
>      fi
>  
> -    printf "ERROR: %s has wrong %s hash:\n" "${base}" "${_h}" >&2
> +    printf "ERROR: %s has wrong %s hash:\n" "${base_with_range}" "${_h}" >&2
>      printf "ERROR: expected: %s\n" "${_known}" >&2
>      printf "ERROR: got     : %s\n" "${_hash}" >&2
>      printf "ERROR: Incomplete download, or man-in-the-middle (MITM) attack\n" >&2
> @@ -80,7 +94,7 @@ check_one_hash() {
>  
>  # Do we know one or more hashes for that file?
>  nb_checks=0
> -while read t h f; do
> +while read t h f r; do
>      case "${t}" in
>          ''|'#'*)
>              # Skip comments and empty lines
> @@ -88,7 +102,7 @@ while read t h f; do
>              ;;
>          *)
>              if [ "${f}" = "${base}" ]; then
> -                check_one_hash "${t}" "${h}" "${file}"
> +                check_one_hash "${t}" "${h}" "${file}" "${r}"
>                  : $((nb_checks++))
>              fi
>              ;;
> -- 
> 2.41.0
>
Thomas Petazzoni July 27, 2023, 8:59 p.m. UTC | #2
On Thu, 27 Jul 2023 17:59:44 +0200
"Yann E. MORIN" <yann.morin.1998@free.fr> wrote:

>  2. for such packages, add a post-extract hook that prepares the license
>     file(s), and register those as _FOO_LICENSE_FILES instead. By using
>     code, you can search for patterns rather than depend on constant
>     line numbers.

Like:

# The Github repository doesn't contain the source code as-is: it
# contains a tarball with the kernel driver source code, and a
# self-extractible binary for the user-space parts. So we extract both
# below, and also extract the EULA text from the self-extractible binary
define GCNANO_BINARIES_EXTRACT_HELPER
        tar --strip-components=1 -xJf $(@D)/gcnano-driver-$(GCNANO_BINARIES_DRIVER_VERSION).tar.xz -C $(@D)
        awk 'BEGIN      { start = 0; } \
                /^EOEULA/  { start = 0; } \
                        { if (start) print; } \
                /<<EOEULA/ { start = 1; }' \
                $(@D)/gcnano-userland-multi-$(GCNANO_BINARIES_USERLAND_VERSION).bin > $(@D)/EULA
        cd $(@D) && sh gcnano-userland-multi-$(GCNANO_BINARIES_USERLAND_VERSION).bin --auto-accept
endef

GCNANO_BINARIES_POST_EXTRACT_HOOKS += GCNANO_BINARIES_EXTRACT_HELPER

(which among other things extracts the EULA file, that is used as a
license file)

Best regards,

Thomas
Herve Codina July 28, 2023, 7:04 a.m. UTC | #3
Hi Yann, Thomas,

On Thu, 27 Jul 2023 22:59:50 +0200
Thomas Petazzoni <thomas.petazzoni@bootlin.com> wrote:

> On Thu, 27 Jul 2023 17:59:44 +0200
> "Yann E. MORIN" <yann.morin.1998@free.fr> wrote:
> 
> >  2. for such packages, add a post-extract hook that prepares the license
> >     file(s), and register those as _FOO_LICENSE_FILES instead. By using
> >     code, you can search for patterns rather than depend on constant
> >     line numbers.  
> 
> Like:
> 
> # The Github repository doesn't contain the source code as-is: it
> # contains a tarball with the kernel driver source code, and a
> # self-extractible binary for the user-space parts. So we extract both
> # below, and also extract the EULA text from the self-extractible binary
> define GCNANO_BINARIES_EXTRACT_HELPER
>         tar --strip-components=1 -xJf $(@D)/gcnano-driver-$(GCNANO_BINARIES_DRIVER_VERSION).tar.xz -C $(@D)
>         awk 'BEGIN      { start = 0; } \
>                 /^EOEULA/  { start = 0; } \
>                         { if (start) print; } \
>                 /<<EOEULA/ { start = 1; }' \
>                 $(@D)/gcnano-userland-multi-$(GCNANO_BINARIES_USERLAND_VERSION).bin > $(@D)/EULA
>         cd $(@D) && sh gcnano-userland-multi-$(GCNANO_BINARIES_USERLAND_VERSION).bin --auto-accept
> endef
> 
> GCNANO_BINARIES_POST_EXTRACT_HOOKS += GCNANO_BINARIES_EXTRACT_HELPER
> 
> (which among other things extracts the EULA file, that is used as a
> license file)
> 
> Best regards,
> 
> Thomas

Thanks for your feedback.

I am going to look at the POST_EXTRACT_HOOKS to generate a license file.

Best regards,
Hervé
diff mbox series

Patch

diff --git a/support/download/check-hash b/support/download/check-hash
index 5a47f49bc3..a90c9ca58a 100755
--- a/support/download/check-hash
+++ b/support/download/check-hash
@@ -45,10 +45,20 @@  fi
 # $1: algo hash
 # $2: known hash
 # $3: file (full path)
+# $4: lines ranges in the form s,e. If present, the checksum is compute on the
+#     file extracted part from line number s to line number e included.
+#     The first line in the file is the line number 1.
+#     If not present, the whole file is used.
 check_one_hash() {
     _h="${1}"
     _known="${2}"
     _file="${3}"
+    _r="${4}"
+
+    base_with_range=${base}
+    if [ ${_r} ]; then
+        base_with_range="${base_with_range}:${_r}"
+    fi
 
     # Note: md5 is supported, but undocumented on purpose.
     # Note: sha3 is not supported, since there is currently no implementation
@@ -64,13 +74,17 @@  check_one_hash() {
     esac
 
     # Do the hashes match?
-    _hash=$( ${_h}sum "${_file}" |cut -d ' ' -f 1 )
+    if [ ${_r} ]; then
+        _hash=$( sed -n "${r} p" ${file} | ${_h}sum |cut -d ' ' -f 1 )
+    else
+        _hash=$( ${_h}sum "${_file}" |cut -d ' ' -f 1 )
+    fi
     if [ "${_hash}" = "${_known}" ]; then
-        printf "%s: OK (%s: %s)\n" "${base}" "${_h}" "${_hash}"
+        printf "%s: OK (%s: %s)\n" "${base_with_range}" "${_h}" "${_hash}"
         return 0
     fi
 
-    printf "ERROR: %s has wrong %s hash:\n" "${base}" "${_h}" >&2
+    printf "ERROR: %s has wrong %s hash:\n" "${base_with_range}" "${_h}" >&2
     printf "ERROR: expected: %s\n" "${_known}" >&2
     printf "ERROR: got     : %s\n" "${_hash}" >&2
     printf "ERROR: Incomplete download, or man-in-the-middle (MITM) attack\n" >&2
@@ -80,7 +94,7 @@  check_one_hash() {
 
 # Do we know one or more hashes for that file?
 nb_checks=0
-while read t h f; do
+while read t h f r; do
     case "${t}" in
         ''|'#'*)
             # Skip comments and empty lines
@@ -88,7 +102,7 @@  while read t h f; do
             ;;
         *)
             if [ "${f}" = "${base}" ]; then
-                check_one_hash "${t}" "${h}" "${file}"
+                check_one_hash "${t}" "${h}" "${file}" "${r}"
                 : $((nb_checks++))
             fi
             ;;