diff mbox series

[v2,04/12] support/download/post-process-helpers: add helper function for post process scripts

Message ID 20201219153525.1361175-5-thomas.petazzoni@bootlin.com
State Changes Requested
Headers show
Series [v2,01/12] support/download/dl-wrapper: add concept of download post-processing | expand

Commit Message

Thomas Petazzoni Dec. 19, 2020, 3:35 p.m. UTC
download post process scripts will often need to unpack the source
code tarball, do some operation, and then repack it. In order to help
with this, post-process-helpers provide an unpack() function and a
repack() function.

Signed-off-by: Thomas Petazzoni <thomas.petazzoni@bootlin.com>
---
 support/download/post-process-helpers | 30 +++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)
 create mode 100644 support/download/post-process-helpers

Comments

Yann E. MORIN Dec. 19, 2020, 5:39 p.m. UTC | #1
Thomas, All,

On 2020-12-19 16:35 +0100, Thomas Petazzoni spake thusly:
> download post process scripts will often need to unpack the source
> code tarball, do some operation, and then repack it. In order to help
> with this, post-process-helpers provide an unpack() function and a
> repack() function.

You forgot to explain in the commit log why we need a hard-coded, fixed
date.

However, I find it ugly that we resort to such an arbitrary constant. At
the very least, I'd like we take a memorable value, like 1970-01-01
00:00:00 +0000.

However, I am afraid that tar would refuse to extract files to that
date, whining about an "implausibly old time stamp" when extracting such
files...

But see below for a proposal...

> Signed-off-by: Thomas Petazzoni <thomas.petazzoni@bootlin.com>
> ---
>  support/download/post-process-helpers | 30 +++++++++++++++++++++++++++
>  1 file changed, 30 insertions(+)
>  create mode 100644 support/download/post-process-helpers
> 
> diff --git a/support/download/post-process-helpers b/support/download/post-process-helpers
> new file mode 100644
> index 0000000000..bed8df2577
> --- /dev/null
> +++ b/support/download/post-process-helpers
> @@ -0,0 +1,30 @@
> +
> +unpack() {
> +        dest="$1"
> +        tarball="$2"

This is a shell script, variables are global. But since we run it with
bash, we can declare such variable as local:

    local dest="${1}"
    local tarball="${2}"

> +        mkdir ${dest}
> +        tar -C ${dest} --strip-components=1 -xf ${tarball}

    local one_file

    one_file="$(find "${dest}" -type f |head -n1)"
    touch -r "${one_file}" "${dest}.timestamp"

> +}
> +
> +repack() {
> +        src="$1"
> +        tarball="$2"

Ditto: make them 'local' (this can be fixed when applying).

Also, I suppose that the 'dest' parameter of unpack is expected to be the
'src' parameter of repack, right?

> +        # Generate the archive, sort with the C locale so that it is reproducible.
> +        find "$(basename ${src})" -not -type d -print0 >files.list
> +        LC_ALL=C sort -z <files.list >files.list.sorted
> +
> +        # let's use a fixed hardcoded date to be reproducible
> +        date="2020-02-06 01:02:03 +0000"

Assuming that unpack's dest and repack's src are pointing to the same
directory:

        date="$(stat -c "${src}.timestamp")"

> +        # Create GNU-format tarballs, since that's the format of the tarballs on
> +        # sources.buildroot.org and used in the *.hash files
> +        tar cf new.tar --null --verbatim-files-from --numeric-owner --format=gnu \
> +            --owner=0 --group=0 --mtime="${date}" -T files.list.sorted

This will have to be changed according to my pending series to change to
the PAX format, whish is even more reproducible.

> +        gzip -6 -n <new.tar >new.tar.gz
> +        mv "${tarball}" "${tarball}".old
> +        mv new.tar.gz "${tarball}"
> +        rm "${tarball}".old
> +        rm -rf ${src}

Quote all expanded variable, even "${src}". And also rm "${src}.timestamp",
now.

Regards,
Yann E. MORIN.

> +}
> -- 
> 2.29.2
>
Yann E. MORIN Dec. 19, 2020, 8:29 p.m. UTC | #2
Thomas, All,

A few mre tweaks I forgot about...

On 2020-12-19 18:39 +0100, Yann E. MORIN spake thusly:
> On 2020-12-19 16:35 +0100, Thomas Petazzoni spake thusly:
[--SNIP--]
> > +        mkdir ${dest}

Using 'mkdir -p' would ensure that parent direcotries are created as
needed.

> 
>     one_file="$(find "${dest}" -type f |head -n1)"

Of course, the output should be sorted in the C locale, to ensure the
same file is always used as reference (and hence a better name for that
file):

    local ref_file
    ref_file="$(find "${dest}" -type f |LC_ALL=C sort |head -n1)"
    touch -r "${ref_file}" "${dest}.timestamp"

[--SNIP--]
> > +        # let's use a fixed hardcoded date to be reproducible
> > +        date="2020-02-06 01:02:03 +0000"
> Assuming that unpack's dest and repack's src are pointing to the same
> directory:
>         date="$(stat -c "${src}.timestamp")"

Meh.. I forgot the actual format:

    date="$(stat -c '%y' "${src}.timestamp")"

Regards,
Yann E. MORIN.
Yann E. MORIN Dec. 20, 2020, 8:40 a.m. UTC | #3
Thomas, All,

On 2020-12-19 16:35 +0100, Thomas Petazzoni spake thusly:
> download post process scripts will often need to unpack the source
> code tarball, do some operation, and then repack it. In order to help
> with this, post-process-helpers provide an unpack() function and a
> repack() function.

Sorry to come back yet once more on that one, but I may have found quite
an issue. Basically, your code does this:

    unpack() {
        tar -C ${dest} --strip-components=1 -xf ${tarball}
    }

    repack() {
        tar cf new.tar --null --verbatim-files-from --numeric-owner --format=gnu \
            --owner=0 --group=0 --mtime="${date}" -T files.list.sorted
        gzip -6 -n <new.tar >new.tar.gz
    }

So, it takes a tarball in whatever copression scheme and extract it, but
it will alwyas create a gzip-compressed tarball.

So, how would we hanfle a case where the upstream tarball is, say, a
.tar.bz2 ? For example;

    FOO_SOURCE = foo=1.2.3.tar.bz2
    FOO_SITE = http://foo.example.net/download
    FOO_SITE_METOD = wget
    $(eval $(go-package))

Besides the fact that the packager would see in DL_DIR a .tar.gz tarball
different from the .tar.bz2 he coded, the most problematic issue wil be
that we would have to list the .tar.gz inthe .hash file. This will be
very confusing to packagers and reviewers...

Furthermore, there is another discrepancy in the post-processing
scripts. For example, the go post-processor does, basically:

    if tarball already vendored:
        exit leaving the tarball as-is

    unpack
    vendor
    repack as .tar.gz

So some go tarballs will be left as they are upstream, while others will
be re-encoded. This is going to be quite confusing...

I think the post-processing script should always re-encode the tarball,
so that at least we have a consistent behaviour across the board. This
is pretty trivial to come up with, at the exepense of a bit of overhead
for those tarballs that are already vendored, but the majority of
packages will probably be git clones that need to be vendored, so it is
a small price to pay.

The big problem that's left, is how we handle the discrepancy between
the tarball listed in _SOURCE and the one listed in the .hash file...

I have to admit I do not have an easy solution for that for now...

Especially since with the pending series about changing the way we
assemble tarballs from git and svn: the tar format changes, and the
compression changes...

Regards,
Yann E. MORIN.
diff mbox series

Patch

diff --git a/support/download/post-process-helpers b/support/download/post-process-helpers
new file mode 100644
index 0000000000..bed8df2577
--- /dev/null
+++ b/support/download/post-process-helpers
@@ -0,0 +1,30 @@ 
+
+unpack() {
+        dest="$1"
+        tarball="$2"
+
+        mkdir ${dest}
+        tar -C ${dest} --strip-components=1 -xf ${tarball}
+}
+
+repack() {
+        src="$1"
+        tarball="$2"
+
+        # Generate the archive, sort with the C locale so that it is reproducible.
+        find "$(basename ${src})" -not -type d -print0 >files.list
+        LC_ALL=C sort -z <files.list >files.list.sorted
+
+        # let's use a fixed hardcoded date to be reproducible
+        date="2020-02-06 01:02:03 +0000"
+
+        # Create GNU-format tarballs, since that's the format of the tarballs on
+        # sources.buildroot.org and used in the *.hash files
+        tar cf new.tar --null --verbatim-files-from --numeric-owner --format=gnu \
+            --owner=0 --group=0 --mtime="${date}" -T files.list.sorted
+        gzip -6 -n <new.tar >new.tar.gz
+        mv "${tarball}" "${tarball}".old
+        mv new.tar.gz "${tarball}"
+        rm "${tarball}".old
+        rm -rf ${src}
+}