diff mbox

[1/1] support/download/git: Prioritize remote archive

Message ID 1471461016-9660-1-git-send-email-kamath.ben@gmail.com
State Changes Requested
Headers show

Commit Message

Benjamin Kamath Aug. 17, 2016, 7:10 p.m. UTC
Attempt to do a remote archive since it shortcuts us past a few steps when
available. Additionally. if the git server has uploadArchive.allowUnreachable
set to true, then this method can also work on arbitrary sha1s, offering a huge
speed advantage over a full clone.

Signed-off-by: Benjamin Kamath <kamath.ben@gmail.com>
---
 support/download/git | 9 +++++++++
 1 file changed, 9 insertions(+)

Comments

Thomas Petazzoni Aug. 17, 2016, 8:39 p.m. UTC | #1
Hello,

On Wed, 17 Aug 2016 12:10:16 -0700, Benjamin Kamath wrote:
> Attempt to do a remote archive since it shortcuts us past a few steps when
> available. Additionally. if the git server has uploadArchive.allowUnreachable
> set to true, then this method can also work on arbitrary sha1s, offering a huge
> speed advantage over a full clone.
> 
> Signed-off-by: Benjamin Kamath <kamath.ben@gmail.com>
> ---
>  support/download/git | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/support/download/git b/support/download/git
> index 416cd1b..043a6de 100755
> --- a/support/download/git
> +++ b/support/download/git
> @@ -36,6 +36,15 @@ _git() {
>      eval ${GIT} "${@}"
>  }
>  
> +# Try a remote archive, since it is as fast as a shallow clone and can give us
> +# an archive directly. Also, if uploadArchive.allowUnreachable is set to true
> +# on the remote, this will also work for arbitrary sha1s, and will offer a
> +# considerable speedup over a full clone.
> +printf "Doing remote archive\n"
> +if _git archive --format=tar.gz --prefix=${basename}/ --remote=${repo} -o ${output} ${cset} 2>&1; then
> +    exit 0
> +fi

Are the tarballs produced using this method reproducible? We need the
tarballs produced here to be reproducible so that we can store hashes
for them in the package .hash file.

Thomas
Yann E. MORIN Aug. 17, 2016, 9:03 p.m. UTC | #2
Benjamin, All,

On 2016-08-17 12:10 -0700, Benjamin Kamath spake thusly:
> Attempt to do a remote archive since it shortcuts us past a few steps when
> available. Additionally. if the git server has uploadArchive.allowUnreachable
> set to true, then this method can also work on arbitrary sha1s, offering a huge
> speed advantage over a full clone.
> 
> Signed-off-by: Benjamin Kamath <kamath.ben@gmail.com>
> ---
>  support/download/git | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/support/download/git b/support/download/git
> index 416cd1b..043a6de 100755
> --- a/support/download/git
> +++ b/support/download/git
> @@ -36,6 +36,15 @@ _git() {
>      eval ${GIT} "${@}"
>  }
>  
> +# Try a remote archive, since it is as fast as a shallow clone and can give us
> +# an archive directly. Also, if uploadArchive.allowUnreachable is set to true
> +# on the remote, this will also work for arbitrary sha1s, and will offer a
> +# considerable speedup over a full clone.
> +printf "Doing remote archive\n"
> +if _git archive --format=tar.gz --prefix=${basename}/ --remote=${repo} -o ${output} ${cset} 2>&1; then
> +    exit 0
> +fi

NAK in the state.

If the package needs submodules, we can't ask the remote to generate
the archive for us, because git-archive does not know how to include
submodules.

So, maybe this would work:

    if [ ${recurse} -eq 0 ]; then
        if _git blabla remote archive; then
            exit 0
        fi
    fi

Also, as stated by Thomas, we want to generate reproducible archives, so
that we can check the hashes of archives. We go at great length to
generate such archives locally, but I don't see a guarantee that the
remote archive would be reproducible.

Regards,
Yann E. MORIN.

>  # Try a shallow clone, since it is faster than a full clone - but that only
>  # works if the version is a ref (tag or branch). Before trying to do a shallow
>  # clone we check if ${cset} is in the list provided by git ls-remote. If not
> -- 
> 2.7.4
> 
> _______________________________________________
> buildroot mailing list
> buildroot@busybox.net
> http://lists.busybox.net/mailman/listinfo/buildroot
Benjamin Kamath Aug. 17, 2016, 9:06 p.m. UTC | #3
On Wed, Aug 17, 2016 at 1:39 PM, Thomas Petazzoni
<thomas.petazzoni@free-electrons.com> wrote:
>
> Hello,
>
>
> Are the tarballs produced using this method reproducible? We need the
> tarballs produced here to be reproducible so that we can store hashes
> for them in the package .hash file.
>

I've investigated this a little bit, and it does seem to create
reproducible tarballs. In my testing, the archive's produced via git
archive --remote=... actually match checksums with those produced via
shallow clone, delete .git, tar + gzip. I'd possibly need to
investigate the git source a bit more.
Benjamin Kamath Aug. 17, 2016, 9:13 p.m. UTC | #4
On Wed, Aug 17, 2016 at 2:03 PM, Yann E. MORIN <yann.morin.1998@free.fr> wrote:
> Benjamin, All,
>
>>
>> +# Try a remote archive, since it is as fast as a shallow clone and can give us
>> +# an archive directly. Also, if uploadArchive.allowUnreachable is set to true
>> +# on the remote, this will also work for arbitrary sha1s, and will offer a
>> +# considerable speedup over a full clone.
>> +printf "Doing remote archive\n"
>> +if _git archive --format=tar.gz --prefix=${basename}/ --remote=${repo} -o ${output} ${cset} 2>&1; then
>> +    exit 0
>> +fi
>
> NAK in the state.
Is this related to the following paragraph or a separate issue?

>
> If the package needs submodules, we can't ask the remote to generate
> the archive for us, because git-archive does not know how to include
> submodules.
>
> So, maybe this would work:
>
>     if [ ${recurse} -eq 0 ]; then
>         if _git blabla remote archive; then
>             exit 0
>         fi
>     fi
Indeed, I hadn't thought about submodules. I think your suggestion
would be sufficient. After all,
it should fall back to the older behavior upon failure.

>
> Also, as stated by Thomas, we want to generate reproducible archives, so
> that we can check the hashes of archives. We go at great length to
> generate such archives locally, but I don't see a guarantee that the
> remote archive would be reproducible.

I'm quite certain the archive is reproducible but this requires a bit
more investigation
to prove.

>
> Regards,
> Yann E. MORIN.
>
>>  # Try a shallow clone, since it is faster than a full clone - but that only
>>  # works if the version is a ref (tag or branch). Before trying to do a shallow
>>  # clone we check if ${cset} is in the list provided by git ls-remote. If not
>> --
>> 2.7.4
>>
>> _______________________________________________
>> buildroot mailing list
>> buildroot@busybox.net
>> http://lists.busybox.net/mailman/listinfo/buildroot
>
> --
> .-----------------.--------------------.------------------.--------------------.
> |  Yann E. MORIN  | Real-Time Embedded | /"\ ASCII RIBBON | Erics' conspiracy: |
> | +33 662 376 056 | Software  Designer | \ / CAMPAIGN     |  ___               |
> | +33 223 225 172 `------------.-------:  X  AGAINST      |  \e/  There is no  |
> | http://ymorin.is-a-geek.org/ | _/*\_ | / \ HTML MAIL    |   v   conspiracy.  |
> '------------------------------^-------^------------------^--------------------'
Yann E. MORIN Aug. 17, 2016, 9:31 p.m. UTC | #5
Benjamin, All,

On 2016-08-17 14:13 -0700, Benjamin Kamath spake thusly:
> On Wed, Aug 17, 2016 at 2:03 PM, Yann E. MORIN <yann.morin.1998@free.fr> wrote:
> > Benjamin, All,
> >
> >>
> >> +# Try a remote archive, since it is as fast as a shallow clone and can give us
> >> +# an archive directly. Also, if uploadArchive.allowUnreachable is set to true
> >> +# on the remote, this will also work for arbitrary sha1s, and will offer a
> >> +# considerable speedup over a full clone.
> >> +printf "Doing remote archive\n"
> >> +if _git archive --format=tar.gz --prefix=${basename}/ --remote=${repo} -o ${output} ${cset} 2>&1; then
> >> +    exit 0
> >> +fi
> >
> > NAK in the state.
> Is this related to the following paragraph or a separate issue?

It's "NAK in the state" because of what I explained below.

I'm OK for this feature if:
  - the submodule support is handled (at least as I suggest),
  - the reproducibility of archives is guaranteed.

> > If the package needs submodules, we can't ask the remote to generate
> > the archive for us, because git-archive does not know how to include
> > submodules.
> >
> > So, maybe this would work:
> >
> >     if [ ${recurse} -eq 0 ]; then
> >         if _git blabla remote archive; then
> >             exit 0
> >         fi
> >     fi
> Indeed, I hadn't thought about submodules. I think your suggestion
> would be sufficient. After all,
> it should fall back to the older behavior upon failure.
> 
> >
> > Also, as stated by Thomas, we want to generate reproducible archives, so
> > that we can check the hashes of archives. We go at great length to
> > generate such archives locally, but I don't see a guarantee that the
> > remote archive would be reproducible.
> 
> I'm quite certain the archive is reproducible but this requires a bit
> more investigation
> to prove.

Well, I had a wquick look at archive.c in the git git tree (weird to
write that!), and I can neither conclusively state that they are not
that are not... :-/

There does not seem to be any call to sort() in there, not are they
setting LC_COLLATE anywhere.

However, I've tried to generate two archives (locally) with different
collating rules (en_US.UTF-8 which does not differentiate between upper
and lower case, and C which does) and the two archive had the same sha1.

Inspecting the archives in both cases shows that the collating seems to
always be C, with Uppercase always before lowercase, with .files before
non-dot files, and so on...

So, I think it is safe to assume that git-archives always generates
reproducible archive.

There. Solved that one for you! ;-)

Regards,
Yann E. MORIN.
Thomas Petazzoni Aug. 17, 2016, 9:54 p.m. UTC | #6
Hello,

On Wed, 17 Aug 2016 23:31:02 +0200, Yann E. MORIN wrote:

> So, I think it is safe to assume that git-archives always generates
> reproducible archive.

So the only remaining reason to not use git archive all the time is to
support submodules?

Thomas
Yann E. MORIN Aug. 17, 2016, 10 p.m. UTC | #7
On 2016-08-17 23:54 +0200, Thomas Petazzoni spake thusly:
> Hello,
> 
> On Wed, 17 Aug 2016 23:31:02 +0200, Yann E. MORIN wrote:
> 
> > So, I think it is safe to assume that git-archives always generates
> > reproducible archive.
> 
> So the only remaining reason to not use git archive all the time is to
> support submodules?

Yes.

When doing submodules, I pondered doing two code paths: one for
non-submodules, that would use git-archive, and one where we would do it
all manually as we do today.

But then I concluded that it was better to have a single code path.

Note that, before submodules, we were happily using git-archive locally,
not even forcing the collating rules.

Regards,
Yann E. MORIN.
Peter Korsgaard Aug. 22, 2016, 7:53 p.m. UTC | #8
>>>>> "Yann" == Yann E MORIN <yann.morin.1998@free.fr> writes:

Hi,

 > NAK in the state.

 > If the package needs submodules, we can't ask the remote to generate
 > the archive for us, because git-archive does not know how to include
 > submodules.

 > So, maybe this would work:

 >     if [ ${recurse} -eq 0 ]; then
 >         if _git blabla remote archive; then
 >             exit 0
 >         fi
 >     fi

Or alternatively, we look at the alternative approach for handling
submodules - E.G. splicing git archive outputs.

 > Also, as stated by Thomas, we want to generate reproducible archives, so
 > that we can check the hashes of archives. We go at great length to
 > generate such archives locally, but I don't see a guarantee that the
 > remote archive would be reproducible.

Normal 'git archive' output should be reproducable, E.G. that is what we
used until recently.
Yann E. MORIN Aug. 22, 2016, 8:55 p.m. UTC | #9
Peter, All,

On 2016-08-22 21:53 +0200, Peter Korsgaard spake thusly:
> >>>>> "Yann" == Yann E MORIN <yann.morin.1998@free.fr> writes:
>  > NAK in the state.
> 
>  > If the package needs submodules, we can't ask the remote to generate
>  > the archive for us, because git-archive does not know how to include
>  > submodules.
> 
>  > So, maybe this would work:
> 
>  >     if [ ${recurse} -eq 0 ]; then
>  >         if _git blabla remote archive; then
>  >             exit 0
>  >         fi
>  >     fi
> 
> Or alternatively, we look at the alternative approach for handling
> submodules - E.G. splicing git archive outputs.

And I think I already explained that this was not so trivial...

For example, I did this layout of git tree and submodules:

    foo/
    foo/.git/
    foo/foo             <- file with "FOO" in it
    foo/bar/
    foo/bar/.git
    foo/bar/bar         <- file with "BAR" in it
    foo/bar/buz/
    foo/bar/buz/.git
    foo/bar/buz/buz     <- file with "BUW" in it

  - each git tree has a file named after the git tree and containing the
    name of the git tree in uppercase (just for fun and as a way to check
    what I did).
  - 'foo' is a git tree with a submodule 'bar'.
  - 'bar' is a git tree with a submodule 'buz'.
  - So, 'buz' is *not* a submodule of 'foo'

    $ git submodule foreach -q --recursive 'printf "name=${name} path=${path} toplevel=${toplevel}\n"'
    name=bar  path=bar toplevel=/home/ymorin/dev/buildroot/foo/git/foo
    name=buz  path=buz toplevel=/home/ymorin/dev/buildroot/foo/git/foo/bar

So it means we have no easy way to get the relative path to the
sub-submodules. We have to extract them:

    $ git submodule foreach -q --recursive "printf \"reldir=\${toplevel#$(git rev-parse --show-toplevel)}/\${path}\n\""
    reldir=/bar
    reldir=/bar/buz

And then for each of them, we shoe-horn that path as a --prefix to git
archive.

This does not make our git wrapper any much simpler:
  - we still need to try a shallow clone and fallback to a full clone,
  - we still need to fetch the special refs,
  - we still need to do checkouts (thus non-bare clones) because
    submodules are only known with a working tree,
  - we still need to init and update submodules, recursively.

The only slight simplification would be with using git-archive instead
of a canned tar, but even then this git-archive command would be quite
complex (untested):

    $ git archive --prefix=${basename} --format=tar >"${output}.tmp"
    $ git submodule foreach -q --recursive \
        "git archive --prefix=${basename}\${toplevel#$(git rev-parse --show-toplevel)}/\${path}/ --format=tar" \
        >>"${output}.tmp"
    $ gzip -9 <"${output}.tmp" >"${output}"

Sorry, but this is totally unreadable... :-/

And this is only about replacing the *single* tar we have right now.
We'd still have to keep all the rest of the wrapper...

However, taking again my exmple git tree above:

    $ git archive --prefix=foo/ --format=tar HEAD >foo.tar
    $ ls -l foo.tar
    -rw-rw-r-- 1 ymorin ymorin 10240 Aug 22 22:37 foo.tar

    $ git submodule foreach -q --recursive "git archive --prefix=foo\${toplevel#$(git rev-parse --show-toplevel)}/\${path}/ --format=tar HEAD >>$(pwd)/foo.tar"
    $ ls -l foo.tar
    -rw-rw-r-- 1 ymorin ymorin 30720 Aug 22 22:37 foo.tar

So it seems the submodules were somewhat added to the acrchive, right?
Well, at least it seems the archive is ill-formed:

    $ tar tf foo.tar
    foo/
    foo/.gitmodules
    foo/bar/
    foo/foo

If I 'hexdump -Cv foo.tar' it looks like there is everything in there,
though... But git-archive generates a 'global pax header' (whatever that
is) by default. We can tell it not too, by using a special syntax when
specifying the tree-ish: using HEAD^{tree} instead of HEAD.

No more luck at extracting the archive... :-(

So I'm not sure where to go from here.

>  > Also, as stated by Thomas, we want to generate reproducible archives, so
>  > that we can check the hashes of archives. We go at great length to
>  > generate such archives locally, but I don't see a guarantee that the
>  > remote archive would be reproducible.
> 
> Normal 'git archive' output should be reproducable, E.G. that is what we
> used until recently.

Yet, we did notice that, at one point, github archives were *not*
reproducible...

Regards,
Yann E. MORIN.
diff mbox

Patch

diff --git a/support/download/git b/support/download/git
index 416cd1b..043a6de 100755
--- a/support/download/git
+++ b/support/download/git
@@ -36,6 +36,15 @@  _git() {
     eval ${GIT} "${@}"
 }
 
+# Try a remote archive, since it is as fast as a shallow clone and can give us
+# an archive directly. Also, if uploadArchive.allowUnreachable is set to true
+# on the remote, this will also work for arbitrary sha1s, and will offer a
+# considerable speedup over a full clone.
+printf "Doing remote archive\n"
+if _git archive --format=tar.gz --prefix=${basename}/ --remote=${repo} -o ${output} ${cset} 2>&1; then
+    exit 0
+fi
+
 # Try a shallow clone, since it is faster than a full clone - but that only
 # works if the version is a ref (tag or branch). Before trying to do a shallow
 # clone we check if ${cset} is in the list provided by git ls-remote. If not