gitlab-ci: Replace Docker with Kaniko

Message ID	20240516165410.28800-3-cconte@redhat.com
State	New
Headers	show Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org> From: Camilla Conte <cconte@redhat.com> To: qemu-devel@nongnu.org Cc: Camilla Conte <cconte@redhat.com>, berrange@redhat.com, richard.henderson@linaro.org, alex.bennee@linaro.org Subject: [PATCH] gitlab-ci: Replace Docker with Kaniko Date: Thu, 16 May 2024 17:52:43 +0100 Message-ID: <20240516165410.28800-3-cconte@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Received-SPF: pass client-ip=170.10.129.124; envelope-from=cconte@redhat.com; helo=us-smtp-delivery-124.mimecast.com X-Spam_score_int: -30 X-Spam_score: -3.1 X-Spam_bar: --- X-Spam_report: (-3.1 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-1.022, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H4=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action Precedence: list Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org
Series	gitlab-ci: Replace Docker with Kaniko \| expand gitlab-ci: Replace Docker with Kaniko

Camilla Conte May 16, 2024, 4:52 p.m. UTC

Enables caching from the qemu-project repository.

Uses a dedicated "$NAME-cache" tag for caching, to address limitations.
See issue "when using --cache=true, kaniko fail to push cache layer [...]":
https://github.com/GoogleContainerTools/kaniko/issues/1459

Does not specify a context since no Dockerfile is using COPY or ADD instructions.

Does not enable reproducible builds as
that results in builds failing with an out of memory error.
See issue "Using --reproducible loads entire image into memory":
https://github.com/GoogleContainerTools/kaniko/issues/862

Previous attempts, for the records:
  - Alex Bennée: https://lore.kernel.org/qemu-devel/20230330101141.30199-12-alex.bennee@linaro.org/
  - Camilla Conte (me): https://lore.kernel.org/qemu-devel/20230531150824.32349-6-cconte@redhat.com/

Signed-off-by: Camilla Conte <cconte@redhat.com>
---
 .gitlab-ci.d/container-template.yml | 25 +++++++++++--------------
 1 file changed, 11 insertions(+), 14 deletions(-)

Daniel P. Berrangé May 16, 2024, 6:24 p.m. UTC | #1

On Thu, May 16, 2024 at 05:52:43PM +0100, Camilla Conte wrote:
> Enables caching from the qemu-project repository.
> 
> Uses a dedicated "$NAME-cache" tag for caching, to address limitations.
> See issue "when using --cache=true, kaniko fail to push cache layer [...]":
> https://github.com/GoogleContainerTools/kaniko/issues/1459

After investigating, this is a result of a different design approach
for caching in kaniko.

In docker, it can leverage any existing image as a cache source,
reusing individual layers that were present. IOW, there's no
difference between a cache and a final image, they're one and the
same thing

In kaniko, the cache is a distinct object type. IIUC, it is not
populated with the individual layers, instead it has a custom
format for storing the cached content. Therefore the concept of
storing the cache at the same location as the final image, is
completely inappropriate - you can't store two completely different
kinds of content at the same place.

That is also why you can't just "git pull" the fetch the cache
image(s) beforehand, and also why it doesn't look like you can
use multiple cache sources with kaniko.

None of this is inherantly a bad thing..... except when it comes
to data storage. By using Kaniko we would, at minimum, doubling
the amount of data storage we consume in the gitlab registry.

This is a potentially significant concern because GitLab does
technically have a limited storage quota, even with our free
OSS plan  subscription.

Due to technical limitations, they've never been able to
actually enforce it thus far, but one day they probably will.
At which point we're doomed, because even with our current
Docker-in-Docker setup I believe we're exceeding our quota.
Thus the idea of doubling our container storage usage is pretty
unappealing.

We can avoid that by running without cache, but that has the
cost of increasing the job running time, since all containers
would be rebuilt on every pipeline. This will burn through
our Azure compute allowance more quickly (or our GitLab CI
credits if we had to switch away from Azure).

> Does not specify a context since no Dockerfile is using COPY or ADD instructions.
> 
> Does not enable reproducible builds as
> that results in builds failing with an out of memory error.
> See issue "Using --reproducible loads entire image into memory":
> https://github.com/GoogleContainerTools/kaniko/issues/862
> 
> Previous attempts, for the records:
>   - Alex Bennée: https://lore.kernel.org/qemu-devel/20230330101141.30199-12-alex.bennee@linaro.org/
>   - Camilla Conte (me): https://lore.kernel.org/qemu-devel/20230531150824.32349-6-cconte@redhat.com/
> 
> Signed-off-by: Camilla Conte <cconte@redhat.com>
> ---
>  .gitlab-ci.d/container-template.yml | 25 +++++++++++--------------
>  1 file changed, 11 insertions(+), 14 deletions(-)
> 
> diff --git a/.gitlab-ci.d/container-template.yml b/.gitlab-ci.d/container-template.yml
> index 4eec72f383..066f253dd5 100644
> --- a/.gitlab-ci.d/container-template.yml
> +++ b/.gitlab-ci.d/container-template.yml
> @@ -1,21 +1,18 @@
>  .container_job_template:
>    extends: .base_job_template
> -  image: docker:latest
>    stage: containers
> -  services:
> -    - docker:dind
> +  image:
> +    name: gcr.io/kaniko-project/executor:debug
> +    entrypoint: [""]
> +  variables:
> +    DOCKERFILE: "$CI_PROJECT_DIR/tests/docker/dockerfiles/$NAME.docker"
> +    CACHE_REPO: "$CI_REGISTRY/qemu-project/qemu/qemu/$NAME-cache"
>    before_script:
>      - export TAG="$CI_REGISTRY_IMAGE/qemu/$NAME:$QEMU_CI_CONTAINER_TAG"
> -    # Always ':latest' because we always use upstream as a common cache source
> -    - export COMMON_TAG="$CI_REGISTRY/qemu-project/qemu/qemu/$NAME:latest"
> -    - docker login $CI_REGISTRY -u "$CI_REGISTRY_USER" -p "$CI_REGISTRY_PASSWORD"
> -    - until docker info; do sleep 1; done
>    script:
>      - echo "TAG:$TAG"
> -    - echo "COMMON_TAG:$COMMON_TAG"
> -    - docker build --tag "$TAG" --cache-from "$TAG" --cache-from "$COMMON_TAG"
> -      --build-arg BUILDKIT_INLINE_CACHE=1
> -      -f "tests/docker/dockerfiles/$NAME.docker" "."
> -    - docker push "$TAG"
> -  after_script:
> -    - docker logout
> +    - /kaniko/executor
> +      --dockerfile "$DOCKERFILE"
> +      --destination "$TAG"
> +      --cache=true
> +      --cache-repo="$CACHE_REPO"

I'm surprised there is no need to set provide the user/password
login credentials for the registry. None the less  I tested this
and it succeeed.

I guess gitlab somehow has some magic authorization granted to any CI
job, that avoids the need for a manual login ? Wonder why we needed
the 'docker login' step though ? Perhaps because D-in-D results in
using an externally running docker daemon which didn't inherit
credentials from the job environment ?

Caching of course fails when I'm running jobs in my fork. IOW, if we
change container content in a fork and want to test it, it will be
doing a full build from scratch every time. This likely isn't the end
of the world because dockerfiles change in frequently, and when they
do, paying the price of full rebuild is a time limited proble unless
a PULL is sent and accepted.

TL;DR: functionally this patch is capable of working. The key downside
is that it doubles our storage usage. I'm not convinced Kaniko offers
a compelling enough benefit to justify this penalty.

With regards,
Daniel

Thomas Huth May 17, 2024, 6:24 a.m. UTC | #2

On 16/05/2024 20.24, Daniel P. Berrangé wrote:
> On Thu, May 16, 2024 at 05:52:43PM +0100, Camilla Conte wrote:
>> Enables caching from the qemu-project repository.
>>
>> Uses a dedicated "$NAME-cache" tag for caching, to address limitations.
>> See issue "when using --cache=true, kaniko fail to push cache layer [...]":
>> https://github.com/GoogleContainerTools/kaniko/issues/1459
...
> TL;DR: functionally this patch is capable of working. The key downside
> is that it doubles our storage usage. I'm not convinced Kaniko offers
> a compelling enough benefit to justify this penalty.

Will this patch fix the issues that we are currently seeing with the k8s 
runners not working in the upstream CI? If so, I think that would be enough 
benefit, wouldn't it?

  Thomas

Daniel P. Berrangé May 17, 2024, 7:37 a.m. UTC | #3

On Fri, May 17, 2024 at 08:24:44AM +0200, Thomas Huth wrote:
> On 16/05/2024 20.24, Daniel P. Berrangé wrote:
> > On Thu, May 16, 2024 at 05:52:43PM +0100, Camilla Conte wrote:
> > > Enables caching from the qemu-project repository.
> > > 
> > > Uses a dedicated "$NAME-cache" tag for caching, to address limitations.
> > > See issue "when using --cache=true, kaniko fail to push cache layer [...]":
> > > https://github.com/GoogleContainerTools/kaniko/issues/1459
> ...
> > TL;DR: functionally this patch is capable of working. The key downside
> > is that it doubles our storage usage. I'm not convinced Kaniko offers
> > a compelling enough benefit to justify this penalty.
> 
> Will this patch fix the issues that we are currently seeing with the k8s
> runners not working in the upstream CI? If so, I think that would be enough
> benefit, wouldn't it?

Paolo said on IRC that he has reverted the changes to the runner which
caused us problems. Docker in Docker is still a documented & supported
option for GitLab AFAICT, so I would hope we can keep using it as
before.

With regards,
Daniel

Daniel P. Berrangé May 17, 2024, 8:14 a.m. UTC | #4

On Thu, May 16, 2024 at 07:24:04PM +0100, Daniel P. Berrangé wrote:
> On Thu, May 16, 2024 at 05:52:43PM +0100, Camilla Conte wrote:
> > Enables caching from the qemu-project repository.
> > 
> > Uses a dedicated "$NAME-cache" tag for caching, to address limitations.
> > See issue "when using --cache=true, kaniko fail to push cache layer [...]":
> > https://github.com/GoogleContainerTools/kaniko/issues/1459
> 
> After investigating, this is a result of a different design approach
> for caching in kaniko.
> 
> In docker, it can leverage any existing image as a cache source,
> reusing individual layers that were present. IOW, there's no
> difference between a cache and a final image, they're one and the
> same thing
> 
> In kaniko, the cache is a distinct object type. IIUC, it is not
> populated with the individual layers, instead it has a custom
> format for storing the cached content. Therefore the concept of
> storing the cache at the same location as the final image, is
> completely inappropriate - you can't store two completely different
> kinds of content at the same place.
> 
> That is also why you can't just "git pull" the fetch the cache
> image(s) beforehand, and also why it doesn't look like you can
> use multiple cache sources with kaniko.
> 
> None of this is inherantly a bad thing..... except when it comes
> to data storage. By using Kaniko we would, at minimum, doubling
> the amount of data storage we consume in the gitlab registry.

Double is actually just the initial case. The cache is storing layers
using docker tags, whose name appears based on a hash of the "RUN"
command.

IOW, the first time we build a container we have double the usage.
When a dockerfile is updated changing a 'RUN' command, we now have
triple the storage usage for cache. Update the RUN command again,
and we now have quadruple the storage. etc.

Kaniko does not appear to purge cache entries itself, and will rely
on something else to do the cache purging.

GitLab has support for purging old docker tags, but I'm not an
admin on the QEMU project namespace, so can't tell if it can be
enabled or not ? Many older projects have this permanently disabled
due to historical compat issues in gitlab after they introduced the
feature.

With regards,
Daniel

Camilla Conte May 20, 2024, 4:56 p.m. UTC | #5

On Fri, May 17, 2024 at 9:14 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
>
> On Thu, May 16, 2024 at 07:24:04PM +0100, Daniel P. Berrangé wrote:
> > On Thu, May 16, 2024 at 05:52:43PM +0100, Camilla Conte wrote:
> > > Enables caching from the qemu-project repository.
> > >
> > > Uses a dedicated "$NAME-cache" tag for caching, to address limitations.
> > > See issue "when using --cache=true, kaniko fail to push cache layer [...]":
> > > https://github.com/GoogleContainerTools/kaniko/issues/1459
> >
> > After investigating, this is a result of a different design approach
> > for caching in kaniko.
> >
> > In docker, it can leverage any existing image as a cache source,
> > reusing individual layers that were present. IOW, there's no
> > difference between a cache and a final image, they're one and the
> > same thing
> >
> > In kaniko, the cache is a distinct object type. IIUC, it is not
> > populated with the individual layers, instead it has a custom
> > format for storing the cached content. Therefore the concept of
> > storing the cache at the same location as the final image, is
> > completely inappropriate - you can't store two completely different
> > kinds of content at the same place.
> >
> > That is also why you can't just "git pull" the fetch the cache
> > image(s) beforehand, and also why it doesn't look like you can
> > use multiple cache sources with kaniko.
> >
> > None of this is inherantly a bad thing..... except when it comes
> > to data storage. By using Kaniko we would, at minimum, doubling
> > the amount of data storage we consume in the gitlab registry.
>
> Double is actually just the initial case. The cache is storing layers
> using docker tags, whose name appears based on a hash of the "RUN"
> command.
>
> IOW, the first time we build a container we have double the usage.
> When a dockerfile is updated changing a 'RUN' command, we now have
> triple the storage usage for cache. Update the RUN command again,
> and we now have quadruple the storage. etc.
>
> Kaniko does not appear to purge cache entries itself, and will rely
> on something else to do the cache purging.
>
> GitLab has support for purging old docker tags, but I'm not an
> admin on the QEMU project namespace, so can't tell if it can be
> enabled or not ? Many older projects have this permanently disabled
> due to historical compat issues in gitlab after they introduced the
> feature.

I'm pretty sure purging can be enabled. Gitlab itself proposes this
with a "set up cleanup" link on the registry page (1).
Can you recall what issues they were experiencing?

If this is the only issue blocking Kaniko adoption, and we can't solve
it by enabling the cleanup, I can write an additional step at the end
of the container build to explicitly remove old cache tags.

(1) https://gitlab.com/qemu-project/qemu/container_registry

>
> With regards,
> Daniel
> --
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
>

Daniel P. Berrangé May 22, 2024, 10:46 a.m. UTC | #6

On Mon, May 20, 2024 at 05:56:46PM +0100, Camilla Conte wrote:
> On Fri, May 17, 2024 at 9:14 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
> >
> > On Thu, May 16, 2024 at 07:24:04PM +0100, Daniel P. Berrangé wrote:
> > > On Thu, May 16, 2024 at 05:52:43PM +0100, Camilla Conte wrote:
> > > > Enables caching from the qemu-project repository.
> > > >
> > > > Uses a dedicated "$NAME-cache" tag for caching, to address limitations.
> > > > See issue "when using --cache=true, kaniko fail to push cache layer [...]":
> > > > https://github.com/GoogleContainerTools/kaniko/issues/1459
> > >
> > > After investigating, this is a result of a different design approach
> > > for caching in kaniko.
> > >
> > > In docker, it can leverage any existing image as a cache source,
> > > reusing individual layers that were present. IOW, there's no
> > > difference between a cache and a final image, they're one and the
> > > same thing
> > >
> > > In kaniko, the cache is a distinct object type. IIUC, it is not
> > > populated with the individual layers, instead it has a custom
> > > format for storing the cached content. Therefore the concept of
> > > storing the cache at the same location as the final image, is
> > > completely inappropriate - you can't store two completely different
> > > kinds of content at the same place.
> > >
> > > That is also why you can't just "git pull" the fetch the cache
> > > image(s) beforehand, and also why it doesn't look like you can
> > > use multiple cache sources with kaniko.
> > >
> > > None of this is inherantly a bad thing..... except when it comes
> > > to data storage. By using Kaniko we would, at minimum, doubling
> > > the amount of data storage we consume in the gitlab registry.
> >
> > Double is actually just the initial case. The cache is storing layers
> > using docker tags, whose name appears based on a hash of the "RUN"
> > command.
> >
> > IOW, the first time we build a container we have double the usage.
> > When a dockerfile is updated changing a 'RUN' command, we now have
> > triple the storage usage for cache. Update the RUN command again,
> > and we now have quadruple the storage. etc.
> >
> > Kaniko does not appear to purge cache entries itself, and will rely
> > on something else to do the cache purging.
> >
> > GitLab has support for purging old docker tags, but I'm not an
> > admin on the QEMU project namespace, so can't tell if it can be
> > enabled or not ? Many older projects have this permanently disabled
> > due to historical compat issues in gitlab after they introduced the
> > feature.
> 
> I'm pretty sure purging can be enabled. Gitlab itself proposes this
> with a "set up cleanup" link on the registry page (1).
> Can you recall what issues they were experiencing?

Looks like they may have finally fixed the issue in gitlab. They have
previously blocked cleanup on all repositories older than a certain
date

> If this is the only issue blocking Kaniko adoption, and we can't solve
> it by enabling the cleanup, I can write an additional step at the end
> of the container build to explicitly remove old cache tags.

Cleanup stops the container usage growing without bound, but switching
to Kaniko will still double our long term storage usage which is pretty
undesirable IMHO.

With regards,
Daniel

gitlab-ci: Replace Docker with Kaniko

Commit Message

Comments

Patch