Message ID | 20211116163309.246602-1-thuth@redhat.com |
---|---|
State | New |
Headers | show |
Series | gitlab-ci/cirrus: Increase timeout to 80 minutes | expand |
On Tue, Nov 16, 2021 at 05:33:09PM +0100, Thomas Huth wrote: > The jobs on Cirrus-CI sometimes get delayed quite a bit, waiting to > be scheduled, so while the build test itself finishes within 60 minutes, > the total run time of the jobs can be longer due to this waiting time. > Thus let's increase the timeout on the gitlab side a little bit, so > that these jobs are not marked as failing just because of the delay. On a successful pipeline I see freebsd-11 - 28 minutes freebsd-12 - 57 minutes macos - 30 minutes We know cirrus allows 2 concurrent jobs, so from that I infer that the freebsd-12 job was queued for ~30 minutes waiting for either the freebsd-11 or macos job to finish, and then it ran in 30 minutes, giving the ~60 minute total. That's too close to the 60 minute gitlab default job timeout for comfort - it can easily slip over 60 minutes by just a small amount. 80 minutes will certainly help in the case where we randomly take a little longer than 30 minutes to build, and have 1 of the 3 jobs queued. When we're running jobs on both master + staging, we can have 2 jobs running and 4 more queued - 2 of those queued might just finish in time, but 2 will definitely fail. My patch will cut these extra jobs on master, so in common case we only ever get 1 queued, which should work well in combo with your patch here. That should be good enough for the qemu-project namespace, unless someone is triggering pipelines for stable branch staging at the same time as the master branch staging. If we do want to worry about more than 2 queued jobs again for that reason, we might consider putting it upto 100 minutes. That would give us enough slack to have 4 queued jobs behind two running jobs and have them all succeed > Signed-off-by: Thomas Huth <thuth@redhat.com> > --- > .gitlab-ci.d/cirrus.yml | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/.gitlab-ci.d/cirrus.yml b/.gitlab-ci.d/cirrus.yml > index e7b25e7427..22d42585e4 100644 > --- a/.gitlab-ci.d/cirrus.yml > +++ b/.gitlab-ci.d/cirrus.yml > @@ -14,6 +14,7 @@ > stage: build > image: registry.gitlab.com/libvirt/libvirt-ci/cirrus-run:master > needs: [] > + timeout: 80m > allow_failure: true > script: > - source .gitlab-ci.d/cirrus/$NAME.vars Whether 80 or 100 minute, consider it Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> Regards, Daniel
On 11/16/21 17:49, Daniel P. Berrangé wrote: > On Tue, Nov 16, 2021 at 05:33:09PM +0100, Thomas Huth wrote: >> The jobs on Cirrus-CI sometimes get delayed quite a bit, waiting to >> be scheduled, so while the build test itself finishes within 60 minutes, >> the total run time of the jobs can be longer due to this waiting time. >> Thus let's increase the timeout on the gitlab side a little bit, so >> that these jobs are not marked as failing just because of the delay. > > On a successful pipeline I see > > freebsd-11 - 28 minutes > freebsd-12 - 57 minutes > macos - 30 minutes > > We know cirrus allows 2 concurrent jobs, so from that I infer > that the freebsd-12 job was queued for ~30 minutes waiting for > either the freebsd-11 or macos job to finish, and then it > ran in 30 minutes, giving the ~60 minute total. > > That's too close to the 60 minute gitlab default job timeout > for comfort - it can easily slip over 60 minutes by just a > small amount. > > 80 minutes will certainly help in the case where we > randomly take a little longer than 30 minutes to build, > and have 1 of the 3 jobs queued. > > When we're running jobs on both master + staging, we can > have 2 jobs running and 4 more queued - 2 of those queued > might just finish in time, but 2 will definitely fail. > My patch will cut these extra jobs on master, so in common > case we only ever get 1 queued, which should work well in > combo with your patch here. That should be good enough > for the qemu-project namespace, unless someone is triggering > pipelines for stable branch staging at the same time as > the master branch staging. > > If we do want to worry about more than 2 queued jobs > again for that reason, we might consider putting > it upto 100 minutes. That would give us enough slack to > have 4 queued jobs behind two running jobs and have > them all succeed > >> Signed-off-by: Thomas Huth <thuth@redhat.com> >> --- >> .gitlab-ci.d/cirrus.yml | 1 + >> 1 file changed, 1 insertion(+) >> >> diff --git a/.gitlab-ci.d/cirrus.yml b/.gitlab-ci.d/cirrus.yml >> index e7b25e7427..22d42585e4 100644 >> --- a/.gitlab-ci.d/cirrus.yml >> +++ b/.gitlab-ci.d/cirrus.yml >> @@ -14,6 +14,7 @@ >> stage: build >> image: registry.gitlab.com/libvirt/libvirt-ci/cirrus-run:master >> needs: [] >> + timeout: 80m >> allow_failure: true >> script: >> - source .gitlab-ci.d/cirrus/$NAME.vars > > Whether 80 or 100 minute, consider it > > Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> This pipeline took 1h51m09s: https://gitlab.com/qemu-project/qemu/-/pipelines/409666733/builds But Richard restarted unstable jobs, which probably added time to the total. IIRC from a maintainer perspective 1h15 is the upper limit. 80m fits, 100m is over. Up to the project maintainers (personally I don't have any objection, in particular if this reduces the failures rate). Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
On Tue, Nov 16, 2021 at 1:33 PM Thomas Huth <thuth@redhat.com> wrote: > > The jobs on Cirrus-CI sometimes get delayed quite a bit, waiting to > be scheduled, so while the build test itself finishes within 60 minutes, > the total run time of the jobs can be longer due to this waiting time. > Thus let's increase the timeout on the gitlab side a little bit, so > that these jobs are not marked as failing just because of the delay. > > Signed-off-by: Thomas Huth <thuth@redhat.com> > --- > .gitlab-ci.d/cirrus.yml | 1 + > 1 file changed, 1 insertion(+) > Reviewed-by: Willian Rampazzo <willianr@redhat.com>
On 16/11/2021 18.09, Philippe Mathieu-Daudé wrote: > On 11/16/21 17:49, Daniel P. Berrangé wrote: >> On Tue, Nov 16, 2021 at 05:33:09PM +0100, Thomas Huth wrote: >>> The jobs on Cirrus-CI sometimes get delayed quite a bit, waiting to >>> be scheduled, so while the build test itself finishes within 60 minutes, >>> the total run time of the jobs can be longer due to this waiting time. >>> Thus let's increase the timeout on the gitlab side a little bit, so >>> that these jobs are not marked as failing just because of the delay. ...>>> diff --git a/.gitlab-ci.d/cirrus.yml b/.gitlab-ci.d/cirrus.yml >>> index e7b25e7427..22d42585e4 100644 >>> --- a/.gitlab-ci.d/cirrus.yml >>> +++ b/.gitlab-ci.d/cirrus.yml >>> @@ -14,6 +14,7 @@ >>> stage: build >>> image: registry.gitlab.com/libvirt/libvirt-ci/cirrus-run:master >>> needs: [] >>> + timeout: 80m >>> allow_failure: true >>> script: >>> - source .gitlab-ci.d/cirrus/$NAME.vars >> >> Whether 80 or 100 minute, consider it >> >> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> > > This pipeline took 1h51m09s: > https://gitlab.com/qemu-project/qemu/-/pipelines/409666733/builds > But Richard restarted unstable jobs, which probably added time > to the total. > > IIRC from a maintainer perspective 1h15 is the upper limit. > 80m fits, 100m is over. I think I agree ... I normally don't want to wait more than a little bit more than one hour, so 100 minutes feels too long already. We already have some 70m timeouts in other jobs, and one 80 minute timeout in .gitlab-ci.d/crossbuild-template.yml, so I'd say 80 minutes are really the upper boundary that we should use. > Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com> Thank to all for your reviews! Thomas
On 11/16/21 6:22 PM, Thomas Huth wrote: > On 16/11/2021 18.09, Philippe Mathieu-Daudé wrote: >> On 11/16/21 17:49, Daniel P. Berrangé wrote: >>> On Tue, Nov 16, 2021 at 05:33:09PM +0100, Thomas Huth wrote: >>>> The jobs on Cirrus-CI sometimes get delayed quite a bit, waiting to >>>> be scheduled, so while the build test itself finishes within 60 minutes, >>>> the total run time of the jobs can be longer due to this waiting time. >>>> Thus let's increase the timeout on the gitlab side a little bit, so >>>> that these jobs are not marked as failing just because of the delay. > ...>>> diff --git a/.gitlab-ci.d/cirrus.yml b/.gitlab-ci.d/cirrus.yml >>>> index e7b25e7427..22d42585e4 100644 >>>> --- a/.gitlab-ci.d/cirrus.yml >>>> +++ b/.gitlab-ci.d/cirrus.yml >>>> @@ -14,6 +14,7 @@ >>>> stage: build >>>> image: registry.gitlab.com/libvirt/libvirt-ci/cirrus-run:master >>>> needs: [] >>>> + timeout: 80m >>>> allow_failure: true >>>> script: >>>> - source .gitlab-ci.d/cirrus/$NAME.vars >>> >>> Whether 80 or 100 minute, consider it >>> >>> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> >> >> This pipeline took 1h51m09s: >> https://gitlab.com/qemu-project/qemu/-/pipelines/409666733/builds >> But Richard restarted unstable jobs, which probably added time >> to the total. >> >> IIRC from a maintainer perspective 1h15 is the upper limit. >> 80m fits, 100m is over. > > I think I agree ... I normally don't want to wait more than a little bit more than one > hour, so 100 minutes feels too long already. We already have some 70m timeouts in other > jobs, and one 80 minute timeout in .gitlab-ci.d/crossbuild-template.yml, so I'd say 80 > minutes are really the upper boundary that we should use. We are also talking apples and oranges: Gitlab timeouts are on the amount of time the job runs. Cirrus timeouts appear to be on the amount of time the job is queued. If cirrus would just not start accounting until the thing runs we'd be fine. r~
On Tue, Nov 16, 2021 at 06:36:50PM +0100, Richard Henderson wrote: > On 11/16/21 6:22 PM, Thomas Huth wrote: > > On 16/11/2021 18.09, Philippe Mathieu-Daudé wrote: > > > On 11/16/21 17:49, Daniel P. Berrangé wrote: > > > > On Tue, Nov 16, 2021 at 05:33:09PM +0100, Thomas Huth wrote: > > > > > The jobs on Cirrus-CI sometimes get delayed quite a bit, waiting to > > > > > be scheduled, so while the build test itself finishes within 60 minutes, > > > > > the total run time of the jobs can be longer due to this waiting time. > > > > > Thus let's increase the timeout on the gitlab side a little bit, so > > > > > that these jobs are not marked as failing just because of the delay. > > ...>>> diff --git a/.gitlab-ci.d/cirrus.yml b/.gitlab-ci.d/cirrus.yml > > > > > index e7b25e7427..22d42585e4 100644 > > > > > --- a/.gitlab-ci.d/cirrus.yml > > > > > +++ b/.gitlab-ci.d/cirrus.yml > > > > > @@ -14,6 +14,7 @@ > > > > > stage: build > > > > > image: registry.gitlab.com/libvirt/libvirt-ci/cirrus-run:master > > > > > needs: [] > > > > > + timeout: 80m > > > > > allow_failure: true > > > > > script: > > > > > - source .gitlab-ci.d/cirrus/$NAME.vars > > > > > > > > Whether 80 or 100 minute, consider it > > > > > > > > Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> > > > > > > This pipeline took 1h51m09s: > > > https://gitlab.com/qemu-project/qemu/-/pipelines/409666733/builds > > > But Richard restarted unstable jobs, which probably added time > > > to the total. > > > > > > IIRC from a maintainer perspective 1h15 is the upper limit. > > > 80m fits, 100m is over. > > > > I think I agree ... I normally don't want to wait more than a little bit > > more than one hour, so 100 minutes feels too long already. We already > > have some 70m timeouts in other jobs, and one 80 minute timeout in > > .gitlab-ci.d/crossbuild-template.yml, so I'd say 80 minutes are really > > the upper boundary that we should use. > > We are also talking apples and oranges: > Gitlab timeouts are on the amount of time the job runs. > Cirrus timeouts appear to be on the amount of time the job is queued. > > If cirrus would just not start accounting until the thing runs we'd be fine. Unfortunately it isn't that easy. Our cirrus CI jobs are launched using the cirrus-run tool, from a gitlab job. The timeouts we're usually hitting are from the gitlab job which is sitting around waiting for the cirrus job it launched to finish, so it can report back stats. Cirrus CI does itself have a job timeout, but I'm not aware of us hitting that typically, unless i'm misinterpreting something. Regards, Daniel
On 16/11/2021 19.20, Daniel P. Berrangé wrote: > On Tue, Nov 16, 2021 at 06:36:50PM +0100, Richard Henderson wrote: >> On 11/16/21 6:22 PM, Thomas Huth wrote: >>> On 16/11/2021 18.09, Philippe Mathieu-Daudé wrote: >>>> On 11/16/21 17:49, Daniel P. Berrangé wrote: >>>>> On Tue, Nov 16, 2021 at 05:33:09PM +0100, Thomas Huth wrote: >>>>>> The jobs on Cirrus-CI sometimes get delayed quite a bit, waiting to >>>>>> be scheduled, so while the build test itself finishes within 60 minutes, >>>>>> the total run time of the jobs can be longer due to this waiting time. >>>>>> Thus let's increase the timeout on the gitlab side a little bit, so >>>>>> that these jobs are not marked as failing just because of the delay. >>> ...>>> diff --git a/.gitlab-ci.d/cirrus.yml b/.gitlab-ci.d/cirrus.yml >>>>>> index e7b25e7427..22d42585e4 100644 >>>>>> --- a/.gitlab-ci.d/cirrus.yml >>>>>> +++ b/.gitlab-ci.d/cirrus.yml >>>>>> @@ -14,6 +14,7 @@ >>>>>> stage: build >>>>>> image: registry.gitlab.com/libvirt/libvirt-ci/cirrus-run:master >>>>>> needs: [] >>>>>> + timeout: 80m >>>>>> allow_failure: true >>>>>> script: >>>>>> - source .gitlab-ci.d/cirrus/$NAME.vars >>>>> >>>>> Whether 80 or 100 minute, consider it >>>>> >>>>> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> >>>> >>>> This pipeline took 1h51m09s: >>>> https://gitlab.com/qemu-project/qemu/-/pipelines/409666733/builds >>>> But Richard restarted unstable jobs, which probably added time >>>> to the total. >>>> >>>> IIRC from a maintainer perspective 1h15 is the upper limit. >>>> 80m fits, 100m is over. >>> >>> I think I agree ... I normally don't want to wait more than a little bit >>> more than one hour, so 100 minutes feels too long already. We already >>> have some 70m timeouts in other jobs, and one 80 minute timeout in >>> .gitlab-ci.d/crossbuild-template.yml, so I'd say 80 minutes are really >>> the upper boundary that we should use. >> >> We are also talking apples and oranges: >> Gitlab timeouts are on the amount of time the job runs. >> Cirrus timeouts appear to be on the amount of time the job is queued. >> >> If cirrus would just not start accounting until the thing runs we'd be fine. > > Unfortunately it isn't that easy. Our cirrus CI jobs are launched using > the cirrus-run tool, from a gitlab job. The timeouts we're usually > hitting are from the gitlab job which is sitting around waiting for > the cirrus job it launched to finish, so it can report back stats. > > Cirrus CI does itself have a job timeout, but I'm not aware of us > hitting that typically, unless i'm misinterpreting something. Right, the problem is the timeout on the gitlab-CI side, not the timeout on the Cirrus-CI side. I've never seen us hitting the timeout on the Cirrus side either. Thomas
diff --git a/.gitlab-ci.d/cirrus.yml b/.gitlab-ci.d/cirrus.yml index e7b25e7427..22d42585e4 100644 --- a/.gitlab-ci.d/cirrus.yml +++ b/.gitlab-ci.d/cirrus.yml @@ -14,6 +14,7 @@ stage: build image: registry.gitlab.com/libvirt/libvirt-ci/cirrus-run:master needs: [] + timeout: 80m allow_failure: true script: - source .gitlab-ci.d/cirrus/$NAME.vars
The jobs on Cirrus-CI sometimes get delayed quite a bit, waiting to be scheduled, so while the build test itself finishes within 60 minutes, the total run time of the jobs can be longer due to this waiting time. Thus let's increase the timeout on the gitlab side a little bit, so that these jobs are not marked as failing just because of the delay. Signed-off-by: Thomas Huth <thuth@redhat.com> --- .gitlab-ci.d/cirrus.yml | 1 + 1 file changed, 1 insertion(+)