Message ID | 6a793a6dba4f052ca8bbc35edd63df601f46478b.1521146096.git.yann.morin.1998@free.fr |
---|---|
State | Accepted |
Commit | 7fb6e782542fc440c2da226ec4525236d0508b77 |
Headers | show |
Series | None | expand |
>>>>> "Yann" == Yann E MORIN <yann.morin.1998@free.fr> writes: Hi, > As part of the build, we run some instrumentation hooks to gather > statistics about the usage of the target/, staging/ and host/ > directories, so that we can generate reports for the user, that > shows: > - for each file, what package installed it, > - for each package,the size that it installed. > In so doing, we run a double md5 pass on all files of the affected > directories. These passes were mostly invisible when we were only > scanning target/, but has greatly increased in time now that we also > scan staging/ and host/ (but only in the corresponding _CMDS, of > course). > This md5 wsa mostly aimed at catching packages that would "cheat" with > mtime/atime/ctime somehow. They can't really cheat on md5, though [0]. > Timings however speak for themselves, with this defconfig (slightly > biggish-but-still-manageable build) [1]. > host/ 20965 files 1.2GiB > staging/ 4715 files 333MiB > target/ 1801 files 44MiB > All instrumentation steps, using md5: 19min 27s > All instrumentation steps, using mtime: 14min 45s > No instrumentation step at all: 14min 31s > So, using mtime is an almost-5min improvement, i.e. about 25% faster, > while removing all instrumentation steps does not gain that much more... > So, we switch to using mtime, because in the end that's still good-enough > for our use-case: generating some graphs. It is not mission-critical, and > if a graph is slightly off, that's not biggy. It can anyway be attributed > to a broken package's buildsystem, which should get fixed. > However, we lose the ability to track directories. Non-empty directories > can be tracked back by a bit of scripting, but empty directories are > simply not caught. If we were to also look for directories using mtime, > we would catch parents of installed files: > - /foo/bar/ exists > - a package installs /foo/bar/buz > - mtime of /foo/bar/ is changed to account for the nex file in it. Playing around with this, I noticed two other issues: - It doesn't work for packages using rsync to install, E.G. skeleton-init-common as rsync also sets the mtime to match the source files - It breaks for <pkg>-reinstall I don't think either of those are really big issues compared to the huge slowdown, but it is worth noticing. > +define step_pkg_size_inner > + cd $(2); \ > + find . \( -type f -o -type L \) \ > + -newer $($(PKG)_DIR)/.stamp_built \ > + -exec printf '$(1),%s\n' {} + \ > + >> $(BUILD_DIR)/packages-file-list$(3).txt What find version are you using? My fileutils find (and the busybox applet) use 'l' for symlinks, so I've changed it to that. Committed with that fixed (and a few tweaks to the commit message), thanks.
Peter, All, On 2018-03-18 15:14 +0100, Peter Korsgaard spake thusly: > >>>>> "Yann" == Yann E MORIN <yann.morin.1998@free.fr> writes: > > As part of the build, we run some instrumentation hooks to gather > > statistics about the usage of the target/, staging/ and host/ > > directories, so that we can generate reports for the user, that > > shows: > > - for each file, what package installed it, > > - for each package,the size that it installed. [--SNIP--] > > So, we switch to using mtime, because in the end that's still good-enough > > for our use-case: generating some graphs. It is not mission-critical, and > > if a graph is slightly off, that's not biggy. It can anyway be attributed > > to a broken package's buildsystem, which should get fixed. > > - /foo/bar/ exists > > - a package installs /foo/bar/buz > > - mtime of /foo/bar/ is changed to account for the nex file in it. > > Playing around with this, I noticed two other issues: > > - It doesn't work for packages using rsync to install, > E.G. skeleton-init-common as rsync also sets the mtime to match the > source files We could maybe tell rsycn not to do that, then? > - It breaks for <pkg>-reinstall Well, we can't guarantee anything except with a clean build from scratch anyway. > I don't think either of those are really big issues compared to the huge > slowdown, but it is worth noticing. Well, the -reinstall was already not working correctly, because the list pf files before/after would be alsmost the same, and the md5-diff would miss all the laready-installed files for the package. The rsync issue is new, but we can "fix" it in a later patch, then, for those packages like the skeletons, by using the --no-times option for example. However, if a third-party package internally uses rsync as its install method, we're screwed. But who would be insane enough to do that? ;-] Alternatively, we could use ctime instead of mtime, maybe? Or check both? > > +define step_pkg_size_inner > > + cd $(2); \ > > + find . \( -type f -o -type L \) \ > > + -newer $($(PKG)_DIR)/.stamp_built \ > > + -exec printf '$(1),%s\n' {} + \ > > + >> $(BUILD_DIR)/packages-file-list$(3).txt > > What find version are you using? My fileutils find (and the busybox > applet) use 'l' for symlinks, so I've changed it to that. Doh, the s/L/l/ is still uncomitted here. Dang... > Committed with that fixed (and a few tweaks to the commit message), > thanks. Thanks! :-) Regards, Yann E. MORIN.
>>>>> "Yann" == Yann E MORIN <yann.morin.1998@free.fr> writes: Hi, >> - It doesn't work for packages using rsync to install, >> E.G. skeleton-init-common as rsync also sets the mtime to match the >> source files > We could maybe tell rsycn not to do that, then? Yes, possibly. >> - It breaks for <pkg>-reinstall > Well, we can't guarantee anything except with a clean build from scratch > anyway. True. We could potentionally do a touch on the stamp file before running find, but it is somewhat icky. >> I don't think either of those are really big issues compared to the huge >> slowdown, but it is worth noticing. > Well, the -reinstall was already not working correctly, because the list > pf files before/after would be alsmost the same, and the md5-diff would > miss all the laready-installed files for the package. > The rsync issue is new, but we can "fix" it in a later patch, then, for > those packages like the skeletons, by using the --no-times option for > example. Yes. I guess we cannot use --no-times unconditionally in SYSTEM_RSYNC, as the mtime shouldn't be touched for source files so OVERRIDE_SRCDIR doesn't rebuild too much. > However, if a third-party package internally uses rsync as its install > method, we're screwed. But who would be insane enough to do that? ;-] And even so, the breakage is not so bad. > Alternatively, we could use ctime instead of mtime, maybe? Or check > both? ctime would presumably miss modifications to existing files, so that is no good. >> > +define step_pkg_size_inner >> > + cd $(2); \ >> > + find . \( -type f -o -type L \) \ >> > + -newer $($(PKG)_DIR)/.stamp_built \ >> > + -exec printf '$(1),%s\n' {} + \ >> > + >> $(BUILD_DIR)/packages-file-list$(3).txt >> >> What find version are you using? My fileutils find (and the busybox >> applet) use 'l' for symlinks, so I've changed it to that. > Doh, the s/L/l/ is still uncomitted here. Dang... ;)
On Thu, 2018-03-15 at 21:35 +0100, Yann E. MORIN wrote: > Timings however speak for themselves, with this defconfig (slightly > biggish-but-still-manageable build) [1]. > > host/ 20965 files 1.2GiB > staging/ 4715 files 333MiB > target/ 1801 files 44MiB > > All instrumentation steps, using md5: 19min 27s > All instrumentation steps, using mtime: 14min 45s > No instrumentation step at all: 14min 31s > > So, using mtime is an almost-5min improvement, i.e. about 25% faster, > while removing all instrumentation steps does not gain that much more... After fixing s/L/l/, here are the timings for a 15 minute build: configure 510.29 build 166.12 extract 45.53 other 43.88 hostinstall 35.43 check_host_rpath 33.37 check_bin_arch 28.30 targetinstall 23.37 stageinstall 15.97 step_pkg_size 9.49 Without ccache, the build step would be somewhat longer. > Now, we also change the way we handle symlinks. Previously, we would > hash the file pointed to by the symlink. Now, we only look at the mtime > of the symlink itself, which still detects modifications. This patch also fixes a problem the previous system had w.r.t. broken symlinks. They did not show up as owned by any packages. Now they are owned by the package that creates the link. It also means a package that replaced a file which was linked to does not also appear, incorrectly, to replace all the links as well.
Hello, On Mon, 19 Mar 2018 16:30:52 +0000, Trent Piepho wrote: > After fixing s/L/l/, here are the timings for a 15 minute build: > > configure 510.29 > build 166.12 > extract 45.53 > other 43.88 > hostinstall 35.43 > check_host_rpath 33.37 > check_bin_arch 28.30 > targetinstall 23.37 > stageinstall 15.97 > step_pkg_size 9.49 Thanks for those measurements. I still find it a bit annoying that we spend 1m10sec on instrumentation related topics on a 15 minutes build. The use of ccache makes the build time lower than it normally is, and therefore makes the instrumentation cost even more visible, but using ccache is a valid use case. Should we move the check_host_rpath and check_bin_arch checks as finalize hooks, instead of running them at the end of each package installation ? We can always get the package that installed the "bogus" file through the package-file-list*.txt files, no ? This would save 1 minute of build time on the above test. Best regards, Thomas
>>>>> "Trent" == Trent Piepho <tpiepho@impinj.com> writes: Hi, > After fixing s/L/l/, here are the timings for a 15 minute build: > configure 510.29 > build 166.12 > extract 45.53 > other 43.88 > hostinstall 35.43 > check_host_rpath 33.37 > check_bin_arch 28.30 > targetinstall 23.37 > stageinstall 15.97 > step_pkg_size 9.49 > Without ccache, the build step would be somewhat longer. Looks good! (well, the configure step time is horrible). >> Now, we also change the way we handle symlinks. Previously, we would >> hash the file pointed to by the symlink. Now, we only look at the mtime >> of the symlink itself, which still detects modifications. > This patch also fixes a problem the previous system had w.r.t. broken > symlinks. They did not show up as owned by any packages. Now they are > owned by the package that creates the link. It also means a package > that replaced a file which was linked to does not also appear, > incorrectly, to replace all the links as well. Ahh, ok - Good.
>>>>> "Thomas" == Thomas Petazzoni <thomas.petazzoni@bootlin.com> writes: > Hello, > On Mon, 19 Mar 2018 16:30:52 +0000, Trent Piepho wrote: >> After fixing s/L/l/, here are the timings for a 15 minute build: >> >> configure 510.29 >> build 166.12 >> extract 45.53 >> other 43.88 >> hostinstall 35.43 >> check_host_rpath 33.37 >> check_bin_arch 28.30 >> targetinstall 23.37 >> stageinstall 15.97 >> step_pkg_size 9.49 > Thanks for those measurements. I still find it a bit annoying that we > spend 1m10sec on instrumentation related topics on a 15 minutes build. > The use of ccache makes the build time lower than it normally is, and > therefore makes the instrumentation cost even more visible, but using > ccache is a valid use case. Agreed, Buildroot has always been about low overhead and fast builds. > Should we move the check_host_rpath and check_bin_arch checks as > finalize hooks, instead of running them at the end of each package > installation ? We can always get the package that installed the "bogus" > file through the package-file-list*.txt files, no ? I would think so as well. Anybody interested in giving it a shot?
On Mon, 2018-03-19 at 21:04 +0100, Peter Korsgaard wrote: > >> After fixing s/L/l/, here are the timings for a 15 minute build: > >> > >> configure 510.29 > >> build 166.12 > >> extract 45.53 > >> other 43.88 > >> hostinstall 35.43 > >> check_host_rpath 33.37 > >> check_bin_arch 28.30 > >> targetinstall 23.37 > >> stageinstall 15.97 > >> step_pkg_size 9.49 > > > Thanks for those measurements. I still find it a bit annoying that we > > spend 1m10sec on instrumentation related topics on a 15 minutes build. > > The use of ccache makes the build time lower than it normally is, and > > therefore makes the instrumentation cost even more visible, but using > > ccache is a valid use case. Those times were on a decently speced native Linux workstation. For a slightly different build in a dockerized container on a vsphere VM acting as a agent for a Bamboo CI process, I get this timings from a total build of 33 minutes: configure 786.48 build 656.64 step_pkg_size 94.67 extract 90.29 other 87.05 postimage 47.90 check_host_rpath 47.21 hostinstall 42.22 targetinstall 35.36 check_bin_arch 31.01 stageinstall 30.38 finalize 20.72 download 19.33 All sources were already downloaded, so the download time can be mostly attributed to re-checking the file hashes. ccache hit rate is > 99%, yet build time really is that slow even with ccache. High cost of random I/O on a VM I think. Instrumentation is a bit less than 10% of total build time. I'll also point out the interesting cost of step_pkg_size vs hostinstall. One would think running "find" over the host tree would be much faster than copying all those files into the host tree in first place. Yet it's not! Likely because the hostinstall step copies each file once, while step_pkg_size must stat each file about 60 times, once for each host package. If one had per-package install trees, then most of these per-package instrumentation steps could be much faster by operating only on the package's files instead of all previous packages' files as well.
Hi all, On Sun, Mar 18, 2018 at 05:33:45PM +0100, Peter Korsgaard wrote: > >>>>> "Yann" == Yann E MORIN <yann.morin.1998@free.fr> writes: > > Hi, > > >> - It doesn't work for packages using rsync to install, > >> E.G. skeleton-init-common as rsync also sets the mtime to match the > >> source files > > > We could maybe tell rsycn not to do that, then? > > Yes, possibly. > > >> - It breaks for <pkg>-reinstall > > > Well, we can't guarantee anything except with a clean build from scratch > > anyway. > > True. We could potentionally do a touch on the stamp file before running > find, but it is somewhat icky. > > >> I don't think either of those are really big issues compared to the huge > >> slowdown, but it is worth noticing. > > > Well, the -reinstall was already not working correctly, because the list > > pf files before/after would be alsmost the same, and the md5-diff would > > miss all the laready-installed files for the package. > > > The rsync issue is new, but we can "fix" it in a later patch, then, for > > those packages like the skeletons, by using the --no-times option for > > example. > > Yes. I guess we cannot use --no-times unconditionally in SYSTEM_RSYNC, > as the mtime shouldn't be touched for source files so OVERRIDE_SRCDIR > doesn't rebuild too much. > > > However, if a third-party package internally uses rsync as its install > > method, we're screwed. But who would be insane enough to do that? ;-] > > And even so, the breakage is not so bad. It really depends on what you use these files for. The original use case for the target list was rootfs size analysis. In the discussion I have seen comments like missing a few files is not that important here, but I disagree: if the missing file is 2MB large, it is a big problem. Another use in-tree is to check for check-uniq-files. While this is a non-critical feature, it's a pity if it would not detect problems because the lists are inaccurate. But there are out-of-tree uses too. The most obvious usage is simply to understand which package was responsible for a given file, even separate from size analysis. But there are also derived use cases. For example we are using the target list in order to extract some packages from the root filesystem. For example, instead of on the root filesystem (initramfs or NOR flash), they should end up on the NAND flash. A script gets as input the list of packages to extract this way, and uses the list to get the right associated files. I'm sure there are other use cases. The current timestamp-based approach not guaranteeing an accurate list is problematic for many such uses. And as you already mentioned, since we don't have full control over the build steps done in any given package, we don't know which timestamps they will use. There may be very good reasons to install certain files with their original timestamp and not the one from the build. Best regards, Thomas
Hello, On Thu, 22 Mar 2018 17:41:44 +0100, Thomas De Schampheleire wrote: > It really depends on what you use these files for. > > The original use case for the target list was rootfs size analysis. In the > discussion I have seen comments like missing a few files is not that important > here, but I disagree: if the missing file is 2MB large, it is a big problem. > > Another use in-tree is to check for check-uniq-files. While this is a > non-critical feature, it's a pity if it would not detect problems because the > lists are inaccurate. > > But there are out-of-tree uses too. The most obvious usage is simply to > understand which package was responsible for a given file, even separate from > size analysis. > > But there are also derived use cases. For example we are using the target list > in order to extract some packages from the root filesystem. For example, instead > of on the root filesystem (initramfs or NOR flash), they should end up on the > NAND flash. A script gets as input the list of packages to extract this way, and > uses the list to get the right associated files. > > I'm sure there are other use cases. > > The current timestamp-based approach not guaranteeing an accurate list is > problematic for many such uses. And as you already mentioned, since we don't have > full control over the build steps done in any given package, we don't know which > timestamps they will use. There may be very good reasons to install certain > files with their original timestamp and not the one from the build. These are all valid concerns, but what do you suggest ? The current approach of hashing all files clearly doesn't scale, as a significant amount of build time is now spent on hashing files. Thomas
On Thu, Mar 22, 2018 at 05:50:35PM +0100, Thomas Petazzoni wrote: > Hello, > > On Thu, 22 Mar 2018 17:41:44 +0100, Thomas De Schampheleire wrote: > > > It really depends on what you use these files for. > > > > The original use case for the target list was rootfs size analysis. In the > > discussion I have seen comments like missing a few files is not that important > > here, but I disagree: if the missing file is 2MB large, it is a big problem. > > > > Another use in-tree is to check for check-uniq-files. While this is a > > non-critical feature, it's a pity if it would not detect problems because the > > lists are inaccurate. > > > > But there are out-of-tree uses too. The most obvious usage is simply to > > understand which package was responsible for a given file, even separate from > > size analysis. > > > > But there are also derived use cases. For example we are using the target list > > in order to extract some packages from the root filesystem. For example, instead > > of on the root filesystem (initramfs or NOR flash), they should end up on the > > NAND flash. A script gets as input the list of packages to extract this way, and > > uses the list to get the right associated files. > > > > I'm sure there are other use cases. > > > > The current timestamp-based approach not guaranteeing an accurate list is > > problematic for many such uses. And as you already mentioned, since we don't have > > full control over the build steps done in any given package, we don't know which > > timestamps they will use. There may be very good reasons to install certain > > files with their original timestamp and not the one from the build. > > These are all valid concerns, but what do you suggest ? > > The current approach of hashing all files clearly doesn't scale, as a > significant amount of build time is now spent on hashing files. > I can only observe that previously, when we still only listed the target files, the impact did not seem to be that bad, and the concerns about impact on build time arose with the creation of staging and host lists. (I hope I caught this correctly from the discussions, I did not yet do measurements myself. I just saw several differences in the list files when applying this patch on top of 2018.02). So one possible alternative is to go back to a situation where only target files are listed, or make the different lists optional. Users that want the lists and are ready to accept build time impact, can enable it. Those that don't care about the lists and just want a fast build, can disable it. We'd loose the feature of check-uniq-files in case a list is not present, of course. Yet another alternative could be to have a different method depending on the list. Although I personally think that all lists should be accurate, if they are created. Another approach: just do a find without md5 (possibly depending on some option). If all you care about is an accurate list of who created a file but don't care that much about others possibly overwriting one, then a simple find is enough and normally quite fast. Best regards, Thomas
On Thu, 2018-03-22 at 18:11 +0100, Thomas De Schampheleire wrote: > > > > The current timestamp-based approach not guaranteeing an accurate list is > > > problematic for many such uses. And as you already mentioned, since we don't have > > > full control over the build steps done in any given package, we don't know which > > > timestamps they will use. There may be very good reasons to install certain > > > files with their original timestamp and not the one from the build. > > > > These are all valid concerns, but what do you suggest ? > > > > The current approach of hashing all files clearly doesn't scale, as a > > significant amount of build time is now spent on hashing files. > > The hash approach also fails to handle symlinks well. I think the answer is to install each package into a clean per-package install tree, then combine those trees. It will fix all the these problems, allow more parallelism, and quite possible be outright more efficient since it will scale with the number of files, while all currently presented approaches scale with the number of files multiplied by the number of packages. > I can only observe that previously, when we still only listed the target files, > the impact did not seem to be that bad, and the concerns about impact on build > time arose with the creation of staging and host lists. That was the big one, but here's a breakdown: step_pkg_size-stage 143.50 step_pkg_size-target 267.14 step_pkg_size-host 419.21 Target alone is still over four minutes. This was for a 12 minutes building + 14 minutes instrumentation build.
>>>>> "Thomas" == Thomas Petazzoni <thomas.petazzoni@bootlin.com> writes: Hi, > These are all valid concerns, but what do you suggest ? > The current approach of hashing all files clearly doesn't scale, as a > significant amount of build time is now spent on hashing files. Indeed. If the choice is between huge overhead vs potentially minor inaccuracies in these list, then I think the choice is made fast, even though the inaccuracies aren't nice - Especially if we have a plan for fixing the the inaccuracies going forward (E.G. toplevel parallel builds).
On 22-03-18 17:41, Thomas De Schampheleire wrote: > Hi all, > > On Sun, Mar 18, 2018 at 05:33:45PM +0100, Peter Korsgaard wrote: >>>>>>> "Yann" == Yann E MORIN <yann.morin.1998@free.fr> writes: [snip] > The current timestamp-based approach not guaranteeing an accurate list is > problematic for many such uses. And as you already mentioned, since we don't have > full control over the build steps done in any given package, we don't know which > timestamps they will use. There may be very good reasons to install certain > files with their original timestamp and not the one from the build. We could revert to the original before/after file list approach, and just use something like "stat -c '%N %Y %f'" instead of md5sum. Though I certainly like Trent's idea of installing each package in a separate directory, too. With PPS we're more or less there anyway. We'd just need to step up the adoption rate of PPS then :-) Regards, Arnout
Hello, On Fri, 23 Mar 2018 23:39:47 +0100, Arnout Vandecappelle wrote: > > The current timestamp-based approach not guaranteeing an accurate list is > > problematic for many such uses. And as you already mentioned, since we don't have > > full control over the build steps done in any given package, we don't know which > > timestamps they will use. There may be very good reasons to install certain > > files with their original timestamp and not the one from the build. > > We could revert to the original before/after file list approach, and just use > something like "stat -c '%N %Y %f'" instead of md5sum. > > Though I certainly like Trent's idea of installing each package in a separate > directory, too. With PPS we're more or less there anyway. We'd just need to step > up the adoption rate of PPS then :-) With the PPS patches I posted, we are not installing each package to a separate folder, but to a folder where all the package dependencies have already been installed. Of course, we could change to a model where the per-package staging and target folders really only contain what the package has installed, and have a separate per-package SDK, in which we do the "accumulation" of the dependencies. Best regards, Thomas
diff --git a/package/pkg-generic.mk b/package/pkg-generic.mk index 2a82025a04..293824c601 100644 --- a/package/pkg-generic.mk +++ b/package/pkg-generic.mk @@ -57,53 +57,26 @@ GLOBAL_INSTRUMENTATION_HOOKS += step_time # Hooks to collect statistics about installed files -define _step_pkg_size_get_file_list - (cd $(2) ; \ - ( \ - find . -xtype f -print0 | xargs -0 md5sum ; \ - find . -xtype d -print0 | xargs -0 -I{} printf 'directory {}\n'; \ - ) \ - ) | sort > $1 -endef - -# This hook will be called before the installation of a package. We store in -# a file named .br_filelist_before the list of files currently installed. -# Note that the MD5 is also stored, in order to identify if the files are -# overwritten. -# $(1): package name (ignored) -# $(2): base directory to search in -define step_pkg_size_start - $(call _step_pkg_size_get_file_list,$($(PKG)_DIR)/.br_filelist_before,$(2)) -endef - -# This hook will be called after the installation of a package. We store in -# a file named .br_filelist_after the list of files (and their MD5) currently -# installed. We then do a diff with the .br_filelist_before to compute the -# list of files installed by this package. # The suffix is typically empty for the target variant, for legacy backward # compatibility. -# $(1): package name (ignored) +# $(1): package name # $(2): base directory to search in # $(3): suffix of file (optional) -define step_pkg_size_end - $(call _step_pkg_size_get_file_list,$($(PKG)_DIR)/.br_filelist_after,$(2)) - comm -13 $($(PKG)_DIR)/.br_filelist_before $($(PKG)_DIR)/.br_filelist_after | \ - while read hash file ; do \ - echo "$(1),$${file}" ; \ - done >> $(BUILD_DIR)/packages-file-list$(3).txt - rm -f $($(PKG)_DIR)/.br_filelist_before $($(PKG)_DIR)/.br_filelist_after +define step_pkg_size_inner + cd $(2); \ + find . \( -type f -o -type L \) \ + -newer $($(PKG)_DIR)/.stamp_built \ + -exec printf '$(1),%s\n' {} + \ + >> $(BUILD_DIR)/packages-file-list$(3).txt endef define step_pkg_size $(if $(filter install-target,$(2)),\ - $(if $(filter start,$(1)),$(call step_pkg_size_start,$(3),$(TARGET_DIR))) \ - $(if $(filter end,$(1)),$(call step_pkg_size_end,$(3),$(TARGET_DIR)))) + $(if $(filter end,$(1)),$(call step_pkg_size_inner,$(3),$(TARGET_DIR)))) $(if $(filter install-staging,$(2)),\ - $(if $(filter start,$(1)),$(call step_pkg_size_start,$(3),$(STAGING_DIR),-staging)) \ - $(if $(filter end,$(1)),$(call step_pkg_size_end,$(3),$(STAGING_DIR),-staging))) + $(if $(filter end,$(1)),$(call step_pkg_size_inner,$(3),$(STAGING_DIR),-staging))) $(if $(filter install-host,$(2)),\ - $(if $(filter start,$(1)),$(call step_pkg_size_start,$(3),$(HOST_DIR),-host)) \ - $(if $(filter end,$(1)),$(call step_pkg_size_end,$(3),$(HOST_DIR),-host))) + $(if $(filter end,$(1)),$(call step_pkg_size_inner,$(3),$(HOST_DIR),-host))) endef GLOBAL_INSTRUMENTATION_HOOKS += step_pkg_size
As part of the build, we run some instrumentation hooks to gather statistics about the usage of the target/, staging/ and host/ directories, so that we can generate reports for the user, that shows: - for each file, what package installed it, - for each package,the size that it installed. In so doing, we run a double md5 pass on all files of the affected directories. These passes were mostly invisible when we were only scanning target/, but has greatly increased in time now that we also scan staging/ and host/ (but only in the corresponding _CMDS, of course). This md5 wsa mostly aimed at catching packages that would "cheat" with mtime/atime/ctime somehow. They can't really cheat on md5, though [0]. Timings however speak for themselves, with this defconfig (slightly biggish-but-still-manageable build) [1]. host/ 20965 files 1.2GiB staging/ 4715 files 333MiB target/ 1801 files 44MiB All instrumentation steps, using md5: 19min 27s All instrumentation steps, using mtime: 14min 45s No instrumentation step at all: 14min 31s So, using mtime is an almost-5min improvement, i.e. about 25% faster, while removing all instrumentation steps does not gain that much more... So, we switch to using mtime, because in the end that's still good-enough for our use-case: generating some graphs. It is not mission-critical, and if a graph is slightly off, that's not biggy. It can anyway be attributed to a broken package's buildsystem, which should get fixed. However, we lose the ability to track directories. Non-empty directories can be tracked back by a bit of scripting, but empty directories are simply not caught. If we were to also look for directories using mtime, we would catch parents of installed files: - /foo/bar/ exists - a package installs /foo/bar/buz - mtime of /foo/bar/ is changed to account for the nex file in it. So we do not track directories at all, and we lose empty directories. The existing tracking was mostly happenstance, with the original submission and comments not really accounting for a real use-case. Now, we also change the way we handle symlinks. Previously, we would hash the file pointed to by the symlink. Now, we only look at the mtime of the symlink itself, which still detects modifications. Eventually, this also means that we now no longer need to establish a list before the install step; we can now simply run after the install step. [0] Yeah, md5 is very weak, but we're not guarding against malicious attacks, just about careless modifications. [1] defconfig used for tests: BR2_arm=y BR2_cortex_a7=y BR2_TOOLCHAIN_EXTERNAL=y BR2_INIT_SYSTEMD=y BR2_PACKAGE_MESA3D=y BR2_PACKAGE_MESA3D_GALLIUM_DRIVER_ETNAVIV=y BR2_PACKAGE_MESA3D_GALLIUM_DRIVER_SWRAST=y BR2_PACKAGE_MESA3D_GALLIUM_DRIVER_VC4=y BR2_PACKAGE_MESA3D_GALLIUM_DRIVER_VIRGL=y BR2_PACKAGE_MESA3D_DRI_DRIVER_SWRAST=y BR2_PACKAGE_MESA3D_OSMESA=y BR2_PACKAGE_MESA3D_OPENGL_ES=y BR2_PACKAGE_SYSTEMD_JOURNAL_GATEWAY=y BR2_PACKAGE_SYSTEMD_BACKLIGHT=y BR2_PACKAGE_SYSTEMD_BINFMT=y BR2_PACKAGE_SYSTEMD_COREDUMP=y BR2_PACKAGE_SYSTEMD_FIRSTBOOT=y BR2_PACKAGE_SYSTEMD_HIBERNATE=y BR2_PACKAGE_SYSTEMD_IMPORTD=y BR2_PACKAGE_SYSTEMD_LOCALED=y BR2_PACKAGE_SYSTEMD_LOGIND=y BR2_PACKAGE_SYSTEMD_MACHINED=y BR2_PACKAGE_SYSTEMD_POLKIT=y BR2_PACKAGE_SYSTEMD_QUOTACHECK=y BR2_PACKAGE_SYSTEMD_RANDOMSEED=y BR2_PACKAGE_SYSTEMD_RFKILL=y BR2_PACKAGE_SYSTEMD_SMACK_SUPPORT=y BR2_PACKAGE_SYSTEMD_SYSUSERS=y BR2_PACKAGE_SYSTEMD_VCONSOLE=y Reported-by: Trent Piepho <tpiepho@impinj.com> Signed-off-by: "Yann E. MORIN" <yann.morin.1998@free.fr> Cc: Trent Piepho <tpiepho@impinj.com> Cc: Thomas Petazzoni <thomas.petazzoni@bootlin.com> Cc: Peter Korsgaard <peter@korsgaard.com> --- package/pkg-generic.mk | 47 ++++++++++------------------------------------- 1 file changed, 10 insertions(+), 37 deletions(-)