diff mbox

[1/1] size-stats: don't count hard links

Message ID 1476489930-10456-1-git-send-email-fhunleth@troodon-software.com
State Superseded
Headers show

Commit Message

Frank Hunleth Oct. 15, 2016, 12:05 a.m. UTC
This change adds inode tracking to the size-stats script so that hard
links don't cause files to be double counted. This has a significant
effect on the size computation for some packages. For example, git has
around a dozen hard links to a large file. Before this change, git would
weigh in at about 170 MB with the total filesystem size reported as
175 MB. The actual rootfs.ext2 size was around 16 MB. With the change,
the git package registers at 10.5 MB with a total filesystem size of
15.8 MB.

Signed-off-by: Frank Hunleth <fhunleth@troodon-software.com>
---
 support/scripts/size-stats | 27 ++++++++++++++++++++++-----
 1 file changed, 22 insertions(+), 5 deletions(-)

Comments

Thomas Petazzoni Oct. 15, 2016, 6:17 a.m. UTC | #1
Hello,

On Fri, 14 Oct 2016 20:05:30 -0400, Frank Hunleth wrote:
> This change adds inode tracking to the size-stats script so that hard
> links don't cause files to be double counted. This has a significant
> effect on the size computation for some packages. For example, git has
> around a dozen hard links to a large file. Before this change, git would
> weigh in at about 170 MB with the total filesystem size reported as
> 175 MB. The actual rootfs.ext2 size was around 16 MB. With the change,
> the git package registers at 10.5 MB with a total filesystem size of
> 15.8 MB.
> 
> Signed-off-by: Frank Hunleth <fhunleth@troodon-software.com>

Thanks a lot for this change! Definitely this is something that needs
to be handled.

> -def add_file(filesdict, relpath, abspath, pkg):
> +def add_file(filesdict, seeninodes, relpath, abspath, pkg):
>      if not os.path.exists(abspath):
>          return
>      if os.path.islink(abspath):
>          return
> -    sz = os.stat(abspath).st_size
> +    if relpath in filesdict:
> +        return

I'm not sure why this test is being added, or at least why it's related
to the inode tracking.

> @@ -97,10 +113,11 @@ def build_package_size(filesdict, builddir):
>              if not frelpath in filesdict:
>                  print("WARNING: %s is not part of any package" % frelpath)
>                  pkg = "unknown"
> +                sz = os.path.getsize(fpath)

So for files not belonging to packages, we do not track inodes?

Maybe we should instead have our own filesize() helper function that
takes care of returning the right size if we have never seen this
inode, or 0 if we have already seen it. It could then be used in both
places.

Another concern is that some files will now be reported as having a 0
size, while it's not entirely correct. This does not matter at all for
the per-package graph or CSV file, but is a bit more annoying for the
per-file CSV file. Indeed, a user inspecting this CSV file will wonder
what those zero-size files are. So, another option is to divide the
size of the file by the number of hard-links, and spread the size over
the different hard-links. But it's also not very nice, as the size
reported in the CSV will not match the visible size of the file.

So, maybe we should just leave it like you propose, unless others have
a better idea about this.

Thanks!

Thomas
Frank Hunleth Oct. 15, 2016, 11:05 p.m. UTC | #2
On Sat, Oct 15, 2016 at 2:17 AM, Thomas Petazzoni
<thomas.petazzoni@free-electrons.com> wrote:
> Hello,
>
> On Fri, 14 Oct 2016 20:05:30 -0400, Frank Hunleth wrote:
>> This change adds inode tracking to the size-stats script so that hard
>> links don't cause files to be double counted. This has a significant
>> effect on the size computation for some packages. For example, git has
>> around a dozen hard links to a large file. Before this change, git would
>> weigh in at about 170 MB with the total filesystem size reported as
>> 175 MB. The actual rootfs.ext2 size was around 16 MB. With the change,
>> the git package registers at 10.5 MB with a total filesystem size of
>> 15.8 MB.
>>
>> Signed-off-by: Frank Hunleth <fhunleth@troodon-software.com>
>
> Thanks a lot for this change! Definitely this is something that needs
> to be handled.
>
>> -def add_file(filesdict, relpath, abspath, pkg):
>> +def add_file(filesdict, seeninodes, relpath, abspath, pkg):
>>      if not os.path.exists(abspath):
>>          return
>>      if os.path.islink(abspath):
>>          return
>> -    sz = os.stat(abspath).st_size
>> +    if relpath in filesdict:
>> +        return
>
> I'm not sure why this test is being added, or at least why it's related
> to the inode tracking.

I find that I often have duplicate files listed in
package-file-list.txt. I think that they're all due to rebuilding
packages. Without the hard link check, the duplicate files were
harmless in that they just resulted in a file being added to filesdict
more than once. With the hard link checking, an if statement was
needed to make sure that a file with a previously seen inode wouldn't
mistakenly be added as a hard link. Checking up front here seemed best
since it not only address the hard link case, but also saved a little
time handling duplicate entries.

>
>> @@ -97,10 +113,11 @@ def build_package_size(filesdict, builddir):
>>              if not frelpath in filesdict:
>>                  print("WARNING: %s is not part of any package" % frelpath)
>>                  pkg = "unknown"
>> +                sz = os.path.getsize(fpath)
>
> So for files not belonging to packages, we do not track inodes?

Good point. I hadn't considered that anyone would use hard links
outside of packages, but I suppose there's nothing preventing it.

>
> Maybe we should instead have our own filesize() helper function that
> takes care of returning the right size if we have never seen this
> inode, or 0 if we have already seen it. It could then be used in both
> places.
>
> Another concern is that some files will now be reported as having a 0
> size, while it's not entirely correct. This does not matter at all for
> the per-package graph or CSV file, but is a bit more annoying for the
> per-file CSV file. Indeed, a user inspecting this CSV file will wonder
> what those zero-size files are. So, another option is to divide the
> size of the file by the number of hard-links, and spread the size over
> the different hard-links. But it's also not very nice, as the size
> reported in the CSV will not match the visible size of the file.
>
> So, maybe we should just leave it like you propose, unless others have
> a better idea about this.

I had similar concerns about reporting 0 sized files, since it's not
that obvious and which file gets picked as the non-zero length file is
arbitrary. I hadn't thought of spreading the file size across all
links, but I'm not sure that would make things more clear. The only
idea I have is to add a comments field to the csv, add the file as 0
length, and have a comment "hard link to xyz." I pretty much only use
the package size stats, so I'll wait around for other suggestions too.

Frank

>
> Thanks!
>
> Thomas
> --
> Thomas Petazzoni, CTO, Free Electrons
> Embedded Linux and Kernel engineering
> http://free-electrons.com
diff mbox

Patch

diff --git a/support/scripts/size-stats b/support/scripts/size-stats
index 0ddcc07..858b2e0 100755
--- a/support/scripts/size-stats
+++ b/support/scripts/size-stats
@@ -41,17 +41,32 @@  colors = ['#e60004', '#009836', '#2e1d86', '#ffed00',
 # key, and as the value a tuple containing the name of the package to
 # which the file belongs and the size of the file.
 #
+# Hard links are pruned by tracking inodes so that files are not
+# double counted.
+#
 # filesdict: the dict to which  the file is added
+# seeninodes: the set of inodes that have been seen before
 # relpath: relative path of the file
 # fullpath: absolute path to the file
 # pkg: package to which the file belongs
 #
-def add_file(filesdict, relpath, abspath, pkg):
+def add_file(filesdict, seeninodes, relpath, abspath, pkg):
     if not os.path.exists(abspath):
         return
     if os.path.islink(abspath):
         return
-    sz = os.stat(abspath).st_size
+    if relpath in filesdict:
+        return
+
+    st = os.stat(abspath)
+    sz = st.st_size
+    ino = st.st_ino
+    if ino in seeninodes:
+        # hard link
+        sz = 0
+    else:
+        seeninodes.add(ino)
+
     filesdict[relpath] = (pkg, sz)
 
 #
@@ -64,13 +79,14 @@  def add_file(filesdict, relpath, abspath, pkg):
 #
 def build_package_dict(builddir):
     filesdict = {}
+    seeninodes = set()
     with open(os.path.join(builddir, "build", "packages-file-list.txt")) as filelistf:
         for l in filelistf.readlines():
             pkg, fpath = l.split(",", 1)
             # remove the initial './' in each file path
             fpath = fpath.strip()[2:]
             fullpath = os.path.join(builddir, "target", fpath)
-            add_file(filesdict, fpath, fullpath, pkg)
+            add_file(filesdict, seeninodes, fpath, fullpath, pkg)
     return filesdict
 
 #
@@ -97,10 +113,11 @@  def build_package_size(filesdict, builddir):
             if not frelpath in filesdict:
                 print("WARNING: %s is not part of any package" % frelpath)
                 pkg = "unknown"
+                sz = os.path.getsize(fpath)
             else:
-                pkg = filesdict[frelpath][0]
+                pkg, sz = filesdict[frelpath]
 
-            pkgsize[pkg] += os.path.getsize(fpath)
+            pkgsize[pkg] += sz
 
     return pkgsize