[v4,2/5] autobuild-run: initial implementation of get_reproducibility_failure_reason()
diff mbox series

Message ID 20190820145231.15507-2-itsatharva@gmail.com
State New
Headers show
Series
  • [v4,1/5] autobuild-run: check if reproducibile_results exists before checking its size
Related show

Commit Message

Atharva Lele Aug. 20, 2019, 2:52 p.m. UTC
Analyze the JSON formatted output from diffoscope and check if
the differences are due to a filesystem reproducibility issue
or a package reproducibility issue.

Also, discard the deltas because they might take up too much space.

Signed-off-by: Atharva Lele <itsatharva@gmail.com>
---
Changes v2 -> v4:
  - Change if-else to try-except
  - remove blank line
Changes v1 -> v2:
  - Refactor using subfunctions and local variables (suggested by Thomas)
  - Added comments (suggested by Thomas)
  - Use more pythonic loops (suggested by Thomas)
---
 scripts/autobuild-run | 88 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 88 insertions(+)

Comments

Arnout Vandecappelle Sept. 8, 2019, 5:06 p.m. UTC | #1
Hi Atharva,

 I'm very sorry that I'm so late reviewing this...

On 20/08/2019 16:52, Atharva Lele wrote:
> Analyze the JSON formatted output from diffoscope and check if
> the differences are due to a filesystem reproducibility issue
> or a package reproducibility issue.
> 
> Also, discard the deltas because they might take up too much space.
> 
> Signed-off-by: Atharva Lele <itsatharva@gmail.com>
> ---
> Changes v2 -> v4:
>   - Change if-else to try-except
>   - remove blank line
> Changes v1 -> v2:
>   - Refactor using subfunctions and local variables (suggested by Thomas)
>   - Added comments (suggested by Thomas)
>   - Use more pythonic loops (suggested by Thomas)
> ---
>  scripts/autobuild-run | 88 +++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 88 insertions(+)
> 
> diff --git a/scripts/autobuild-run b/scripts/autobuild-run
> index 99b57dd..384cf73 100755
> --- a/scripts/autobuild-run
> +++ b/scripts/autobuild-run
> @@ -131,6 +131,7 @@ import csv
>  import docopt
>  import errno
>  import hashlib
> +import json
>  import mmap
>  import multiprocessing
>  import os
> @@ -599,6 +600,93 @@ class Builder:
>          if reject_results():
>              return
>  
> +        def get_reproducibility_failure_reason(reproducible_results):

 This function really warrants a docstring. Especially because it does more than
the title says: it not only returns the reason, but it also updates the JSON
file with it and changes the way the diffs are stored.

> +            def split_delta(delta):
> +                # Take a delta and split it into added, deleted lines.
> +                added = []
> +                deleted = []
> +                for line in delta:
> +                    if line.startswith("+"):
> +                        added.append(line)
> +                    if line.startswith("-"):
> +                        deleted.append(line)
> +                return added, deleted

 A more pythonesque way to do this would be

added = [line for line in delta if line.startswith("+")]

> +
> +            def get_package(sourcef):
> +                # Returns which package the source file belongs to.
> +                with open(packages_file_list, "r") as packagef:

 It is strange to have packages_file_list defined at a higher scope and then
used within this function. I think it's better to move its definition inside
this function as well.

> +                    for line in packagef:
> +                        if sourcef in line:
> +                            package = line.split(',')[0]

 I think it is better to keep a list of packages for the JSON output, and use
the last one for the reason output. So here you would have:

 packages = [line.split(',', 1)[0] for line in packagef if sourcef in line]


> +
> +                try:
> +                    # Get package version
> +                    package_info = json.loads(subprocess.check_output(["make", "--no-print-directory",
> +                                                                       "O=%s" % self.outputdir,
> +                                                                       "-C", self.srcdir,
> +                                                                       "%s-show-info" % package]))
> +                    if "version" in package_info[package]:
> +                        version = package_info[package]["version"]
> +                        return [package, version]

 I don't know how useful it is to extract the package version... It is currently
part of the reason file mostly by accident, I think.

> +                    else:
> +                        return [package]
> +                except:
> +                    return ["not found"]
> +
> +            def cleanup(l):
> +                # Takes a list and removes data which is redundant (source2) or data
> +                # that might take up too much space (like huge diffs).
> +                if "unified_diff" in l:
> +                    l.pop("unified_diff")

 Condition can be avoided with

 l.pop("unified_diff", None)

(or maybe that's only python3? I'm not used to legacy python any more :-)

> +                if "source2" in l:
> +                    l.pop("source2")
> +
> +            packages_file_list = os.path.join(self.outputdir, "build", "packages-file-list.txt")
> +
> +            with open(reproducible_results, "r") as reproduciblef:
> +                json_data = json.load(reproduciblef)
> +
> +            if json_data["unified_diff"] == None:

 == None makes little sense - you should just use 'not' in that case. If you
really want to check that it's None, you should use 'is None'.  However, do you
really get a null entry in the json output? Isn't it that the "unified_diff"
section is missing in the json output? In that case, the
json_data["unified_diff"] would give a KeyError exception...

> +                # Remove the file list because it is not useful, i.e. it only shows
> +                # which files vary, and nothing more.
> +                if json_data["details"][0]["source1"] == "file list":
> +                    json_data["details"].pop(0)
> +
> +                # Iterate over details in the diffoscope output.
> +                for item in json_data["details"]:
> +                    diff_src = item["source1"]
> +                    item["package"] = get_package(diff_src)
> +
> +                    # In some cases, diffoscope uses multiple commands to get various
> +                    # diffs. Due to this, it generates a "details" key for those files
> +                    # instead of just storing the diff in the "unified_diff" key.
> +                    if item["unified_diff"] == None:
> +                        for item_details in item["details"]:
> +                            diff = item_details["unified_diff"].split("\n")
> +                            split_deltas = split_delta(diff)
> +                            item_details["added"] = split_deltas[0][:100]
> +                            item_details["deleted"] = split_deltas[1][:100]
> +                            cleanup(item_details)
> +                    else:
> +                        diff = item["unified_diff"].split("\n")
> +                        split_deltas = split_delta(diff)
> +                        item["added"] = split_deltas[0][:100]
> +                        item["deleted"] = split_deltas[1][:100]
> +                    cleanup(item)

 This cleanup has nothing to do with getting the failure reason. I think it
would be better to do it in a separate function (and a separate patch). Also,
I'm not really happy yet with this diff cleanup, because we loose all context
information. So maybe, for now, it's better to just use the --max-report-size
option.

> +                # We currently just set the reason from first non-reproducible package in the
> +                # dictionary.
> +                reason = json_data["details"][0]["package"]
> +
> +                # If there does exist a unified_diff directly for the .tar images, it is probably
> +                # a filesystem reproducibility issue.

 This comment should come after the else, or before the condition. Like this, it
looks like it belongs to the == None condition.

 Maybe it's cleaner to swap the condition and put this filesystem assumption first.


 Regards,
 Arnout

> +            else:
> +                reason = ["filesystem"]
> +
> +            with open(reproducible_results, "w") as reproduciblef:
> +                json.dump(json_data, reproduciblef, sort_keys=True, indent=4)
> +
> +            return reason
> +
>          def get_failure_reason():
>              # Output is a tuple (package, version), or None.
>              lastlines = decode_bytes(subprocess.Popen(
>
Thomas Petazzoni Sept. 8, 2019, 10:42 p.m. UTC | #2
On Sun, 8 Sep 2019 19:06:11 +0200
Arnout Vandecappelle <arnout@mind.be> wrote:

> > +                    if "version" in package_info[package]:
> > +                        version = package_info[package]["version"]
> > +                        return [package, version]  
> 
>  I don't know how useful it is to extract the package version... It is currently
> part of the reason file mostly by accident, I think.

I find it quite useful to have the version in the failure reason.
Thanks to that, when you search with ?reason=foo%, you can see
immediately in the list which versions of the package are/were affected
by a failure.

Thomas
Arnout Vandecappelle Sept. 9, 2019, 7:35 a.m. UTC | #3
On 09/09/2019 00:42, Thomas Petazzoni wrote:
> On Sun, 8 Sep 2019 19:06:11 +0200
> Arnout Vandecappelle <arnout@mind.be> wrote:
> 
>>> +                    if "version" in package_info[package]:
>>> +                        version = package_info[package]["version"]
>>> +                        return [package, version]  
>>
>>  I don't know how useful it is to extract the package version... It is currently
>> part of the reason file mostly by accident, I think.
> 
> I find it quite useful to have the version in the failure reason.
> Thanks to that, when you search with ?reason=foo%, you can see
> immediately in the list which versions of the package are/were affected
> by a failure.

 Yes, but:

1. I really think it is there by accident :-)

2. For reproducibility failures, I don't think it is that relevant.

In case a failure was introduced by a version bump, you anyway have to check by
hand when the version was bumped to be sure that that is really the cause.

 Regards,
 Arnout
Thomas Petazzoni Sept. 9, 2019, 7:45 a.m. UTC | #4
On Mon, 9 Sep 2019 09:35:34 +0200
Arnout Vandecappelle <arnout@mind.be> wrote:

>  Yes, but:
> 
> 1. I really think it is there by accident :-)
> 
> 2. For reproducibility failures, I don't think it is that relevant.
> 
> In case a failure was introduced by a version bump, you anyway have to check by
> hand when the version was bumped to be sure that that is really the cause.

Agreed.

Thomas
Atharva Lele Sept. 12, 2019, 12:47 p.m. UTC | #5
On Sun, Sep 8, 2019 at 10:36 PM Arnout Vandecappelle <arnout@mind.be> wrote:
>
>  Hi Atharva,
>
>  I'm very sorry that I'm so late reviewing this...
>
> On 20/08/2019 16:52, Atharva Lele wrote:
> > Analyze the JSON formatted output from diffoscope and check if
> > the differences are due to a filesystem reproducibility issue
> > or a package reproducibility issue.
> >
> > Also, discard the deltas because they might take up too much space.
> >
> > Signed-off-by: Atharva Lele <itsatharva@gmail.com>
> > ---
> > Changes v2 -> v4:
> >   - Change if-else to try-except
> >   - remove blank line
> > Changes v1 -> v2:
> >   - Refactor using subfunctions and local variables (suggested by Thomas)
> >   - Added comments (suggested by Thomas)
> >   - Use more pythonic loops (suggested by Thomas)
> > ---
> >  scripts/autobuild-run | 88 +++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 88 insertions(+)
> >
> > diff --git a/scripts/autobuild-run b/scripts/autobuild-run
> > index 99b57dd..384cf73 100755
> > --- a/scripts/autobuild-run
> > +++ b/scripts/autobuild-run
> > @@ -131,6 +131,7 @@ import csv
> >  import docopt
> >  import errno
> >  import hashlib
> > +import json
> >  import mmap
> >  import multiprocessing
> >  import os
> > @@ -599,6 +600,93 @@ class Builder:
> >          if reject_results():
> >              return
> >
> > +        def get_reproducibility_failure_reason(reproducible_results):
>
>  This function really warrants a docstring. Especially because it does more than
> the title says: it not only returns the reason, but it also updates the JSON
> file with it and changes the way the diffs are stored.
>

That's true.. I'll add it in a future revision.

>  A more pythonesque way to do this would be
>
> added = [line for line in delta if line.startswith("+")]
>

Man.. 3 months coding mostly in python and its still hard to code the
python way. ;)
Thanks!

> > +
> > +            def get_package(sourcef):
> > +                # Returns which package the source file belongs to.
> > +                with open(packages_file_list, "r") as packagef:
>
>  It is strange to have packages_file_list defined at a higher scope and then
> used within this function. I think it's better to move its definition inside
> this function as well.
>

Right, thanks!

> > +                    for line in packagef:
> > +                        if sourcef in line:
> > +                            package = line.split(',')[0]
>
>  I think it is better to keep a list of packages for the JSON output, and use
> the last one for the reason output. So here you would have:
>
>  packages = [line.split(',', 1)[0] for line in packagef if sourcef in line]
>

Alright, I'll switch it.

>
> > +
> > +                try:
> > +                    # Get package version
> > +                    package_info = json.loads(subprocess.check_output(["make", "--no-print-directory",
> > +                                                                       "O=%s" % self.outputdir,
> > +                                                                       "-C", self.srcdir,
> > +                                                                       "%s-show-info" % package]))
> > +                    if "version" in package_info[package]:
> > +                        version = package_info[package]["version"]
> > +                        return [package, version]
>
>  I don't know how useful it is to extract the package version... It is currently
> part of the reason file mostly by accident, I think.
>

I kept it purely because it was being reported on the autobuilder site
since before I started work. I read the comments from you and Thomas'
and I agree that we'd just need to manually check things. So it
doesn't make sense to go the extra efforts of extracting the version
here. I'll remove it for the next version of the patch.

> > +                    else:
> > +                        return [package]
> > +                except:
> > +                    return ["not found"]
> > +
> > +            def cleanup(l):
> > +                # Takes a list and removes data which is redundant (source2) or data
> > +                # that might take up too much space (like huge diffs).
> > +                if "unified_diff" in l:
> > +                    l.pop("unified_diff")
>
>  Condition can be avoided with
>
>  l.pop("unified_diff", None)
>
> (or maybe that's only python3? I'm not used to legacy python any more :-)
>

That is valid for dictionaries, even in python2. However, we are
passing a list element to cleanup() and list.pop() only takes in 1
argument i.e. the index.

> > +                if "source2" in l:
> > +                    l.pop("source2")
> > +
> > +            packages_file_list = os.path.join(self.outputdir, "build", "packages-file-list.txt")
> > +
> > +            with open(reproducible_results, "r") as reproduciblef:
> > +                json_data = json.load(reproduciblef)
> > +
> > +            if json_data["unified_diff"] == None:
>
>  == None makes little sense - you should just use 'not' in that case. If you
> really want to check that it's None, you should use 'is None'.  However, do you
> really get a null entry in the json output? Isn't it that the "unified_diff"
> section is missing in the json output? In that case, the
> json_data["unified_diff"] would give a KeyError exception...

I do get a null entry in the json output. unified_diff is not missing
but rather set to null. I should use 'is None' then. Thanks!

>
> > +                # Remove the file list because it is not useful, i.e. it only shows
> > +                # which files vary, and nothing more.
> > +                if json_data["details"][0]["source1"] == "file list":
> > +                    json_data["details"].pop(0)
> > +
> > +                # Iterate over details in the diffoscope output.
> > +                for item in json_data["details"]:
> > +                    diff_src = item["source1"]
> > +                    item["package"] = get_package(diff_src)
> > +
> > +                    # In some cases, diffoscope uses multiple commands to get various
> > +                    # diffs. Due to this, it generates a "details" key for those files
> > +                    # instead of just storing the diff in the "unified_diff" key.
> > +                    if item["unified_diff"] == None:
> > +                        for item_details in item["details"]:
> > +                            diff = item_details["unified_diff"].split("\n")
> > +                            split_deltas = split_delta(diff)
> > +                            item_details["added"] = split_deltas[0][:100]
> > +                            item_details["deleted"] = split_deltas[1][:100]
> > +                            cleanup(item_details)
> > +                    else:
> > +                        diff = item["unified_diff"].split("\n")
> > +                        split_deltas = split_delta(diff)
> > +                        item["added"] = split_deltas[0][:100]
> > +                        item["deleted"] = split_deltas[1][:100]
> > +                    cleanup(item)
>
>  This cleanup has nothing to do with getting the failure reason. I think it
> would be better to do it in a separate function (and a separate patch). Also,
> I'm not really happy yet with this diff cleanup, because we loose all context
> information. So maybe, for now, it's better to just use the --max-report-size
> option.
>

You're right, until we find a way to save context its better to use
--max-report-size. What about the issue that it truncates diff output?

> > +                # We currently just set the reason from first non-reproducible package in the
> > +                # dictionary.
> > +                reason = json_data["details"][0]["package"]
> > +
> > +                # If there does exist a unified_diff directly for the .tar images, it is probably
> > +                # a filesystem reproducibility issue.
>
>  This comment should come after the else, or before the condition. Like this, it
> looks like it belongs to the == None condition.
>
>  Maybe it's cleaner to swap the condition and put this filesystem assumption first.
>
Ah yes, it will definitely be cleaner to swap the conditions.

>
>  Regards,
>  Arnout
>
> > +            else:
> > +                reason = ["filesystem"]
> > +
> > +            with open(reproducible_results, "w") as reproduciblef:
> > +                json.dump(json_data, reproduciblef, sort_keys=True, indent=4)
> > +
> > +            return reason
> > +
> >          def get_failure_reason():
> >              # Output is a tuple (package, version), or None.
> >              lastlines = decode_bytes(subprocess.Popen(
> >

Thanks for the review! And sorry for replying a few days late..
Arnout Vandecappelle Sept. 14, 2019, 5:27 p.m. UTC | #6
On 12/09/2019 14:47, Atharva Lele wrote:
> On Sun, Sep 8, 2019 at 10:36 PM Arnout Vandecappelle <arnout@mind.be> wrote:
>>
>>  Hi Atharva,
>>
>>  I'm very sorry that I'm so late reviewing this...
>>
>> On 20/08/2019 16:52, Atharva Lele wrote:
[snip]
>>> +            def cleanup(l):
>>> +                # Takes a list and removes data which is redundant (source2) or data
>>> +                # that might take up too much space (like huge diffs).
>>> +                if "unified_diff" in l:
>>> +                    l.pop("unified_diff")
>>
>>  Condition can be avoided with
>>
>>  l.pop("unified_diff", None)
>>
>> (or maybe that's only python3? I'm not used to legacy python any more :-)
>>
> 
> That is valid for dictionaries, even in python2. However, we are
> passing a list element to cleanup() and list.pop() only takes in 1
> argument i.e. the index.

 Err, but if it is a list, it would give us a TypeError, because list.pop()
takes an index as argument, not a value...

[snip]
>>  This cleanup has nothing to do with getting the failure reason. I think it
>> would be better to do it in a separate function (and a separate patch). Also,
>> I'm not really happy yet with this diff cleanup, because we loose all context
>> information. So maybe, for now, it's better to just use the --max-report-size
>> option.
>>
> 
> You're right, until we find a way to save context its better to use
> --max-report-size. What about the issue that it truncates diff output?

 I've never seen how exactly it truncates. If you have a really large diff, do
you just get - lines? Or is it slightly more intelligent and do you get both -
and + lines?

 Regardless, I don't think the current solution is that much better, so for now
I'd keep using the max-report-size option.



 Regards,
 Arnout

[snip]

Patch
diff mbox series

diff --git a/scripts/autobuild-run b/scripts/autobuild-run
index 99b57dd..384cf73 100755
--- a/scripts/autobuild-run
+++ b/scripts/autobuild-run
@@ -131,6 +131,7 @@  import csv
 import docopt
 import errno
 import hashlib
+import json
 import mmap
 import multiprocessing
 import os
@@ -599,6 +600,93 @@  class Builder:
         if reject_results():
             return
 
+        def get_reproducibility_failure_reason(reproducible_results):
+            def split_delta(delta):
+                # Take a delta and split it into added, deleted lines.
+                added = []
+                deleted = []
+                for line in delta:
+                    if line.startswith("+"):
+                        added.append(line)
+                    if line.startswith("-"):
+                        deleted.append(line)
+                return added, deleted
+
+            def get_package(sourcef):
+                # Returns which package the source file belongs to.
+                with open(packages_file_list, "r") as packagef:
+                    for line in packagef:
+                        if sourcef in line:
+                            package = line.split(',')[0]
+
+                try:
+                    # Get package version
+                    package_info = json.loads(subprocess.check_output(["make", "--no-print-directory",
+                                                                       "O=%s" % self.outputdir,
+                                                                       "-C", self.srcdir,
+                                                                       "%s-show-info" % package]))
+                    if "version" in package_info[package]:
+                        version = package_info[package]["version"]
+                        return [package, version]
+                    else:
+                        return [package]
+                except:
+                    return ["not found"]
+
+            def cleanup(l):
+                # Takes a list and removes data which is redundant (source2) or data
+                # that might take up too much space (like huge diffs).
+                if "unified_diff" in l:
+                    l.pop("unified_diff")
+                if "source2" in l:
+                    l.pop("source2")
+
+            packages_file_list = os.path.join(self.outputdir, "build", "packages-file-list.txt")
+
+            with open(reproducible_results, "r") as reproduciblef:
+                json_data = json.load(reproduciblef)
+
+            if json_data["unified_diff"] == None:
+                # Remove the file list because it is not useful, i.e. it only shows
+                # which files vary, and nothing more.
+                if json_data["details"][0]["source1"] == "file list":
+                    json_data["details"].pop(0)
+
+                # Iterate over details in the diffoscope output.
+                for item in json_data["details"]:
+                    diff_src = item["source1"]
+                    item["package"] = get_package(diff_src)
+
+                    # In some cases, diffoscope uses multiple commands to get various
+                    # diffs. Due to this, it generates a "details" key for those files
+                    # instead of just storing the diff in the "unified_diff" key.
+                    if item["unified_diff"] == None:
+                        for item_details in item["details"]:
+                            diff = item_details["unified_diff"].split("\n")
+                            split_deltas = split_delta(diff)
+                            item_details["added"] = split_deltas[0][:100]
+                            item_details["deleted"] = split_deltas[1][:100]
+                            cleanup(item_details)
+                    else:
+                        diff = item["unified_diff"].split("\n")
+                        split_deltas = split_delta(diff)
+                        item["added"] = split_deltas[0][:100]
+                        item["deleted"] = split_deltas[1][:100]
+                    cleanup(item)
+                # We currently just set the reason from first non-reproducible package in the
+                # dictionary.
+                reason = json_data["details"][0]["package"]
+
+                # If there does exist a unified_diff directly for the .tar images, it is probably
+                # a filesystem reproducibility issue.
+            else:
+                reason = ["filesystem"]
+
+            with open(reproducible_results, "w") as reproduciblef:
+                json.dump(json_data, reproduciblef, sort_keys=True, indent=4)
+
+            return reason
+
         def get_failure_reason():
             # Output is a tuple (package, version), or None.
             lastlines = decode_bytes(subprocess.Popen(