diff mbox series

[v4,5/5] support/scripts/pkgstats: add CPE reporting

Message ID 1525978734-35706-6-git-send-email-matthew.weber@rockwellcollins.com
State Superseded
Headers show
Series CPE ID Support | expand

Commit Message

Matt Weber May 10, 2018, 6:58 p.m. UTC
Pkg status now includes CPE as an item reported in the html
output (stat summary and for each pkg)

Option added to allow analysis of a specific Buildroot target's
'make cpe-info' reports accuracy against CPE database.

Signed-off-by: Matthew Weber <matthew.weber@rockwellcollins.com>
---
Changes
v3 -> v4
 - Collapsed patch 5 and 6 together into this single patch

[Eric
 - added except handling around file io
 - fixed condition where buildroot isn't generating a CPE
   string as part of the infra and output that is the case.
   (eventually these probably could be fixed but there aren't
   many at this point)

[Ricardo
 - fixed patch naming and resolved flake8 issues
 - added except handling to have proper exits
 - cleaned up csv file header skippin
 - condensed partial cve string split
 - updated help txt as suggested
 - reworked output file requirement.  Removed -o as required but
   added check if provided when -c isn't used

v3
 - New patch
---
 support/scripts/pkg-stats | 188 +++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 170 insertions(+), 18 deletions(-)

Comments

Ricardo Martincoski May 16, 2018, 3:43 a.m. UTC | #1
Hello,

See below:
 - an important typo to fix;
 - some considerations about verbosity;
 - a possible trivial patch to split from this one;
 - some considerations about performance (to possibly run 2x or 4x faster, iff
   we can make some assumptions about the XML).

I hope other people can give more input about the last three items above.

On Thu, May 10, 2018 at 03:58 PM, Matt Weber wrote:

> support/scripts/pkgstats: add CPE reporting

nit: support/scripts/pkg-stats: add CPE reporting

[snip]
> +    def find_partial(self, cpe_str):
> +        print("CPE: Searching for partial [%s]" % cpe_str)

This ...

> +        for cpe in self.all_cpes['cpe-list']['cpe-item']:
> +            if cpe_str in cpe['cpe-23:cpe23-item']['@name']:
> +                return cpe['cpe-23:cpe23-item']['@name']
> +
> +    def find(self, cpe_str):
> +        print("CPE: Searching for [%s]" % cpe_str)

... and this make the script very verbose.

It generates 4000+ lines to stdout when generating the html.
Should we remove it from the final version?
Should we add a -v option for debug?

It seems to me it can be removed.

> +        for cpe in self.all_cpes['cpe-list']['cpe-item']:
> +            if cpe['cpe-23:cpe23-item']['@name'] == cpe_str:
> +                return cpe['cpe-23:cpe23-item']['@name']
> +
> +    def get_cpe_no_version(self, cpe):
> +        return "".join(cpe.split(":")[:5])

typo: it should be ":".join...
otherwise no partial is found for any package.

[snip]
>  def __main__():
>      args = parse_args()
>      if args.npackages and args.packages:
> -        print "ERROR: -n and -p are mutually exclusive"
> +        print("ERROR: -n and -p are mutually exclusive")

Ideally this (and similar changes below) should be a separate patch, early in
the series. Since the new patch would be trivial it would potentially be
reviewed/applied quickly. Also it would be less changes to carry/rework if new
iterations (due to other patches) are needed.

But well... this is the only line that would not be touched by this patch, so I
am OK with this.
Maybe others disagree.
Mentioning in the commit would be good, something like "Take the opportunity to
..."

>          sys.exit(1)
>      if args.packages:
>          package_list = args.packages.split(",")
>      else:
>          package_list = None
> -    print "Build package list ..."
> -    packages = get_pkglist(args.npackages, package_list)
> -    print "Getting package make info ..."
> -    package_init_make_info()
> -    print "Getting package details ..."
> -    for pkg in packages:
> -        pkg.set_infra()
> -        pkg.set_license()
> -        pkg.set_hash_info()
> -        pkg.set_patch_count()
> -        pkg.set_check_package_warnings()
> -        pkg.set_current_version()
> -    print "Calculate stats"
> -    stats = calculate_stats(packages)
> -    print "Write HTML"
> -    dump_html(packages, stats, args.output)
> +    cpe_dict = CPE()
> +    cpe_dict.get_xml_dict()
> +    if args.cpe_report:
> +        print("Performing Target CPE Report Analysis...")
> +        get_target_cpe_report(args.cpe_report, cpe_dict)
> +    elif args.output:

This is not common in other scripts in the tree. All checks between options are
done at the start of __main__.
But having two different code paths is not common either (in the scripts in the
tree), so it seems to me it makes sense here.
Maybe others disagree.

> +        print("Build package list ...")
> +        packages = get_pkglist(args.npackages, package_list)
> +        print("Getting package make info ...")
> +        package_init_make_info()
> +        print("Getting package details ...")
> +        for pkg in packages:
> +            pkg.set_infra()
> +            pkg.set_license()
> +            pkg.set_hash_info()
> +            pkg.set_patch_count()
> +            pkg.set_check_package_warnings()
> +            pkg.set_current_version()
> +            pkg.set_cpe_info(cpe_dict)
> +        print("Calculate stats")
> +        stats = calculate_stats(packages)
> +        print("Write HTML")
> +        dump_html(packages, stats, args.output)
> +    else:
> +        print("Please provide the -o HTML output file arg")
>  
>  
>  __main__()

About the performance:
17 minutes is not incredibly good but it is not terrible for 2000+ packages.

I created 2 *poorly-tested* patches for comparison purposes only. They seem
(again *poorly-tested*) to produce the same html output as v4 when called
without -c.

series v4 + typo fixed:
17 minutes
series v4 + typo fixed + comparison patch 1:
 8 minutes
series v4 + typo fixed + comparison patch 2:
 4 minutes

I am OK with first having the script functional and later, in another patch,
improving its performance while keeping the same functionality.
Maybe others disagree.
Maybe you like any patch below and find the time to fully test that it really
covers all cases and decide to merge to this one.
Maybe you foresee any use for other fields from the XML. In that case, both
comparison patches would not fit, as they assume we will always only need to
know cpe-23:cpe23-item/@name and that there will be always one, and only one,
cpe-23:cpe23-item/@name per cpe-item. Notice: I did not read the specs for the
CPE dictionary, I looked only to this script.


comparison patch 1:
Can we assume we will only need cpe-23:cpe23-item/@name and pre-process?
----->8-----
+++ b/support/scripts/pkg-stats
@@ -561,3 +561,3 @@ def dump_html(packages, stats, output):
 class CPE:
-    all_cpes = dict()
+    all_cpes = list()
 
@@ -569,4 +569,6 @@ class CPE:
             cpe_file = gzip.GzipFile(fileobj=StringIO(compressed_cpe_file.read())).read()
-            print("CPE: Converting xml manifest to dict...")
-            self.all_cpes = xmltodict.parse(cpe_file)
+            print("CPE: Converting xml manifest to list...")
+            raw_dict = xmltodict.parse(cpe_file)
+            for cpe in raw_dict['cpe-list']['cpe-item']:
+                self.all_cpes.append(cpe['cpe-23:cpe23-item']['@name'])
         except urllib2.HTTPError:
@@ -580,5 +582,5 @@ class CPE:
         print("CPE: Searching for partial [%s]" % cpe_str)
-        for cpe in self.all_cpes['cpe-list']['cpe-item']:
-            if cpe_str in cpe['cpe-23:cpe23-item']['@name']:
-                return cpe['cpe-23:cpe23-item']['@name']
+        for cpe in self.all_cpes:
+            if cpe_str in cpe:
+                return cpe
 
@@ -586,5 +588,4 @@ class CPE:
         print("CPE: Searching for [%s]" % cpe_str)
-        for cpe in self.all_cpes['cpe-list']['cpe-item']:
-            if cpe['cpe-23:cpe23-item']['@name'] == cpe_str:
-                return cpe['cpe-23:cpe23-item']['@name']
+        if cpe_str in self.all_cpes:
+            return cpe_str
----->8-----


comparison patch 2:
Note: I usually don't parse XML using Python, so I am not sure how future-proof
is the string that includes the namespace. It seems (from google search results)
people invented few hacks to disable namespaces in ElementTree. xmltodict has
process_namespaces disabled by default.
----->8-----
+++ b/support/scripts/pkg-stats
@@ -27,3 +27,3 @@ import sys
 import urllib2
-import xmltodict
+import xml.etree.ElementTree as ET
 import gzip
@@ -561,3 +561,3 @@ def dump_html(packages, stats, output):
 class CPE:
-    all_cpes = dict()
+    all_cpes = None
 
@@ -569,4 +569,5 @@ class CPE:
             cpe_file = gzip.GzipFile(fileobj=StringIO(compressed_cpe_file.read())).read()
-            print("CPE: Converting xml manifest to dict...")
-            self.all_cpes = xmltodict.parse(cpe_file)
+            print("CPE: Converting xml manifest to list...")
+            tree = ET.fromstring(cpe_file)
+            self.all_cpes = [i.get('name') for i in tree.iter('{http://scap.nist.gov/schema/cpe-extension/2.3}cpe23-item')]
         except urllib2.HTTPError:
@@ -580,5 +581,5 @@ class CPE:
         print("CPE: Searching for partial [%s]" % cpe_str)
-        for cpe in self.all_cpes['cpe-list']['cpe-item']:
-            if cpe_str in cpe['cpe-23:cpe23-item']['@name']:
-                return cpe['cpe-23:cpe23-item']['@name']
+        for cpe in self.all_cpes:
+            if cpe.startswith(cpe_str):
+                return cpe
 
@@ -586,5 +587,5 @@ class CPE:
         print("CPE: Searching for [%s]" % cpe_str)
-        for cpe in self.all_cpes['cpe-list']['cpe-item']:
-            if cpe['cpe-23:cpe23-item']['@name'] == cpe_str:
-                return cpe['cpe-23:cpe23-item']['@name']
+        for cpe in self.all_cpes:
+            if cpe == cpe_str:
+                return cpe
----->8-----

Regards,
Ricardo
Arnout Vandecappelle May 16, 2018, 11:32 p.m. UTC | #2
Hi Matt, Ricardo,

 I haven't been following things very closely so I may miss something entirely,
however...

On 16-05-18 05:43, Ricardo Martincoski wrote:
> Hello,
> 
> See below:
>  - an important typo to fix;
>  - some considerations about verbosity;
>  - a possible trivial patch to split from this one;
>  - some considerations about performance (to possibly run 2x or 4x faster, iff
>    we can make some assumptions about the XML).

 I thought the performance was completely dominated by the network, but
apparently it isn't?

 Oh, now I see, the set_cpe_info is in fact O(#cpe_packages *
#buildroot_packages), so 17 minutes is just 1ms for the inner loop, which
doesn't sound entirely unreasonable.


> I hope other people can give more input about the last three items above.
> 
> On Thu, May 10, 2018 at 03:58 PM, Matt Weber wrote:
> 
>> support/scripts/pkgstats: add CPE reporting
> 
> nit: support/scripts/pkg-stats: add CPE reporting
> 
> [snip]
>> +    def find_partial(self, cpe_str):
>> +        print("CPE: Searching for partial [%s]" % cpe_str)
> 
> This ...
> 
>> +        for cpe in self.all_cpes['cpe-list']['cpe-item']:
>> +            if cpe_str in cpe['cpe-23:cpe23-item']['@name']:
>> +                return cpe['cpe-23:cpe23-item']['@name']
>> +
>> +    def find(self, cpe_str):
>> +        print("CPE: Searching for [%s]" % cpe_str)
> 
> ... and this make the script very verbose.
> 
> It generates 4000+ lines to stdout when generating the html.
> Should we remove it from the final version?
> Should we add a -v option for debug?
> 
> It seems to me it can be removed.

 Well, if it takes 17 minutes, it's nice to see some progress... But if you can
reduce it to 4 minutes the need isn't there so much.

[snip]

>> +    cpe_dict = CPE()
>> +    cpe_dict.get_xml_dict()
>> +    if args.cpe_report:
>> +        print("Performing Target CPE Report Analysis...")
>> +        get_target_cpe_report(args.cpe_report, cpe_dict)
>> +    elif args.output:
> 
> This is not common in other scripts in the tree. All checks between options are
> done at the start of __main__.
> But having two different code paths is not common either (in the scripts in the
> tree), so it seems to me it makes sense here.
> Maybe others disagree.

 The way I see it, there is almost no commonality between the script with the -c
option and the normal pkg-stats, so it makes no sense to have them in the same
script. I would split off the CPE class into a separate module that would be
imported into this script and a new cpe_report script.

> 
>> +        print("Build package list ...")
>> +        packages = get_pkglist(args.npackages, package_list)
>> +        print("Getting package make info ...")
>> +        package_init_make_info()
>> +        print("Getting package details ...")
>> +        for pkg in packages:
>> +            pkg.set_infra()
>> +            pkg.set_license()
>> +            pkg.set_hash_info()
>> +            pkg.set_patch_count()
>> +            pkg.set_check_package_warnings()
>> +            pkg.set_current_version()
>> +            pkg.set_cpe_info(cpe_dict)
>> +        print("Calculate stats")
>> +        stats = calculate_stats(packages)
>> +        print("Write HTML")
>> +        dump_html(packages, stats, args.output)
>> +    else:
>> +        print("Please provide the -o HTML output file arg")
>>  
>>  
>>  __main__()
> 
> About the performance:
> 17 minutes is not incredibly good but it is not terrible for 2000+ packages.

 IMO it is terrible...

> 
> I created 2 *poorly-tested* patches for comparison purposes only. They seem
> (again *poorly-tested*) to produce the same html output as v4 when called
> without -c.
> 
> series v4 + typo fixed:
> 17 minutes
> series v4 + typo fixed + comparison patch 1:
>  8 minutes
> series v4 + typo fixed + comparison patch 2:
>  4 minutes
> 
> I am OK with first having the script functional and later, in another patch,
> improving its performance while keeping the same functionality.
> Maybe others disagree.
> Maybe you like any patch below and find the time to fully test that it really
> covers all cases and decide to merge to this one.
> Maybe you foresee any use for other fields from the XML. In that case, both
> comparison patches would not fit, as they assume we will always only need to
> know cpe-23:cpe23-item/@name and that there will be always one, and only one,
> cpe-23:cpe23-item/@name per cpe-item. Notice: I did not read the specs for the
> CPE dictionary, I looked only to this script.
> 
> 
> comparison patch 1:
> Can we assume we will only need cpe-23:cpe23-item/@name and pre-process?

 We can easily keep both the dict and the list.

> ----->8-----
> +++ b/support/scripts/pkg-stats
> @@ -561,3 +561,3 @@ def dump_html(packages, stats, output):
>  class CPE:
> -    all_cpes = dict()
> +    all_cpes = list()

 This (both the dict() and the list()) is a bit unconventional. If two instances
of class CPE are created, they will refer to the same dict/list...  Normally
members are set in the __init__ constructor. Was this intentional?

 Personally, I'm a fan of the attrs module, which allows you to do this very
elegantly. Unfortunately, attrs is not in the standard library.

>  
> @@ -569,4 +569,6 @@ class CPE:
>              cpe_file = gzip.GzipFile(fileobj=StringIO(compressed_cpe_file.read())).read()
> -            print("CPE: Converting xml manifest to dict...")
> -            self.all_cpes = xmltodict.parse(cpe_file)
> +            print("CPE: Converting xml manifest to list...")
> +            raw_dict = xmltodict.parse(cpe_file)
> +            for cpe in raw_dict['cpe-list']['cpe-item']:
> +                self.all_cpes.append(cpe['cpe-23:cpe23-item']['@name'])
>          except urllib2.HTTPError:
> @@ -580,5 +582,5 @@ class CPE:
>          print("CPE: Searching for partial [%s]" % cpe_str)
> -        for cpe in self.all_cpes['cpe-list']['cpe-item']:
> -            if cpe_str in cpe['cpe-23:cpe23-item']['@name']:
> -                return cpe['cpe-23:cpe23-item']['@name']
> +        for cpe in self.all_cpes:
> +            if cpe_str in cpe:
> +                return cpe
>  
> @@ -586,5 +588,4 @@ class CPE:
>          print("CPE: Searching for [%s]" % cpe_str)
> -        for cpe in self.all_cpes['cpe-list']['cpe-item']:
> -            if cpe['cpe-23:cpe23-item']['@name'] == cpe_str:
> -                return cpe['cpe-23:cpe23-item']['@name']
> +        if cpe_str in self.all_cpes:
> +            return cpe_str
> ----->8-----
> 
> 
> comparison patch 2:
> Note: I usually don't parse XML using Python, so I am not sure how future-proof
> is the string that includes the namespace. It seems (from google search results)
> people invented few hacks to disable namespaces in ElementTree. xmltodict has
> process_namespaces disabled by default.
> ----->8-----
> +++ b/support/scripts/pkg-stats
> @@ -27,3 +27,3 @@ import sys
>  import urllib2
> -import xmltodict
> +import xml.etree.ElementTree as ET
>  import gzip
> @@ -561,3 +561,3 @@ def dump_html(packages, stats, output):
>  class CPE:
> -    all_cpes = dict()
> +    all_cpes = None

 This should still be list() IMO, so that it has a consistent interface also in
case of e.g. HTTPError.

>  
> @@ -569,4 +569,5 @@ class CPE:
>              cpe_file = gzip.GzipFile(fileobj=StringIO(compressed_cpe_file.read())).read()
> -            print("CPE: Converting xml manifest to dict...")
> -            self.all_cpes = xmltodict.parse(cpe_file)
> +            print("CPE: Converting xml manifest to list...")
> +            tree = ET.fromstring(cpe_file)
> +            self.all_cpes = [i.get('name') for i in tree.iter('{http://scap.nist.gov/schema/cpe-extension/2.3}cpe23-item')]

 So after this you get basically the same as after comparison patch 1, right? So
the xmltodict takes 4 minutes? Or am I missing something?

 Oh, actually, the [... for ... iter...] is also more efficient than
for...: append() so that could be an effect here as well. But this part of the
code is only O(#cpe packages) so it shouldn't have that much impact.

>          except urllib2.HTTPError:package
> @@ -580,5 +581,5 @@ class CPE:
>          print("CPE: Searching for partial [%s]" % cpe_str)
> -        for cpe in self.all_cpes['cpe-list']['cpe-item']:
> -            if cpe_str in cpe['cpe-23:cpe23-item']['@name']:
> -                return cpe['cpe-23:cpe23-item']['@name']
> +        for cpe in self.all_cpes:
> +            if cpe.startswith(cpe_str):

 Originally it was 'in' instead of startswith(). Obviously startswith() will be
more efficient. And also more correct, I guess, or does the partial match not
need to be anchored at the beginning of the string? Well, since it always starts
with 'cpe:2.3:a:' a match somewhere in the middle seems unlikely...

> +                return cpe
>  
> @@ -586,5 +587,5 @@ class CPE:
>          print("CPE: Searching for [%s]" % cpe_str)
> -        for cpe in self.all_cpes['cpe-list']['cpe-item']:
> -            if cpe['cpe-23:cpe23-item']['@name'] == cpe_str:
> -                return cpe['cpe-23:cpe23-item']['@name']
> +        for cpe in self.all_cpes:
> +            if cpe == cpe_str:
> +                return cpe

 Looks to me like the lookup time can be cut in half again by merging the two
find functions:

    def find(self, cpe_str):
        partial_matches = []
        partial_cpe_str = self.get_cpe_no_version(cpe_str)
        for cpe in self.all_cpes:
            if cpe == cpe_str:
                return self.get_nvd_url(cpe_str)
            elif cpe.startswith(partial_cpe_str):
                partial_matches.append(cpe)
        if partial_matches:
            return 'Updated' # Or we could return the 'best' version?

 Although... that probably doesn't make much of a difference.

 What would probably make a difference, however, is to make two sets of cpe strings:

   all_cpes = set()
   all_cpes_no_version = set()

...
           print("CPE: Converting xml manifest to dict...")
           for cpe in raw_dict['cpe-list']['cpe-item']:
               cpe_str = cpe['cpe-23:cpe23-item']['@name']
               cpe_str_no_version = self.get_cpe_no_version(cpe_str)
               self.all_cpes.add(cpe_str)
               self.all_cpes_no_version.add(cpe_str_no_version)
...
    def find(self, cpe_str):
        if cpe_str in self.all_cpes:
            return self.get_cpe_no_version(cpe_str)
        cpe_str_no_version = self.get_cpe_no_version(cpe_str)
        if cpe_str_no_version in self.all_cpes_no_version:
            return 'Updated'

 Regards,
 Arnout

> ----->8-----
> 
> Regards,
> Ricardo
> 
> 
> 
> _______________________________________________
> buildroot mailing list
> buildroot@busybox.net
> http://lists.busybox.net/mailman/listinfo/buildroot
>
Arnout Vandecappelle May 16, 2018, 11:34 p.m. UTC | #3
On 10-05-18 20:58, Matt Weber wrote:
> +
> +        if pkgvar == "LINUX":
> +            Package.all_cpe_id[pkgvar] = "cpe:2.3:o:" + value + ":*:*:*:*:*:*:*"
> +        elif pkgvar == "LINUX_HEADERS":
> +            Package.all_cpe_id[pkgvar] = "cpe:2.3:o:" + value + ":*:*:*:*:*:*:*"

 A bit more idiomatic:

        if pkgvar in ("LINUX", "LINUX_HEADERS"):
            Package.all_cpe_id[pkgvar] = "cpe:2.3:o:" + value + ":*:*:*:*:*:*:*"

 Regards,
 Arnout

> +        else:
> +            Package.all_cpe_id[pkgvar] = "cpe:2.3:a:" + value + ":*:*:*:*:*:*:*"
Matt Weber May 17, 2018, 1:42 a.m. UTC | #4
Arnout, Ricardo,

On Wed, May 16, 2018 at 6:32 PM, Arnout Vandecappelle <arnout@mind.be> wrote:
>
>  Hi Matt, Ricardo,
>
>  I haven't been following things very closely so I may miss something entirely,
> however...
>
> On 16-05-18 05:43, Ricardo Martincoski wrote:
> > Hello,
> >
> > See below:
> >  - an important typo to fix;
> >  - some considerations about verbosity;
> >  - a possible trivial patch to split from this one;
> >  - some considerations about performance (to possibly run 2x or 4x faster, iff
> >    we can make some assumptions about the XML).

Ricardo, thanks for going through v4 and all the great feedback.  As
Arnout responded to your email, I'll add on below.


> >> +    def find(self, cpe_str):
> >> +        print("CPE: Searching for [%s]" % cpe_str)
> >
> > ... and this make the script very verbose.
> >
> > It generates 4000+ lines to stdout when generating the html.
> > Should we remove it from the final version?
> > Should we add a -v option for debug?
> >
> > It seems to me it can be removed.
>
>  Well, if it takes 17 minutes, it's nice to see some progress... But if you can
> reduce it to 4 minutes the need isn't there so much.

I believe the verbosity can be removed/reduced.  Both CPE activities
end with a report output of some sort.

> >> +    cpe_dict = CPE()
> >> +    cpe_dict.get_xml_dict()
> >> +    if args.cpe_report:
> >> +        print("Performing Target CPE Report Analysis...")
> >> +        get_target_cpe_report(args.cpe_report, cpe_dict)
> >> +    elif args.output:
> >
> > This is not common in other scripts in the tree. All checks between options are
> > done at the start of __main__.
> > But having two different code paths is not common either (in the scripts in the
> > tree), so it seems to me it makes sense here.
> > Maybe others disagree.
>
>  The way I see it, there is almost no commonality between the script with the -c
> option and the normal pkg-stats, so it makes no sense to have them in the same
> script. I would split off the CPE class into a separate module that would be
> imported into this script and a new cpe_report script.

I debated this.  Ricardo/Arnout, what naming and directory scheme
would you propose?

> > About the performance:
> > 17 minutes is not incredibly good but it is not terrible for 2000+ packages.
>
>  IMO it is terrible...

It's going to get longer once we have the "updated pkg available"
checking merged, so I'm up for optimizing now.

>
> >
> > I created 2 *poorly-tested* patches for comparison purposes only. They seem
> > (again *poorly-tested*) to produce the same html output as v4 when called
> > without -c.
> >
> > series v4 + typo fixed:
> > 17 minutes
> > series v4 + typo fixed + comparison patch 1:
> >  8 minutes
> > series v4 + typo fixed + comparison patch 2:
> >  4 minutes
> >
> > I am OK with first having the script functional and later, in another patch,
> > improving its performance while keeping the same functionality.
> > Maybe others disagree.
> > Maybe you like any patch below and find the time to fully test that it really
> > covers all cases and decide to merge to this one.
> > Maybe you foresee any use for other fields from the XML. In that case, both
> > comparison patches would not fit, as they assume we will always only need to
> > know cpe-23:cpe23-item/@name and that there will be always one, and only one,
> > cpe-23:cpe23-item/@name per cpe-item. Notice: I did not read the specs for the
> > CPE dictionary, I looked only to this script.
> >
> >
> > comparison patch 1:
> > Can we assume we will only need cpe-23:cpe23-item/@name and pre-process?
>
>  We can easily keep both the dict and the list.

For the validity check we'd only use the name entry but for building
the "update xml file" to make changes to an existing CPE (patchset
being developed now), we mine some additional detail out of the dict.
Like mentioned below we could build two sets that are pre-processed
and still have the dict available.

>
> > ----->8-----
> > +++ b/support/scripts/pkg-stats
> > @@ -561,3 +561,3 @@ def dump_html(packages, stats, output):
> >  class CPE:
> > -    all_cpes = dict()
> > +    all_cpes = list()
>
>  This (both the dict() and the list()) is a bit unconventional. If two instances
> of class CPE are created, they will refer to the same dict/list...  Normally
> members are set in the __init__ constructor. Was this intentional?

Nope, just maturity of my python knowledge.  How would that work
instead?  I can only see one instance of the class for the current use
case.  However those variables could be shared between instances, we'd
just need to manage if they're valid.

> >
> > comparison patch 2:
> > Note: I usually don't parse XML using Python, so I am not sure how future-proof
> > is the string that includes the namespace. It seems (from google search results)
> > people invented few hacks to disable namespaces in ElementTree. xmltodict has
> > process_namespaces disabled by default.
> > ----->8-----
[snip]
> >          except urllib2.HTTPError:package
> > @@ -580,5 +581,5 @@ class CPE:
> >          print("CPE: Searching for partial [%s]" % cpe_str)
> > -        for cpe in self.all_cpes['cpe-list']['cpe-item']:
> > -            if cpe_str in cpe['cpe-23:cpe23-item']['@name']:
> > -                return cpe['cpe-23:cpe23-item']['@name']
> > +        for cpe in self.all_cpes:
> > +            if cpe.startswith(cpe_str):
>
>  Originally it was 'in' instead of startswith(). Obviously startswith() will be
> more efficient. And also more correct, I guess, or does the partial match not
> need to be anchored at the beginning of the string? Well, since it always starts
> with 'cpe:2.3:a:' a match somewhere in the middle seems unlikely...

startswith() will work.

>
> > +                return cpe
> >
> > @@ -586,5 +587,5 @@ class CPE:
> >          print("CPE: Searching for [%s]" % cpe_str)
> > -        for cpe in self.all_cpes['cpe-list']['cpe-item']:
> > -            if cpe['cpe-23:cpe23-item']['@name'] == cpe_str:
> > -                return cpe['cpe-23:cpe23-item']['@name']
> > +        for cpe in self.all_cpes:
> > +            if cpe == cpe_str:
> > +                return cpe
>
>  Looks to me like the lookup time can be cut in half again by merging the two
> find functions:
>
>     def find(self, cpe_str):
>         partial_matches = []
>         partial_cpe_str = self.get_cpe_no_version(cpe_str)
>         for cpe in self.all_cpes:
>             if cpe == cpe_str:
>                 return self.get_nvd_url(cpe_str)
>             elif cpe.startswith(partial_cpe_str):
>                 partial_matches.append(cpe)
>         if partial_matches:
>             return 'Updated' # Or we could return the 'best' version?
>
>  Although... that probably doesn't make much of a difference.
>
>  What would probably make a difference, however, is to make two sets of cpe strings:
>
>    all_cpes = set()
>    all_cpes_no_version = set()
>
> ...
>            print("CPE: Converting xml manifest to dict...")
>            for cpe in raw_dict['cpe-list']['cpe-item']:
>                cpe_str = cpe['cpe-23:cpe23-item']['@name']
>                cpe_str_no_version = self.get_cpe_no_version(cpe_str)
>                self.all_cpes.add(cpe_str)
>                self.all_cpes_no_version.add(cpe_str_no_version)
> ...
>     def find(self, cpe_str):
>         if cpe_str in self.all_cpes:
>             return self.get_cpe_no_version(cpe_str)
>         cpe_str_no_version = self.get_cpe_no_version(cpe_str)
>         if cpe_str_no_version in self.all_cpes_no_version:
>             return 'Updated'
>

Agree, the pre-processed sets would be cleaner and more efficient.

I think I have enough good examples, should I make an attempt at
another version based on using sets and other patch#2 suggestions?

Matt
Matt Weber May 17, 2018, 1:42 a.m. UTC | #5
Arnout,

On Wed, May 16, 2018 at 6:34 PM, Arnout Vandecappelle <arnout@mind.be> wrote:
>
>
> On 10-05-18 20:58, Matt Weber wrote:
>> +
>> +        if pkgvar == "LINUX":
>> +            Package.all_cpe_id[pkgvar] = "cpe:2.3:o:" + value + ":*:*:*:*:*:*:*"
>> +        elif pkgvar == "LINUX_HEADERS":
>> +            Package.all_cpe_id[pkgvar] = "cpe:2.3:o:" + value + ":*:*:*:*:*:*:*"
>
>  A bit more idiomatic:
>
>         if pkgvar in ("LINUX", "LINUX_HEADERS"):
>             Package.all_cpe_id[pkgvar] = "cpe:2.3:o:" + value + ":*:*:*:*:*:*:*"
>

Noted, will update.

Matt
Ricardo Martincoski May 18, 2018, 3:07 a.m. UTC | #6
Hello,

On Wed, May 16, 2018 at 08:32 PM, Arnout Vandecappelle wrote:
> On 16-05-18 05:43, Ricardo Martincoski wrote:

[snip]
>> @@ -569,4 +569,5 @@ class CPE:
>>              cpe_file = gzip.GzipFile(fileobj=StringIO(compressed_cpe_file.read())).read()
>> -            print("CPE: Converting xml manifest to dict...")
>> -            self.all_cpes = xmltodict.parse(cpe_file)
>> +            print("CPE: Converting xml manifest to list...")
>> +            tree = ET.fromstring(cpe_file)
>> +            self.all_cpes = [i.get('name') for i in tree.iter('{http://scap.nist.gov/schema/cpe-extension/2.3}cpe23-item')]
> 
>  So after this you get basically the same as after comparison patch 1, right? So
> the xmltodict takes 4 minutes? Or am I missing something?

No. I missed something important and jumped to wrong conclusions.

After adding some simple instrumentation code to display relative timestamps,
the main difference in performance is *not* related to the xml parser used but
it is related to the code used for find() and find_partial().

I didn't performed further testings but it seems related to the use of
startswith as you said.

patch 1:
0:00:00.001015 CPE: Fetching xml manifest...
0:00:03.924777 CPE: Unzipping xml manifest...
0:00:11.672462 CPE: Converting xml manifest to list...
0:00:11.672504 before xmltodict.parse
0:00:36.343417 before append
0:00:36.462400 list created
0:00:36.738042 Build package list ...
0:00:36.875742 Getting package make info ...
0:00:58.543116 Getting package details ...
0:01:00.016925 BR Infra Not building CPE for pkg: [UBOOT]
0:01:07.714094 BR Infra Not building CPE for pkg: [IMX_USB_LOADER]
...
0:08:00.615649 BR Infra Not building CPE for pkg: [INTLTOOL]
0:08:01.243667 BR Infra Not building CPE for pkg: [DOXYGEN]
0:08:02.035463 Calculate stats
0:08:02.042401 Write HTML

patch 2:
0:00:00.000889 CPE: Fetching xml manifest...
0:00:03.640856 CPE: Unzipping xml manifest...
0:00:14.569496 CPE: Converting xml manifest to list...
0:00:14.569541 before ET.fromstring
0:00:21.325842 before list comprehension
0:00:21.609946 list created
0:00:21.612443 Build package list ...
0:00:21.754223 Getting package make info ...
0:00:43.111196 Getting package details ...
0:00:43.828047 BR Infra Not building CPE for pkg: [UBOOT]
0:00:47.125995 BR Infra Not building CPE for pkg: [IMX_USB_LOADER]
...
0:03:46.279893 BR Infra Not building CPE for pkg: [INTLTOOL]
0:03:46.571266 BR Infra Not building CPE for pkg: [DOXYGEN]
0:03:46.892839 Calculate stats
0:03:46.895765 Write HTML

>  Oh, actually, the [... for ... iter...] is also more efficient than
> for...: append() so that could be an effect here as well. But this part of the
> code is only O(#cpe packages) so it shouldn't have that much impact.
> 
>>          except urllib2.HTTPError:package
>> @@ -580,5 +581,5 @@ class CPE:
>>          print("CPE: Searching for partial [%s]" % cpe_str)
>> -        for cpe in self.all_cpes['cpe-list']['cpe-item']:
>> -            if cpe_str in cpe['cpe-23:cpe23-item']['@name']:
>> -                return cpe['cpe-23:cpe23-item']['@name']
>> +        for cpe in self.all_cpes:
>> +            if cpe.startswith(cpe_str):
> 
>  Originally it was 'in' instead of startswith(). Obviously startswith() will be
> more efficient. And also more correct, I guess, or does the partial match not
[snip]

Regards,
Ricardo
Ricardo Martincoski May 18, 2018, 3:16 a.m. UTC | #7
Hello,

On Wed, May 16, 2018 at 10:42 PM, Matthew Weber wrote:
> On Wed, May 16, 2018 at 6:32 PM, Arnout Vandecappelle wrote:
>> On 16-05-18 05:43, Ricardo Martincoski wrote:

[snip]
>> >> +    cpe_dict = CPE()
>> >> +    cpe_dict.get_xml_dict()
>> >> +    if args.cpe_report:
>> >> +        print("Performing Target CPE Report Analysis...")
>> >> +        get_target_cpe_report(args.cpe_report, cpe_dict)
>> >> +    elif args.output:
>> >
>> > This is not common in other scripts in the tree. All checks between options are
>> > done at the start of __main__.
>> > But having two different code paths is not common either (in the scripts in the
>> > tree), so it seems to me it makes sense here.
>> > Maybe others disagree.
>>
>>  The way I see it, there is almost no commonality between the script with the -c
>> option and the normal pkg-stats, so it makes no sense to have them in the same
>> script. I would split off the CPE class into a separate module that would be
>> imported into this script and a new cpe_report script.
> 
> I debated this.  Ricardo/Arnout, what naming and directory scheme
> would you propose?

It seems you need a single shared module, so inspired by:

getdeveloperlib.py
get-developers -> import getdeveloperlib

you could have:

cpelib.py
pkg-stats -> import cpelib
cpe-report -> import cpelib

But cpe-report seems to be a user-facing script, so it would go to the utils/
directory. I am not sure what is the best way to solve the import for:

support/scripts/cpelib.py
support/scripts/pkg-stats -> import cpelib
utils/cpe-report -> import cpelib

Maybe a symlink utils/cpelib.py -> support/scripts/cpelib.py ? (not tested)
I never tried importing using relative paths without having __init__.py.


Regards,
Ricardo
Matt Weber May 18, 2018, 3:18 a.m. UTC | #8
Ricardo,

On Thu, May 17, 2018 at 10:07 PM, Ricardo Martincoski
<ricardo.martincoski@gmail.com> wrote:
>
> Hello,
>
> On Wed, May 16, 2018 at 08:32 PM, Arnout Vandecappelle wrote:
> > On 16-05-18 05:43, Ricardo Martincoski wrote:
>
> [snip]
> >> @@ -569,4 +569,5 @@ class CPE:
> >>              cpe_file = gzip.GzipFile(fileobj=StringIO(compressed_cpe_file.read())).read()
> >> -            print("CPE: Converting xml manifest to dict...")
> >> -            self.all_cpes = xmltodict.parse(cpe_file)
> >> +            print("CPE: Converting xml manifest to list...")
> >> +            tree = ET.fromstring(cpe_file)
> >> +            self.all_cpes = [i.get('name') for i in tree.iter('{http://scap.nist.gov/schema/cpe-extension/2.3}cpe23-item')]
> >
> >  So after this you get basically the same as after comparison patch 1, right? So
> > the xmltodict takes 4 minutes? Or am I missing something?
>
> No. I missed something important and jumped to wrong conclusions.
>
> After adding some simple instrumentation code to display relative timestamps,
> the main difference in performance is *not* related to the xml parser used but
> it is related to the code used for find() and find_partial().
>
> I didn't performed further testings but it seems related to the use of
> startswith as you said.
>
> patch 1:
> 0:00:00.001015 CPE: Fetching xml manifest...
> 0:00:03.924777 CPE: Unzipping xml manifest...
> 0:00:11.672462 CPE: Converting xml manifest to list...
> 0:00:11.672504 before xmltodict.parse
> 0:00:36.343417 before append
> 0:00:36.462400 list created
> 0:00:36.738042 Build package list ...
> 0:00:36.875742 Getting package make info ...
> 0:00:58.543116 Getting package details ...
> 0:01:00.016925 BR Infra Not building CPE for pkg: [UBOOT]
> 0:01:07.714094 BR Infra Not building CPE for pkg: [IMX_USB_LOADER]
> ...
> 0:08:00.615649 BR Infra Not building CPE for pkg: [INTLTOOL]
> 0:08:01.243667 BR Infra Not building CPE for pkg: [DOXYGEN]
> 0:08:02.035463 Calculate stats
> 0:08:02.042401 Write HTML
>
> patch 2:
> 0:00:00.000889 CPE: Fetching xml manifest...
> 0:00:03.640856 CPE: Unzipping xml manifest...
> 0:00:14.569496 CPE: Converting xml manifest to list...
> 0:00:14.569541 before ET.fromstring
> 0:00:21.325842 before list comprehension
> 0:00:21.609946 list created
> 0:00:21.612443 Build package list ...
> 0:00:21.754223 Getting package make info ...
> 0:00:43.111196 Getting package details ...
> 0:00:43.828047 BR Infra Not building CPE for pkg: [UBOOT]
> 0:00:47.125995 BR Infra Not building CPE for pkg: [IMX_USB_LOADER]
> ...
> 0:03:46.279893 BR Infra Not building CPE for pkg: [INTLTOOL]
> 0:03:46.571266 BR Infra Not building CPE for pkg: [DOXYGEN]
> 0:03:46.892839 Calculate stats
> 0:03:46.895765 Write HTML
>
> >  Oh, actually, the [... for ... iter...] is also more efficient than
> > for...: append() so that could be an effect here as well. But this part of the
> > code is only O(#cpe packages) so it shouldn't have that much impact.
> >
> >>          except urllib2.HTTPError:package
> >> @@ -580,5 +581,5 @@ class CPE:
> >>          print("CPE: Searching for partial [%s]" % cpe_str)
> >> -        for cpe in self.all_cpes['cpe-list']['cpe-item']:
> >> -            if cpe_str in cpe['cpe-23:cpe23-item']['@name']:
> >> -                return cpe['cpe-23:cpe23-item']['@name']
> >> +        for cpe in self.all_cpes:
> >> +            if cpe.startswith(cpe_str):
> >
> >  Originally it was 'in' instead of startswith(). Obviously startswith() will be
> > more efficient. And also more correct, I guess, or does the partial match not

I just sent a revised v5 that's a hybrid of feedback so far.
http://patchwork.ozlabs.org/project/buildroot/list/?series=45124

Optimization wise, I did not switch to using .startswith however would
need to look now traversing a set with a for loop and using startswith
would be more efficient then doing a if "in" check.

Matt
Matt Weber May 18, 2018, 3:21 a.m. UTC | #9
Ricardo,

On Thu, May 17, 2018 at 10:16 PM, Ricardo Martincoski
<ricardo.martincoski@gmail.com> wrote:
> Hello,
>
> On Wed, May 16, 2018 at 10:42 PM, Matthew Weber wrote:
>> On Wed, May 16, 2018 at 6:32 PM, Arnout Vandecappelle wrote:
>>> On 16-05-18 05:43, Ricardo Martincoski wrote:
>
> [snip]
>>> >> +    cpe_dict = CPE()
>>> >> +    cpe_dict.get_xml_dict()
>>> >> +    if args.cpe_report:
>>> >> +        print("Performing Target CPE Report Analysis...")
>>> >> +        get_target_cpe_report(args.cpe_report, cpe_dict)
>>> >> +    elif args.output:
>>> >
>>> > This is not common in other scripts in the tree. All checks between options are
>>> > done at the start of __main__.
>>> > But having two different code paths is not common either (in the scripts in the
>>> > tree), so it seems to me it makes sense here.
>>> > Maybe others disagree.
>>>
>>>  The way I see it, there is almost no commonality between the script with the -c
>>> option and the normal pkg-stats, so it makes no sense to have them in the same
>>> script. I would split off the CPE class into a separate module that would be
>>> imported into this script and a new cpe_report script.
>>
>> I debated this.  Ricardo/Arnout, what naming and directory scheme
>> would you propose?
>
> It seems you need a single shared module, so inspired by:
>
> getdeveloperlib.py
> get-developers -> import getdeveloperlib
>
> you could have:
>
> cpelib.py
> pkg-stats -> import cpelib
> cpe-report -> import cpelib
>
> But cpe-report seems to be a user-facing script, so it would go to the utils/
> directory. I am not sure what is the best way to solve the import for:
>
> support/scripts/cpelib.py
> support/scripts/pkg-stats -> import cpelib
> utils/cpe-report -> import cpelib
>
> Maybe a symlink utils/cpelib.py -> support/scripts/cpelib.py ? (not tested)
> I never tried importing using relative paths without having __init__.py.
>
>

I caught Arnout on IRC today and my v5 patchset reflects his
suggestion.  Ended up calling the class/module file
support/scripts/cpedb.py and support/scripts/cpe-report for the
checker script.  I do see what you mean about a user facing script.
What do others think?

Matt
Ricardo Martincoski May 18, 2018, 3:44 a.m. UTC | #10
Hello,

On Fri, May 18, 2018 at 12:21 AM, Matthew Weber wrote:

> I caught Arnout on IRC today and my v5 patchset reflects his
> suggestion.  Ended up calling the class/module file
> support/scripts/cpedb.py and support/scripts/cpe-report for the

Good names IMO.

Just for reference, in the same machine with v5 takes 3m13,630s to run pkg-stats.
In the current master: 1m40,318s

> checker script.  I do see what you mean about a user facing script.
> What do others think?

Let's wait for more input.

But if for any reason you send a v6, I would suggest to:
Merge patch 7 back to 5.
Split patch 6 into one changing print to print(), and another one
adding CPE report.

Regards,
Ricardo
Matt Weber May 18, 2018, 1:07 p.m. UTC | #11
Ricardo,

On Thu, May 17, 2018 at 10:44 PM, Ricardo Martincoski
<ricardo.martincoski@gmail.com> wrote:
> Hello,
>
> On Fri, May 18, 2018 at 12:21 AM, Matthew Weber wrote:
>
>> I caught Arnout on IRC today and my v5 patchset reflects his
>> suggestion.  Ended up calling the class/module file
>> support/scripts/cpedb.py and support/scripts/cpe-report for the
>
> Good names IMO.
>
> Just for reference, in the same machine with v5 takes 3m13,630s to run pkg-stats.
> In the current master: 1m40,318s
>
>> checker script.  I do see what you mean about a user facing script.
>> What do others think?
>
> Let's wait for more input.
>
> But if for any reason you send a v6, I would suggest to:
> Merge patch 7 back to 5.
> Split patch 6 into one changing print to print(), and another one
> adding CPE report.
>

I just split the  print to print() out into a seperate
http://patchwork.ozlabs.org/patch/916300/.

I'll respin v6 after more feedback and fold the cpe-report in with the
cpedb (7 into 5).

Matt
diff mbox series

Patch

diff --git a/support/scripts/pkg-stats b/support/scripts/pkg-stats
index 43f7e8d..c28d397 100755
--- a/support/scripts/pkg-stats
+++ b/support/scripts/pkg-stats
@@ -24,14 +24,22 @@  from collections import defaultdict
 import re
 import subprocess
 import sys
+import urllib2
+import xmltodict
+import gzip
+from StringIO import StringIO
+import csv
 
 INFRA_RE = re.compile("\$\(eval \$\(([a-z-]*)-package\)\)")
 
+CPE_XML_URL = "https://static.nvd.nist.gov/feeds/xml/cpe/dictionary/official-cpe-dictionary_v2.3.xml.gz"
+
 
 class Package:
     all_licenses = list()
     all_license_files = list()
     all_versions = dict()
+    all_cpe_id = dict()
 
     def __init__(self, name, path):
         self.name = name
@@ -43,6 +51,8 @@  class Package:
         self.patch_count = 0
         self.warnings = 0
         self.current_version = None
+        self.cpe_id = None
+        self.has_cpe = False
 
     def pkgvar(self):
         return self.name.upper().replace("-", "_")
@@ -116,6 +126,25 @@  class Package:
                 self.warnings = int(m.group(1))
                 return
 
+    def set_cpe_info(self, cpe_dict):
+        """
+        Fills in the .has_cpe field
+        """
+        var = self.pkgvar()
+        if var in self.all_cpe_id:
+            self.cpe_id = self.all_cpe_id[var]
+        if self.cpe_id is None:
+            print("BR Infra Not building CPE for pkg: [%s]" % var)
+            return
+        result = cpe_dict.find(self.cpe_id)
+        if not result:
+            result = cpe_dict.find_partial(cpe_dict.get_cpe_no_version(self.cpe_id))
+            if result:
+                self.has_cpe = "Update"
+            # Unset case for has_cpe is assumed missing/does not exist
+        else:
+            self.has_cpe = cpe_dict.get_nvd_url(self.cpe_id)
+
     def __eq__(self, other):
         return self.path == other.path
 
@@ -254,6 +283,23 @@  def package_init_make_info():
 
         Package.all_versions[pkgvar] = value
 
+    # CPE ID
+    o = subprocess.check_output(["make", "BR2_HAVE_DOT_CONFIG=y",
+                                 "-s", "printvars", "VARS=%_CPE_ID"])
+    for l in o.splitlines():
+        # Get variable name and value
+        pkgvar, value = l.split("=")
+
+        # Strip _CPE_ID
+        pkgvar = pkgvar[:-7]
+
+        if pkgvar == "LINUX":
+            Package.all_cpe_id[pkgvar] = "cpe:2.3:o:" + value + ":*:*:*:*:*:*:*"
+        elif pkgvar == "LINUX_HEADERS":
+            Package.all_cpe_id[pkgvar] = "cpe:2.3:o:" + value + ":*:*:*:*:*:*:*"
+        else:
+            Package.all_cpe_id[pkgvar] = "cpe:2.3:a:" + value + ":*:*:*:*:*:*:*"
+
 
 def calculate_stats(packages):
     stats = defaultdict(int)
@@ -279,6 +325,12 @@  def calculate_stats(packages):
             stats["hash"] += 1
         else:
             stats["no-hash"] += 1
+        if pkg.has_cpe == "Update":
+            stats["update-cpe"] += 1
+        elif pkg.has_cpe:
+            stats["cpe"] += 1
+        else:
+            stats["no-cpe"] += 1
         stats["patches"] += pkg.patch_count
     return stats
 
@@ -422,6 +474,20 @@  def dump_html_pkg(f, pkg):
     f.write("  <td class=\"%s\">%d</td>\n" %
             (" ".join(td_class), pkg.warnings))
 
+    # CPE Valid
+    td_class = ["centered"]
+    if not pkg.has_cpe:
+        td_class.append("wrong")
+        f.write("  <td class=\"%s\">%s</td>\n" %
+                (" ".join(td_class), boolean_str(pkg.has_cpe)))
+    elif pkg.has_cpe == "Update":
+        td_class.append("wrong")
+        f.write("  <td class=\"%s\">Update</td>\n" %
+                (" ".join(td_class)))
+    else:
+        td_class.append("correct")
+        f.write("  <td class=\"%s\"><a href=\"%s\">%s</a></td>\n" %
+                (" ".join(td_class), pkg.has_cpe, boolean_str(pkg.has_cpe)))
     f.write(" </tr>\n")
 
 
@@ -437,6 +503,7 @@  def dump_html_all_pkgs(f, packages):
 <td class=\"centered\">Hash file</td>
 <td class=\"centered\">Current version</td>
 <td class=\"centered\">Warnings</td>
+<td class=\"centered\">CPE Valid</td>
 </tr>
 """)
     for pkg in sorted(packages):
@@ -463,6 +530,12 @@  def dump_html_stats(f, stats):
             stats["hash"])
     f.write(" <tr><td>Packages not having a hash file</td><td>%s</td></tr>\n" %
             stats["no-hash"])
+    f.write(" <tr><td>Packages having a registered CPE</td><td>%s</td></tr>\n" %
+            stats["cpe"])
+    f.write(" <tr><td>Packages needing CPE update</td><td>%s</td></tr>\n" %
+            stats["update-cpe"])
+    f.write(" <tr><td>Packages missing a registered CPE</td><td>%s</td></tr>\n" %
+            stats["no-cpe"])
     f.write(" <tr><td>Total number of patches</td><td>%s</td></tr>\n" %
             stats["patches"])
     f.write("</table>\n")
@@ -485,42 +558,121 @@  def dump_html(packages, stats, output):
         f.write(html_footer)
 
 
+class CPE:
+    all_cpes = dict()
+
+    def get_xml_dict(self):
+        print("CPE: Fetching xml manifest...")
+        try:
+            compressed_cpe_file = urllib2.urlopen(CPE_XML_URL)
+            print("CPE: Unzipping xml manifest...")
+            cpe_file = gzip.GzipFile(fileobj=StringIO(compressed_cpe_file.read())).read()
+            print("CPE: Converting xml manifest to dict...")
+            self.all_cpes = xmltodict.parse(cpe_file)
+        except urllib2.HTTPError:
+            print("CPE: HTTP Error: %s" % CPE_XML_URL)
+            sys.exit(1)
+        except urllib2.URLError:
+            print("CPE: URL Error: %s" % CPE_XML_URL)
+            sys.exit(1)
+
+    def find_partial(self, cpe_str):
+        print("CPE: Searching for partial [%s]" % cpe_str)
+        for cpe in self.all_cpes['cpe-list']['cpe-item']:
+            if cpe_str in cpe['cpe-23:cpe23-item']['@name']:
+                return cpe['cpe-23:cpe23-item']['@name']
+
+    def find(self, cpe_str):
+        print("CPE: Searching for [%s]" % cpe_str)
+        for cpe in self.all_cpes['cpe-list']['cpe-item']:
+            if cpe['cpe-23:cpe23-item']['@name'] == cpe_str:
+                return cpe['cpe-23:cpe23-item']['@name']
+
+    def get_cpe_no_version(self, cpe):
+        return "".join(cpe.split(":")[:5])
+
+    def get_nvd_url(self, cpe_str):
+        return "https://nvd.nist.gov/products/cpe/search/results?keyword=" + \
+                urllib2.quote(cpe_str) + \
+                "&status=FINAL&orderBy=CPEURI&namingFormat=2.3"
+
+
+def get_target_cpe_report(cpe_report_file, cpe_dict):
+    report_cpe_exact_match = ""
+    report_cpe_needing_update = ""
+    report_cpe_missing = ""
+
+    print("CPE: Checking for matches...")
+    try:
+        with open(cpe_report_file) as cpe_file:
+            cpe_list = csv.reader(cpe_file)
+            next(cpe_list)  # make cpe-info has a one line header
+            for cpe in cpe_list:
+                result = cpe_dict.find(cpe[0])
+                if not result:
+                    result = cpe_dict.find_partial(cpe_dict.get_cpe_no_version(cpe[0]))
+                    if not result:
+                        report_cpe_missing += cpe[0] + "\n"
+                    else:
+                        report_cpe_needing_update += cpe[0] + "\n"
+                else:
+                    report_cpe_exact_match += cpe[0] + "\n"
+    except (OSError, IOError) as e:
+        print("CPE: report csv file (%s): %s" % (e.errno, e.strerror))
+        sys.exit(1)
+
+    print("CPE: Found EXACT match:\n" + report_cpe_exact_match)
+    print("CPE: Found but REQUIRES UPDATE:\n" + report_cpe_needing_update)
+    print("CPE: Not found (proposing the following to be added):\n" + report_cpe_missing)
+
+
 def parse_args():
     parser = argparse.ArgumentParser()
-    parser.add_argument('-o', dest='output', action='store', required=True,
+    parser.add_argument('-o', dest='output', action='store',
                         help='HTML output file')
     parser.add_argument('-n', dest='npackages', type=int, action='store',
                         help='Number of packages')
     parser.add_argument('-p', dest='packages', action='store',
                         help='List of packages (comma separated)')
+    parser.add_argument('-c', dest='cpe_report', action='store',
+                        help='CPE Report generated by make cpe-info (csv format)')
     return parser.parse_args()
 
 
 def __main__():
     args = parse_args()
     if args.npackages and args.packages:
-        print "ERROR: -n and -p are mutually exclusive"
+        print("ERROR: -n and -p are mutually exclusive")
         sys.exit(1)
     if args.packages:
         package_list = args.packages.split(",")
     else:
         package_list = None
-    print "Build package list ..."
-    packages = get_pkglist(args.npackages, package_list)
-    print "Getting package make info ..."
-    package_init_make_info()
-    print "Getting package details ..."
-    for pkg in packages:
-        pkg.set_infra()
-        pkg.set_license()
-        pkg.set_hash_info()
-        pkg.set_patch_count()
-        pkg.set_check_package_warnings()
-        pkg.set_current_version()
-    print "Calculate stats"
-    stats = calculate_stats(packages)
-    print "Write HTML"
-    dump_html(packages, stats, args.output)
+    cpe_dict = CPE()
+    cpe_dict.get_xml_dict()
+    if args.cpe_report:
+        print("Performing Target CPE Report Analysis...")
+        get_target_cpe_report(args.cpe_report, cpe_dict)
+    elif args.output:
+        print("Build package list ...")
+        packages = get_pkglist(args.npackages, package_list)
+        print("Getting package make info ...")
+        package_init_make_info()
+        print("Getting package details ...")
+        for pkg in packages:
+            pkg.set_infra()
+            pkg.set_license()
+            pkg.set_hash_info()
+            pkg.set_patch_count()
+            pkg.set_check_package_warnings()
+            pkg.set_current_version()
+            pkg.set_cpe_info(cpe_dict)
+        print("Calculate stats")
+        stats = calculate_stats(packages)
+        print("Write HTML")
+        dump_html(packages, stats, args.output)
+    else:
+        print("Please provide the -o HTML output file arg")
 
 
 __main__()