diff mbox series

support/scripts: use FKIE git tree

Message ID 20240318220420.356343-1-yann.morin.1998@free.fr
State Accepted
Headers show
Series support/scripts: use FKIE git tree | expand

Commit Message

Yann E. MORIN March 18, 2024, 10:04 p.m. UTC
Currently, we grab the per-year CVE feeds, in two passes: first, we grab
the meta files, and check whether something has changed since last we
downloaded it; second, we download the feed proper, unless the meta file
has not changed, in which case we use the locally cached feed.

However, it has appeared that the FKIE releases no longer provide the
meta files, which means that (once again), our daily reports are broken.

The obvious fix would be to drop the use of the meta file, and always
and unconditionally download the feeds. That's relatively trivial to do,
but he feds are relatively big (even as xz-xompressed).

However, the CVE database from FKIE is available as a git tree. Git is
pretty good as only sending delta when updating a local copy. The git
tree, however, contains each CVE as individual files, so it is
relatively easier to scan and parse.

Switch to using a local git clone.

Slightly surprisingly (but not so much either), parsing the CVE files is
much faster when using the git working copy, than it is when parsing the
per-year feeds: indeed, the per-year feeds are xz-compressed, and even
if python is slow-ish to scan a directory and opening files therein, it
is still much faster than to decompress xz files. The timing delta [0]
is ~100s before and ~10s now, about a ten time improvement, over the
whole package set.

The drawback, however, is that the git tree is much bigger on-disk, from
~55MiB for the per-year compressed feeds, to 2.1GiB for the git tree
(~366MiB) and a working copy (~1.8GiB)... Given very few people are
going to use that, that's considered acceptable...

Eventually, with a bit of hacking [1], the two pkg-stats, before and
after this change, yield the same data (except for the date and commit
hash).

[0] hacking support/scripts/pkg-stats to display the time before/after
the CVE scan, and hacking support/scripts/cve.py to do no download so
that only the CVE scan happens (and also because the meta files are no
longer available).

[1] sorting the CVE lists in json, sorting the json keys, and using the
commit from the FKIE git tree that was used for the current per-year
feeds.

Signed-off-by: Yann E. MORIN <yann.morin.1998@free.fr>
Cc: Arnout Vandecappelle (Essensium/Mind) <arnout@mind.be>
Cc: Thomas Petazzoni <thomas.petazzoni@bootlin.com>
---
 support/scripts/cve.py | 76 ++++++++++++++++--------------------------
 1 file changed, 29 insertions(+), 47 deletions(-)

Comments

Arnout Vandecappelle March 20, 2024, 8:23 p.m. UTC | #1
Hi Yann,

  Since this is quite urgent again due to pkg-stats being broken at the moment, 
I've applied to master (mostly) as is, but I have a bunch of ideas for 
improvements, below.

On 18/03/2024 23:04, Yann E. MORIN wrote:
> Currently, we grab the per-year CVE feeds, in two passes: first, we grab
> the meta files, and check whether something has changed since last we
> downloaded it; second, we download the feed proper, unless the meta file
> has not changed, in which case we use the locally cached feed.
> 
> However, it has appeared that the FKIE releases no longer provide the
> meta files, which means that (once again), our daily reports are broken.
> 
> The obvious fix would be to drop the use of the meta file, and always
> and unconditionally download the feeds. That's relatively trivial to do,
> but he feds are relatively big (even as xz-xompressed).
> 
> However, the CVE database from FKIE is available as a git tree. Git is
> pretty good as only sending delta when updating a local copy. The git
> tree, however, contains each CVE as individual files, so it is
> relatively easier to scan and parse.
> 
> Switch to using a local git clone.
> 
> Slightly surprisingly (but not so much either), parsing the CVE files is
> much faster when using the git working copy, than it is when parsing the
> per-year feeds: indeed, the per-year feeds are xz-compressed, and even
> if python is slow-ish to scan a directory and opening files therein, it
> is still much faster than to decompress xz files. The timing delta [0]
> is ~100s before and ~10s now, about a ten time improvement, over the
> whole package set.
> 
> The drawback, however, is that the git tree is much bigger on-disk, from
> ~55MiB for the per-year compressed feeds, to 2.1GiB for the git tree
> (~366MiB) and a working copy (~1.8GiB)... Given very few people are
> going to use that, that's considered acceptable...

  We could "solve" that by not keeping a working tree at all, just a bare 
repository, and use `git ls-tree --name-only -r origin/main` and `git cat-file` 
to extract the JSON files. It's probably much more efficient than os.walk as 
well (though os.walk isn't much of a bottleneck, I guess).


> Eventually, with a bit of hacking [1], the two pkg-stats, before and
> after this change, yield the same data (except for the date and commit
> hash).
> 
> [0] hacking support/scripts/pkg-stats to display the time before/after
> the CVE scan, and hacking support/scripts/cve.py to do no download so
> that only the CVE scan happens (and also because the meta files are no
> longer available).
> 
> [1] sorting the CVE lists in json, sorting the json keys, and using the
> commit from the FKIE git tree that was used for the current per-year
> feeds.
> 
> Signed-off-by: Yann E. MORIN <yann.morin.1998@free.fr>
> Cc: Arnout Vandecappelle (Essensium/Mind) <arnout@mind.be>
> Cc: Thomas Petazzoni <thomas.petazzoni@bootlin.com>
> ---
>   support/scripts/cve.py | 76 ++++++++++++++++--------------------------
>   1 file changed, 29 insertions(+), 47 deletions(-)
> 
> diff --git a/support/scripts/cve.py b/support/scripts/cve.py
> index 7167ecbc6a..88c7fde577 100755
> --- a/support/scripts/cve.py
> +++ b/support/scripts/cve.py
> @@ -19,10 +19,9 @@
>   
>   import datetime
>   import os
> -import requests  # URL checking
>   import distutils.version
> -import lzma
> -import time
> +import json
> +import subprocess
>   import sys
>   import operator
>   
> @@ -41,7 +40,7 @@ except ImportError:
>   sys.path.append('utils/')
>   
>   NVD_START_YEAR = 1999
> -NVD_BASE_URL = "https://github.com/fkie-cad/nvd-json-data-feeds/releases/latest/download"
> +NVD_BASE_URL = "https://github.com/fkie-cad/nvd-json-data-feeds/"
>   
>   ops = {
>       '>=': operator.ge,
> @@ -81,41 +80,24 @@ class CVE:
>           self.nvd_cve = nvd_cve
>   
>       @staticmethod
> -    def download_nvd_year(nvd_path, year):
> -        metaf = "CVE-%s.meta" % year
> -        path_metaf = os.path.join(nvd_path, metaf)
> -        jsonf_xz = "CVE-%s.json.xz" % year
> -        path_jsonf_xz = os.path.join(nvd_path, jsonf_xz)
> -
> -        # If the database file is less than a day old, we assume the NVD data
> -        # locally available is recent enough.
> -        if os.path.exists(path_jsonf_xz) and os.stat(path_jsonf_xz).st_mtime >= time.time() - 86400:
> -            return path_jsonf_xz
> -
> -        # If not, we download the meta file
> -        url = "%s/%s" % (NVD_BASE_URL, metaf)
> -        print("Getting %s" % url)
> -        page_meta = requests.get(url)
> -        page_meta.raise_for_status()
> -
> -        # If the meta file already existed, we compare the existing
> -        # one with the data newly downloaded. If they are different,
> -        # we need to re-download the database.
> -        # If the database does not exist locally, we need to redownload it in
> -        # any case.
> -        if os.path.exists(path_metaf) and os.path.exists(path_jsonf_xz):
> -            meta_known = open(path_metaf, "r").read()
> -            if page_meta.text == meta_known:
> -                return path_jsonf_xz
> -
> -        # Grab the compressed JSON NVD, and write files to disk
> -        url = "%s/%s" % (NVD_BASE_URL, jsonf_xz)
> -        print("Getting %s" % url)
> -        page_json = requests.get(url)
> -        page_json.raise_for_status()
> -        open(path_jsonf_xz, "wb").write(page_json.content)
> -        open(path_metaf, "w").write(page_meta.text)
> -        return path_jsonf_xz
> +    def download_nvd(nvd_git_dir):
> +        print(f"Updating from {NVD_BASE_URL}")
> +        if os.path.exists(nvd_git_dir):

  It would be nice if we could automatically recover broken clones. I had hoped 
that we could reuse (part of) support/download/git...

> +            subprocess.check_call(
> +                ["git", "pull"],
> +                cwd=nvd_git_dir,
> +                stdout=subprocess.DEVNULL,
> +                stderr=subprocess.DEVNULL,
> +            )
> +        else:
> +            # Create the directory and its parents; git
> +            # happily clones into an empty directory.
> +            os.makedirs(nvd_git_dir)
> +            subprocess.check_call(
> +                ["git", "clone", NVD_BASE_URL, nvd_git_dir],
> +                stdout=subprocess.DEVNULL,
> +                stderr=subprocess.DEVNULL,
> +            )
>   
>       @staticmethod
>       def sort_id(cve_ids):
> @@ -131,15 +113,15 @@ class CVE:
>           feeds since NVD_START_YEAR. If the files are missing or outdated in
>           nvd_dir, a fresh copy will be downloaded, and kept in .json.gz
>           """
> +        nvd_git_dir = os.path.join(nvd_dir, "git")
> +        CVE.download_nvd(nvd_git_dir)
>           for year in range(NVD_START_YEAR, datetime.datetime.now().year + 1):

  There's no real need to keep this iteration over years, we can just os.walk 
from top-level (skipping the .git directory by deleting it from dirnames). But 
with git ls-tree it's even better of course.

> -            filename = CVE.download_nvd_year(nvd_dir, year)
> -            try:
> -                content = ijson.items(lzma.LZMAFile(filename), 'cve_items.item')

  Since ijson is no longer used, we don't need the complicated import any more. 
I removed it.


  Regards,
  Arnout

> -            except:  # noqa: E722
> -                print("ERROR: cannot read %s. Please remove the file then rerun this script" % filename)
> -                raise
> -            for cve in content:
> -                yield cls(cve)
> +            for dirpath, _, filenames in os.walk(os.path.join(nvd_git_dir, f"CVE-{year}")):
> +                for filename in filenames:
> +                    if filename[-5:] != ".json":
> +                        continue
> +                    with open(os.path.join(dirpath, filename), "rb") as f:
> +                        yield cls(json.load(f))
>   
>       def each_product(self):
>           """Iterate over each product section of this cve"""
Peter Korsgaard March 23, 2024, 12:10 p.m. UTC | #2
>>>>> "Yann" == Yann E MORIN <yann.morin.1998@free.fr> writes:

 > Currently, we grab the per-year CVE feeds, in two passes: first, we grab
 > the meta files, and check whether something has changed since last we
 > downloaded it; second, we download the feed proper, unless the meta file
 > has not changed, in which case we use the locally cached feed.

 > However, it has appeared that the FKIE releases no longer provide the
 > meta files, which means that (once again), our daily reports are broken.

 > The obvious fix would be to drop the use of the meta file, and always
 > and unconditionally download the feeds. That's relatively trivial to do,
 > but he feds are relatively big (even as xz-xompressed).

 > However, the CVE database from FKIE is available as a git tree. Git is
 > pretty good as only sending delta when updating a local copy. The git
 > tree, however, contains each CVE as individual files, so it is
 > relatively easier to scan and parse.

 > Switch to using a local git clone.

 > Slightly surprisingly (but not so much either), parsing the CVE files is
 > much faster when using the git working copy, than it is when parsing the
 > per-year feeds: indeed, the per-year feeds are xz-compressed, and even
 > if python is slow-ish to scan a directory and opening files therein, it
 > is still much faster than to decompress xz files. The timing delta [0]
 > is ~100s before and ~10s now, about a ten time improvement, over the
 > whole package set.

 > The drawback, however, is that the git tree is much bigger on-disk, from
 > ~55MiB for the per-year compressed feeds, to 2.1GiB for the git tree
 > (~366MiB) and a working copy (~1.8GiB)... Given very few people are
 > going to use that, that's considered acceptable...

 > Eventually, with a bit of hacking [1], the two pkg-stats, before and
 > after this change, yield the same data (except for the date and commit
 > hash).

 > [0] hacking support/scripts/pkg-stats to display the time before/after
 > the CVE scan, and hacking support/scripts/cve.py to do no download so
 > that only the CVE scan happens (and also because the meta files are no
 > longer available).

 > [1] sorting the CVE lists in json, sorting the json keys, and using the
 > commit from the FKIE git tree that was used for the current per-year
 > feeds.

 > Signed-off-by: Yann E. MORIN <yann.morin.1998@free.fr>
 > Cc: Arnout Vandecappelle (Essensium/Mind) <arnout@mind.be>
 > Cc: Thomas Petazzoni <thomas.petazzoni@bootlin.com>

Committed to 2024.02.x, thanks.
diff mbox series

Patch

diff --git a/support/scripts/cve.py b/support/scripts/cve.py
index 7167ecbc6a..88c7fde577 100755
--- a/support/scripts/cve.py
+++ b/support/scripts/cve.py
@@ -19,10 +19,9 @@ 
 
 import datetime
 import os
-import requests  # URL checking
 import distutils.version
-import lzma
-import time
+import json
+import subprocess
 import sys
 import operator
 
@@ -41,7 +40,7 @@  except ImportError:
 sys.path.append('utils/')
 
 NVD_START_YEAR = 1999
-NVD_BASE_URL = "https://github.com/fkie-cad/nvd-json-data-feeds/releases/latest/download"
+NVD_BASE_URL = "https://github.com/fkie-cad/nvd-json-data-feeds/"
 
 ops = {
     '>=': operator.ge,
@@ -81,41 +80,24 @@  class CVE:
         self.nvd_cve = nvd_cve
 
     @staticmethod
-    def download_nvd_year(nvd_path, year):
-        metaf = "CVE-%s.meta" % year
-        path_metaf = os.path.join(nvd_path, metaf)
-        jsonf_xz = "CVE-%s.json.xz" % year
-        path_jsonf_xz = os.path.join(nvd_path, jsonf_xz)
-
-        # If the database file is less than a day old, we assume the NVD data
-        # locally available is recent enough.
-        if os.path.exists(path_jsonf_xz) and os.stat(path_jsonf_xz).st_mtime >= time.time() - 86400:
-            return path_jsonf_xz
-
-        # If not, we download the meta file
-        url = "%s/%s" % (NVD_BASE_URL, metaf)
-        print("Getting %s" % url)
-        page_meta = requests.get(url)
-        page_meta.raise_for_status()
-
-        # If the meta file already existed, we compare the existing
-        # one with the data newly downloaded. If they are different,
-        # we need to re-download the database.
-        # If the database does not exist locally, we need to redownload it in
-        # any case.
-        if os.path.exists(path_metaf) and os.path.exists(path_jsonf_xz):
-            meta_known = open(path_metaf, "r").read()
-            if page_meta.text == meta_known:
-                return path_jsonf_xz
-
-        # Grab the compressed JSON NVD, and write files to disk
-        url = "%s/%s" % (NVD_BASE_URL, jsonf_xz)
-        print("Getting %s" % url)
-        page_json = requests.get(url)
-        page_json.raise_for_status()
-        open(path_jsonf_xz, "wb").write(page_json.content)
-        open(path_metaf, "w").write(page_meta.text)
-        return path_jsonf_xz
+    def download_nvd(nvd_git_dir):
+        print(f"Updating from {NVD_BASE_URL}")
+        if os.path.exists(nvd_git_dir):
+            subprocess.check_call(
+                ["git", "pull"],
+                cwd=nvd_git_dir,
+                stdout=subprocess.DEVNULL,
+                stderr=subprocess.DEVNULL,
+            )
+        else:
+            # Create the directory and its parents; git
+            # happily clones into an empty directory.
+            os.makedirs(nvd_git_dir)
+            subprocess.check_call(
+                ["git", "clone", NVD_BASE_URL, nvd_git_dir],
+                stdout=subprocess.DEVNULL,
+                stderr=subprocess.DEVNULL,
+            )
 
     @staticmethod
     def sort_id(cve_ids):
@@ -131,15 +113,15 @@  class CVE:
         feeds since NVD_START_YEAR. If the files are missing or outdated in
         nvd_dir, a fresh copy will be downloaded, and kept in .json.gz
         """
+        nvd_git_dir = os.path.join(nvd_dir, "git")
+        CVE.download_nvd(nvd_git_dir)
         for year in range(NVD_START_YEAR, datetime.datetime.now().year + 1):
-            filename = CVE.download_nvd_year(nvd_dir, year)
-            try:
-                content = ijson.items(lzma.LZMAFile(filename), 'cve_items.item')
-            except:  # noqa: E722
-                print("ERROR: cannot read %s. Please remove the file then rerun this script" % filename)
-                raise
-            for cve in content:
-                yield cls(cve)
+            for dirpath, _, filenames in os.walk(os.path.join(nvd_git_dir, f"CVE-{year}")):
+                for filename in filenames:
+                    if filename[-5:] != ".json":
+                        continue
+                    with open(os.path.join(dirpath, filename), "rb") as f:
+                        yield cls(json.load(f))
 
     def each_product(self):
         """Iterate over each product section of this cve"""