diff mbox series

[v3,5/8] support/scripts/nvd_api_v2.py: new helper class

Message ID 20230812192842.135682-5-dalang@gmx.at
State Superseded
Headers show
Series [v3,1/8] support/scripts/pkg-stats: fix typos | expand

Commit Message

Daniel Lang Aug. 12, 2023, 7:28 p.m. UTC
The current NVD data feeds used for CVE and CPE checking will be retired
by 2023-12-05 [0]. Both have to be switched to the new v2 API. Since
fetching data from both sources workes the same, a common base class is
used to handle the API interaction.
To store the data locally a sqlite database is used.

[0]: https://nvd.nist.gov/General/News/change-timeline

Signed-off-by: Daniel Lang <dalang@gmx.at>
---
 DEVELOPERS                    |   1 +
 support/scripts/nvd_api_v2.py | 138 ++++++++++++++++++++++++++++++++++
 2 files changed, 139 insertions(+)
 create mode 100755 support/scripts/nvd_api_v2.py

Comments

Arnout Vandecappelle Aug. 30, 2023, 9:37 p.m. UTC | #1
Hi Daniel,

  Since this is new code, I'm going to give some coding style comments that 
differ a bit from existing code you see in pkg-stats.

  I may actually end up applying the patch without those coding style changes, 
because the new NVD API is more important than any coding style concerns... 
We'll see how far I get.

  This is also not a complete review of this patch because it's almost time for 
me to go to bed.

On 12/08/2023 21:28, Daniel Lang wrote:
> The current NVD data feeds used for CVE and CPE checking will be retired
> by 2023-12-05 [0]. Both have to be switched to the new v2 API. Since
> fetching data from both sources workes the same, a common base class is

  workes -> works

> used to handle the API interaction.
> To store the data locally a sqlite database is used.

  Maybe explain a bit more than the new API doesn't allow to download files 
anymore, so we need to invent our own file format for persisting it, and a 
sqlite db is a convenient format that is always available in python.

> 
> [0]: https://nvd.nist.gov/General/News/change-timeline
> 
> Signed-off-by: Daniel Lang <dalang@gmx.at>
> ---
>   DEVELOPERS                    |   1 +
>   support/scripts/nvd_api_v2.py | 138 ++++++++++++++++++++++++++++++++++
>   2 files changed, 139 insertions(+)
>   create mode 100755 support/scripts/nvd_api_v2.py
> 
> diff --git a/DEVELOPERS b/DEVELOPERS
> index 6ffa3ee693..81f809a4c0 100644
> --- a/DEVELOPERS
> +++ b/DEVELOPERS
> @@ -668,6 +668,7 @@ F:	package/paho-mqtt-cpp/
>   F:	package/pangomm/
>   F:	package/pangomm2_46/
>   F:	package/sam-ba/
> +F:	support/scripts/nvd_api_v2.py
>   
>   N:	Damien Lanson <damien@kal-host.com>
>   F:	package/libvdpau/
> diff --git a/support/scripts/nvd_api_v2.py b/support/scripts/nvd_api_v2.py
> new file mode 100755
> index 0000000000..3fdf32596f
> --- /dev/null
> +++ b/support/scripts/nvd_api_v2.py
> @@ -0,0 +1,138 @@
> +#!/usr/bin/env python3
> +
> +from datetime import datetime, timezone
> +import os

  Personally, I think it's better to move away from the os module and use 
pathlib.Path instead everywhere a path is used. We don't do that anywhere in 
Buildroot at the moment, but I think we should.


> +import requests
> +import shutil
> +import sqlite3
> +import time
> +
> +NVD_API_VERSION = '2.0'
> +NVD_API_BASE_URL = 'https://services.nvd.nist.gov/rest/json'
> +
> +
> +class NVD_API:
> +    """
> +    A helper class that fetches data from a NVD API and
> +    helps manage a sqlite database.
> +    """
> +    def __init__(self, nvd_path, service, database_file_prefix):
> +        """
> +        Initialize a new NVD API endpoint with service
> +        as the URL postfix.

  Could you add documentation for the arguments? I was initially confused by 
nvd_path, it sounded like the path part of the URL...

> +        """
> +        self.nvd_path = nvd_path
> +        self.service = service
> +        self.url = '%s/%s/%s' % (NVD_API_BASE_URL, service.lower(), NVD_API_VERSION)

  Prefer to use format strings:

         self.url = f'{NVD_API_BASE_URL}/{service.lower()}/{NVD_API_VERSION}

> +        self.db_file = os.path.join(nvd_path, '%s-%s.sqlite' % (database_file_prefix, NVD_API_VERSION))
> +        self.db_file_tmp = '%s.tmp' % self.db_file

  I haven't looked in too much detail at how the tmp file is used, but I wonder 
if it is really needed. sqlite should be able to manage access to the database 
so that it stays consistent. If we make sure to use transactions if more than 
one row needs to be stored consistently (which I doubt is really needed anyway), 
then we should always be able to read consistent data from the database. And if 
we re-fetch information from the NVD database, then we should be able to update 
the existing row if it actually was already downloaded before.

  If I'm missing something here, please add a detailed explanation to the 
beginning of this file or class to understand why the tmp file is needed.

> +
> +    def init_db(self):
> +        """
> +        Needs to be implemented by derived classes.
> +        Used to make sure that database tables exist.
> +        """
> +        pass
> +
> +    def save_to_db(self, start_index, total_results, content):
> +        """
> +        Needs to be implemented by derived classes.
> +        Used to save the data given by a single API request
> +        to the database.
> +        """
> +        pass
> +
> +    def cleanup_db(self):

  The name of the function doesn't sound very good to me - it makes me think 
you're deleting the db file itself, not the tmp file.

> +        """
> +        Clean up any files that where left by previously
> +        failed runs.
> +        """
> +        if os.path.exists(self.db_file_tmp):
> +            os.remove(self.db_file_tmp)
> +
> +    def open_db(self, tmp=False):
> +        """
> +        Open and return a connection to the sqlite database.
> +        """
> +        if tmp:
> +            return sqlite3.connect(self.db_file_tmp)
> +        return sqlite3.connect(self.db_file)

  I think putting this in a function is making things more complicated than 
calling sqlite3.connect directly from the caller.

> +
> +    def download(self, last_update):
> +        """
> +        Download all entries from NVD since last_update (if not None).
> +        For each downloaded page save_to_db is called to together with
> +        progress information.
> +        NVD rate limiting allows 5 requests per 30 seconds or one every
> +        6 seconds.

  Could we maybe define those numbers as class variables and calculate the sleep 
etc. times that are used below based on those? Instead of having constant 5 and 6.

> +        """
> +        args = {}
> +        start_index = 0
> +        total_results = 0
> +        results_per_page = 0
> +
> +        print('Downloading new %s' % self.service)
> +
> +        if (last_update is not None):
> +            args['lastModStartDate'] = last_update.isoformat()
> +            args['lastModEndDate'] = datetime.now(tz=timezone.utc).isoformat()
> +
> +        while True:
> +            args['startIndex'] = start_index
> +
> +            for attempt in range(5):
> +                try:
> +                    page = requests.get(self.url, params=args)
> +                    page.raise_for_status()
> +                    content = page.json()
> +                except Exception:

  This will also catch unrecoverable errors, e.g. DNS lookup failures, which 
leads to an endless loop without any feedback. Would it be possible to instead 
catch the exact exception that requests raises when we hit rate limiting?

> +                    time.sleep(6)
> +                else:
> +                    break
> +
> +            if content is None:
> +                # Nothing was downloaded
> +                return False
> +
> +            results_per_page = content['resultsPerPage']
> +            total_results = content['totalResults']
> +            start_index = content['startIndex']
> +
> +            start_index += results_per_page
> +
> +            if self.save_to_db(start_index, total_results, content) is not True:

  Instead of "is not True" just do "if not self.save_to_db".

  I'm not sure what the start_index and total_results parameters are supposed to 
be used for. Are they supposed to be persisted? If yes, then IMHO this should be 
done by this class, in a separate table defined by this class.


> +                return False
> +
> +            self.connection.commit()
> +
> +            if start_index >= total_results:
> +                return True
> +
> +            # Otherwise rate limiting will be hit.
> +            time.sleep(6)

  If the rate limit really is "5 requests per 30 seconds", then we can actually 
do 3 requests back-to-back without any sleep without hitting the rate limit, 
right? Assuming you do call the script pretty regularly (like, what we do on 
a.b.o), the new results should be just a few pages, so we can avoid the sleep 
entirely...

> +
> +    def check_for_updates(self):
> +        """
> +        Check if the database file exists and if the last
> +        update was more than 24 hours ago.
> +        """
> +        self.cleanup_db()
> +        last_update = None
> +        if os.path.exists(self.db_file):
> +            last_update = os.stat(self.db_file).st_mtime

  I don't think the mtime is a reliable way to determine the last update time. 
Instead, IMHO it's better to add a table in the database that keeps track of the 
last update. This would be a table that is added by the base class itself before 
calling init_db(). After the download has completed entirely, this timestamp 
would be updated. So if it's interrupted in the middle, the next try will 
re-download with the previous timestamp. Also, the timestamp should be updated 
with the timestamp that was used in the lastModEndDate (which can be quite a bit 
earlier than the time at which the download actually finishes). Perhaps even 
subtract one second from it to make sure we don't miss any updates that were 
entered at the exact time we did the previous download.


> +            if last_update >= time.time() - 86400:
> +                return []
> +            # NVD uses UTC timestamps
> +            last_update = datetime.fromtimestamp(last_update, tz=timezone.utc)
> +            shutil.copy2(self.db_file, self.db_file_tmp)
> +
> +        self.connection = self.open_db(True)

  self.connection = None should be added to __init__ to "declare" the attribute.

  Also, db_connection would be a better name.


  Now it's time for bed :-)

  Regards,
  Arnout

> +        self.init_db()
> +
> +        success = self.download(last_update)
> +        self.connection.close()
> +        if success:
> +            shutil.move(self.db_file_tmp, self.db_file)
> +        else:
> +            print("Update failed!")
> +            os.remove(self.db_file_tmp)
Daniel Lang Aug. 31, 2023, 8:18 p.m. UTC | #2
Hello Arnout,

thanks for the detailed comments.
I will try to address them and send a new version quickly.

On 30.08.23 23:37, Arnout Vandecappelle wrote:
>  Hi Daniel,
>
>  Since this is new code, I'm going to give some coding style comments that differ a bit from existing code you see in pkg-stats.
>
>  I may actually end up applying the patch without those coding style changes, because the new NVD API is more important than any coding style concerns... We'll see how far I get.
>
>  This is also not a complete review of this patch because it's almost time for me to go to bed.
>

[-SNIP-]

>> +        """
>> +        self.nvd_path = nvd_path
>> +        self.service = service
>> +        self.url = '%s/%s/%s' % (NVD_API_BASE_URL, service.lower(), NVD_API_VERSION)
>
>  Prefer to use format strings:
>
>         self.url = f'{NVD_API_BASE_URL}/{service.lower()}/{NVD_API_VERSION}

This means that pkg-stats needs python >= 3.6.
I couldn't find anything about the minimum version that we currently require.

>
>> +        self.db_file = os.path.join(nvd_path, '%s-%s.sqlite' % (database_file_prefix, NVD_API_VERSION))
>> +        self.db_file_tmp = '%s.tmp' % self.db_file
>
>  I haven't looked in too much detail at how the tmp file is used, but I wonder if it is really needed. sqlite should be able to manage access to the database so that it stays consistent. If we make sure to use transactions if more than one row needs to be stored consistently (which I doubt is really needed anyway), then we should always be able to read consistent data from the database. And if we re-fetch information from the NVD database, then we should be able to update the existing row if it actually was already downloaded before.
>
>  If I'm missing something here, please add a detailed explanation to the beginning of this file or class to understand why the tmp file is needed.

The idea of the tmp file is that changes (inserts, updates) are only
done on the tmp file. If the update fails mid process we have an
unmodified version.
Inserting directly into the database would update the modification date
which is used to calculate the last update.
Considering your suggestion further down to add a metadata table
that stores the last update date, this becomes less relevant and
the tmp handling could probably be dropped.

>
>> +
>> +    def init_db(self):
>> +        """
>> +        Needs to be implemented by derived classes.
>> +        Used to make sure that database tables exist.
>> +        """
>> +        pass

[-SNIP-]

>> +
>> +    def download(self, last_update):
>> +        """
>> +        Download all entries from NVD since last_update (if not None).
>> +        For each downloaded page save_to_db is called to together with
>> +        progress information.
>> +        NVD rate limiting allows 5 requests per 30 seconds or one every
>> +        6 seconds.
>
>  Could we maybe define those numbers as class variables and calculate the sleep etc. times that are used below based on those? Instead of having constant 5 and 6.

Sure, for the code the only relevant number is 6.
But I will add a variable that is defined as 30 / 5.

>
>> +        """
>> +        args = {}
>> +        start_index = 0
>> +        total_results = 0
>> +        results_per_page = 0
>> +
>> +        print('Downloading new %s' % self.service)
>> +
>> +        if (last_update is not None):
>> +            args['lastModStartDate'] = last_update.isoformat()
>> +            args['lastModEndDate'] = datetime.now(tz=timezone.utc).isoformat()
>> +
>> +        while True:
>> +            args['startIndex'] = start_index
>> +
>> +            for attempt in range(5):
>> +                try:
>> +                    page = requests.get(self.url, params=args)
>> +                    page.raise_for_status()
>> +                    content = page.json()
>> +                except Exception:
>
>  This will also catch unrecoverable errors, e.g. DNS lookup failures, which leads to an endless loop without any feedback. Would it be possible to instead catch the exact exception that requests raises when we hit rate limiting?

If I remember correctly I encountered different errors while
downloading the complete set. But I will rerun it and see if
the errors can be narrowed down.

>
>> +                    time.sleep(6)
>> +                else:
>> +                    break
>> +
>> +            if content is None:
>> +                # Nothing was downloaded
>> +                return False
>> +
>> +            results_per_page = content['resultsPerPage']
>> +            total_results = content['totalResults']
>> +            start_index = content['startIndex']
>> +
>> +            start_index += results_per_page
>> +
>> +            if self.save_to_db(start_index, total_results, content) is not True:
>
>  Instead of "is not True" just do "if not self.save_to_db".
>
>  I'm not sure what the start_index and total_results parameters are supposed to be used for. Are they supposed to be persisted? If yes, then IMHO this should be done by this class, in a separate table defined by this class.

start_index and total_results are used by the derived classes
to report the progress like [001000/200000].
I didn't want to do it in this class to be more flexible.
I decided that CVE progress is always reported with 6 digits
and CPE with 7 digits.
Does that make sense?

>
>
>> +                return False
>> +
>> +            self.connection.commit()
>> +
>> +            if start_index >= total_results:
>> +                return True
>> +
>> +            # Otherwise rate limiting will be hit.
>> +            time.sleep(6)
>
>  If the rate limit really is "5 requests per 30 seconds", then we can actually do 3 requests back-to-back without any sleep without hitting the rate limit, right? Assuming you do call the script pretty regularly (like, what we do on a.b.o), the new results should be just a few pages, so we can avoid the sleep entirely...

The official wording [0] is "The public rate limit is 5 requests
in a rolling 30 second window; the rate limit with an API key is
50 requests in a rolling 30 seconds window. [...] However,
it is still recommended that your application sleeps for
several seconds between requests so that legitimate requests
are not denied, and all requests are responded to in sequence."

My understanding of this statement is that a sleep should always
be added.
In my experience testing this, timeouts/errors occur even with
the 6 second sleep.

>
>> +
>> +    def check_for_updates(self):
>> +        """
>> +        Check if the database file exists and if the last
>> +        update was more than 24 hours ago.
>> +        """
>> +        self.cleanup_db()
>> +        last_update = None
>> +        if os.path.exists(self.db_file):
>> +            last_update = os.stat(self.db_file).st_mtime
>
>  I don't think the mtime is a reliable way to determine the last update time. Instead, IMHO it's better to add a table in the database that keeps track of the last update. This would be a table that is added by the base class itself before calling init_db(). After the download has completed entirely, this timestamp would be updated. So if it's interrupted in the middle, the next try will re-download with the previous timestamp. Also, the timestamp should be updated with the timestamp that was used in the lastModEndDate (which can be quite a bit earlier than the time at which the download actually finishes). Perhaps even subtract one second from it to make sure we don't miss any updates that were entered at the exact time we did the previous download.

Good point.
As addressed above this would also make the tmp file irrelevant.

>
>
>> +            if last_update >= time.time() - 86400:
>> +                return []
>> +            # NVD uses UTC timestamps
>> +            last_update = datetime.fromtimestamp(last_update, tz=timezone.utc)
>> +            shutil.copy2(self.db_file, self.db_file_tmp)
>> +
>> +        self.connection = self.open_db(True)
>
>  self.connection = None should be added to __init__ to "declare" the attribute.
>
>  Also, db_connection would be a better name.
>
>
>  Now it's time for bed :-)
>
>  Regards,
>  Arnout
>
>> +        self.init_db()
>> +
>> +        success = self.download(last_update)
>> +        self.connection.close()
>> +        if success:
>> +            shutil.move(self.db_file_tmp, self.db_file)
>> +        else:
>> +            print("Update failed!")
>> +            os.remove(self.db_file_tmp)

Regards,
Daniel

[0]: https://nvd.nist.gov/developers/start-here
Arnout Vandecappelle Sept. 1, 2023, 7:03 a.m. UTC | #3
On 31/08/2023 22:18, Daniel Lang wrote:
> Hello Arnout,
> 
> thanks for the detailed comments.
> I will try to address them and send a new version quickly.
> 
> On 30.08.23 23:37, Arnout Vandecappelle wrote:
>>   Hi Daniel,
>>
>>   Since this is new code, I'm going to give some coding style comments that differ a bit from existing code you see in pkg-stats.
>>
>>   I may actually end up applying the patch without those coding style changes, because the new NVD API is more important than any coding style concerns... We'll see how far I get.
>>
>>   This is also not a complete review of this patch because it's almost time for me to go to bed.
>>
> 
> [-SNIP-]
> 
>>> +        """
>>> +        self.nvd_path = nvd_path
>>> +        self.service = service
>>> +        self.url = '%s/%s/%s' % (NVD_API_BASE_URL, service.lower(), NVD_API_VERSION)
>>
>>   Prefer to use format strings:
>>
>>          self.url = f'{NVD_API_BASE_URL}/{service.lower()}/{NVD_API_VERSION}
> 
> This means that pkg-stats needs python >= 3.6.

  I think on a.b.o is running the script inside the buildroot/base container, 
which has 3.9.2. Thomas?

  Anyway, pkg-stats already uses f-strings in many places.

> I couldn't find anything about the minimum version that we currently require.

[snip]
>>> +            results_per_page = content['resultsPerPage']
>>> +            total_results = content['totalResults']
>>> +            start_index = content['startIndex']
>>> +
>>> +            start_index += results_per_page
>>> +
>>> +            if self.save_to_db(start_index, total_results, content) is not True:
>>
>>   Instead of "is not True" just do "if not self.save_to_db".
>>
>>   I'm not sure what the start_index and total_results parameters are supposed to be used for. Are they supposed to be persisted? If yes, then IMHO this should be done by this class, in a separate table defined by this class.
> 
> start_index and total_results are used by the derived classes
> to report the progress like [001000/200000].
> I didn't want to do it in this class to be more flexible.

  Er, sounds like this would be the same in all derived classes though? But I 
haven't looked at the rest yet so I could be wrong.

> I decided that CVE progress is always reported with 6 digits
> and CPE with 7 digits.

  Ah, that's the flexibility you mean.

  I would count the number of digits in total_results and use that. You can do 
nested interpolation in an f-string, like:

f'[{start_index:0{len(str(total_results))}}/{total_results}]'


> Does that make sense?
> 
>>
>>
>>> +                return False
>>> +
>>> +            self.connection.commit()
>>> +
>>> +            if start_index >= total_results:
>>> +                return True
>>> +
>>> +            # Otherwise rate limiting will be hit.
>>> +            time.sleep(6)
>>
>>   If the rate limit really is "5 requests per 30 seconds", then we can actually do 3 requests back-to-back without any sleep without hitting the rate limit, right? Assuming you do call the script pretty regularly (like, what we do on a.b.o), the new results should be just a few pages, so we can avoid the sleep entirely...

  In your experience, is it true that we only need to download a few pages if 
you update every day? Or maybe it's even less than one page - in that case it's 
not really relevant.


> The official wording [0] is "The public rate limit is 5 requests
> in a rolling 30 second window; the rate limit with an API key is
> 50 requests in a rolling 30 seconds window. [...] However,
> it is still recommended that your application sleeps for
> several seconds between requests so that legitimate requests
> are not denied, and all requests are responded to in sequence."
> 
> My understanding of this statement is that a sleep should always
> be added.
> In my experience testing this, timeouts/errors occur even with
> the 6 second sleep.

  My concern is just that a pkg-stats run which could be instantaneous (because 
we have _mostly_ cached results) will now take 6 seconds more than needed.

  But we really shouldn't make the code more complicated to cater to that 
specific issue, so you can probably ignore my comment.


  Regards,
  Arnout


>>> +
>>> +    def check_for_updates(self):
>>> +        """
>>> +        Check if the database file exists and if the last
>>> +        update was more than 24 hours ago.
>>> +        """
>>> +        self.cleanup_db()
>>> +        last_update = None
>>> +        if os.path.exists(self.db_file):
>>> +            last_update = os.stat(self.db_file).st_mtime
>>
>>   I don't think the mtime is a reliable way to determine the last update time. Instead, IMHO it's better to add a table in the database that keeps track of the last update. This would be a table that is added by the base class itself before calling init_db(). After the download has completed entirely, this timestamp would be updated. So if it's interrupted in the middle, the next try will re-download with the previous timestamp. Also, the timestamp should be updated with the timestamp that was used in the lastModEndDate (which can be quite a bit earlier than the time at which the download actually finishes). Perhaps even subtract one second from it to make sure we don't miss any updates that were entered at the exact time we did the previous download.
> 
> Good point.
> As addressed above this would also make the tmp file irrelevant.
> 
>>
>>
>>> +            if last_update >= time.time() - 86400:
>>> +                return []
>>> +            # NVD uses UTC timestamps
>>> +            last_update = datetime.fromtimestamp(last_update, tz=timezone.utc)
>>> +            shutil.copy2(self.db_file, self.db_file_tmp)
>>> +
>>> +        self.connection = self.open_db(True)
>>
>>   self.connection = None should be added to __init__ to "declare" the attribute.
>>
>>   Also, db_connection would be a better name.
>>
>>
>>   Now it's time for bed :-)
>>
>>   Regards,
>>   Arnout
>>
>>> +        self.init_db()
>>> +
>>> +        success = self.download(last_update)
>>> +        self.connection.close()
>>> +        if success:
>>> +            shutil.move(self.db_file_tmp, self.db_file)
>>> +        else:
>>> +            print("Update failed!")
>>> +            os.remove(self.db_file_tmp)
> 
> Regards,
> Daniel
> 
> [0]: https://nvd.nist.gov/developers/start-here
>
Arnout Vandecappelle Sept. 1, 2023, 7:10 a.m. UTC | #4
On 12/08/2023 21:28, Daniel Lang wrote:
[snip]
> +    def check_for_updates(self):
> +        """
> +        Check if the database file exists and if the last
> +        update was more than 24 hours ago.
> +        """
> +        self.cleanup_db()
> +        last_update = None
> +        if os.path.exists(self.db_file):
> +            last_update = os.stat(self.db_file).st_mtime
> +            if last_update >= time.time() - 86400:
> +                return []

  Actually, is this check really needed?

  If it is, it really should be something smaller than 86400. If the script runs 
daily, we really do want do update the database every day. If we use 86400, then 
it will sometimes not update, because it runs just one second earlier than the 
day before.

  But I don't think it's needed. AFAIU, the check_for_updates() would be called 
only once for a run of pkg-stats (well, once for CVE and once for CPE). So it's 
only if you run pkg-stats several times in a row that it's needed to avoid 
re-fetching. You normally don't run pkg-stats several times in a row. But even 
so, it will just hit the rate limit, wait for 6 seconds, and all is well again.

  Maybe your experiments show that that's not good either, e.g. because the rate 
limit will hit harder and harder if you run pkg-stats all the time. In that 
case, a time check is useful, but I'd drastically reduce it, to e.g. 2 hours.

  Regards,
  Arnout


> +            # NVD uses UTC timestamps
> +            last_update = datetime.fromtimestamp(last_update, tz=timezone.utc)
> +            shutil.copy2(self.db_file, self.db_file_tmp)
> +
> +        self.connection = self.open_db(True)
> +        self.init_db()
> +
> +        success = self.download(last_update)
> +        self.connection.close()
> +        if success:
> +            shutil.move(self.db_file_tmp, self.db_file)
> +        else:
> +            print("Update failed!")
> +            os.remove(self.db_file_tmp)
Thomas Petazzoni Sept. 1, 2023, 8:10 a.m. UTC | #5
On Fri, 1 Sep 2023 09:03:45 +0200
Arnout Vandecappelle <arnout@mind.be> wrote:

>   I think on a.b.o is running the script inside the buildroot/base container, 
> which has 3.9.2. Thomas?

No, the script is not run in the buildroot/base container. The machine
on which I run the script has Python 3.8.10.

Best regards,

Thomas
Daniel Lang Sept. 1, 2023, 11:08 a.m. UTC | #6
On 01.09.23 09:03, Arnout Vandecappelle wrote:
>
>
> On 31/08/2023 22:18, Daniel Lang wrote:
>> Hello Arnout,
>>
>> thanks for the detailed comments.
>> I will try to address them and send a new version quickly.
>>
>> On 30.08.23 23:37, Arnout Vandecappelle wrote:
>>>   Hi Daniel,
>>>
>>>   Since this is new code, I'm going to give some coding style comments that differ a bit from existing code you see in pkg-stats.
>>>
>>>   I may actually end up applying the patch without those coding style changes, because the new NVD API is more important than any coding style concerns... We'll see how far I get.
>>>
>>>   This is also not a complete review of this patch because it's almost time for me to go to bed.
>>>
>>

[-SNIP-]


>>>> +                return False
>>>> +
>>>> +            self.connection.commit()
>>>> +
>>>> +            if start_index >= total_results:
>>>> +                return True
>>>> +
>>>> +            # Otherwise rate limiting will be hit.
>>>> +            time.sleep(6)
>>>
>>>   If the rate limit really is "5 requests per 30 seconds", then we can actually do 3 requests back-to-back without any sleep without hitting the rate limit, right? Assuming you do call the script pretty regularly (like, what we do on a.b.o), the new results should be just a few pages, so we can avoid the sleep entirely...
>
>  In your experience, is it true that we only need to download a few pages if you update every day? Or maybe it's even less than one page - in that case it's not really relevant.

214 CVEs where updated in the last 24h.
978 CVEs where updated in the last week.
Not sure those are representative numbers, but autobuilders
should never have to fetch more than 1-2 pages.
diff mbox series

Patch

diff --git a/DEVELOPERS b/DEVELOPERS
index 6ffa3ee693..81f809a4c0 100644
--- a/DEVELOPERS
+++ b/DEVELOPERS
@@ -668,6 +668,7 @@  F:	package/paho-mqtt-cpp/
 F:	package/pangomm/
 F:	package/pangomm2_46/
 F:	package/sam-ba/
+F:	support/scripts/nvd_api_v2.py
 
 N:	Damien Lanson <damien@kal-host.com>
 F:	package/libvdpau/
diff --git a/support/scripts/nvd_api_v2.py b/support/scripts/nvd_api_v2.py
new file mode 100755
index 0000000000..3fdf32596f
--- /dev/null
+++ b/support/scripts/nvd_api_v2.py
@@ -0,0 +1,138 @@ 
+#!/usr/bin/env python3
+
+from datetime import datetime, timezone
+import os
+import requests
+import shutil
+import sqlite3
+import time
+
+NVD_API_VERSION = '2.0'
+NVD_API_BASE_URL = 'https://services.nvd.nist.gov/rest/json'
+
+
+class NVD_API:
+    """
+    A helper class that fetches data from a NVD API and
+    helps manage a sqlite database.
+    """
+    def __init__(self, nvd_path, service, database_file_prefix):
+        """
+        Initialize a new NVD API endpoint with service
+        as the URL postfix.
+        """
+        self.nvd_path = nvd_path
+        self.service = service
+        self.url = '%s/%s/%s' % (NVD_API_BASE_URL, service.lower(), NVD_API_VERSION)
+        self.db_file = os.path.join(nvd_path, '%s-%s.sqlite' % (database_file_prefix, NVD_API_VERSION))
+        self.db_file_tmp = '%s.tmp' % self.db_file
+
+    def init_db(self):
+        """
+        Needs to be implemented by derived classes.
+        Used to make sure that database tables exist.
+        """
+        pass
+
+    def save_to_db(self, start_index, total_results, content):
+        """
+        Needs to be implemented by derived classes.
+        Used to save the data given by a single API request
+        to the database.
+        """
+        pass
+
+    def cleanup_db(self):
+        """
+        Clean up any files that where left by previously
+        failed runs.
+        """
+        if os.path.exists(self.db_file_tmp):
+            os.remove(self.db_file_tmp)
+
+    def open_db(self, tmp=False):
+        """
+        Open and return a connection to the sqlite database.
+        """
+        if tmp:
+            return sqlite3.connect(self.db_file_tmp)
+        return sqlite3.connect(self.db_file)
+
+    def download(self, last_update):
+        """
+        Download all entries from NVD since last_update (if not None).
+        For each downloaded page save_to_db is called to together with
+        progress information.
+        NVD rate limiting allows 5 requests per 30 seconds or one every
+        6 seconds.
+        """
+        args = {}
+        start_index = 0
+        total_results = 0
+        results_per_page = 0
+
+        print('Downloading new %s' % self.service)
+
+        if (last_update is not None):
+            args['lastModStartDate'] = last_update.isoformat()
+            args['lastModEndDate'] = datetime.now(tz=timezone.utc).isoformat()
+
+        while True:
+            args['startIndex'] = start_index
+
+            for attempt in range(5):
+                try:
+                    page = requests.get(self.url, params=args)
+                    page.raise_for_status()
+                    content = page.json()
+                except Exception:
+                    time.sleep(6)
+                else:
+                    break
+
+            if content is None:
+                # Nothing was downloaded
+                return False
+
+            results_per_page = content['resultsPerPage']
+            total_results = content['totalResults']
+            start_index = content['startIndex']
+
+            start_index += results_per_page
+
+            if self.save_to_db(start_index, total_results, content) is not True:
+                return False
+
+            self.connection.commit()
+
+            if start_index >= total_results:
+                return True
+
+            # Otherwise rate limiting will be hit.
+            time.sleep(6)
+
+    def check_for_updates(self):
+        """
+        Check if the database file exists and if the last
+        update was more than 24 hours ago.
+        """
+        self.cleanup_db()
+        last_update = None
+        if os.path.exists(self.db_file):
+            last_update = os.stat(self.db_file).st_mtime
+            if last_update >= time.time() - 86400:
+                return []
+            # NVD uses UTC timestamps
+            last_update = datetime.fromtimestamp(last_update, tz=timezone.utc)
+            shutil.copy2(self.db_file, self.db_file_tmp)
+
+        self.connection = self.open_db(True)
+        self.init_db()
+
+        success = self.download(last_update)
+        self.connection.close()
+        if success:
+            shutil.move(self.db_file_tmp, self.db_file)
+        else:
+            print("Update failed!")
+            os.remove(self.db_file_tmp)