Message ID | a6a18c80f4da8160fd36529a8a4bed22c39a96cb.1546898693.git.yann.morin.1998@free.fr |
---|---|
State | Changes Requested |
Headers | show |
Series | [01/19] infra/pkg-generic: display MESSAGE before running PRE_HOOKS | expand |
El lun., 7 ene. 2019 a las 23:06, Yann E. MORIN (<yann.morin.1998@free.fr>) escribió: > > The existing format for the packages-files lists has two drawbacks: > > - it is not very resilient against filenames with weird characters, > like \n or \b or spaces... > > - it is not easily expandable, partly because of the above. > > AS such, introduce a new format for those files, that solves both > issues. > > First, we must find a way that unambiguously separate fields. There is > one single byte that can never ever occur in filenames or package names, > i.e. the NULL character. So, we use that as a field separator. > > Second, we must find a way to unambiguously separate records. Except for > \0, any character may occur in filenames, but the other existing field > we currently have is the package name, which we do know does not contain > any weird byte (e.g. it's basically limited to [[:alnum:]_-]). Thus, we > can't use a single character as record separator. A solution is to use > \0\n as the record separator. > > Thirdly, we must ensure that filenames never mess up with our > separators. By making the filename the first field, we can be sure that > it is properly terminated by a field separator, and that any leading \n > does not interfere with a previous field separator to form a spurious > record separator. > > So, the new format is now (without spaces): > > filename \0 package-name \0\n > > Update the parser accordingly. > > Signed-off-by: "Yann E. MORIN" <yann.morin.1998@free.fr> > --- > package/pkg-generic.mk | 8 ++++++-- > support/scripts/brpkgutil.py | 28 ++++++++++++++++++++++++---- > 2 files changed, 30 insertions(+), 6 deletions(-) > > diff --git a/package/pkg-generic.mk b/package/pkg-generic.mk > index 7daea190a6..d261b5bf76 100644 > --- a/package/pkg-generic.mk > +++ b/package/pkg-generic.mk > @@ -59,6 +59,10 @@ GLOBAL_INSTRUMENTATION_HOOKS += step_time > > # The suffix is typically empty for the target variant, for legacy backward > # compatibility. > +# Files are record-formatted, with \0\n as record separator, and \0 as > +# field separator. A record is made of these fields: > +# - file path > +# - package name > # $(1): package name > # $(2): base directory to search in > # $(3): suffix of file (optional) > @@ -66,8 +70,8 @@ define step_pkg_size_inner > cd $(2); \ > find . \( -type f -o -type l \) \ > -newer $@_before \ > - -exec printf '$(1),%s\n' {} + \ > - >> $(BUILD_DIR)/packages-file-list$(3).txt > + |sed -r -e 's/$$/\x00$(1)\x00/' \ > + >> $(BUILD_DIR)/packages-file-list$(3).txt > endef > > define step_pkg_size > diff --git a/support/scripts/brpkgutil.py b/support/scripts/brpkgutil.py > index d15b18845b..f6ef4b3dca 100644 > --- a/support/scripts/brpkgutil.py > +++ b/support/scripts/brpkgutil.py > @@ -5,6 +5,26 @@ import sys > import subprocess > > > +# Read the binary-opened file object f with \0\n separated records (aka lines). > +# Highly inspired by: > +# https://stackoverflow.com/questions/19600475/how-to-read-records-terminated-by-custom-separator-from-file-in-python > +def _readlines0n(f): > + buf = b'' > + while True: > + newbuf = f.read(1048576) I would find 1024 * 1024 more readable. > + if not newbuf: > + if buf: > + yield buf > + return > + if buf is None: > + buf = b'' > + buf += newbuf > + lines = buf.split(b'\x00\n') > + for line in lines[:-1]: > + yield line > + buf = lines[-1] > + > + > # Iterate on all records of the packages-file-list file passed as filename > # Returns an iterator over a list of dictionaries. Each dictionary contains > # these keys (others maybe added in the future): > @@ -12,11 +32,11 @@ import subprocess > # 'pkg': the last package that installed that file > def parse_pkg_file_list(path): > with open(path, 'rb') as f: I now understand why you read as binary. > - for rec in f.readlines(): > - l = rec.split(',0') I still think this ',0' was wrong. > + for rec in _readlines0n(f): > + srec = rec.split(b'\x00') > d = { > - 'file': l[0], > - 'pkg': l[1], > + 'file': srec[0], > + 'pkg': srec[1], and I now see how the swap in a previous commit could go unnoticed in your testing :-) /Thomas
Thomas, All, On 2019-01-08 16:07 +0100, Thomas De Schampheleire spake thusly: > El lun., 7 ene. 2019 a las 23:06, Yann E. MORIN > (<yann.morin.1998@free.fr>) escribió: > > The existing format for the packages-files lists has two drawbacks: [--SNIP--] > > AS such, introduce a new format for those files, that solves both > > issues. [--SNIP--] > > +# Read the binary-opened file object f with \0\n separated records (aka lines). > > +# Highly inspired by: > > +# https://stackoverflow.com/questions/19600475/how-to-read-records-terminated-by-custom-separator-from-file-in-python > > +def _readlines0n(f): > > + buf = b'' > > + while True: > > + newbuf = f.read(1048576) > I would find 1024 * 1024 more readable. We definitely have a different eyesight, as I find this easier to grok! ;-) I should note that we absolutely do not care about the size we buffer. We could very well read it whole, or use smaller chunks (the original inspiration read 4kiB blocks). But I can switch to 1024*1024. [--SNIP--] > > # Iterate on all records of the packages-file-list file passed as filename > > # Returns an iterator over a list of dictionaries. Each dictionary contains > > # these keys (others maybe added in the future): > > @@ -12,11 +32,11 @@ import subprocess > > # 'pkg': the last package that installed that file > > def parse_pkg_file_list(path): > > with open(path, 'rb') as f: > I now understand why you read as binary. Even though here we really want to read a binary file, we already wanted to in the previous patch, because filenames can be any binary sequence. > > - for rec in f.readlines(): > > - l = rec.split(',0') > I still think this ',0' was wrong. Yes it was, I'll fix it (well, I already fixed it now). > > + for rec in _readlines0n(f): > > + srec = rec.split(b'\x00') > > d = { > > - 'file': l[0], > > - 'pkg': l[1], > > + 'file': srec[0], > > + 'pkg': srec[1], > and I now see how the swap in a previous commit could go unnoticed in > your testing :-) Yeah. Thanks for spotting it. I'll still fix the interim patches to be consistent. Thank you! :-) Regards, Yann E. MORIN.
diff --git a/package/pkg-generic.mk b/package/pkg-generic.mk index 7daea190a6..d261b5bf76 100644 --- a/package/pkg-generic.mk +++ b/package/pkg-generic.mk @@ -59,6 +59,10 @@ GLOBAL_INSTRUMENTATION_HOOKS += step_time # The suffix is typically empty for the target variant, for legacy backward # compatibility. +# Files are record-formatted, with \0\n as record separator, and \0 as +# field separator. A record is made of these fields: +# - file path +# - package name # $(1): package name # $(2): base directory to search in # $(3): suffix of file (optional) @@ -66,8 +70,8 @@ define step_pkg_size_inner cd $(2); \ find . \( -type f -o -type l \) \ -newer $@_before \ - -exec printf '$(1),%s\n' {} + \ - >> $(BUILD_DIR)/packages-file-list$(3).txt + |sed -r -e 's/$$/\x00$(1)\x00/' \ + >> $(BUILD_DIR)/packages-file-list$(3).txt endef define step_pkg_size diff --git a/support/scripts/brpkgutil.py b/support/scripts/brpkgutil.py index d15b18845b..f6ef4b3dca 100644 --- a/support/scripts/brpkgutil.py +++ b/support/scripts/brpkgutil.py @@ -5,6 +5,26 @@ import sys import subprocess +# Read the binary-opened file object f with \0\n separated records (aka lines). +# Highly inspired by: +# https://stackoverflow.com/questions/19600475/how-to-read-records-terminated-by-custom-separator-from-file-in-python +def _readlines0n(f): + buf = b'' + while True: + newbuf = f.read(1048576) + if not newbuf: + if buf: + yield buf + return + if buf is None: + buf = b'' + buf += newbuf + lines = buf.split(b'\x00\n') + for line in lines[:-1]: + yield line + buf = lines[-1] + + # Iterate on all records of the packages-file-list file passed as filename # Returns an iterator over a list of dictionaries. Each dictionary contains # these keys (others maybe added in the future): @@ -12,11 +32,11 @@ import subprocess # 'pkg': the last package that installed that file def parse_pkg_file_list(path): with open(path, 'rb') as f: - for rec in f.readlines(): - l = rec.split(',0') + for rec in _readlines0n(f): + srec = rec.split(b'\x00') d = { - 'file': l[0], - 'pkg': l[1], + 'file': srec[0], + 'pkg': srec[1], } yield d
The existing format for the packages-files lists has two drawbacks: - it is not very resilient against filenames with weird characters, like \n or \b or spaces... - it is not easily expandable, partly because of the above. AS such, introduce a new format for those files, that solves both issues. First, we must find a way that unambiguously separate fields. There is one single byte that can never ever occur in filenames or package names, i.e. the NULL character. So, we use that as a field separator. Second, we must find a way to unambiguously separate records. Except for \0, any character may occur in filenames, but the other existing field we currently have is the package name, which we do know does not contain any weird byte (e.g. it's basically limited to [[:alnum:]_-]). Thus, we can't use a single character as record separator. A solution is to use \0\n as the record separator. Thirdly, we must ensure that filenames never mess up with our separators. By making the filename the first field, we can be sure that it is properly terminated by a field separator, and that any leading \n does not interfere with a previous field separator to form a spurious record separator. So, the new format is now (without spaces): filename \0 package-name \0\n Update the parser accordingly. Signed-off-by: "Yann E. MORIN" <yann.morin.1998@free.fr> --- package/pkg-generic.mk | 8 ++++++-- support/scripts/brpkgutil.py | 28 ++++++++++++++++++++++++---- 2 files changed, 30 insertions(+), 6 deletions(-)