manual: Clarify the documentation of strverscmp [BZ #20524]

Message ID	3147398a-fa03-af37-f0d3-230e46182b51@redhat.com
State	New
Headers	show Return-Path: <libc-alpha-return-73007-incoming=patchwork.ozlabs.org@sourceware.org> DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:subject:to:references:cc:from:message-id:date :mime-version:in-reply-to:content-type; q=dns; s=default; b=xlnf VZvT3p+Q8rRnAeg2znnE5d520fu3E1RzidyY61vF92imVkFC2R1cBJs6Rpm/HtxZ vjXr1op69o42okgdD136VBnerZK8I5xMphllrsz8bVYbE+7hOuxpjfAvy7WQDXUK RI4CONfsmNwTXrPhUvOfoHVWkMlJmXXSKW3AnrM= Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk Sender: libc-alpha-owner@sourceware.org Subject: Re: [PATCH] manual: Clarify the documentation of strverscmp [BZ #20524] To: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com> References: <20160829202145.A2548403DC4A9@oldenburg.str.redhat.com> <CALxWeYpXBMo6+d3QTtbz1rM2Jv0Gn+G5OTNDV8Sze06Gp9hnrw@mail.gmail.com> <44b7d954-354c-2407-3079-fd511dd1f510@redhat.com> <929dbcd3-6984-c17f-1193-2e2ac8aeaf14@gmail.com> Cc: libc-alpha <libc-alpha@sourceware.org> From: Florian Weimer <fweimer@redhat.com> Message-ID: <3147398a-fa03-af37-f0d3-230e46182b51@redhat.com> Date: Wed, 31 Aug 2016 13:56:20 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.2.0 MIME-Version: 1.0 In-Reply-To: <929dbcd3-6984-c17f-1193-2e2ac8aeaf14@gmail.com> Content-Type: multipart/mixed; boundary="------------955E289A7A0356B8C26DFA1A"

Message ID

3147398a-fa03-af37-f0d3-230e46182b51@redhat.com

State

New

Headers

DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id
	:list-unsubscribe:list-subscribe:list-archive:list-post
	:list-help:sender:subject:to:references:cc:from:message-id:date
	:mime-version:in-reply-to:content-type; q=dns; s=default; b=xlnf
	VZvT3p+Q8rRnAeg2znnE5d520fu3E1RzidyY61vF92imVkFC2R1cBJs6Rpm/HtxZ
	vjXr1op69o42okgdD136VBnerZK8I5xMphllrsz8bVYbE+7hOuxpjfAvy7WQDXUK
	RI4CONfsmNwTXrPhUvOfoHVWkMlJmXXSKW3AnrM=
Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
Sender: libc-alpha-owner@sourceware.org
Subject: Re: [PATCH] manual: Clarify the documentation of strverscmp [BZ
	#20524]
To: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
References: <20160829202145.A2548403DC4A9@oldenburg.str.redhat.com>
	<CALxWeYpXBMo6+d3QTtbz1rM2Jv0Gn+G5OTNDV8Sze06Gp9hnrw@mail.gmail.com>
	<44b7d954-354c-2407-3079-fd511dd1f510@redhat.com>
	<929dbcd3-6984-c17f-1193-2e2ac8aeaf14@gmail.com>
Cc: libc-alpha <libc-alpha@sourceware.org>
From: Florian Weimer <fweimer@redhat.com>
Message-ID: <3147398a-fa03-af37-f0d3-230e46182b51@redhat.com>
Date: Wed, 31 Aug 2016 13:56:20 +0200
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101
	Thunderbird/45.2.0
MIME-Version: 1.0
In-Reply-To: <929dbcd3-6984-c17f-1193-2e2ac8aeaf14@gmail.com>
Content-Type: multipart/mixed;
	boundary="------------955E289A7A0356B8C26DFA1A"

Commit Message

Florian Weimer Aug. 31, 2016, 11:56 a.m. UTC

On 08/30/2016 08:59 PM, Michael Kerrisk (man-pages) wrote:

>> the other sequence is truncated to be of the same (extended) length, and
>> these two sequences are compared lexicographically.  In this last case,
>
> s/last// ?

Maybe “in the last case”?

>> the sequence comparison determines the result of the function because
>> the extension character (or some character before it) is necessarily
>> different from the character at the same offset in the other input
>> string.

>>>> +@item
>>>> +If both digit sequences have an equal, positive number of leading zeros,
>>>
>>> Why is the word "positive" here?
>>
>> It's used as a synonym for “non-zero” (to discriminate this case from
>> the previous one).
>
> Ahh -- okay. Somehow that wording threw me.
>
> Maybe:
>
> "If both digit sequences start with a zero and have an equal number
> of leading zeros..."

Thanks.

New patch attached.

Perhaps we should just say the comparison happens in an 
implementation-defined manner and is only consistent within a single 
process (just like strcoll).  It would simplify matters and would us 
allow to fix the algorithm. :)

Florian

Comments

Florian Weimer Sept. 8, 2016, 1:10 p.m. UTC | #1

On 08/31/2016 01:56 PM, Florian Weimer wrote:

> Perhaps we should just say the comparison happens in an
> implementation-defined manner and is only consistent within a single
> process (just like strcoll).  It would simplify matters and would us
> allow to fix the algorithm. :)

Any further comments?

I was only half-joking.  Maybe we should indeed leave the exact sorting 
order unspecified.

Florian

Michael Kerrisk \(man-pages\) Sept. 11, 2016, 3:35 p.m. UTC | #2

On Thu, Sep 8, 2016 at 2:10 PM, Florian Weimer <fweimer@redhat.com> wrote:
> On 08/31/2016 01:56 PM, Florian Weimer wrote:
>
>> Perhaps we should just say the comparison happens in an
>> implementation-defined manner and is only consistent within a single
>> process (just like strcoll).  It would simplify matters and would us
>> allow to fix the algorithm. :)
>
>
> Any further comments?

Not really any from my part, except:

> I was only half-joking.  Maybe we should indeed leave the exact sorting
> order unspecified.

I'm not sure you can get away with that. I mean, there might be
applications out there that depend on the current weird behavior(?).

Cheers,

Michael

Florian Weimer Sept. 21, 2016, 1:43 p.m. UTC | #3

On 08/31/2016 01:56 PM, Florian Weimer wrote:

> Perhaps we should just say the comparison happens in an
> implementation-defined manner and is only consistent within a single
> process (just like strcoll).  It would simplify matters and would us
> allow to fix the algorithm. :)

I have committed the posted version (that is, the one that describes the 
behavior in greater detail and stresses that it is unusual).

Thanks,
Florian

manual: Clarify the documentation of strverscmp [BZ #20524]

2016-08-31  Florian Weimer  <fweimer@redhat.com>

	[BZ #20524]
	* manual/string.texi (String/Array Comparison): Clarify the
	strverscmp behavior.

diff --git a/manual/string.texi b/manual/string.texi
index bce81a7..1986357 100644
--- a/manual/string.texi
+++ b/manual/string.texi
@@ -1374,46 +1374,75 @@  The @code{strverscmp} function compares the string @var{s1} against
 @var{s2}, considering them as holding indices/version numbers.  The
 return value follows the same conventions as found in the
 @code{strcmp} function.  In fact, if @var{s1} and @var{s2} contain no
-digits, @code{strverscmp} behaves like @code{strcmp}.
+digits, @code{strverscmp} behaves like @code{strcmp}
+(in the sense that the sign of the result is the same).
 
-Basically, we compare strings normally (byte by byte), until
-we find a digit in each string - then we enter a special comparison
-mode, where each sequence of digits is taken as a whole.  If we reach the
-end of these two parts without noticing a difference, we return to the
-standard comparison mode.  There are two types of numeric parts:
-"integral" and "fractional" (those  begin with a '0').  The types
-of the numeric parts affect the way we sort them:
+The comparison algorithm which the @code{strverscmp} function implements
+differs slightly from other version-comparison algorithms.  The
+implementation is based on a finite-state machine, whose behavior is
+approximated below.
 
 @itemize @bullet
 @item
-integral/integral: we compare values as you would expect.
+The input strings are each split into sequences of non-digits and
+digits.  These sequences can be empty at the beginning and end of the
+string.  Digits are determined by the @code{isdigit} function and are
+thus subject to the current locale.
 
 @item
-fractional/integral: the fractional part is less than the integral one.
-Again, no surprise.
+Comparison starts with a (possibly empty) non-digit sequence.  The first
+non-equal sequences of non-digits or digits determines the outcome of
+the comparison.
 
 @item
-fractional/fractional: the things become a bit more complex.
-If the common prefix contains only leading zeroes, the longest part is less
-than the other one; else the comparison behaves normally.
+Corresponding non-digit sequences in both strings are compared
+lexicographically if their lengths are equal.  If the lengths differ,
+the shorter non-digit sequence is extended with the input string
+character immediately following it (which may be the null terminator),
+the other sequence is truncated to be of the same (extended) length, and
+these two sequences are compared lexicographically.  In the last case,
+the sequence comparison determines the result of the function because
+the extension character (or some character before it) is necessarily
+different from the character at the same offset in the other input
+string.
+
+@item
+For two sequences of digits, the number of leading zeros is counted (which
+can be zero).  If the count differs, the string with more leading zeros
+in the digit sequence is considered smaller than the other string.
+
+@item
+If the two sequences of digits have no leading zeros, they are compared
+as integers, that is, the string with the longer digit sequence is
+deemed larger, and if both sequences are of equal length, they are
+compared lexicographically.
+
+@item
+If both digit sequences start with a zero and have an equal number of
+leading zeros, they are compared lexicographically if their lengths are
+the same.  If the lengths differ, the shorter sequence is extended with
+the following character in its input string, and the other sequence is
+truncated to the same length, and both sequences are compared
+lexicographically (similar to the non-digit sequence case above).
 @end itemize
 
+The treatment of leading zeros and the tie-breaking extension characters
+(which in effect propagate across non-digit/digit sequence boundaries)
+differs from other version-comparison algorithms.
+
 @smallexample
 strverscmp ("no digit", "no digit")
     @result{} 0    /* @r{same behavior as strcmp.} */
 strverscmp ("item#99", "item#100")
     @result{} <0   /* @r{same prefix, but 99 < 100.} */
 strverscmp ("alpha1", "alpha001")
-    @result{} >0   /* @r{fractional part inferior to integral one.} */
+    @result{} >0   /* @r{different number of leading zeros (0 and 2).} */
 strverscmp ("part1_f012", "part1_f01")
-    @result{} >0   /* @r{two fractional parts.} */
+    @result{} >0   /* @r{lexicographical comparison with leading zeros.} */
 strverscmp ("foo.009", "foo.0")
-    @result{} <0   /* @r{idem, but with leading zeroes only.} */
+    @result{} <0   /* @r{different number of leading zeros (2 and 1).} */
 @end smallexample
 
-This function is especially useful when dealing with filename sorting,
-because filenames frequently hold indices/version numbers.
-
 @code{strverscmp} is a GNU extension.
 @end deftypefun

manual: Clarify the documentation of strverscmp [BZ #20524]

Commit Message

Comments

Patch