diff mbox

glob: Simplify the interface for the GLOB_ALTDIRFUNC callback gl_readdir

Message ID 570BD76D.5070907@redhat.com
State New
Headers show

Commit Message

Florian Weimer April 11, 2016, 4:57 p.m. UTC
On 04/08/2016 08:44 PM, Roland McGrath wrote:
> This is a potentially breaking ABI and API change.  The documentation does
> not say anything specific about what the gl_readdir function's protocol is.
> It just says "an alternative implementation of readdir".  The way I'd read
> that, and the reality of the status quo ante, is that a trivial wrapper
> around readdir will behave identically to not using GLOB_ALTDIRFUNC at all.
> That means that d_ino==0 results must be ignored.
>
> IMHO this means we need symbol versioning to make a change to the treatment
> of d_ino.

I'm trying to get clarification if getdents on Linux can ever return 
zero d_ino values.  It's going to be difficult to work this out due to 
the VFS layer.  My hunch is that inode 0 is more likely to mean “I can't 
tell you the inode number right now here”, and not “please skip this entry”.

In a sense, with glob64, supplying readdir64 as the callback does not 
actually change behavior if the d_ino == 0 is left out because that 
readdir64 implementation would not be POSIX-compliant (d_ino has to be 
the file serial number if the member exists).

I think we have to filter in readdir in order to comply with POSIX if 
the alleged historic race (d_ino is 0 during removal) can happen on 
Linux or Hurd.

> It's a sufficiently conservative change to start ignoring d_namlen,
> so you could do that separately.

Indeed, this should be a different conversation.  I tried to word the 
new documentation in a way that leaves the door open for that.

Florian

Comments

Roland McGrath April 11, 2016, 9:19 p.m. UTC | #1
> I'm trying to get clarification if getdents on Linux can ever return 
> zero d_ino values.  It's going to be difficult to work this out due to 
> the VFS layer.  My hunch is that inode 0 is more likely to mean  I can't 
> tell you the inode number right now here , and not  please skip this entry .

I don't know where your hunch comes from, but I don't believe it for a
second.  The d_ino==0 convention has been universal in Unix since before
Linux existed.  Any filesystem that returns d_ino==0 entries for getdents
will have those entries completely ignored by userland, so I doubt anyone
has implemented a filesystem that way.

> In a sense, with glob64, supplying readdir64 as the callback does not 
> actually change behavior if the d_ino == 0 is left out because that 
> readdir64 implementation would not be POSIX-compliant (d_ino has to be 
> the file serial number if the member exists).
>
> I think we have to filter in readdir in order to comply with POSIX if 
> the alleged historic race (d_ino is 0 during removal) can happen on 
> Linux or Hurd.

It was never a race.  In filesystems where it happened, it was the case
that removing an entry often consisted of nothing but clearing its d_ino
field in the on-disk directory.  Only some subsequent operation on the same
directory, like creating a new entry, might overwrite deleted entries or
cause a compaction.  And originally, the on-disk directory format was just
read directly using 'read' (there was no separate getdirentries/getdents
system call), so it was a userland responsibility to ignore the deleted
entries.

Hmm.  Perhaps no readdir function has actually returned d_ino==0 entries
since before 1003.1-1988.  The GNU implementations have indeed always
filtered out such entries.  But code using readdir has always had the
checks.

I don't have the ancient 1003.1 editions in front of me right now (can
probably check next week when I'm back home).  I'm not sure whether there
was ever a standard for readdir that said anything about d_ino different
from what the current POSIX.1 says (which indeed disallows d_ino==0
entries).  My recollection is that early versions of 1003.1 did not specify
d_ino/d_fileno at all.

> +An implementation of @code{gl_readdir} should initialize the
> +@code{struct dirent} object to zero, up to the @code{d_name} member.
> +The @code{d_ino} member must be set to a non-zero value, otherwise
> +@code{glob} may ignore the returned directory entry.  

This wording is awkward.  It seems to contradict itself, as it first
implies that d_ino should be zero and then says it must be nonzero.  I'm
not convinced this "zero the object" advice is the way to go anyway.  If
there is struct padding in struct dirent, it doesn't matter that it be
zero.  If it has d_namlen, it (now) doesn't matter what it contains at all.
Perhaps it's better to say just which fields glob examines and that the
protocol requires only that those fields be set.

Regardless of what we settle on as best practice, I think we want a little
code example here that demonstrates it.

> +The @code{struct dirent} object can be overwritten by a subsequent call
> +to the @code{gl_readdir} callback function on the same directory stream
> +created by the @code{gl_opendir} callback function.  It should be
> +deallocated by the @code{gl_closedir} callback function.

I think it would be better to word this in terms of what glob does, rather
than what the callback should do.  e.g.

@code{glob} examines the @code{struct dirent} object only immediately after
each call to your @code{gl_readdir} callback function and does not store
the pointer returned.  That pointer is not expected to remain valid after a
subsequent call to the @code{gl_readdir} or @code{gl_closedir} callback.  A
common way to implement these callbacks it to reuse the same buffer each
time the @code{gl_readdir} function is called and free the buffer in the
@code{gl_closedir} function.


The code changes look fine to me.


Thanks,
Roland
Paul Eggert April 11, 2016, 10:27 p.m. UTC | #2
On 04/11/2016 02:19 PM, Roland McGrath wrote:
> I'm not sure whether there
> was ever a standard for readdir that said anything about d_ino different
> from what the current POSIX.1 says (which indeed disallows d_ino==0
> entries).

I don't think there was either.

> My recollection is that early versions of 1003.1 did not specify
> d_ino/d_fileno at all.

Yes, that's right. I just checked, and 1003.1-1988 and 1003.1b-1993 both 
specify only d_name.
Florian Weimer April 12, 2016, 12:16 p.m. UTC | #3
On 04/12/2016 12:27 AM, Paul Eggert wrote:
> On 04/11/2016 02:19 PM, Roland McGrath wrote:
>> I'm not sure whether there
>> was ever a standard for readdir that said anything about d_ino different
>> from what the current POSIX.1 says (which indeed disallows d_ino==0
>> entries).
>
> I don't think there was either.

Can we remove the d_ino checking in glob from glibc at least (in a 
future patch)?

Thanks,
Florian
Paul Eggert April 13, 2016, 12:01 a.m. UTC | #4
On 04/12/2016 05:16 AM, Florian Weimer wrote:
> Can we remove the d_ino checking in glob from glibc at least (in a 
> future patch)? 
I have no objection.
diff mbox

Patch

glob: Simplify the interface for the GLOB_ALTDIRFUNC callback gl_readdir

Previously, application code had to set up d_ino and d_namlen members
if the platform supported them, involving conditional compilation.
DT_UNKNOWN is zero, so zero-initialization of struct dirent will set
the d_type member to a conservative value.

Changing the behavior with regards to d_ino is left to a future
cleanup.

2016-04-11  Florian Weimer  <fweimer@redhat.com>

	glob: Simplify and document the interface for the GLOB_ALTDIRFUNC
	callback function gl_readdir.
	* posix/glob.c (NAMELEN, CONVERT_D_NAMLEN): Remove.
	(CONVERT_DIRENT_DIRENT64): Use strcpy instead of memcpy.
	(glob_in_dir): Remove len.  Use strdup instead of malloc and
	memcpy to copy the name.
	* posix/bug-glob2.c (my_readdir): Clear struct dirent object.  Set
	d_ino to 1.
	* posix/tst-gnuglob.c (my_readdir): Likewise.
	* manual/pattern.texi (Calling Glob): Document requirements for
	implementations of the gl_readdir callback function.

diff --git a/manual/pattern.texi b/manual/pattern.texi
index d1b9275..fa5ecd7 100644
--- a/manual/pattern.texi
+++ b/manual/pattern.texi
@@ -237,6 +237,19 @@  function used to read the contents of a directory.  It is used if the
 @code{GLOB_ALTDIRFUNC} bit is set in the flag parameter.  The type of
 this field is @w{@code{struct dirent *(*) (void *)}}.
 
+An implementation of @code{gl_readdir} should initialize the
+@code{struct dirent} object to zero, up to the @code{d_name} member.
+The @code{d_ino} member must be set to a non-zero value, otherwise
+@code{glob} may ignore the returned directory entry.  As an
+optimization, it may set the @code{d_type} member if the file type of
+the entry is known.  The @code{d_name} member must be null-terminated;
+the entire string is used by @code{glob}.
+
+The @code{struct dirent} object can be overwritten by a subsequent call
+to the @code{gl_readdir} callback function on the same directory stream
+created by the @code{gl_opendir} callback function.  It should be
+deallocated by the @code{gl_closedir} callback function.
+
 This is a GNU extension.
 
 @item gl_opendir
diff --git a/posix/bug-glob2.c b/posix/bug-glob2.c
index ddf5ec9..8c377b6 100644
--- a/posix/bug-glob2.c
+++ b/posix/bug-glob2.c
@@ -193,7 +193,8 @@  my_readdir (void *gdir)
       return NULL;
     }
 
-  dir->d.d_ino = dir->idx;
+  memset (&dir->d, 0, offsetof (struct dirent, d_name));
+  dir->d.d_ino = 1;		/* glob should not skip this entry.  */
 
 #ifdef _DIRENT_HAVE_D_TYPE
   dir->d.d_type = filesystem[dir->idx].type;
@@ -202,13 +203,11 @@  my_readdir (void *gdir)
   strcpy (dir->d.d_name, filesystem[dir->idx].name);
 
 #ifdef _DIRENT_HAVE_D_TYPE
-  PRINTF ("my_readdir ({ level: %d, idx: %ld }) = { d_ino: %ld, d_type: %d, d_name: \"%s\" }\n",
-	  dir->level, (long int) dir->idx, dir->d.d_ino, dir->d.d_type,
-	  dir->d.d_name);
+  PRINTF ("my_readdir ({ level: %d, idx: %ld }) = { d_type: %d, d_name: \"%s\" }\n",
+	  dir->level, (long int) dir->idx, dir->d.d_type, dir->d.d_name);
 #else
-  PRINTF ("my_readdir ({ level: %d, idx: %ld }) = { d_ino: %ld, d_name: \"%s\" }\n",
-	  dir->level, (long int) dir->idx, dir->d.d_ino,
-	  dir->d.d_name);
+  PRINTF ("my_readdir ({ level: %d, idx: %ld }) = { d_name: \"%s\" }\n",
+	  dir->level, (long int) dir->idx, dir->d.d_name);
 #endif
 
   ++dir->idx;
diff --git a/posix/glob.c b/posix/glob.c
index 0c04c3c..9ae76ac 100644
--- a/posix/glob.c
+++ b/posix/glob.c
@@ -57,10 +57,8 @@ 
 
 #if defined HAVE_DIRENT_H || defined __GNU_LIBRARY__
 # include <dirent.h>
-# define NAMLEN(dirent) strlen((dirent)->d_name)
 #else
 # define dirent direct
-# define NAMLEN(dirent) (dirent)->d_namlen
 # ifdef HAVE_SYS_NDIR_H
 #  include <sys/ndir.h>
 # endif
@@ -76,12 +74,6 @@ 
 #endif
 
 
-/* In GNU systems, <dirent.h> defines this macro for us.  */
-#ifdef _D_NAMLEN
-# undef NAMLEN
-# define NAMLEN(d) _D_NAMLEN(d)
-#endif
-
 /* When used in the GNU libc the symbol _DIRENT_HAVE_D_TYPE is available
    if the `d_type' member for `struct dirent' is available.
    HAVE_STRUCT_DIRENT_D_TYPE plays the same role in GNULIB.  */
@@ -105,12 +97,6 @@ 
 
 /* If the system has the `struct dirent64' type we use it internally.  */
 #if defined _LIBC && !defined COMPILE_GLOB64
-# if defined HAVE_DIRENT_H || defined __GNU_LIBRARY__
-#  define CONVERT_D_NAMLEN(d64, d32)
-# else
-#  define CONVERT_D_NAMLEN(d64, d32) \
-  (d64)->d_namlen = (d32)->d_namlen;
-# endif
 
 # if (defined POSIX || defined WINDOWS32) && !defined __GNU_LIBRARY__
 #  define CONVERT_D_INO(d64, d32)
@@ -127,8 +113,7 @@ 
 # endif
 
 # define CONVERT_DIRENT_DIRENT64(d64, d32) \
-  memcpy ((d64)->d_name, (d32)->d_name, NAMLEN (d32) + 1);		      \
-  CONVERT_D_NAMLEN (d64, d32)						      \
+  strcpy ((d64)->d_name, (d32)->d_name);				      \
   CONVERT_D_INO (d64, d32)						      \
   CONVERT_D_TYPE (d64, d32)
 #endif
@@ -1554,7 +1539,6 @@  glob_in_dir (const char *pattern, const char *directory, int flags,
 	  while (1)
 	    {
 	      const char *name;
-	      size_t len;
 #if defined _LIBC && !defined COMPILE_GLOB64
 	      struct dirent64 *d;
 	      union
@@ -1622,12 +1606,10 @@  glob_in_dir (const char *pattern, const char *directory, int flags,
 			  names = newnames;
 			  cur = 0;
 			}
-		      len = NAMLEN (d);
-		      names->name[cur] = (char *) malloc (len + 1);
+		      names->name[cur] = strdup (d->d_name);
 		      if (names->name[cur] == NULL)
 			goto memory_error;
-		      *((char *) mempcpy (names->name[cur++], name, len))
-			= '\0';
+		      ++cur;
 		      ++nfound;
 		    }
 		}
diff --git a/posix/tst-gnuglob.c b/posix/tst-gnuglob.c
index 992b997..184e030 100644
--- a/posix/tst-gnuglob.c
+++ b/posix/tst-gnuglob.c
@@ -211,7 +211,8 @@  my_readdir (void *gdir)
       return NULL;
     }
 
-  dir->d.d_ino = dir->idx;
+  memset (&dir->d, 0, offsetof (struct dirent, d_name));
+  dir->d.d_ino = 1;		/* glob should not skip this entry.  */
 
 #ifdef _DIRENT_HAVE_D_TYPE
   dir->d.d_type = filesystem[dir->idx].type;
@@ -220,13 +221,11 @@  my_readdir (void *gdir)
   strcpy (dir->d.d_name, filesystem[dir->idx].name);
 
 #ifdef _DIRENT_HAVE_D_TYPE
-  PRINTF ("my_readdir ({ level: %d, idx: %ld }) = { d_ino: %ld, d_type: %d, d_name: \"%s\" }\n",
-	  dir->level, (long int) dir->idx, dir->d.d_ino, dir->d.d_type,
-	  dir->d.d_name);
+  PRINTF ("my_readdir ({ level: %d, idx: %ld }) = { d_type: %d, d_name: \"%s\" }\n",
+	  dir->level, (long int) dir->idx, dir->d.d_type, dir->d.d_name);
 #else
-  PRINTF ("my_readdir ({ level: %d, idx: %ld }) = { d_ino: %ld, d_name: \"%s\" }\n",
-	  dir->level, (long int) dir->idx, dir->d.d_ino,
-	  dir->d.d_name);
+  PRINTF ("my_readdir ({ level: %d, idx: %ld }) = { d_name: \"%s\" }\n",
+	  dir->level, (long int) dir->idx, dir->d.d_name);
 #endif
 
   ++dir->idx;