Patchwork [Fortran] PR 50016: Slow Fortran I/O on Windows and flushing/_commit

login
register
mail settings
Submitter Tobias Burnus
Date Oct. 17, 2011, 12:49 p.m.
Message ID <4E9C2445.2000007@net-b.de>
Download mbox | patch
Permalink /patch/120204/
State New
Headers show

Comments

Tobias Burnus - Oct. 17, 2011, 12:49 p.m.
This patch adds a call to _commit() on _WIN32 for the FLUSH subroutine 
and the FLUSH statement. It removes the _commit from gfortran's buf_flush.

Background:
* gfortran internally buffers the I/O, but it calls the nonbuffering 
open/write/read functions (and not, e.g., fopen/fwrite/fread). On 
Unix/Darwin/Linux system, the changes become immediately visible to all 
processes after a "write()".
* On Windows, there seems to be a file-descriptor specific system buffer 
such that the data only becomes available to other processes after 
either a "_commit" or after closing the file.
* The Windows _commit() is a combination of a (system) buffer flush - as 
a "write()" already implies on POSIX systems, but it also ensures that 
the data ends up on the disk, which on POSIX systems is done via fsync().

Currently, _commit() is also called for the internal buf_flush, which 
can happen very often and thus delays the I/O a lot (cf. PR 50016). With 
this patch, it is only called when the user explicitly called FLUSH(). 
Contrary to non-_WIN32 systems, this also flushes to the hard disk, 
which causes some slow down. However, as the user (should) only rarely 
call the function, it should not pose a real performance issue.

The patch should ensure that the values can be read via other 
file-descriptors within the same program or from other programs; on the 
same time, it prevents excessive _commit() calling as it is done with 
the current code.

An alternative would be not to call _commit() at all, leaving it to the 
user. That would match the fsync() result on POSIX systems, but I think 
having a file not available in the current status for other processes 
causes more confusion that it helps. Thus, I think for FLUSH, one should 
really make sure other processes get the right data.

The patch was build and regtested on x86-64-linux (where it should be a 
noop).
OK for the trunk?

Can someone test it on MinGW/MinGW-64?

Tobias

PS: For the discussion, see 
http://gcc.gnu.org/ml/fortran/2011-10/msg00079.html

PPS: I regard this patch as rather independent of INQUIRE even if the 
issue first occurred with INQUIRE (PR 44698). Thus, it is independent 
from whether one calls "stat()" for an open file or whether one uses a 
different method like seek(SEEK_END) + tell() to obtain the file size.
Janne Blomqvist - Oct. 17, 2011, 3:30 p.m.
On Mon, Oct 17, 2011 at 15:49, Tobias Burnus <burnus@net-b.de> wrote:
> This patch adds a call to _commit() on _WIN32 for the FLUSH subroutine and
> the FLUSH statement. It removes the _commit from gfortran's buf_flush.

Like I argued in this message

http://gcc.gnu.org/ml/fortran/2011-10/msg00094.html ,

I think this is a gross mistake. libgfortran should not require
_commit nor fsync in any situation. Those calls are useful for writing
databases and other applications which must make data integrity
guarantees, and are prepared to pay the performance cost associated
with it. It's absolutely not something a language support library
should do unless the language spec explicitly requires such data
integrity guarantees.

> Background:
> * gfortran internally buffers the I/O, but it calls the nonbuffering
> open/write/read functions (and not, e.g., fopen/fwrite/fread). On
> Unix/Darwin/Linux system, the changes become immediately visible to all
> processes after a "write()".
> * On Windows, there seems to be a file-descriptor specific system buffer
> such that the data only becomes available to other processes after either a
> "_commit" or after closing the file.
> * The Windows _commit() is a combination of a (system) buffer flush - as a
> "write()" already implies on POSIX systems, but it also ensures that the
> data ends up on the disk, which on POSIX systems is done via fsync().

I admit I'm somewhat confused by this issue, and despite extensive
googling I haven't been able to come up with a clear explanation for
the behavior seen in PR 44698. That write() would be buffered on
windows makes no sense to me, and I have found no documentation
supporting this view. The closest what I found was in the
documentation for WriteFile and WriteFileEX (which are the win32
native calls, write() is a wrapper around one of these (the EX version
if available, presumably)):

http://msdn.microsoft.com/en-us/library/windows/desktop/aa365748%28v=vs.85%29.aspx

"When writing to a file, the last write time is not fully updated
until all handles used for writing have been closed. Therefore, to
ensure an accurate last write time, close the file handle immediately
after writing to the file."

This suggests that metadata updates are not immediately visible to
other processes. Or at least it makes sense that it would update the
size at the same time it updates the last write time. Which would
explain why stat("/path/to/file") would show an old size, as the size
hasn't necessarily yet been updated to the directory, although the
file data itself is already transferred to the OS. And, while I
haven't checked it, it would make sense that fstat(fd,...) would
return the up to date info, as that would just need to check the data
via the process local file descriptor table rather than looking up the
metadata via the directory.

That is, I guess that the implementation is something along the lines
of these calls not being buffered (thus no data lost if the process
crashes after WriteFile(EX)), but the kernel maintains a per-handle
metadata cache which is flushed to the filesystem during batched
metadata updates (and close(), _commit(), or process ending), at which
point it also becomes visible to other processes. Or something like
that.

That being said, this is just me speculating. If someone knows better,
please feel free to share.

And, while I'm at it, this kind of "relaxed consistency" is not
unheard of in the unix world either. Consider NFS, where data and
metadata may not be flushed to the server until fsync() or close() is
called, or the attribute cache timeout forces the writeout(?), and
thus it's possible for clients to have an inconsistent view of a file.

In both cases the remedy is the same; if this kind of consistency
matters, the user should close the file or fsync()/_commit() before
expecting that the OS metadata is consistent. I think that's a better
option than sprinkling _commit() all over the library.

So I would rather prefer my own patch from the URL above. Also, I
think it would be nice if we could get this fix into 4.6.2..
Tobias Burnus - Oct. 17, 2011, 4:03 p.m.
Hi Janne,

On 10/17/2011 05:30 PM, Janne Blomqvist wrote:
> On Mon, Oct 17, 2011 at 15:49, Tobias Burnus<burnus@net-b.de>  wrote:
>> This patch adds a call to _commit() on _WIN32 for the FLUSH subroutine and
>> the FLUSH statement. It removes the _commit from gfortran's buf_flush.
> Like I argued in this message http://gcc.gnu.org/ml/fortran/2011-10/msg00094.html, I think this is a gross mistake.
[...]

And I think it is a mistake to not make the data available to other 
processes as it is indicated by the Fortran 2008 standard:

"Execution of a FLUSH statement causes data written to an external le 
to be available to other processes, or causes data placed in an external 
file by means other than Fortran to be available to a READ statement. 
These actions are processor dependent."

Thus, I think it makes sense for FLUSH to call _commit on Windows.

If you don't want to have a slow down: Simply do not call FLUSH.

> libgfortran should not require _commit nor fsync in any situation. Those calls are useful for writing databases and other applications which must make data integrity guarantees, and are prepared to pay the performance cost associated with it. It's absolutely not something a language support library should do unless the language spec explicitly requires such data integrity guarantees.

Well, Fortran does not need to write the data to the file, however, the 
purpose of FLUSH is that I can, e.g., run execute_command_line with the 
file the program just has written. It will work on Unix/Linux but not on 
MinGW/MinGW-w64 without a _commit (or without closing the file).

> That write() would be buffered on windows makes no sense to me

Why shouldn't it be buffed? Typical Windows programs open files with an 
exclusive lock and as Windows never had the pipes and many small 
programs as Unix did, having a per-file-descriptor buffer is easier to 
implement, avoids multi-thread issues and is potentially faster. If a 
program wants to make the data available, it can just _commit it or 
close the file handle - that way one also has a perfect data integrity.

> And, while I'm at it, this kind of "relaxed consistency" is not
> unheard of in the unix world either. Consider NFS, where data and
> metadata may not be flushed to the server until fsync() or close() is
> called, or the attribute cache timeout forces the writeout(?), and
> thus it's possible for clients to have an inconsistent view of a file.

Well, most of the time it works well on the same system: If I call 
execute_command_line, the data is up to date. The issue with NFS only 
occurs if I want to access the data remotely, which is another issue. If 
one wants to do that, one can use a parallel access with, e.g., HDF5 or 
MPIv2 or the Coarray TS (to be written and implemented).

> In both cases the remedy is the same; if this kind of consistency matters, the user should close the file or fsync()/_commit() before expecting that the OS metadata is consistent. I think that's a better option than sprinkling _commit() all over the library.

No, for the required consistency, FLUSH is enough (including calling 
_commit on MinGW/MinGW-w64). It makes sure that if the program crashes, 
the data is still there, it makes the data available for other processes.

Only if one wants to have complete integrity, one can call fsync. 
However, with NFS, Lustre et al., I am not 100% sure that the data is 
immediately available on all other clients after fsync returned.

> So I would rather prefer my own patch from the URL above. Also, I
> think it would be nice if we could get this fix into 4.6.2..

I also would like to see this fixed for 4.6.2. However, a prerequisite 
is that we agree on how to implement it.

Regarding your patch: I think it does not solve the FLUSH issue. For the 
file size itself, I think the patch is okay, but frankly, I think for 
the performance it does not really matter which approach is taken. And I 
do not like the test-suite part of your patch.

Tobias
Janne Blomqvist - Oct. 17, 2011, 8:52 p.m.
On Mon, Oct 17, 2011 at 19:03, Tobias Burnus <burnus@net-b.de> wrote:
> Hi Janne,
>
> On 10/17/2011 05:30 PM, Janne Blomqvist wrote:
>>
>> On Mon, Oct 17, 2011 at 15:49, Tobias Burnus<burnus@net-b.de>  wrote:
>>>
>>> This patch adds a call to _commit() on _WIN32 for the FLUSH subroutine
>>> and
>>> the FLUSH statement. It removes the _commit from gfortran's buf_flush.
>>
>> Like I argued in this message
>> http://gcc.gnu.org/ml/fortran/2011-10/msg00094.html, I think this is a gross
>> mistake.
>
> [...]
>
> And I think it is a mistake to not make the data available to other
> processes as it is indicated by the Fortran 2008 standard:
>
> "Execution of a FLUSH statement causes data written to an external le to be
> available to other processes, or causes data placed in an external file by
> means other than Fortran to be available to a READ statement. These actions
> are processor dependent."
>
> Thus, I think it makes sense for FLUSH to call _commit on Windows.

I'm not actually sure we can draw such conclusions. What we know is
that metadata updates to the directory are delayed. It wouldn't
surprise me if opening a file (which presumably is a atomic operation
in order to avoid race conditions just like on POSIX) forces the
kernel to sync metadata of other handles to the same file.

> If you don't want to have a slow down: Simply do not call FLUSH.

There are legitimate reasons flush which do not entail storing dirty
data to stable storage. Which is why things like fflush() and fsync()
are separate.

>> libgfortran should not require _commit nor fsync in any situation. Those
>> calls are useful for writing databases and other applications which must
>> make data integrity guarantees, and are prepared to pay the performance cost
>> associated with it. It's absolutely not something a language support library
>> should do unless the language spec explicitly requires such data integrity
>> guarantees.
>
> Well, Fortran does not need to write the data to the file, however, the
> purpose of FLUSH is that I can, e.g., run execute_command_line with the file
> the program just has written. It will work on Unix/Linux but not on
> MinGW/MinGW-w64 without a _commit (or without closing the file).

Do we have any evidence of this failing in reality?

>> That write() would be buffered on windows makes no sense to me
>
> Why shouldn't it be buffed?

Because language runtimes already provide buffered IO so there's no
reason for what is ostensibly a syscall to provide an extra layer of
user space buffering. Secondly, AFAIK once a write is finished, the
data is there even if the application crashes, which suggests that the
data is transferred to the kernel page cache just like on other OS'es.

To clarify, I was talking about a per-process user space buffer, like
C stdio or indeed the internal buffering in libgfortran. Of course,
like any remotely modern OS Windows does aggressive buffering of files
in the kernel using whatever memory is available.

> Typical Windows programs open files with an
> exclusive lock and as Windows never had the pipes and many small programs as
> Unix did, having a per-file-descriptor buffer is easier to implement, avoids
> multi-thread issues and is potentially faster.

The fact that Windows, like many other OS'es, supports things like
shared libraries and shared file mappings, strongly suggests that it
has a unified page cache for file data, just like Linux. Also, an OS
that would waste memory on caching the same file multiple times would
not be performance competitive in today's world.

>> And, while I'm at it, this kind of "relaxed consistency" is not
>> unheard of in the unix world either. Consider NFS, where data and
>> metadata may not be flushed to the server until fsync() or close() is
>> called, or the attribute cache timeout forces the writeout(?), and
>> thus it's possible for clients to have an inconsistent view of a file.
>
> Well, most of the time it works well on the same system:

Yes, most of the time it works just fine. If we're going to do
something with massive performance implications, IMHO we'd better make
sure it works always and not just most of the time.

>> In both cases the remedy is the same; if this kind of consistency matters,
>> the user should close the file or fsync()/_commit() before expecting that
>> the OS metadata is consistent. I think that's a better option than
>> sprinkling _commit() all over the library.
>
> No, for the required consistency, FLUSH is enough (including calling _commit
> on MinGW/MinGW-w64). It makes sure that if the program crashes, the data is
> still there, it makes the data available for other processes.

Like I argued above, I'd be VERY surprised if written data is actually
lost if the application crashes without first doing a _commit. Also, I
find it likely that opening a file will actually provide the correct
up-to-date metadata even though the directory might contain slightly
stale metadata.

> Only if one wants to have complete integrity, one can call fsync. However,
> with NFS, Lustre et al., I am not 100% sure that the data is immediately
> available on all other clients after fsync returned.

With NFS, no. fsync() just guarantees that data is committed on the
server; other clients will notice it only when their attribute caches
timeout and they refetch the attributes for the file.

For Lustre, yes, since Lustre implements a full cache coherency protocol.

> Regarding your patch: I think it does not solve the FLUSH issue.

Why not?

> For the
> file size itself, I think the patch is okay, but frankly, I think for the
> performance it does not really matter which approach is taken.

I do think performance matters. The performance of fsync()/_commit()
is limited by the rotation speed of the HD; with a normal 7.2k SATA
disk one can do maybe 100 fsync()'s per second. A terribly small
number, that is, if one compares to the performance of the CPU and
memory on a modern PC. For comparison, I recall I was able to do
something like 10**6 or was it even 10**7 simple syscalls per second.

> And I do not
> like the test-suite part of your patch.

?? Well, with the rest of the patch the FLUSH is unnecessary, but I'm
sure it won't do any great harm if it's kept.
Janne Blomqvist - Oct. 19, 2011, 3:24 p.m.
On Mon, Oct 17, 2011 at 23:52, Janne Blomqvist
<blomqvist.janne@gmail.com> wrote:
> On Mon, Oct 17, 2011 at 19:03, Tobias Burnus <burnus@net-b.de> wrote:
>> Hi Janne,
>>
>> On 10/17/2011 05:30 PM, Janne Blomqvist wrote:
>>>
>>> On Mon, Oct 17, 2011 at 15:49, Tobias Burnus<burnus@net-b.de>  wrote:
>>>>
>>>> This patch adds a call to _commit() on _WIN32 for the FLUSH subroutine
>>>> and
>>>> the FLUSH statement. It removes the _commit from gfortran's buf_flush.
>>>
>>> Like I argued in this message
>>> http://gcc.gnu.org/ml/fortran/2011-10/msg00094.html, I think this is a gross
>>> mistake.
>>
>> [...]
>>
>> And I think it is a mistake to not make the data available to other
>> processes as it is indicated by the Fortran 2008 standard:
>>
>> "Execution of a FLUSH statement causes data written to an external le to be
>> available to other processes, or causes data placed in an external file by
>> means other than Fortran to be available to a READ statement. These actions
>> are processor dependent."
>>
>> Thus, I think it makes sense for FLUSH to call _commit on Windows.
>
> I'm not actually sure we can draw such conclusions. What we know is
> that metadata updates to the directory are delayed. It wouldn't
> surprise me if opening a file (which presumably is a atomic operation
> in order to avoid race conditions just like on POSIX) forces the
> kernel to sync metadata of other handles to the same file.

I did some further googling, and

http://stackoverflow.com/questions/2883691/fflush-on-stdout

seems to suggest that the my explanation above is roughly what is
happening. That is, opening and closing the file in another program,
"type" in the link above, flushes the metadata to the directory.  Also
from  the MSDN link referenced there

"The only guarantee about a file timestamp is that the file time is
correctly reflected when the handle that makes the change is closed. "

(which would suggest that the same hold for other file metadata as
well, such as the size).


Explanation of file caching in Windows:
http://msdn.microsoft.com/en-us/library/aa364218%28v=VS.85%29.aspx

Some MS presentation about how to make apps behave nicely with SMB:
http://mschnlnine.vo.llnwd.net/d1/pdc08/PPTX/ES23.pptx

Under the heading of "Platform Support for Metadata Caching":

"Metadata caching is best effort and there
are very limited consistency guarantees

Metadata caches expire after a fixed time
"

(Though it's not entirely clear if the above "platform support" means
only SMB or Windows filesystem semantics in general.)

So, I think the picture is essentially:

- write() (which is a wrapper around WriteFile(EX)) transfers data and
metadata to the system cache (kernel page cache in unix terminology).

- However, the metadata is written to the directory lazily, allowing
applications that do stat() (or the equivalent Win32 API call(s)) on a
pathname to see stale data. This, incidentally, is what gfortran is
doing and what caused the issue that led to the introduction of
_commit in the first place.

- Closing the file, or _commit() (a wrapper around
FlushFileBuffers()), or (per the stackoverflow link above)
opening/closing the file in another process will force the directory
flush immediately.

Some investigation into how other language support libraries handle this:

- For MS-DOS compatibility (back when OS file caching wasn't that
advanced, one presumes), the MS C runtime allows the user to link in
an extra object COMMODE.OBJ which makes fflush() also call _commit()
recreating the MS-DOS behavior. By default, however, this is not done.
Also, there are some nonstandard flags that can be passed to fopen()
to indicate that one wants the MS-DOS behavior.

- For the MS C++ compiler, the same COMMODE.OBJ linking can be done,
and there is some nonstandard extension allowing one to get the fd
from a C++ stream and then call _commit on it. By default flush() on a
stream does not call _commit/FlushFileBuffers, it just does a
WriteFile().

- For .NET, there is the FileStream.Flush() method, which flushes the
user-space buffer without calling _commit/FlushFileBuffers. In the
latest version of .NET, there is a new FileStream.Flush(bool) method
which can be used to call FlushFileBuffers. In previous .NET versions,
people used PInvoke (the native code calling interface) to call
FlushFileBuffers if needed.

- For the runtimes provided with GCC, grepping the source tree shows
that libgfortran is the only occurence of _commit. In libjava there is
a FlushFileBuffers call, but it's #if 0'ed away.

That is, except for libgfortran, no other language runtime by default
makes an effort to synchronize the metadata.

In conclusion, I still think that my previous patch which got rid of
the _commit and the reliance on using stat() via pathname is the
correct approach. "dir" might show stale data for the file size, but
this seems to be the norm on Windows, and opening the file in another
process will give the correct data. So in practice I don't think there
will be any problems from getting rid of _commit.

Patch

2011-10-17  Tobias Burnus  <burnus@net-b.de>

	PR fortran/50016
	* intrinsic.texi (FLUSH): Document that it calls _commit
	on MinGW(-w64).

2011-10-17  Tobias Burnus  <burnus@net-b.de>

	PR fortran/50016
	* io/file_pos.c (st_flush): Call _commit on MinGW(-w64).
	* io/intrinsics.c (flush_i4, flush_i8): Ditto.
	* io/unix.c (flush_all_units_1, flush_all_units): Ditto.
	(buf_flush): Remove _commit call.

diff --git a/gcc/fortran/intrinsic.texi b/gcc/fortran/intrinsic.texi
index 11f87a5..1a0e930 100644
--- a/gcc/fortran/intrinsic.texi
+++ b/gcc/fortran/intrinsic.texi
@@ -4775,7 +4775,10 @@  that the data is committed to disk.
 On POSIX systems, you can request that all data is transferred  to  the
 storage device by calling the @code{fsync} function, with the POSIX file
 descriptor of the I/O unit as argument (retrieved with GNU intrinsic
-@code{FNUM}). The following example shows how:
+@code{FNUM}). The following example shows how. On Windows (MinGW and
+MinGW-w64), the @code{FLUSH} subroutine and @code{FLUSH} statement always
+call @code{_commit} to ensure that other file descriptors see the current
+file status; @code{_commit} additionally flushes the data to the disk.
 
 @smallexample
   ! Declare the interface for POSIX fsync function
diff --git a/libgfortran/io/file_pos.c b/libgfortran/io/file_pos.c
index b034f38..4efdc1f 100644
--- a/libgfortran/io/file_pos.c
+++ b/libgfortran/io/file_pos.c
@@ -452,6 +452,10 @@  st_flush (st_parameter_filepos *fpp)
         fbuf_flush (u, u->mode);
 
       sflush (u->s);
+#ifdef _WIN32
+      /* Without _commit, changes are not visible to other file descriptors.  */
+      _commit (u->s->fd);
+#endif
       unlock_unit (u);
     }
   else
diff --git a/libgfortran/io/intrinsics.c b/libgfortran/io/intrinsics.c
index f48bd77..eaf2e17 100644
--- a/libgfortran/io/intrinsics.c
+++ b/libgfortran/io/intrinsics.c
@@ -207,6 +207,11 @@  flush_i4 (GFC_INTEGER_4 *unit)
       if (us != NULL)
 	{
 	  sflush (us->s);
+#ifdef _WIN32
+	  /* Without _commit, changes are not visible
+	     to other file descriptors.  */
+	  _commit (u->s->fd);
+#endif
 	  unlock_unit (us);
 	}
     }
@@ -230,6 +235,11 @@  flush_i8 (GFC_INTEGER_8 *unit)
       if (us != NULL)
 	{
 	  sflush (us->s);
+#ifdef _WIN32
+	  /* Without _commit, changes are not visible
+	     to other file descriptors.  */
+	  _commit (u->s->fd);
+#endif
 	  unlock_unit (us);
 	}
     }
diff --git a/libgfortran/io/unix.c b/libgfortran/io/unix.c
index 25cb559..a88e83b 100644
--- a/libgfortran/io/unix.c
+++ b/libgfortran/io/unix.c
@@ -443,10 +443,6 @@  buf_flush (unix_stream * s)
   if (s->ndirty != 0)
     return -1;
 
-#ifdef _WIN32
-  _commit (s->fd);
-#endif
-
   return 0;
 }
 
@@ -1531,7 +1527,14 @@  flush_all_units_1 (gfc_unit *u, int min_unit)
 	  if (__gthread_mutex_trylock (&u->lock))
 	    return u;
 	  if (u->s)
-	    sflush (u->s);
+	    {
+	      sflush (u->s);
+#ifdef _WIN32
+	      /* Without _commit, changes are not visible to other
+		 file descriptors.  */
+	      _commit (u->s->fd);
+#endif
+	    }
 	  __gthread_mutex_unlock (&u->lock);
 	}
       u = u->right;
@@ -1562,6 +1565,11 @@  flush_all_units (void)
       if (u->closed == 0)
 	{
 	  sflush (u->s);
+#ifdef _WIN32
+	  /* Without _commit, changes are not visible to other
+	     file descriptors.  */
+	  _commit (u->s->fd);
+#endif
 	  __gthread_mutex_lock (&unit_lock);
 	  __gthread_mutex_unlock (&u->lock);
 	  (void) predec_waiting_locked (u);