[Cuis-dev] Questions triggered by #forceChangesToDisk

Andres Valloud ten at smallinteger.com
Tue Sep 17 00:28:21 PDT 2019


Hey Phil, this is neat :).  Let's play the VM development game a bit, I 
think it's helpful to at least give an idea of what it's like.  The same 
principles can be used to any program, and IME the results are good.

On 9/16/19 23:10, Phil B wrote:
> Andres,
> 
> On Mon, Sep 16, 2019 at 10:32 PM Andres Valloud via Cuis-dev 
> <cuis-dev at lists.cuis.st <mailto:cuis-dev at lists.cuis.st>> wrote:
> 
>     Phil,
> 
>     Regarding your observations on forceChangesToDisk, first a side
>     comment.
>        That there is a method or even a primitive called
>     'blah-blah-flush-blah' does not mean the function fflush()
> 
>     https://pubs.opengroup.org/onlinepubs/9699919799/functions/fflush.html
> 
>     or equivalent is actually called.
> 
> 
> Good point.  Just because it seems logical that's what it would/should 
> be doing doesn't constitute proof.

I've been burned by this sooooo many times... when I hear that tone of 
"well everybody knows that...", it sounds like "famous last words" to 
me.  So, I'd rather not guess anymore and be correct instead :).

> So let's start with a sanity check: <flush appears to be flushing>

Yes, establishing a base line of expectation here.  Also prevents silly 
problems from wasting tons of time looking in the wrong place.

> Taking a half step back, I ask the question: what's it supposed to be 
> doing?  According to 
> https://github.com/Geal/Squeak-VM/blob/master/platforms/Mac%20OS/vm/Documentation/3.2.2%20Release%20Notes.rtf 
> the file flush primitive was added in (classic) VM 3.0.5 and 'now 
> actually flushes the file via an OS call' as of 3.0.6.

Interesting: how do the dates of the primFlush: method (circa 2001) and 
the 3.0.6 VM correlate?  Is this a case of the Smalltalk image hacks 
going stale while the VM changes away?  Interesting: both primFlush: and 
that VM are essentially contemporaneous, because the VM is from about 2002.

Why would primFlush: retain the comment about xyzOS not doing flush when 
the VM release notes insist that flushing now flushes?  Note also the 
reference to CodeWarrior 5.3, but according to this:

https://en.wikipedia.org/wiki/CodeWarrior

that only ran on Windows and Mac.  Also, surely CodeWarrior usage is 
super obsolete by now.  What's going on in here?...

Side comment: you know, back then there was a POSIX / Single Unix 
Specification, so all that was necessary was to write the VM to POSIX 
(mostly, on Windows you have to do a bit of work for that).  However, 
maybe it can be excused because POSIX /SUS was rather new at the time.

https://en.wikipedia.org/wiki/POSIX

Maybe at the time there wasn't a decent SDK on Mac... no idea.  In any 
case, that's not the case today.

> So that gives an 
> indication that it at least *should* correspond to something at the OS 
> level rather than a home grown approximation of flush.

Ok, apparently the intention was there, ok.

> When I look at FilePlugin>>primitiveFileFlush it calls sqFileFlush: with 
> the file as a parameter.  Just one problem: the only implementor I can 
> find is FileSimulatorPlugin which in turn just sends #flush which gets 
> us right back to #primitiveFileFlush (and besides we're not running in 
> the simulator so this isn't applicable)  Not finding anything else 
> interesting in the image, I grep the VM source tree and find sqFileFlush in:
> platforms/Cross/plugins/FilePlugin/sqFilePluginBasicPrims.c (which calls 
> fflush(getFile(f)) and has an interesting comment[1])

Ok, so apparently it *is* calling fflush(), good.

Also, it looks like fflush() is not being called via the FFI, that's 
also good (recall in POSIX essentially everything that looks like a 
function can be a macro, and you can't call macros from an FFI).

> [1] "fflush() can fail for the same reasons write() can so errors
> must be checked but sqFileFlush() must support being called on
> readonly files for historical reasons so EBADF is ignored"... so
> there's one example of how it could fail but for this particular
> failure case it is ignored in the C code
That's interesting, I'd verify whether a 5 line C program shows fflush() 
fails with EBADF when given a file open for reading only.  POSIX says 
EBADF is returned when the file handle isn't valid, but what if you do 
pass in a valid file handle?  Shouldn't fflush() be a no-op then?

> platforms/win32/plugins/FilePlugin/sqWin32FilePrims.c (which 
> calls FlushFileBuffers(FILE_HANDLE(f)) and ignores the return code)

Ignoring return codes is not good at all.  In particular, you can get 
into all sorts of problems by doing that in MSDN land.

> It looks like it will never fail on Windows (regardless of the fact that 
> the call might have)

In MSDN, any time you see "call GetLastError() to see what happened", 
effectively that means any of these circumstances can happen:

https://docs.microsoft.com/en-us/windows/win32/debug/system-error-codes

Note the numbers go up to 15999.  Ok fine they are not all used, but 
still.  And look at this text:

"System Error Codes are very broad: each one can occur in one of many 
hundreds of locations in the system. Consequently, the descriptions of 
these codes cannot be very specific. Use of these codes requires some 
amount of investigation and analysis. You need to note both the 
programmatic and the runtime context in which these errors occur. 
Because these codes are defined in WinError.h for anyone to use, 
sometimes the codes are returned by non-system software. And sometimes 
the code is returned by a function deep in the stack and far removed 
from your code that is handling the error."

In practice, this means "no MSDN function documentation page will list 
what errors can occur when calling it", which means "anything goes". 
This is very much unlike POSIX, i.e. very much unhelpful.

For instance, why is it that I need to care that ReadFile() and 
WriteFile() may fail with ERROR_NO_SYSTEM_RESOURCES when attempting an 
I/O operation at least 64mb - 32kb + 16 bytes in size (this figure is 
undocumented), but only when that I/O occurs on mapped drives, and even 
if the mapped drive is local to the machine?

Because e.g.: loading or saving the image fails.

Ah, right.  So now that *ONE* error condition needs special handling. 
Great.  Only 15998 possible values to go.

> but can fail everywhere else depending on the rules 
> of the particular OS.  On Linux, over a dozen possible error codes are 
> given (one of them is the ignored EBADF case) as well as a note that a 
> variety of additional errors can occur depending on the particular 
> object the file descriptor represents.  So reasons: many.

Yes, however, there are O(10) possible errors, not O(10^4).  It's a huge 
difference with the MSDN world.  I'd rather write against the Unix 
subsystem / standard C library on Windows.

> I believe the #forceChangesToDisk hack had a different objective.  The 
> other hack(s) are dealing with flush failure, #forceChangesToDisk 
> appears to predate flush support and/or to deal with the reality at the 
> time that flush alone often wasn't a complete solution.

Ok, so if that's true, then we're dealing with bit rot.

This shows that it is incredibly important to be completely thorough, 
because it is at that time that a good understanding of the entire 
problem is in anyone's head.  If you are not thorough today, someone 
else will have to recreate your state of mind tomorrow.  Overall, 
everybody goes slower.

> I'm as baffled by that cryptic comment as you.

Might as well delete it, then.  It serves no good purpose if it can't be 
tied to anything concrete.

> I did a little more general search trying to find something, anything 
> that might point in a direction that leads to clarity... nothing so far.

The reference in POSIX says fflush() flushes, and the MSDN reference 
says FlushFileBuffers() flushes.  If that covers all platforms, the 
comment needs to go because misbehavior means it's not your problem and 
you can file a bug against the spec.  Provided, of course, that you are 
certain as you can be that the relevant API is being used correctly, and 
that you can recreate the problem in a small, standalone C program, 
which you will attach to the bug report :).

Andres.


More information about the Cuis-dev mailing list