[Cuis-dev] [Vm-dev] Cuis on a RISC

Mon Jun 20 08:56:21 PDT 2022

Hi Eliot,

> Boris, Tom & Leon would you care to comment?

Of course.  I hadn't joined this thread earlier simply because last week
I was 100% busy with the Camp Smalltalk and the post-Camp hackathon.
(BTW, with Slavik's two talks set in stone in the schedule, I kind of
assumed I would see you there too... it was a huge disappointment when
I learned neither you nor Slavik were there!)

> Ryan Macnak did a 32-bit MIPS simulator; it's still in VMMaker.oscog,
> but it needs a little love to make it simulate fully.

I try to differentiate between the simulator (MIPSSimulator) and the
compiler (CogMIPSELCompiler).  On the simulator side, I think the
solution is not to write more simulators as more ISAs become important,
but to connect with the work on which those ISA's communities already
have spent millions of engineer-hours.  Look, even x86 — which has been
*the* main Cog target for decades — even the x86 the Alien does not
simulate faithfully (for one example which killed me back in the day,
EIP is wrong after SEGV).  There is no hope we are getting correct
simulation of RISC-V, OpenPOWER, AArch64, MIPS with the resources we
have.  Much less under the 2020s' definition of correct (at least co-
simulating with reference implementation?)  And it would be an act of
inverse vandalism in the first place.

On the other hand, ULD has been connecting to any CPU, simulated or
real, for nine years now.  ULD-is-Good is a thin layer impersonating
the Alien plugin API over ULD, so "Cog simulation" can continue over
real hardware in the same way it's been done previously.

The problem with ULD is speed.  A SEGV is a roundtrip over the GDB RSP
protocol and this RSP is a TCP socket.  In my measurements, native Cog
code traps after 27 instructions on average.  This is catastrophic:
booting up to the Reader's REPL prompt takes 20 minutes on a real
ARM chip.

The cure to that, might be what I call "simprod semihosting": a mixed
simulated/production mode.  A "usual" Cog VM is EITHER simulated (all
jumps from JITted code to the runtime trap), OR production (all such jumps
successfully land on gcc-compiled code).  A "semihosted" Cog VM is an
arbitrary mix of both: in other words you are free to decide which
addresses are fake and which are mapped to gcc-emitted text.
Two ingredients make this possible: (1) a linker script convinces the
OS kernel to load things at the same addresses where they are in
simulation, eliminating any difference between simulated and production
address space layout; and (2) the simulation environment thinks it's
just a normal simulation: ULD-is-Good maps it to the process running
on the real chip transparently.

In 2016, Kurt Kilpela and I implemented a semihosted Modtalk VM and
got it to run the ANSITester, so we know the approach scales beyond
proof-of-concept.  Shortly before the Second VM Hackathon I did an
example of how this could be done for Cog:

https://github.com/shingarov/opensmalltalk-vm/commit/8d6740f81929905bfe5efed691f223ac69bff70b

This example migrates one fake address across the SEGV boundary.
The final goal here would be to get rid of #cCode:inSmalltalk: altogether,
so there is truly ONE Cog codebase.  At the time this didn't have much
resonance in the community, so we left it where it is.

> RV64GC

Not sure why "C" ("code density" extension somewhat like Thumb on ARM)
is relevant here.  As to G... to me it feels a bit like runnung Linux
without the S extension (meaning no MMU).  Since Project SOAR
(aka RISC-III), this has been about building a Smalltalk processor.
Jecel, Mario, myself and a bunch of other VM hackers are working towards
the J extension (hardware support for Smalltalk).  The main focus here
seems to center around multicore, from I/D cache synchronization, to
cost-free read barriers for forwarders (removing the huge complexity of
object memory designs like Spur), to other such issues.  Unfortunately
Smalltalk-80 is the main blocker here, as it chisels in stone the very
assumptions we are trying to break from.
Let's consider #collect: as one good example.
A given Smalltalk-80 expresses a fixed representation of a concrete
Collection subclass; e.g. OrderedCollection would have a backing Array,
and then what the #collect: method really expresses is how to overcome
the impedance mismatch between the concept of map (mapping a block over
a collection) vs the concrete representation using Array and sequential
processor.  So the whole design is inherently single-threaded BY SILENT
ASSUMPTIONS.
Luckily in Cuis we have Luciano's Domains, where functoriality is
expressed by a universal construction (as opposed to a representation).
But it will take quite some research to replace the Smalltalk-80 base
class hierarchy with Domains (I admit I don't yet have a clear
understanding how this will work).

> I don't know if they're interested in finishing and/or collaborating
> with someone to finish, to produce a production RISC-V JIT.

It should be, err... I wouldn't go as far as saying "simple", but it
definitely isn't science-intensive.  We demonstrated everything needed,
two FAST VM Hackathons ago.
(1) One missing piece is Slanging the algebraic form of instruction
encoding.  (The Petrich ArchC Assembler produces those algebraic ASTs
in order to make register allocation, instruction selection, scheduling
and encoding passes commute).  Someone with Slang familiarity will have
no difficulty adding C translation for them.
(2) For the Hackathon I had removed a few sends of #cCode:inSmalltalk:,
because I only cared about ULD (we have neither RISC-V nor OpenPOWER
simulator anyway).  If we want to go back to "classical" Cog with two
address space layouts, these changes would have to be reverted.
(3) There are some stupid ISA idiosyncrasies in my code because I started
studying the Cog compiler from looking at 32-bit ARM, so I blindly
transliterated some ARMisms.  For example, on 32-bit ARM, LR is R14,
and my Call and Ret just emulate that.  There are a few other similarly-
idiotic places like that in the Hackaton code.

> Boris Shingarov has done one but hasn't contributed it back because
> he's interested in auto-generating the JIT backend (the mapping of the
> JIT's abstract instruction set to the processor's concrete
> instructions) from a formal processor description. And, at least when
> last we talked, he was interested in interpreting the description when
> generating the code rather than generating the mapping methods
> (such as concretizeCall, concretizeMoveRR, etc) from the specification.

Yes.  There is a number of reasons why I think "the Second VM Hackaton"
JIT just fundamentally doesn't go far enough:

Speaking about synthesis from spec, the RISC-V normative standard is
the Sail2 source, not the English prose.  The center of difficulty here,
is the memory model.  RISC-V is a weakly-consistent architecture,
this means that optimizations such as out-of-order execution result
in software-visible race conditions.  Morisset's dissertation,
https://fzn.fr/students/morisset-phd.pdf
explains why the compiler is fundamentally affected by weak consistency,
and what to do about it.  The elephant in the room here, is the classical
problem of pointer aliasing.  But this time around we are not getting
away with the usual naïve approaches ("just be careful" etc.)  We are
forced to use separation logic or its equivalents (optics, bidirectional
computing etc.)

The other fundamental shortcoming of "the Second VM Hackaton" JIT is
what you mentioned back in Cambridge: proving at the lowest level (i.e.
that of concretizing one RTL in isolation) is not that interesting.
I tend to listen to my teachers very carefully, so I've been working
towards having the prover traverse the chain of calls through the Cogit
down to 'concretize*'.  This is much more difficult than 'concretize*',
the latter is just straightline (no loops no jumps), whereas up in the
Cogit we see scary things like pointer-chasing in a loop, so now we are
facing questions of [non-]termination, linked pointer structure [not-]
stepping on itself, etc.

The nice surprise is that the principles behind today's state-of-the-art
verified compilers — standard today in the compiler research community —
are the same familiar and loved quantum-gravity ideas we studied decades
ago.  I used to think knots were nonsense invented out of sheer
desperation trying to reconcile Einstein with Heisenberg.  Now we have
André Joyal and John C. Baez constructing practical compilers out of
them, because Boole's "mental action" turns out to be the same thing as
Heisenberg's "measurement" and the same thing as Freeman–Pfenning's
"refinement", which we can write in Smalltalk as #| (pronounced "such
as") sent to a base type, thus

Integer | [ :x | x > 10 ]

evaluates to the set of all integers greater than 10.  Unlike classes,
these things compose, forming an algebra of types.  It is stunning how
far one can go with them expending only trivial effort.

So, I spent much of the time since our Hackathon bringing this
foundational formalism into Smalltalk.  Now things like propagating
Floyd-Hoare verification conditions (shall I say 'method contracts'?)
along the call chain, branch sensitivity, synthesizing unknown
intermediate contracts, etc etc, is completely trivial.  Now approaches
the time for a second stab at synthesizing the Cog from the rich
multitude of #assert:'s spread through the codebase.  If people are
interested, we can schedule a zoom to work on this.

Boris