[Cuis-dev] Thoughts about symbols
Andres Valloud
ten at smallinteger.com
Sun Dec 1 02:14:25 PST 2024
>> So in fact, this is not a sequence of characters but rather a sequence
>> of code points --- that is, integer indices into the Latin-1 table
>> which in turn gives you characters.
>
> No. Let's play Workspace here (result of evaluation in double quotes):
String is a *byte* variable subclass. It stores bytes. That some at:
primitive manufactures a character object for you from each of those
byte entries is something else. And as you see, there is already
implied encoding.
>> So already, the superclass name CharacterSequence is not really an
>> indication of what is actually going on.
>
> CharacterSequence subinstances are sequences of Characters.
How do you store an instance of Character into the indexed slots of a
byte variable class?
>> Sending at: to a (byte) String gives you a character. But this
>> immediate object is only bits and has no idea Latin-1 is involved.
>
> No. Not "only bits". It is an instance of Character, has Character
> protocol.
The bits themselves, which are the only data that give each character
its identity, have no idea. This is true because immediates have no
instance variables. So, *something else* must be supplying the map
between 65 and $A.
>> So whose job is to map small integers disguised as characters into
>> Latin-1 characters?
>
> I don't understand the question. There's nothing disguised as something
> else here.
Oh yes there is. Who said that 65 was $A? Or who said whatever the
number was chosen at the time was the left pointing arrow? And who said
that 164 is $ñ? Everybody knows that alt+164 gives you ñ in any DOS
descendent machine set to the default OEM code page. Or is it 241, as
in Cuis' current Latin-1 / Unicode encoding?
In other words, what magic mapped the immediate character that is the
small integer 65 with some tag bits modified into the glyph for $A?
>> See how the class Character is not really a character but rather a
>> code point into Latin-1?
>
> I can't make sense of that assertion. So, no, I don't "see it".
Maybe now it's clearer?
>> Strangely, Character>>codePoint is there but "blah blah blah".
>
> Exercise left to the reader.
I respectfully refuse to clean up after update 6177, which somehow
passed code review in the following state.
Character>>codePoint
"Unicode codePoint is blah blah blah"
"
self assert: $A codePoint hex = '16r41'.
self assert: $€ codePoint hex = '16r20AC'.
"
<primitive: 171>
^self primitiveFailed
Personal mastery and all that. Well then?
>> Meanwhile, UnicodeString (which has the same superclass as String) is
>> not a sequence of code points because now the encoding is UTF-8.
>
> As the "Workspace" I pasted above shows, an UnicodeString is a sequence
> of Characters. And each Character can answer its #codePoint (which is an
> integer).
It appears that way, yes. However... who said that asking the left
arrow assignment character for its codePoint should give 28? Nice
Unicode FS character, there.
>> So really what should happen here is that instead of Character you
>> should have a class called CodePoint, which you then decode using some
>> code point table to get the actual character. Or, the default
>> encoding is Unicode. But then what do you do with Latin-1?
>>
>> This "encoding" refers to the representable characters, *not* to how a
>> sequence of code points is represented (UCS2, UTF-8, UTF-16, etc).
>> Those two are not the same.
>
> ??
There are two different families of concepts that have been conflated here.
1. Ignoring the whole Latin-1 thing and pretending Cuis only does
strict Unicode standard A, this means that String is a sequence of code
points into the Unicode character set whose values range between 0 and
255. The mapping between said code points and character objects is done
by the VM. And further, there is some magic by which the code point
stored inside each character instance is interpreted to configure the
operation "grab the bits, dereference the Unicode code point list,
answer whatever that table says". So, the class Character is hiding
this mapping inside of it, such that really instances of Character are
just code points into Unicode, rather than the characters themselves (as
far as raw storage is concerned).
If you prefer, in C it is not the same to write characters[k] or
characters+k. Here, the representation of Character stores only k, and
the mapping into Unicode characters is fused (i.e. hardcoded) into it.
2. So, with that, we can see strings are stored as sequences of code
points. String is "space efficient" if you like. But at least you can
see each code point individually stored in a uniform way. UnicodeString
encodes said code points as UTF-8, so at: and at:put: now have to work
around variable width for each code point (and let's ignore the other
complexities of denormalization etc).
But who said that UTF-8 was the only reasonable encoding? There are
other quite reasonable alternatives. So, here we see the second
conflation: UnicodeString, very much unlike String, has fused into it a
storage strategy that encodes code points into something that is not
simply an array of (e.g.) 4 byte entities.
When you look at all this, the class name "CharacterSequence" starts
looking just a bit strange.
> I don't see the point of continuing this exercise.
Ok, but the image is badly broken as will be shown below. I would not
want to have to undo the damage that defect can cause. Ignore at your
peril :).
What triggered this conversation was the observation that the current
string implementation strategy has side effects. Since storage of code
points is fused with its interpretation, well then, what are we going to
do with symbols?
Currently we have an expedient way out: make a symbol subclass for each
string class. But this leads to duplication. The reason for this
duplication is that storage and interpretation are fused. So really, to
avoid all this mess, what should happen is that there should be a
hierarchy of byte storage classes along these lines.
SequenceOfCodePoints
OneBytePerCodePoint
UTFEight
And then, there would be another class hierarchy like this,
SequenceOfUnicodeCharacters
String
Symbol
whose instances have an instance variable pointing to the sequence of
code points in question. In other words, the old-as-dust pattern of
composition and delegation. This is better because then, who cares what
the storage encoding is? And, like the abstract class name suggests,
the interpretation of code points in this hierarchy is that of Unicode.
So really, in this world the class Character is a UnicodeCharacter,
because the interpretation of the code point is fused (i.e., hardcoded).
So, why does this matter in the big picture? Because ST-80 has the same
implementation design flaws elsewhere, anyone contributing to the system
copies the flavor of the system, and in the long run this makes even
pink plane changes more difficult. For example, would you like to
change the storage strategy of a set so that it uses hash buckets
instead of open addressing linear probing? Great --- make an entire new
hierarchy for that and duplicate all the protocol. Why? Because the
interpretation of the storage is fused with the behavior into a single
class.
Yes, I know, the simplicity and lack of experience of the 1970s (and
let's ignore all we learned since then). But look at the class category
where String is. Do you see SymbolSet? Why is that there, exactly?
By the way, SymbolSet>>rehash is broken because it's not aware of
UnicodeSymbol. Look at this lovely failure.
| haha |
haha := 'abc¢' asUnicodeString asSymbol.
Symbol rehash.
Symbol findInterned: haha :: == haha
That's very bad, this is leaking symbols and will lead to problems that
are very hard to debug. Images where this problem manifests are
effectively trashed and need to be rebuilt from scratch --- and this is
why I would not want to have to deal with this.
IIRC, when I originally wrote the new symbol table code, UnicodeSymbol
did not exist. I find it really hard to believe that I would ignore the
presence of UnicodeSymbol. So, I suspect this issue appeared during
integration, and if so this is yet another consequence of duplication.
Andres.
-------------- next part --------------
'From Cuis7.1 [latest update: #6770] on 1 December 2024 at 1:41:12 am'!
!SymbolSet methodsFor: 'private' stamp: 'sqr 12/1/2024 01:40:24'!
rehashSymbolClass: aClass
aClass allInstances do:
[:symbol | self basicInternNew: symbol withHash: symbol hash]! !
!SymbolSet methodsFor: 'lookup' stamp: 'sqr 12/1/2024 01:40:51'!
rehash
| newBuckets |
newBuckets := self newBuckets: self newBucketCount sized: self initialBucketSize.
self buckets: newBuckets.
self rehashSymbolClass: Symbol.
self rehashSymbolClass: UnicodeSymbol! !
More information about the Cuis-dev
mailing list