[Cuis-dev] Thoughts about symbols

Andres Valloud ten at smallinteger.com
Sun Dec 1 02:14:25 PST 2024


>> So in fact, this is not a sequence of characters but rather a sequence 
>> of code points --- that is, integer indices into the Latin-1 table 
>> which in turn gives you characters. 
> 
> No. Let's play Workspace here (result of evaluation in double quotes):

String is a *byte* variable subclass.  It stores bytes.  That some at: 
primitive manufactures a character object for you from each of those 
byte entries is something else.  And as you see, there is already 
implied encoding.

>> So already, the superclass name CharacterSequence is not really an 
>> indication of what is actually going on.
> 
> CharacterSequence subinstances are sequences of Characters.

How do you store an instance of Character into the indexed slots of a 
byte variable class?

>> Sending at: to a (byte) String gives you a character.  But this 
>> immediate object is only bits and has no idea Latin-1 is involved. 
> 
> No. Not "only bits". It is an instance of Character, has Character 
> protocol.

The bits themselves, which are the only data that give each character 
its identity, have no idea.  This is true because immediates have no 
instance variables.  So, *something else* must be supplying the map 
between 65 and $A.

>> So whose job is to map small integers disguised as characters into 
>> Latin-1 characters? 
> 
> I don't understand the question. There's nothing disguised as something 
> else here.

Oh yes there is.  Who said that 65 was $A?  Or who said whatever the 
number was chosen at the time was the left pointing arrow?  And who said 
that 164 is $ñ?  Everybody knows that alt+164 gives you ñ in any DOS 
descendent machine set to the default OEM code page.  Or is it 241, as 
in Cuis' current Latin-1 / Unicode encoding?

In other words, what magic mapped the immediate character that is the 
small integer 65 with some tag bits modified into the glyph for $A?

>> See how the class Character is not really a character but rather a 
>> code point into Latin-1? 
> 
> I can't make sense of that assertion. So, no, I don't "see it".

Maybe now it's clearer?

>> Strangely, Character>>codePoint is there but "blah blah blah".
> 
> Exercise left to the reader.

I respectfully refuse to clean up after update 6177, which somehow 
passed code review in the following state.

Character>>codePoint
	"Unicode codePoint is blah blah blah"
	"
	self assert: $A codePoint hex = '16r41'.
	self assert: $€ codePoint hex = '16r20AC'.
	"
	<primitive: 171>
	^self primitiveFailed

Personal mastery and all that.  Well then?

>> Meanwhile, UnicodeString (which has the same superclass as String) is 
>> not a sequence of code points because now the encoding is UTF-8. 
> 
> As the "Workspace" I pasted above shows, an UnicodeString is a sequence 
> of Characters. And each Character can answer its #codePoint (which is an 
> integer).

It appears that way, yes.  However... who said that asking the left 
arrow assignment character for its codePoint should give 28?  Nice 
Unicode FS character, there.

>> So really what should happen here is that instead of Character you 
>> should have a class called CodePoint, which you then decode using some 
>> code point table to get the actual character.  Or, the default 
>> encoding is Unicode.  But then what do you do with Latin-1?
>>
>> This "encoding" refers to the representable characters, *not* to how a 
>> sequence of code points is represented (UCS2, UTF-8, UTF-16, etc). 
>> Those two are not the same.
> 
> ??

There are two different families of concepts that have been conflated here.

1.  Ignoring the whole Latin-1 thing and pretending Cuis only does 
strict Unicode standard A, this means that String is a sequence of code 
points into the Unicode character set whose values range between 0 and 
255.  The mapping between said code points and character objects is done 
by the VM.  And further, there is some magic by which the code point 
stored inside each character instance is interpreted to configure the 
operation "grab the bits, dereference the Unicode code point list, 
answer whatever that table says".  So, the class Character is hiding 
this mapping inside of it, such that really instances of Character are 
just code points into Unicode, rather than the characters themselves (as 
far as raw storage is concerned).

If you prefer, in C it is not the same to write characters[k] or 
characters+k.  Here, the representation of Character stores only k, and 
the mapping into Unicode characters is fused (i.e. hardcoded) into it.

2.  So, with that, we can see strings are stored as sequences of code 
points.  String is "space efficient" if you like.  But at least you can 
see each code point individually stored in a uniform way.  UnicodeString 
encodes said code points as UTF-8, so at: and at:put: now have to work 
around variable width for each code point (and let's ignore the other 
complexities of denormalization etc).

But who said that UTF-8 was the only reasonable encoding?  There are 
other quite reasonable alternatives.  So, here we see the second 
conflation: UnicodeString, very much unlike String, has fused into it a 
storage strategy that encodes code points into something that is not 
simply an array of (e.g.) 4 byte entities.

When you look at all this, the class name "CharacterSequence" starts 
looking just a bit strange.

> I don't see the point of continuing this exercise.

Ok, but the image is badly broken as will be shown below.  I would not 
want to have to undo the damage that defect can cause.  Ignore at your 
peril :).

What triggered this conversation was the observation that the current 
string implementation strategy has side effects.  Since storage of code 
points is fused with its interpretation, well then, what are we going to 
do with symbols?

Currently we have an expedient way out: make a symbol subclass for each 
string class.  But this leads to duplication.  The reason for this 
duplication is that storage and interpretation are fused.  So really, to 
avoid all this mess, what should happen is that there should be a 
hierarchy of byte storage classes along these lines.

	SequenceOfCodePoints
		OneBytePerCodePoint
		UTFEight

And then, there would be another class hierarchy like this,

	SequenceOfUnicodeCharacters
		String
			Symbol

whose instances have an instance variable pointing to the sequence of 
code points in question.  In other words, the old-as-dust pattern of 
composition and delegation.  This is better because then, who cares what 
the storage encoding is?  And, like the abstract class name suggests, 
the interpretation of code points in this hierarchy is that of Unicode. 
So really, in this world the class Character is a UnicodeCharacter, 
because the interpretation of the code point is fused (i.e., hardcoded).

So, why does this matter in the big picture?  Because ST-80 has the same 
implementation design flaws elsewhere, anyone contributing to the system 
copies the flavor of the system, and in the long run this makes even 
pink plane changes more difficult.  For example, would you like to 
change the storage strategy of a set so that it uses hash buckets 
instead of open addressing linear probing?  Great --- make an entire new 
hierarchy for that and duplicate all the protocol.  Why?  Because the 
interpretation of the storage is fused with the behavior into a single 
class.

Yes, I know, the simplicity and lack of experience of the 1970s (and 
let's ignore all we learned since then).  But look at the class category 
where String is.  Do you see SymbolSet?  Why is that there, exactly?

By the way, SymbolSet>>rehash is broken because it's not aware of 
UnicodeSymbol.  Look at this lovely failure.

| haha |
haha := 'abc¢' asUnicodeString asSymbol.
Symbol rehash.
Symbol findInterned: haha :: == haha

That's very bad, this is leaking symbols and will lead to problems that 
are very hard to debug.  Images where this problem manifests are 
effectively trashed and need to be rebuilt from scratch --- and this is 
why I would not want to have to deal with this.

IIRC, when I originally wrote the new symbol table code, UnicodeSymbol 
did not exist.  I find it really hard to believe that I would ignore the 
presence of UnicodeSymbol.  So, I suspect this issue appeared during 
integration, and if so this is yet another consequence of duplication.

Andres.
-------------- next part --------------
'From Cuis7.1 [latest update: #6770] on 1 December 2024 at 1:41:12 am'!

!SymbolSet methodsFor: 'private' stamp: 'sqr 12/1/2024 01:40:24'!
rehashSymbolClass: aClass

	aClass allInstances do:
		[:symbol | self basicInternNew: symbol withHash: symbol hash]! !


!SymbolSet methodsFor: 'lookup' stamp: 'sqr 12/1/2024 01:40:51'!
rehash

	| newBuckets |
	newBuckets := self newBuckets: self newBucketCount sized: self initialBucketSize.
	self buckets: newBuckets.
	self rehashSymbolClass: Symbol.
	self rehashSymbolClass: UnicodeSymbol! !



More information about the Cuis-dev mailing list