[Cuis-dev] Thoughts about symbols

Sat Nov 30 09:00:57 PST 2024

So I see a variable byte subclass String (and corresponding Symbol), 
which appear to be byte oriented storage sequences of characters.  But 
not really, because the encoding is Latin-1.  So in fact, this is not a 
sequence of characters but rather a sequence of code points --- that is, 
integer indices into the Latin-1 table which in turn gives you 
characters.  So already, the superclass name CharacterSequence is not 
really an indication of what is actually going on.

Sending at: to a (byte) String gives you a character.  But this 
immediate object is only bits and has no idea Latin-1 is involved.  So 
whose job is to map small integers disguised as characters into Latin-1 
characters?  See how the class Character is not really a character but 
rather a code point into Latin-1?  Strangely, Character>>codePoint is 
there but "blah blah blah".

Meanwhile, UnicodeString (which has the same superclass as String) is 
not a sequence of code points because now the encoding is UTF-8.  So 
here we see that while string is a sequence of code points into Latin-1, 
UnicodeString is (presumably) a sequence of Unicode code points into the 
Unicode character set, and these code points have been encoded using 
UTF-8.  Nothing stops anybody from having an array of Character objects 
here.

So back to the questions then.

* Why is "sequence of characters" related to whether said sequence is 
"Unicode"?

This question is trying to draw attention to the fact that a sequence of 
characters implies either that these character objects already come with 
an encoding (i.e. Character represents Unicode characters whose code 
point is the small integer equivalent of the character in question), or 
that some encoding is implied.  Nevertheless, why does the notion of a 
"character" imply "Unicode"?  Here, clearly it doesn't because 
Characters represent Latin-1.

So really what should happen here is that instead of Character you 
should have a class called CodePoint, which you then decode using some 
code point table to get the actual character.  Or, the default encoding 
is Unicode.  But then what do you do with Latin-1?

This "encoding" refers to the representable characters, *not* to how a 
sequence of code points is represented (UCS2, UTF-8, UTF-16, etc). 
Those two are not the same.

* What does "Unicode" mean: characters, or code points?

Here, it appears it is "code points encoded in UTF-8".  Again, note the 
conflated concepts.

* Suppose you only care about image representation.  Do you need a "byte 
string" class and a "small integer string" class for the sake of 
efficiency?  Or is that an issue of encoding that should only be visible 
externally?

Well, here we have a "byte string" and a "UTF-8 encoded string".  There 
is no "a small integer [code point] per character" class.

I do not completely understand the benefit of having the internal 
representation of strings to be UTF-8 encoded, other than "it's 
convenient because external dependencies like it better".  Sure, except 
UTF-8 is not the universal encoding of strings on every platform.

Sooner or later somebody pays the price of the conversion.

* And what do those "strings" store, characters or code points?

Code points.  There's no storing actual characters here.  So why is that 
class called CharacterSequence?

* Should the image support denormalized Unicode code point sequences, or 
should it prioritize efficient at:put:?

Who knows --- but UnicodeString>>at:put: doesn't look that efficient at 
first glance.

* Who handles string encoding, and when?

IMO that's the VM's job, the notion of UTF-8 etc doesn't need to enter 
the image (except via FFI).

Andres.

On 11/30/24 7:32 AM, Juan Vuletich via Cuis-dev wrote:
> On 11/29/2024 7:22 PM, Andres Valloud via Cuis-dev wrote:
>> Why is "sequence of characters" related to whether said sequence is 
>> "Unicode"?  What does "Unicode" mean: characters, or code points? 
>> Suppose you only care about image representation.  Do you need a "byte 
>> string" class and a "small integer string" class for the sake of 
>> efficiency?  Or is that an issue of encoding that should only be 
>> visible externally?  And what do those "strings" store, characters or 
>> code points?  Should the image support denormalized Unicode code point 
>> sequences, or should it prioritize efficient at:put:?  Who handles 
>> string encoding, and when?
>>
>> On 11/29/24 5:14 AM, Luciano Notarfrancesco via Cuis-dev wrote:
>>>
>>> Yes, that’s true, the current design is not wrong. There’s a bit of 
>>> code duplication, and that led to me think about unifying and 
>>> generalizing it, and I wanted to know your thoughts. I guess the 
>>> alternative design would make more sense if the code was triplicated 
>>> (if there were 3 kinds of symbols), but I don’t see that coming. A 
>>> third alternative would be to move some methods to CharacterSequence, 
>>> but I’m not convinced that’s better either.
>>
> 
> Please take a look at https://github.com/Cuis-Smalltalk/Cuis-Smalltalk- 
> Dev/tree/master/Documentation/Presentations/2022-11- 
> UnicodeSupportInCuisSmalltalk . Two things that have changed since that 
> paper is that now there is a single immediate Unicode-wide Character 
> class, and UnicodeStrings now support #at:put:. Class Character and 
> classes in the CharacterSequence hierarchy have meaningful class 
> comments that may help.
> 
> I believe that, and maybe playing a bin in a Workspace, could give 
> better answers to most questions here than email.
> 
> If you still have doubts, feel free to ask.
> 
> Thanks,
>