[Cuis-dev] Thoughts about symbols
Andres Valloud
ten at smallinteger.com
Sat Nov 30 09:00:57 PST 2024
So I see a variable byte subclass String (and corresponding Symbol),
which appear to be byte oriented storage sequences of characters. But
not really, because the encoding is Latin-1. So in fact, this is not a
sequence of characters but rather a sequence of code points --- that is,
integer indices into the Latin-1 table which in turn gives you
characters. So already, the superclass name CharacterSequence is not
really an indication of what is actually going on.
Sending at: to a (byte) String gives you a character. But this
immediate object is only bits and has no idea Latin-1 is involved. So
whose job is to map small integers disguised as characters into Latin-1
characters? See how the class Character is not really a character but
rather a code point into Latin-1? Strangely, Character>>codePoint is
there but "blah blah blah".
Meanwhile, UnicodeString (which has the same superclass as String) is
not a sequence of code points because now the encoding is UTF-8. So
here we see that while string is a sequence of code points into Latin-1,
UnicodeString is (presumably) a sequence of Unicode code points into the
Unicode character set, and these code points have been encoded using
UTF-8. Nothing stops anybody from having an array of Character objects
here.
So back to the questions then.
* Why is "sequence of characters" related to whether said sequence is
"Unicode"?
This question is trying to draw attention to the fact that a sequence of
characters implies either that these character objects already come with
an encoding (i.e. Character represents Unicode characters whose code
point is the small integer equivalent of the character in question), or
that some encoding is implied. Nevertheless, why does the notion of a
"character" imply "Unicode"? Here, clearly it doesn't because
Characters represent Latin-1.
So really what should happen here is that instead of Character you
should have a class called CodePoint, which you then decode using some
code point table to get the actual character. Or, the default encoding
is Unicode. But then what do you do with Latin-1?
This "encoding" refers to the representable characters, *not* to how a
sequence of code points is represented (UCS2, UTF-8, UTF-16, etc).
Those two are not the same.
* What does "Unicode" mean: characters, or code points?
Here, it appears it is "code points encoded in UTF-8". Again, note the
conflated concepts.
* Suppose you only care about image representation. Do you need a "byte
string" class and a "small integer string" class for the sake of
efficiency? Or is that an issue of encoding that should only be visible
externally?
Well, here we have a "byte string" and a "UTF-8 encoded string". There
is no "a small integer [code point] per character" class.
I do not completely understand the benefit of having the internal
representation of strings to be UTF-8 encoded, other than "it's
convenient because external dependencies like it better". Sure, except
UTF-8 is not the universal encoding of strings on every platform.
Sooner or later somebody pays the price of the conversion.
* And what do those "strings" store, characters or code points?
Code points. There's no storing actual characters here. So why is that
class called CharacterSequence?
* Should the image support denormalized Unicode code point sequences, or
should it prioritize efficient at:put:?
Who knows --- but UnicodeString>>at:put: doesn't look that efficient at
first glance.
* Who handles string encoding, and when?
IMO that's the VM's job, the notion of UTF-8 etc doesn't need to enter
the image (except via FFI).
Andres.
On 11/30/24 7:32 AM, Juan Vuletich via Cuis-dev wrote:
> On 11/29/2024 7:22 PM, Andres Valloud via Cuis-dev wrote:
>> Why is "sequence of characters" related to whether said sequence is
>> "Unicode"? What does "Unicode" mean: characters, or code points?
>> Suppose you only care about image representation. Do you need a "byte
>> string" class and a "small integer string" class for the sake of
>> efficiency? Or is that an issue of encoding that should only be
>> visible externally? And what do those "strings" store, characters or
>> code points? Should the image support denormalized Unicode code point
>> sequences, or should it prioritize efficient at:put:? Who handles
>> string encoding, and when?
>>
>> On 11/29/24 5:14 AM, Luciano Notarfrancesco via Cuis-dev wrote:
>>>
>>> Yes, that’s true, the current design is not wrong. There’s a bit of
>>> code duplication, and that led to me think about unifying and
>>> generalizing it, and I wanted to know your thoughts. I guess the
>>> alternative design would make more sense if the code was triplicated
>>> (if there were 3 kinds of symbols), but I don’t see that coming. A
>>> third alternative would be to move some methods to CharacterSequence,
>>> but I’m not convinced that’s better either.
>>
>
> Please take a look at https://github.com/Cuis-Smalltalk/Cuis-Smalltalk-
> Dev/tree/master/Documentation/Presentations/2022-11-
> UnicodeSupportInCuisSmalltalk . Two things that have changed since that
> paper is that now there is a single immediate Unicode-wide Character
> class, and UnicodeStrings now support #at:put:. Class Character and
> classes in the CharacterSequence hierarchy have meaningful class
> comments that may help.
>
> I believe that, and maybe playing a bin in a Workspace, could give
> better answers to most questions here than email.
>
> If you still have doubts, feel free to ask.
>
> Thanks,
>
More information about the Cuis-dev
mailing list