[Cuis-dev] Thoughts about symbols

Sat Nov 30 18:24:33 PST 2024

On 11/30/2024 12:00 PM, Andres Valloud via Cuis-dev wrote:
> So I see a variable byte subclass String (and corresponding Symbol), 
> which appear to be byte oriented storage sequences of characters.  But 
> not really, because the encoding is Latin-1. 

No. That was an outdated comment. I pushed an update to fix it earlier 
today.

> So in fact, this is not a sequence of characters but rather a sequence 
> of code points --- that is, integer indices into the Latin-1 table 
> which in turn gives you characters. 

No. Let's play Workspace here (result of evaluation in double quotes):

'Hello' class. "String" .
'Hello' first. "$H" .
'Hello' first class. "Character" .
'Hello' first codePoint. "72" .
'α ∈ [0 .. 2π)' class. "UnicodeString" .
'α ∈ [0 .. 2π)' first. "$α" .
'α ∈ [0 .. 2π)' first class. "Character" .
'α ∈ [0 .. 2π)' first codePoint. "945" .

> So already, the superclass name CharacterSequence is not really an 
> indication of what is actually going on.

CharacterSequence subinstances are sequences of Characters.

> Sending at: to a (byte) String gives you a character.  But this 
> immediate object is only bits and has no idea Latin-1 is involved. 

No. Not "only bits". It is an instance of Character, has Character protocol.

> So whose job is to map small integers disguised as characters into 
> Latin-1 characters? 

I don't understand the question. There's nothing disguised as something 
else here.

> See how the class Character is not really a character but rather a 
> code point into Latin-1? 

I can't make sense of that assertion. So, no, I don't "see it".

> Strangely, Character>>codePoint is there but "blah blah blah".

Exercise left to the reader.

>
> Meanwhile, UnicodeString (which has the same superclass as String) is 
> not a sequence of code points because now the encoding is UTF-8. 

As the "Workspace" I pasted above shows, an UnicodeString is a sequence 
of Characters. And each Character can answer its #codePoint (which is an 
integer).

> So here we see that while string is a sequence of code points into 
> Latin-1, UnicodeString is (presumably) a sequence of Unicode code 
> points into the Unicode character set, and these code points have been 
> encoded using UTF-8. 

I hope what Strings and UnicodeStrings are is now clear.

> Nothing stops anybody from having an array of Character objects here.

Of course. You can make an Array of whatever you want.

>
> So back to the questions then.
>
> * Why is "sequence of characters" related to whether said sequence is 
> "Unicode"?
>
> This question is trying to draw attention to the fact that a sequence 
> of characters implies either that these character objects already come 
> with an encoding (i.e. Character represents Unicode characters whose 
> code point is the small integer equivalent of the character in 
> question), or that some encoding is implied.  Nevertheless, why does 
> the notion of a "character" imply "Unicode"?  Here, clearly it doesn't 
> because Characters represent Latin-1.

According to Wikipedia (Unicode): "Version 16.0 of the standard[A] 
defines 154998 characters..." Those Characters are what instances of 
Character in Cuis represent. As simple as that. The first 128 can be 
said to also be ASCII. The first 256 can be said to also be Latin-1.

>
> So really what should happen here is that instead of Character you 
> should have a class called CodePoint, which you then decode using some 
> code point table to get the actual character.  Or, the default 
> encoding is Unicode.  But then what do you do with Latin-1?
>
> This "encoding" refers to the representable characters, *not* to how a 
> sequence of code points is represented (UCS2, UTF-8, UTF-16, etc). 
> Those two are not the same.

??

>
> * What does "Unicode" mean: characters, or code points?
>
> Here, it appears it is "code points encoded in UTF-8".  Again, note 
> the conflated concepts.

I guess you mean what "Unicode" means in the class name "UnicodeString". 
The answer is that UnicodeString is a sequence of Unicode Characters. 
Unicode Characters are what the Unicode standard and wikipedia page call 
"Unicode Characters". That's all.

I don't see the point of continuing this exercise. Please read the paper 
I pointed at before: 
https://github.com/Cuis-Smalltalk/Cuis-Smalltalk-Dev/blob/master/Documentation/Presentations/2022-11-UnicodeSupportInCuisSmalltalk/2022-11-UnicodeSupportInCuisSmalltalk.pdf

Maybe experimenting with actual instances of these classes and browsing 
their protocol help all this make sense to you. Or maybe not. Who knows.

-- 
Juan Vuletich
cuis.st
github.com/jvuletich
researchgate.net/profile/Juan-Vuletich
independent.academia.edu/JuanVuletich
patents.justia.com/inventor/juan-manuel-vuletich
linkedin.com/in/juan-vuletich-75611b3
twitter.com/JuanVuletich