[Cuis-dev] Thoughts about symbols

Andres Valloud ten at smallinteger.com
Sun Dec 1 02:54:15 PST 2024


So I read the updated comments, and they read in part:

==============================
A String is an indexed collection of Characters. In Cuis, Characters are 
Unicode Code Points. In an instance of String, all the Characters must 
be in the first 255 CodePoints, the Latin-1 set. See also UnicodeString.
==============================

You're saying it yourself: characters are code points.  String code 
points must be in Latin-1.  To be sure, this must mean "Basic Latin" 
plus "Latin Extended-A" in Unicode, and *not* ISO-8859-1 (the "Latin 
alphabet no. 1" standard, which is where search engines send you if you 
look up "Latin-1").

I say this because if "Latin-1" is to be interpreted as "ISO-8859-1", 
then the comment is not true because ISO-8859-1 has a lot of undefined 
code points (per the Wikipedia page).  I do not see anything stopping 
the storage of the zero code point into an instance of String, for example.

I really think looking at String as "sequence of unsigned byte code 
points" is much better.  The characters you get out of that should be 
Unicode in all cases for the sake of simplicity.

Per the relevant Wikipedia pages, I believe that Latin-1 (meaning 
ISO-8859-1) and Unicode (meaning Basic Latin plus Latin Extended-A) 
match wherever Latin-1 is defined.  However, I didn't check this 
exhaustively.

On 11/30/24 6:24 PM, Juan Vuletich via Cuis-dev wrote:
> On 11/30/2024 12:00 PM, Andres Valloud via Cuis-dev wrote:
>> So I see a variable byte subclass String (and corresponding Symbol), 
>> which appear to be byte oriented storage sequences of characters.  But 
>> not really, because the encoding is Latin-1. 
> 
> No. That was an outdated comment. I pushed an update to fix it earlier 
> today.
> 
>> So in fact, this is not a sequence of characters but rather a sequence 
>> of code points --- that is, integer indices into the Latin-1 table 
>> which in turn gives you characters. 
> 
> No. Let's play Workspace here (result of evaluation in double quotes):
> 
> 'Hello' class. "String" .
> 'Hello' first. "$H" .
> 'Hello' first class. "Character" .
> 'Hello' first codePoint. "72" .
> 'α ∈ [0 .. 2π)' class. "UnicodeString" .
> 'α ∈ [0 .. 2π)' first. "$α" .
> 'α ∈ [0 .. 2π)' first class. "Character" .
> 'α ∈ [0 .. 2π)' first codePoint. "945" .
> 
>> So already, the superclass name CharacterSequence is not really an 
>> indication of what is actually going on.
> 
> CharacterSequence subinstances are sequences of Characters.
> 
>> Sending at: to a (byte) String gives you a character.  But this 
>> immediate object is only bits and has no idea Latin-1 is involved. 
> 
> No. Not "only bits". It is an instance of Character, has Character 
> protocol.
> 
>> So whose job is to map small integers disguised as characters into 
>> Latin-1 characters? 
> 
> I don't understand the question. There's nothing disguised as something 
> else here.
> 
>> See how the class Character is not really a character but rather a 
>> code point into Latin-1? 
> 
> I can't make sense of that assertion. So, no, I don't "see it".
> 
>> Strangely, Character>>codePoint is there but "blah blah blah".
> 
> Exercise left to the reader.
> 
>>
>> Meanwhile, UnicodeString (which has the same superclass as String) is 
>> not a sequence of code points because now the encoding is UTF-8. 
> 
> As the "Workspace" I pasted above shows, an UnicodeString is a sequence 
> of Characters. And each Character can answer its #codePoint (which is an 
> integer).
> 
>> So here we see that while string is a sequence of code points into 
>> Latin-1, UnicodeString is (presumably) a sequence of Unicode code 
>> points into the Unicode character set, and these code points have been 
>> encoded using UTF-8. 
> 
> I hope what Strings and UnicodeStrings are is now clear.
> 
>> Nothing stops anybody from having an array of Character objects here.
> 
> Of course. You can make an Array of whatever you want.
> 
>>
>> So back to the questions then.
>>
>> * Why is "sequence of characters" related to whether said sequence is 
>> "Unicode"?
>>
>> This question is trying to draw attention to the fact that a sequence 
>> of characters implies either that these character objects already come 
>> with an encoding (i.e. Character represents Unicode characters whose 
>> code point is the small integer equivalent of the character in 
>> question), or that some encoding is implied.  Nevertheless, why does 
>> the notion of a "character" imply "Unicode"?  Here, clearly it doesn't 
>> because Characters represent Latin-1.
> 
> According to Wikipedia (Unicode): "Version 16.0 of the standard[A] 
> defines 154998 characters..." Those Characters are what instances of 
> Character in Cuis represent. As simple as that. The first 128 can be 
> said to also be ASCII. The first 256 can be said to also be Latin-1.
> 
>>
>> So really what should happen here is that instead of Character you 
>> should have a class called CodePoint, which you then decode using some 
>> code point table to get the actual character.  Or, the default 
>> encoding is Unicode.  But then what do you do with Latin-1?
>>
>> This "encoding" refers to the representable characters, *not* to how a 
>> sequence of code points is represented (UCS2, UTF-8, UTF-16, etc). 
>> Those two are not the same.
> 
> ??
> 
>>
>> * What does "Unicode" mean: characters, or code points?
>>
>> Here, it appears it is "code points encoded in UTF-8".  Again, note 
>> the conflated concepts.
> 
> I guess you mean what "Unicode" means in the class name "UnicodeString". 
> The answer is that UnicodeString is a sequence of Unicode Characters. 
> Unicode Characters are what the Unicode standard and wikipedia page call 
> "Unicode Characters". That's all.
> 
> I don't see the point of continuing this exercise. Please read the paper 
> I pointed at before: https://github.com/Cuis-Smalltalk/Cuis-Smalltalk- 
> Dev/blob/master/Documentation/Presentations/2022-11- 
> UnicodeSupportInCuisSmalltalk/2022-11-UnicodeSupportInCuisSmalltalk.pdf
> 
> Maybe experimenting with actual instances of these classes and browsing 
> their protocol help all this make sense to you. Or maybe not. Who knows.
> 



More information about the Cuis-dev mailing list