[Cuis-dev] Non-ASCII characters in source code files Was: TrueType font import problems

Juan Vuletich juan at cuis.st
Thu Sep 14 10:38:30 PDT 2023


Hi Bernhard,

On 9/14/2023 10:16 AM, Bernhard Pieber via Cuis-dev wrote:
> Hi Juan,
>
> Thanks for the explanation. Did I understand correctly that old ISO-8859-15-encoded source files should be converted to UTF-8 before they are loaded in Unicode-enabled Cuis? Or does Cuis somehow do that automatically if possible?

Almost all the code is pure ASCII. Nothing to be done here.

For non-ASCII parts, usually they are "invalid UTF-8", they can't be 
mistaken for UTF-8 content. Cuis converts these on the fly to the new 
UnicodeString objects. Then, when you save, the new file is UTF-8. So, 
the conversion is done automatically. There is a very small risk of some 
old non-ASCII stuff to be mistaken for UTF-8, leading to wrong code. But 
it is really small. A bit of checking should be enough.

> IIUC the files on https://github.com/Cuis-Smalltalk/StyledTextEditor should have been converted to UTF-8 already. If yes, I still don't understand why the string literal in RTFExporting.pck.st was changed by resaving the package file? (I did not use my repo for the test.)

That's what I described now above. Auto convert the instances on load, 
then save in UTF-8 format.

> How did you do the conversion? Did you use some external tool? (I could not find any code for this in Cuis except from UnicodeString>>#fromBytesStream: but there are no senders.)

No. Just loaded, saved, and checked that everything looked ok. It did. 
For instance, #nextUtf8BytesAndCodePointInto:into: ends calling 
#utf8BytesAndCodePointFor:byte2:byte3:byte4:into:into: . Check the 
comments in these methods. I had hoped these would be informative enough.

> Regarding the method iso8859s15ToRTFEncoding, I am pretty sure this is the correct string from the comment (Test for Cent and Euro characters):
> self assert: 'A¢€' iso8859s15ToRTFEncoding = 'A\u162?\u8364?'

Cool.

> Instead of #iso8859s15ToRTFEncoding a new method #toRTFEncoding or #asRTF polymorphic to String and UnicodeString is probably needed, right?

Yes. That sounds like a good idea. Still, I'd check recent RTF 
documentation. I'd be really surprised if they don't handle UTF-8 
encoding as part of the standard. If the do, maybe all that can simply 
be removed, and just replaced with UTF-8 stuff.

> Cheers,
> Bernhard
>

Cheers,

-- 
Juan Vuletich
cuis.st
github.com/jvuletich
researchgate.net/profile/Juan-Vuletich
independent.academia.edu/JuanVuletich
patents.justia.com/inventor/juan-manuel-vuletich
linkedin.com/in/juan-vuletich-75611b3
twitter.com/JuanVuletich



More information about the Cuis-dev mailing list