[Cuis-dev] Non-ASCII characters in source code files Was: TrueType font import problems
Juan Vuletich
juan at cuis.st
Thu Sep 14 10:38:30 PDT 2023
Hi Bernhard,
On 9/14/2023 10:16 AM, Bernhard Pieber via Cuis-dev wrote:
> Hi Juan,
>
> Thanks for the explanation. Did I understand correctly that old ISO-8859-15-encoded source files should be converted to UTF-8 before they are loaded in Unicode-enabled Cuis? Or does Cuis somehow do that automatically if possible?
Almost all the code is pure ASCII. Nothing to be done here.
For non-ASCII parts, usually they are "invalid UTF-8", they can't be
mistaken for UTF-8 content. Cuis converts these on the fly to the new
UnicodeString objects. Then, when you save, the new file is UTF-8. So,
the conversion is done automatically. There is a very small risk of some
old non-ASCII stuff to be mistaken for UTF-8, leading to wrong code. But
it is really small. A bit of checking should be enough.
> IIUC the files on https://github.com/Cuis-Smalltalk/StyledTextEditor should have been converted to UTF-8 already. If yes, I still don't understand why the string literal in RTFExporting.pck.st was changed by resaving the package file? (I did not use my repo for the test.)
That's what I described now above. Auto convert the instances on load,
then save in UTF-8 format.
> How did you do the conversion? Did you use some external tool? (I could not find any code for this in Cuis except from UnicodeString>>#fromBytesStream: but there are no senders.)
No. Just loaded, saved, and checked that everything looked ok. It did.
For instance, #nextUtf8BytesAndCodePointInto:into: ends calling
#utf8BytesAndCodePointFor:byte2:byte3:byte4:into:into: . Check the
comments in these methods. I had hoped these would be informative enough.
> Regarding the method iso8859s15ToRTFEncoding, I am pretty sure this is the correct string from the comment (Test for Cent and Euro characters):
> self assert: 'A¢€' iso8859s15ToRTFEncoding = 'A\u162?\u8364?'
Cool.
> Instead of #iso8859s15ToRTFEncoding a new method #toRTFEncoding or #asRTF polymorphic to String and UnicodeString is probably needed, right?
Yes. That sounds like a good idea. Still, I'd check recent RTF
documentation. I'd be really surprised if they don't handle UTF-8
encoding as part of the standard. If the do, maybe all that can simply
be removed, and just replaced with UTF-8 stuff.
> Cheers,
> Bernhard
>
Cheers,
--
Juan Vuletich
cuis.st
github.com/jvuletich
researchgate.net/profile/Juan-Vuletich
independent.academia.edu/JuanVuletich
patents.justia.com/inventor/juan-manuel-vuletich
linkedin.com/in/juan-vuletich-75611b3
twitter.com/JuanVuletich
More information about the Cuis-dev
mailing list