<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <meta content="text/html; charset=UTF-8" http-equiv="Content-Type"> </head> <body bgcolor="#ffffff" text="#000000"> Hi Ezequiel,<br> <br> On top of that, <a class="moz-txt-link-freetext" href="https://www.unicode.org/Public/5.1.0/ucd/UCD.html#General_Category_Values">https://www.unicode.org/Public/5.1.0/ucd/UCD.html#General_Category_Values</a> defines 3 general categories for "separators", but codePoint 00A0 "NON-BREAKING SPACE" is marked as a "Separator, Space" (Zs), just like 0020, although it does not 'separate' words in the common sense...<br> <br> It would be great to have better classification according to UnicodeData.txt (see Character class >> #initialize). I think a set of new testing methods to classify characters as "whitespace", "separator", "non drawable", etc, would be a good addition, if you are in the mood.<br> <br> WRT #isSeparator, I think that, unless we rename it as something like #isSeparatorInSmalltalkCode, it is best to keep the "is whitespace, separator, non-zero width" semantic it already has. Adding additional codepoints that satisfy this criteria is ok though.<br> <br> Thanks,<br> <br> On 5/8/2024 12:31 PM, Ezequiel Birman via Cuis-dev wrote: <blockquote cite="mid:CAOo=t4eykm+fDcpU1dfjxO8x2pZspPs_RqTDOOcbftAr3HCKOQ@mail.gmail.com" type="cite"> <div dir="ltr"> <div>And of course I forgot there are a lot more visible separators, like the middle dots in ancient roman texts, phoenician and aegean scripts... Currently `isSeparator` is being used during parsing, case conversions, trimming, etc. Sometimes meaning blank i.e. non-drawable, and sometimes meaning any word separator whether drawable or not.<br> <br> </div> I'll add an isBlank or isDrawable for my use case, but let me know what you think about adding unicode space-like separators to isSeparator.<br> <br> <div>-- <br> </div> <div>Eze<br> </div> </div> <br> <div class="gmail_quote"> <div dir="ltr" class="gmail_attr">On Wed, 8 May 2024 at 15:37, Ezequiel Birman <<a moz-do-not-send="true" href="mailto:ebirman77@gmail.com">ebirman77@gmail.com</a>> wrote:<br> </div> <blockquote class="gmail_quote" style="margin: 0px 0px 0px 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"> <div dir="ltr"> <div>Lately I've started tinkering with text morphs and I was wondering about UnicodeCodePoint > #isSeparator. I needed to (in)validate non-drawable codepoints including control sequences, but the current implementation doesn't include the codepoints for thin space, hair space, em space, etc. is it on purpose? For what is worth I gathered all the non-drawable codepoints (maybe some are still missing):<br> <br> ^ `#(32 9 10 13 12 160 8192 8193 8194 8195 8196 8197 8198 8199 8200 8201 8202 8203 8239 8287 12288)` statePointsTo: value<br> </div> <div><br> </div> <div>Also, I learned that there is one separator that *is* drawable: The Ogham space mark. Probably, it should be included too, unless I am misunderstanding the semantics of isSeparator.<br> <br> I should have added comments describing the codepoint, will do asap.<br> <br> -- <br> </div> <div>Eze<br> </div> </div> </blockquote> </div> </blockquote> <br> <br> <pre class="moz-signature" cols="72">-- Juan Vuletich cuis.st github.com/jvuletich researchgate.net/profile/Juan-Vuletich independent.academia.edu/JuanVuletich patents.justia.com/inventor/juan-manuel-vuletich linkedin.com/in/juan-vuletich-75611b3 twitter.com/JuanVuletich</pre> </body> </html>