<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html; charset=UTF-8" http-equiv="Content-Type">
</head>
<body bgcolor="#ffffff" text="#000000">
Hi Ezequiel,<br>
<br>
On top of that,
<a class="moz-txt-link-freetext" href="https://www.unicode.org/Public/5.1.0/ucd/UCD.html#General_Category_Values">https://www.unicode.org/Public/5.1.0/ucd/UCD.html#General_Category_Values</a>
defines 3 general categories for "separators", but codePoint 00A0
"NON-BREAKING SPACE" is marked as a "Separator, Space" (Zs), just
like 0020, although it does not 'separate' words in the common
sense...<br>
<br>
It would be great to have better classification according to
UnicodeData.txt (see Character class >> #initialize). I think
a set of new testing methods to classify characters as "whitespace",
"separator", "non drawable", etc, would be a good addition, if you
are in the mood.<br>
<br>
WRT #isSeparator, I think that, unless we rename it as something
like #isSeparatorInSmalltalkCode, it is best to keep the "is
whitespace, separator, non-zero width" semantic it already has.
Adding additional codepoints that satisfy this criteria is ok
though.<br>
<br>
Thanks,<br>
<br>
On 5/8/2024 12:31 PM, Ezequiel Birman via Cuis-dev wrote:
<blockquote
cite="mid:CAOo=t4eykm+fDcpU1dfjxO8x2pZspPs_RqTDOOcbftAr3HCKOQ@mail.gmail.com"
type="cite">
<div dir="ltr">
<div>And of course I forgot there are a lot more visible
separators, like the middle dots in ancient roman texts,
phoenician and aegean scripts... Currently `isSeparator` is
being used during parsing, case conversions, trimming, etc.
Sometimes meaning blank i.e. non-drawable, and sometimes
meaning any word separator whether drawable or not.<br>
<br>
</div>
I'll add an isBlank or isDrawable for my use case, but let me
know what you think about adding unicode space-like separators
to isSeparator.<br>
<br>
<div>-- <br>
</div>
<div>Eze<br>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Wed, 8 May 2024 at 15:37,
Ezequiel Birman <<a moz-do-not-send="true"
href="mailto:ebirman77@gmail.com">ebirman77@gmail.com</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin: 0px 0px 0px
0.8ex; border-left: 1px solid rgb(204, 204, 204);
padding-left: 1ex;">
<div dir="ltr">
<div>Lately I've started tinkering with text morphs and I
was wondering about UnicodeCodePoint > #isSeparator. I
needed to (in)validate non-drawable codepoints including
control sequences, but the current implementation doesn't
include the codepoints for thin space, hair space, em
space, etc. is it on purpose? For what is worth I gathered
all the non-drawable codepoints (maybe some are still
missing):<br>
<br>
^ `#(32 9 10 13 12 160 8192 8193 8194 8195 8196 8197 8198
8199 8200 8201 8202 8203 8239 8287 12288)` statePointsTo:
value<br>
</div>
<div><br>
</div>
<div>Also, I learned that there is one separator that *is*
drawable: The Ogham space mark. Probably, it should be
included too, unless I am misunderstanding the semantics
of isSeparator.<br>
<br>
I should have added comments describing the codepoint,
will do asap.<br>
<br>
-- <br>
</div>
<div>Eze<br>
</div>
</div>
</blockquote>
</div>
</blockquote>
<br>
<br>
<pre class="moz-signature" cols="72">--
Juan Vuletich
cuis.st
github.com/jvuletich
researchgate.net/profile/Juan-Vuletich
independent.academia.edu/JuanVuletich
patents.justia.com/inventor/juan-manuel-vuletich
linkedin.com/in/juan-vuletich-75611b3
twitter.com/JuanVuletich</pre>
</body>
</html>