<div dir="ltr"><div>I *think* what you're asking is 'regexes are pretty powerful, why aren't languages with relatively simple syntax parsed using them instead of more complex/powerful parsers? And how do languages like Smalltalk get by without them?' I'll talk about the first question further down. As to the second, the types of tasks most Smalltalkers are doing isn't the text munging that regexes excel at. For those of us that do, we do use regexes, Petit, OMeta etc... but we're a pretty minuscule segment of the community so there's not often a lot of discussion about it.[1] Perl is probably the best language to look at to see how far you can take regexes (people curse and praise Perl for this) It is a well studied field as regexes came out of formal computer science... it's only how they're often used that can make them appear half-baked ;-)</div><div><br></div><div>I suspect you'll find the how (you do it) *is* the why (you don't often want to) for language parsing. You can absolutely define many languages purely in terms of regexes. But as the language starts to grow, the regexes become increasingly complex and brittle. You might say, 'well that's easy enough to fix... I'll just break it up into multiple regexes'... and so you do that. Now performance tanks because you're performing multiple passes over the same text with different regexes to parse it. It also tends to result in a rats nest of control flow (barely) holding it all together. What you end up with is terse and tedious to debug code that also doesn't perform well. So what you end up accomplishing is a variant of Greenspun's tenth rule applied to parsers.[2]</div><div><div><br></div><div>It's an interesting area and one of the reasons I enjoy and use OMeta as much as I do: it is *so* quick and easy to define a parser using it that it can effectively be used for ad hoc parsing tasks where I previously would have used regexes.[3] One thing OMeta supports, that I don't believe Ohm does, that is apropos: just as regexes allow you to inject parsing capabilities into otherwise non-parser code, OMeta's semantic actions allow you to easily inject non-parsing code into your parsers. It's an interesting and useful inversion.</div></div><div><br></div><div>As far as where to look for more information, I'd just start following the threads on <a href="https://en.wikipedia.org/wiki/Regular_expression">https://en.wikipedia.org/wiki/Regular_expression</a> as almost every time I started to think 'yeah, but what about X...' there was a reference to it on the page.</div><div><br></div><div>Hope this is useful,</div><div>Phil</div><div><br></div><div>[1] And if that is primarily/exclusively what you do, you'll probably be driven to other solutions purely from a performance standpoint.</div><div><br></div><div>[2] A quick way to see this for yourself is to define a simple .ini-like config file format that you parse with a regex: start with a simple 'variable = value' key-value syntax per line. Now allow multiple definitions per line. Now add support for comment lines. Allow comments on the variable definition lines. Add support for multi-line comments. Add some simple control flow. Allow definitions to reference previous definitions etc. etc... stop when you've had enough. How are you feeling about regexes now? ;-) (I can't tell you how many times in the 80's and 90's I used some program that started life using a pile of regexes. Almost inevitably some future release notes would contain a comment along the lines of "now using a 'real' parser for scripts/config files")</div><div><br></div><div>[3] I still use regexes too, but try to limit their scope to where they are best suited IMO: quick and dirty data munging where you either need to extract a small bit of text from a larger whole or where the text has a relatively simple and uniform structure to it.</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Sep 25, 2019 at 3:09 PM Casey Ransberger via Cuis-dev <<a href="mailto:cuis-dev@lists.cuis.st">cuis-dev@lists.cuis.st</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi Mariano,<br>
<br>
I’m familiar with PetitParser, OMeta/Ohm as well. <br>
<br>
What I’m after is more like, “has anyone studied ad-hoc parsers in depth?”<br>
<br>
I have a sneaking suspicion that when the grammar is sufficiently trivial, an ad-hoc parser ends up being the smaller, simpler code base versus a Chomskyan setup with a parser generator. But, how do I confirm that this gut-feeling is correct?<br>
<br>
I can point at anecdotal evidence. Smalltalks, Lisps, and Forths don’t seem to need the — heh — Chomskyan overhead. Again, the grammars are relatively simple, and that’s probably the “why.” But I’m looking for a formal inquiry into the “how.”<br>
<br>
Maybe there isn’t one and I have to do it myself. Cuis would be a great subject, but I’m lazy and not at a college, so I look for research papers. What I’m hoping I can find is research into ad-hoc parsers (with a focus on the scanners they employ.)<br>
<br>
Anyway I think I’ve burned enough bandwidth on something only tangentially relevant to the list, so I’ll bow out now. <br>
<br>
Thanks for indulging me, people of Cuis :)<br>
<br>
—Casey<br>
<br>
> On Sep 24, 2019, at 4:54 AM, Mariano Montone via Cuis-dev <<a href="mailto:cuis-dev@lists.cuis.st" target="_blank">cuis-dev@lists.cuis.st</a>> wrote:<br>
> <br>
> Hello,<br>
> <br>
>> El 24/9/19 a las 00:16, Casey Ransberger via Cuis-dev escribió:<br>
>> First, I should say I’m aware that there are external packages that do regular expressions. I’m interested in how we get by without them. <br>
>> <br>
>> Found myself marveling again tonight that the Smalltalk parsers of the world generally haven’t used them. That really astounded me at first, as I had been raised on lex/flex for lexical analysis. Later I learned about parsing expression grammars and saw those used for both scanning and parsing.<br>
>> <br>
>> I still don’t know how to write a scanner without either of those tools, I’m afraid. I’m also honestly not terribly familiar with how the various Smalltalks go about this.<br>
>> <br>
>> For context: I’ve run across a case where I want to build a scanner in a language that lacks any regular expression support at all, and thought I might look to Cuis for a good example of how to go about it. Sorry that this is only tangentially on-topic.<br>
>> <br>
>> Questions: <br>
>> <br>
>> * What classes / methods should I look at to see the magic?<br>
> <br>
> Smalltalk uses all sorts of magic in different areas, but I don't think<br>
> there's anything special about how it handles parsing. The Scanner and<br>
> Parser classes are implemented manually.<br>
>> <br>
>> * Is there a formal name for the kind of pattern matching that Cuis does when tokenizing input? Asking because I’d like to track down research papers as a matter of habit.<br>
> <br>
> If you are willing to reasearch, then I would look at Parsing Expression<br>
> Grammars. Smalltalk implemention of PEGs is PetitParser from Lukas<br>
> Renggli. PetitParser combines PEGs with a combinators api; it is<br>
> somewhat special, not easily found in other programming languages. The<br>
> PetitParser package for Cuis works very nicely. And this is cool<br>
> Smalltalk stuff using them: <a href="http://scg.unibe.ch/research/helvetia" rel="noreferrer" target="_blank">http://scg.unibe.ch/research/helvetia</a><br>
> <br>
> Cheers,<br>
> <br>
> Mariano<br>
> -- <br>
> Cuis-dev mailing list<br>
> <a href="mailto:Cuis-dev@lists.cuis.st" target="_blank">Cuis-dev@lists.cuis.st</a><br>
> <a href="https://lists.cuis.st/mailman/listinfo/cuis-dev" rel="noreferrer" target="_blank">https://lists.cuis.st/mailman/listinfo/cuis-dev</a><br>
<br>
-- <br>
Cuis-dev mailing list<br>
<a href="mailto:Cuis-dev@lists.cuis.st" target="_blank">Cuis-dev@lists.cuis.st</a><br>
<a href="https://lists.cuis.st/mailman/listinfo/cuis-dev" rel="noreferrer" target="_blank">https://lists.cuis.st/mailman/listinfo/cuis-dev</a><br>
</blockquote></div>