[Cuis-dev] No regex in parser

Thierry Goubier thierry.goubier at gmail.com
Wed Sep 25 23:32:04 PDT 2019


Hi Boris,

Le jeu. 26 sept. 2019 à 03:26, Boris Shingarov via Cuis-dev
<cuis-dev at lists.cuis.st> a écrit :
>
> > I *think* what you're asking is 'regexes are pretty powerful, why aren't languages with relatively simple syntax parsed using them instead of more complex/powerful parsers?
>
> No-no-no, wait.  There is *complexity*, and then there are *levels* of the Chomsky hierarchy.  The two are orthogonal.  You can have an extremely simple LL(1) grammar -- a trivial one-liner, in fact -- which you can't express using regexes, even in a trillion lines (because if you could express it with a regex, it by definition wouldn't be LL(1)).  So, you need to be careful when thinking of context-free as "more complex" than regular -- it can get you in a world of confusion.

Still, you have to admit that: chomsky type 3 are regular expressions
and hence finite state automata, so requiring a single character in
memory (i.e. very simple stuff), and that type 2 (context free
grammars of which LL(1) is a subset) require at least a stack, hence
are inherently more complex.

Now, since this is dependent on the language itself, within a class,
yes, you have more or less complex languages.

What I like, especially with CFGs, is how you can write very simple
grammars that can generates a infinite set of trees that can each be
very large (i.e. recursive definitions). And when you know that you
can enrich that view with things like probabilistic grammars and
parsing... And when you know that grammars can be learned
automatically (i.e. machine learning)...

So much to do and learn. Ok, that's my job, so that's fine :)

Thierry


>
> -----"Cuis-dev" <cuis-dev-bounces at lists.cuis.st> wrote: -----
> To: "Discussion of Cuis Smalltalk" <cuis-dev at lists.cuis.st>
> From: "Phil B via Cuis-dev"
> Sent by: "Cuis-dev"
> Date: 09/25/2019 05:16PM
> Cc: "Phil B" <pbpublist at gmail.com>
> Subject: Re: [Cuis-dev] No regex in parser
>
> I *think* what you're asking is 'regexes are pretty powerful, why aren't languages with relatively simple syntax parsed using them instead of more complex/powerful parsers?  And how do languages like Smalltalk get by without them?'  I'll talk about the first question further down.  As to the second, the types of tasks most Smalltalkers are doing isn't the text munging that regexes excel at.  For those of us that do, we do use regexes, Petit, OMeta etc... but we're a pretty minuscule segment of the community so there's not often a lot of discussion about it.[1] Perl is probably the best language to look at to see how far you can take regexes (people curse and praise Perl for this)  It is a well studied field as regexes came out of formal computer science... it's only how they're often used that can make them appear half-baked ;-)
>
> I suspect you'll find the how (you do it) *is* the why (you don't often want to) for language parsing.  You can absolutely define many languages purely in terms of regexes.  But as the language starts to grow, the regexes become increasingly complex and brittle.  You might say, 'well that's easy enough to fix... I'll just break it up into multiple regexes'... and so you do that.  Now performance tanks because you're performing multiple passes over the same text with different regexes to parse it.  It also tends to result in a rats nest of control flow (barely) holding it all together.  What you end up with is terse and tedious to debug code that also doesn't perform well.  So what you end up accomplishing is a variant of Greenspun's tenth rule applied to parsers.[2]
>
> It's an interesting area and one of the reasons I enjoy and use OMeta as much as I do: it is *so* quick and easy to define a parser using it that it can effectively be used for ad hoc parsing tasks where I previously would have used regexes.[3]  One thing OMeta supports, that I don't believe Ohm does, that is apropos: just as regexes allow you to inject parsing capabilities into otherwise non-parser code, OMeta's semantic actions allow you to easily inject non-parsing code into your parsers.  It's an interesting and useful inversion.
>
> As far as where to look for more information, I'd just start following the threads on https://en.wikipedia.org/wiki/Regular_expression as almost every time I started to think 'yeah, but what about X...' there was a reference to it on the page.
>
> Hope this is useful,
> Phil
>
> [1] And if that is primarily/exclusively what you do, you'll probably be driven to other solutions purely from a performance standpoint.
>
> [2] A quick way to see this for yourself is to define a simple .ini-like config file format that you parse with a regex: start with a simple 'variable = value' key-value syntax per line.  Now allow multiple definitions per line.  Now add support for comment lines.  Allow comments on the variable definition lines.  Add support for multi-line comments.  Add some simple control flow. Allow definitions to reference previous definitions etc. etc... stop when you've had enough.  How are you feeling about regexes now? ;-) (I can't tell you how many times in the 80's and 90's I used some program that started life using a pile of regexes.  Almost inevitably some future release notes would contain a comment along the lines of "now using a 'real' parser for scripts/config files")
>
> [3] I still use regexes too, but try to limit their scope to where they are best suited IMO: quick and dirty data munging where you either need to extract a small bit of text from a larger whole or where the text has a relatively simple and uniform structure to it.
>
> On Wed, Sep 25, 2019 at 3:09 PM Casey Ransberger via Cuis-dev <cuis-dev at lists.cuis.st> wrote:
>>
>> Hi Mariano,
>>
>> I’m familiar with PetitParser, OMeta/Ohm as well.
>>
>> What I’m after is more like, “has anyone studied ad-hoc parsers in depth?”
>>
>> I have a sneaking suspicion that when the grammar is sufficiently trivial, an ad-hoc parser ends up being the smaller, simpler code base versus a Chomskyan setup with a parser generator. But, how do I confirm that this gut-feeling is correct?
>>
>> I can point at anecdotal evidence. Smalltalks, Lisps, and Forths don’t seem to need the — heh — Chomskyan overhead. Again, the grammars are relatively simple, and that’s probably the “why.” But I’m looking for a formal inquiry into the “how.”
>>
>> Maybe there isn’t one and I have to do it myself. Cuis would be a great subject, but I’m lazy and not at a college, so I look for research papers. What I’m hoping I can find is research into ad-hoc parsers (with a focus on the scanners they employ.)
>>
>> Anyway I think I’ve burned enough bandwidth on something only tangentially relevant to the list, so I’ll bow out now.
>>
>> Thanks for indulging me, people of Cuis :)
>>
>> —Casey
>>
>> > On Sep 24, 2019, at 4:54 AM, Mariano Montone via Cuis-dev <cuis-dev at lists.cuis.st> wrote:
>> >
>> > Hello,
>> >
>> >> El 24/9/19 a las 00:16, Casey Ransberger via Cuis-dev escribió:
>> >> First, I should say I’m aware that there are external packages that do regular expressions. I’m interested in how we get by without them.
>> >>
>> >> Found myself marveling again tonight that the Smalltalk parsers of the world generally haven’t used them. That really astounded me at first, as I had been raised on lex/flex for lexical analysis. Later I learned about parsing expression grammars and saw those used for both scanning and parsing.
>> >>
>> >> I still don’t know how to write a scanner without either of those tools, I’m afraid. I’m also honestly not terribly familiar with how the various Smalltalks go about this.
>> >>
>> >> For context: I’ve run across a case where I want to build a scanner in a language that lacks any regular expression support at all, and thought I might look to Cuis for a good example of how to go about it. Sorry that this is only tangentially on-topic.
>> >>
>> >> Questions:
>> >>
>> >> * What classes / methods should I look at to see the magic?
>> >
>> > Smalltalk uses all sorts of magic in different areas, but I don't think
>> > there's anything special about how it handles parsing. The Scanner and
>> > Parser classes are implemented manually.
>> >>
>> >> * Is there a formal name for the kind of pattern matching that Cuis does when tokenizing input? Asking because I’d like to track down research papers as a matter of habit.
>> >
>> > If you are willing to reasearch, then I would look at Parsing Expression
>> > Grammars. Smalltalk implemention of PEGs is PetitParser from Lukas
>> > Renggli. PetitParser combines PEGs with a combinators api; it is
>> > somewhat special, not easily found in other programming languages. The
>> > PetitParser package for Cuis works very nicely. And this is cool
>> > Smalltalk stuff using them: http://scg.unibe.ch/research/helvetia
>> >
>> > Cheers,
>> >
>> > Mariano
>> > --
>> > Cuis-dev mailing list
>> > Cuis-dev at lists.cuis.st
>> > https://lists.cuis.st/mailman/listinfo/cuis-dev
>>
>> --
>> Cuis-dev mailing list
>> Cuis-dev at lists.cuis.st
>> https://lists.cuis.st/mailman/listinfo/cuis-dev
>
> --
> Cuis-dev mailing list
> Cuis-dev at lists.cuis.st
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.cuis.st_mailman_listinfo_cuis-2Ddev&d=DwICAg&c=sPZ6DeHLiehUHQWKIrsNwWp3t7snrE-az24ztT0w7Jc&r=ecC5uu6ubGhPt6qQ8xWcSQh1QUJ8B1-CG4B9kRM0nd4&m=K0HLt6GHSLXj5pUivJrkqqHgJRI_Vs4MxIER3TdWwo4&s=ljZafTCb7iU1Q6v_KUSN6IiKp9Clw4eCC_1TbV2qigQ&e=
> --
> Cuis-dev mailing list
> Cuis-dev at lists.cuis.st
> https://lists.cuis.st/mailman/listinfo/cuis-dev


More information about the Cuis-dev mailing list