[Cuis-dev] No regex in parser

Phil B pbpublist at gmail.com
Wed Sep 25 14:15:49 PDT 2019


I *think* what you're asking is 'regexes are pretty powerful, why aren't
languages with relatively simple syntax parsed using them instead of more
complex/powerful parsers?  And how do languages like Smalltalk get by
without them?'  I'll talk about the first question further down.  As to the
second, the types of tasks most Smalltalkers are doing isn't the text
munging that regexes excel at.  For those of us that do, we do use regexes,
Petit, OMeta etc... but we're a pretty minuscule segment of the community
so there's not often a lot of discussion about it.[1] Perl is probably the
best language to look at to see how far you can take regexes (people curse
and praise Perl for this)  It is a well studied field as regexes came out
of formal computer science... it's only how they're often used that can
make them appear half-baked ;-)

I suspect you'll find the how (you do it) *is* the why (you don't often
want to) for language parsing.  You can absolutely define many languages
purely in terms of regexes.  But as the language starts to grow, the
regexes become increasingly complex and brittle.  You might say, 'well
that's easy enough to fix... I'll just break it up into multiple
regexes'... and so you do that.  Now performance tanks because you're
performing multiple passes over the same text with different regexes to
parse it.  It also tends to result in a rats nest of control flow (barely)
holding it all together.  What you end up with is terse and tedious to
debug code that also doesn't perform well.  So what you end up
accomplishing is a variant of Greenspun's tenth rule applied to parsers.[2]

It's an interesting area and one of the reasons I enjoy and use OMeta as
much as I do: it is *so* quick and easy to define a parser using it that it
can effectively be used for ad hoc parsing tasks where I previously would
have used regexes.[3]  One thing OMeta supports, that I don't believe Ohm
does, that is apropos: just as regexes allow you to inject parsing
capabilities into otherwise non-parser code, OMeta's semantic actions allow
you to easily inject non-parsing code into your parsers.  It's an
interesting and useful inversion.

As far as where to look for more information, I'd just start following the
threads on https://en.wikipedia.org/wiki/Regular_expression as almost every
time I started to think 'yeah, but what about X...' there was a reference
to it on the page.

Hope this is useful,
Phil

[1] And if that is primarily/exclusively what you do, you'll probably be
driven to other solutions purely from a performance standpoint.

[2] A quick way to see this for yourself is to define a simple .ini-like
config file format that you parse with a regex: start with a simple
'variable = value' key-value syntax per line.  Now allow multiple
definitions per line.  Now add support for comment lines.  Allow comments
on the variable definition lines.  Add support for multi-line comments.
Add some simple control flow. Allow definitions to reference previous
definitions etc. etc... stop when you've had enough.  How are you feeling
about regexes now? ;-) (I can't tell you how many times in the 80's and
90's I used some program that started life using a pile of regexes.  Almost
inevitably some future release notes would contain a comment along the
lines of "now using a 'real' parser for scripts/config files")

[3] I still use regexes too, but try to limit their scope to where they are
best suited IMO: quick and dirty data munging where you either need to
extract a small bit of text from a larger whole or where the text has a
relatively simple and uniform structure to it.

On Wed, Sep 25, 2019 at 3:09 PM Casey Ransberger via Cuis-dev <
cuis-dev at lists.cuis.st> wrote:

> Hi Mariano,
>
> I’m familiar with PetitParser, OMeta/Ohm as well.
>
> What I’m after is more like, “has anyone studied ad-hoc parsers in depth?”
>
> I have a sneaking suspicion that when the grammar is sufficiently trivial,
> an ad-hoc parser ends up being the smaller, simpler code base versus a
> Chomskyan setup with a parser generator. But, how do I confirm that this
> gut-feeling is correct?
>
> I can point at anecdotal evidence. Smalltalks, Lisps, and Forths don’t
> seem to need the — heh — Chomskyan overhead. Again, the grammars are
> relatively simple, and that’s probably the “why.” But I’m looking for a
> formal inquiry into the “how.”
>
> Maybe there isn’t one and I have to do it myself. Cuis would be a great
> subject, but I’m lazy and not at a college, so I look for research papers.
> What I’m hoping I can find is research into ad-hoc parsers (with a focus on
> the scanners they employ.)
>
> Anyway I think I’ve burned enough bandwidth on something only tangentially
> relevant to the list, so I’ll bow out now.
>
> Thanks for indulging me, people of Cuis :)
>
> —Casey
>
> > On Sep 24, 2019, at 4:54 AM, Mariano Montone via Cuis-dev <
> cuis-dev at lists.cuis.st> wrote:
> >
> > Hello,
> >
> >> El 24/9/19 a las 00:16, Casey Ransberger via Cuis-dev escribió:
> >> First, I should say I’m aware that there are external packages that do
> regular expressions. I’m interested in how we get by without them.
> >>
> >> Found myself marveling again tonight that the Smalltalk parsers of the
> world generally haven’t used them. That really astounded me at first, as I
> had been raised on lex/flex for lexical analysis. Later I learned about
> parsing expression grammars and saw those used for both scanning and
> parsing.
> >>
> >> I still don’t know how to write a scanner without either of those
> tools, I’m afraid. I’m also honestly not terribly familiar with how the
> various Smalltalks go about this.
> >>
> >> For context: I’ve run across a case where I want to build a scanner in
> a language that lacks any regular expression support at all, and thought I
> might look to Cuis for a good example of how to go about it. Sorry that
> this is only tangentially on-topic.
> >>
> >> Questions:
> >>
> >> * What classes / methods should I look at to see the magic?
> >
> > Smalltalk uses all sorts of magic in different areas, but I don't think
> > there's anything special about how it handles parsing. The Scanner and
> > Parser classes are implemented manually.
> >>
> >> * Is there a formal name for the kind of pattern matching that Cuis
> does when tokenizing input? Asking because I’d like to track down research
> papers as a matter of habit.
> >
> > If you are willing to reasearch, then I would look at Parsing Expression
> > Grammars. Smalltalk implemention of PEGs is PetitParser from Lukas
> > Renggli. PetitParser combines PEGs with a combinators api; it is
> > somewhat special, not easily found in other programming languages. The
> > PetitParser package for Cuis works very nicely. And this is cool
> > Smalltalk stuff using them: http://scg.unibe.ch/research/helvetia
> >
> > Cheers,
> >
> > Mariano
> > --
> > Cuis-dev mailing list
> > Cuis-dev at lists.cuis.st
> > https://lists.cuis.st/mailman/listinfo/cuis-dev
>
> --
> Cuis-dev mailing list
> Cuis-dev at lists.cuis.st
> https://lists.cuis.st/mailman/listinfo/cuis-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.cuis.st/mailman/archives/cuis-dev/attachments/20190925/595a00bd/attachment.htm>


More information about the Cuis-dev mailing list