Beautiful work! I'm not even gonna wonder if any of it was AI-generated, because the code is clearly crafted meticulously by an experienced C engineer, very readable, and shorter than I expected.
zombot · 2026-07-01 09:52:39 UTC
So many parser combinators operate on bytes assuming ASCII input only. I'd be more interested in a parser combinator lib that has UTF-8 decoding already abstracted away, operating on `wchar_t`, or even polymorphic input stream element types.
lokeg · 2026-07-01 10:12:04 UTC
Isn't working with the utf8 stream sufficient? Especially if you only have ASCII keywords/operators/brackets, I feel a ASCII parser should work with utf8 out of the box
t-3 · 2026-07-01 11:17:09 UTC
Yeah, a parser has no need to understand what a string or glyph is, let alone ASCII or UTF-8. The point is to take a stream of arbitrary data and process it into something that can be reasoned about. Unless you know your input stream is regular in some way, processing it at the finest level of granularity (usually bytes) is probably the only thing to do.
Joker_vD · 2026-07-01 11:33:39 UTC
I'd rather not. Most of the time, you don't need it, and when you do, it's for a very small part of the input. And `wchar_t` is an abomination (it's UTF-32 on Linux, UTF-16 on Windows, and all of that is allowed); you probably really want `char32_t`, and again, not for the whole of the input; streaming such data a single rune/codepoint at a time is probably fine as well for most uses.
On the other hand, if your parser combinators process char-by-char, then maintaining a small "is this valid UTF-8 so far" context on the side should be pretty simple, so providing it would be an useful option, but actually decoding? Please don't.
RossBencina · 2026-07-01 09:53:57 UTC
Now all it needs is a parser in 'examples/' that parses EBNF grammars and emits a parser in terms of these combinators.
eqvinox · 2026-07-01 12:25:59 UTC
> Flex or Bison generated code is also hard to maintain plus it complicates builds.
This is, in all honesty, a solved problem in any reasonable build system. (And I have little patience left for people making life hard for themselves through their own choices.)
Comments
On the other hand, if your parser combinators process char-by-char, then maintaining a small "is this valid UTF-8 so far" context on the side should be pretty simple, so providing it would be an useful option, but actually decoding? Please don't.
This is, in all honesty, a solved problem in any reasonable build system. (And I have little patience left for people making life hard for themselves through their own choices.)