jasode 15 days ago

Like the other sibling comments mentioned, I too was confused about "100x faster than regex" and what the actual product was about.

After digging around their website, I found this blog post which explains it better:

https://blog.nezaboodka.com/post/2019/594-using-nevod-for-te...

So my summary would be:

1) it works "faster" than regex in a specific scenario of treating text as entities in natural language. (E.g. higher conceptual abstractions such as qualifiers, variations, etc). If one were to reconstruct Nevod's rules using pure traditional regex (a very complicated regex), executing that regex would be slower because it's more of a "dumb" character sequence matching engine instead of a higher level natural language parser.

2) it's currently an unreleased "text search engine" that presumably will be licensed for you to integrate into your own software. The text matching engine is currently only used in their proprietary database engine. Whether the engine is a library one statically links in like Sqlite -- or -- it's a separate runtime like ElasticSearch that you make API calls to, I don't know.

I notice the CEO is Yury Chetyrko and the submitter is ychetyrko, so maybe he can explain in more detail what exactly Nevod is.

  • bazzargh 15 days ago

    My reading of that is that they compared apples and oranges. They didn't use NLP there at all (stemming, parts of speech tagging, etc); they just relaxed handling of whitespace. The Nevod 'equivalent' was a less general expression to make it seem more maintainable.

    The Nevod example translates to (ruby):

        pattern = Regexp.new( "(?<Name>ejection fraction|LVEF)( by visual inspection)?
                               (?<Qualifier>(is|of)( (at least|about|greater than|less than|equal to))?)
                               (?<Value>[0-9]+(-[0-9]+)?|normal|moderate|severe)".gsub(/\s+/, "\\s+"), 
                             Regexp::IGNORECASE)
    
        ["ejection fraction is at least 70-75",
        "ejection  fraction of about 20",
        "ejection fraction  of 60",
        "ejection  fraction of greater than 65",
        "ejection fraction of 55",
        "ejection fraction by visual  inspection is 65",
        "LVEF is normal"].each do |line|
          puts line
          puts pattern.match(line).inspect
        end
    
    The only trick I used was to substitute literal whitespace in the regex with a whitespace pattern, so that the typed regex was more readable.
    • mikecapizzi 15 days ago

      If you're looking for a tool that allows you to incorporate legitimate NLP approaches, you should have a look at `odin`. Here's a paper https://doi.org/10.1093/database/bay098 showing its usage in the medical domain.

      And the code is open-sourced as part of the `processors` library out of the CLULab at the University of Arizona: https://github.com/clulab/processors

      The most detailed (though not completely up-to-date) documentation is probably in the manual here: https://arxiv.org/abs/1509.07513

      I'm using it at my current job to build an analysis tool for customer-agent phone calls.

      It allows you to build rules that match on different levels of abstraction: tokens, pos-tags, dependency paths. You can even match tokens based on word similarity (as measured by cosine similarity of word vectors).

      And these rules can "cascade" (i.e. build off of each other). So you can find an entity or event in rule 1 and then look for how that interacts with another matched entity or event in a later rule.

  • Cu3PO42 15 days ago

    That sounds very much like what Rust's RegEx engine does. I understand it extracts longer literals (when available) to find a starting point for where a match might be according to [1].

    [1] https://blog.burntsushi.net/ripgrep/#literal-optimizations

    • glangdale 15 days ago

      This is not a new technique with the Rust regex engine. We had a considerably more comprehensive literal 'factoring' approach in Hyperscan about a decade earlier (which also satisfied a lifelong ambition of mine; specifically misusing the netflow algorithm in a graph for something).

      The multiple literal implementation in that matcher is also a partial lift from Hyperscan's "Teddy" small group literal matcher (not salty about that as "burntsushi" has been very clear about inspiration). I do wish they'd pull in the rest of the algorithm at some point - the bits that allow merging of Teddy literals into buckets to reduce false positive rates...

      • burntsushi 14 days ago

        While we're wishing for things, I'd love to see you write up how some of the algorithms (like Teddy) work. Hyperscan is clearly a treasure trove of them, but the effort required to extract its secrets is quite large! For example, I've spent quite a bit of time perusing its code (and your excellent mailing list posts), and I still don't think I could describe its execution flow at a high level.

        As far as literals go, I was doing that well before I heard about Hyperscan. I got the idea from hints in the literature. (On mobile or else I'd provide a link.)

        • glangdale 14 days ago

          We have a paper accepted to NSDI, so this will hopefully shed some light on things. A conference paper is, sadly, too small to fit more than a few subsystems so a writeup of Teddy isn't included.

          As I don't work for Intel any more, it's unlikely that I will put in the effort for anything significantly more comprehensive. I am considering another building another regex matcher but it wouldn't be the encyclopedic list of optimizations approach. Paul Terra refers to Hyperscan as the "Dwarf Fortress of regular expression matchers". I have some new ideas I'd like to try out, in any case.

          I wasn't suggesting that your use of literal factoring comes from Hyperscan, only the Teddy matcher. Literal factoring seems an old technique and I don't know the genuinely first cite of that.

          • burntsushi 14 days ago

            Gotya. Thanks for the clarification! I look forward to reading your paper! I'll hold out hope for the folks working on Hyperscan at Intel to do some more in depth write ups.

      • Cu3PO42 14 days ago

        I didn't mean to insinuate that it was. The Rust RegEx engine was simply the first that came to mind for me. What I did intend to say that if this is indeed the optimisation that Nevod claims it has, it is not the first to do it.

        However I did take the time to read up on Hyperscan and would like to thank you for contributing to what looks like an excellent tool.

Cynddl 15 days ago

> Nevod is a language and technology that provide pattern-based text search. Nevod is specially aimed to rapidly reveal entities and their relationships in texts written in the natural language. This patent pending technology is unique on the market.

What is this? A website, a tool? Can I run it locally? Is it free software, open source, close source? Where is the information about the patent?

Also, you mention it's easier (why?) and hundred time faster, but I don't see anywhere a comparison with traditional regular expressions.

  • OskarS 15 days ago

    When it comes to any kind of programming library:ish thing, there's no bigger turn-off than "patent pending technology".

    No thanks. Off you go.

  • codazoda 15 days ago

    The patent kills it for me even though it may just be for defense (I didn't look and it sounds like they didn't say).

    • airstrike 15 days ago

      Well, it will eventually expire so we can revisit this then

  • dmix 15 days ago

    You’d think they’d at a very minimum backup the 100x claim on their website. It’s repeated on the homepage.

Cu3PO42 15 days ago

The claims seem dubious to me as well. What exactly does faster mean here? Since a speedup of two orders of magnitude is mentioned I'm assuming it refers to matching speed.

Is it faster than PCRE? Is it faster than some other engine or is the claim that it's faster than any available RegEx engine? I had a quick look at the reference and this looks like it will accept context-free languages (since recursion is allowed). I strongly doubt that a CFG parser is magically faster than any RegEx engine.

> It is hundreds of times faster than the well-known regular expressions, and the speed doesn’t degrade linearly when you add patterns.

The author seems to believe that matching time for RegExes is linear in the time of the pattern? Once compiled to a DFA, the pattern only has negligible influence on the matching time.

EDIT: I've been trying to figure out what kind of parser this generates. It might be a LR(k) parser? It breaks on the following (admittedly contrived) example:

  #S = E + End;
  E = {E + "+" + E, E + "*" + E, T};
  T = {[1+]Alpha, N + "()"};
  N = {P, Q};
  P = {"a" + P, "ab"};
  Q = [1+]"a";
with an input of "aaaaaaab()+beta*c"
  • MaxBarraclough 14 days ago

    > Once compiled to a DFA, the pattern only has negligible influence on the matching time.

    With the right optimisations, certain longer patterns can be much faster to scan for, on account of the Boyer-Moore approach.

    If I asked you to search through a book for 10 consecutive pages of the letter 'Q', you wouldn't need to check every page. The same optimisation can be applied to regex. (Not that most regex implementations bother to do it, though.)

    • Cu3PO42 14 days ago

      You're right. What I meant to say is that for any (formal) regular expression, we can compile it to a DFA and then match any strings in time linear in the length of the string.

      Certainly the pattern length matters both for pre-processing (compiling to DFA) and the runtime in a Boyer-Moore approach. However, as you mentioned in the Boyer-Moore average case of Θ(n/m) a longer pattern is faster, rather than slower as the page on Nevod seems to imply.

jehna1 15 days ago

Here's a somewhat related project that aims to make regular expressions more verbose:

https://github.com/VerbalExpressions/JSVerbalExpressions

Here's a simple example for matching URLs:

    const tester = VerEx()
      .startOfLine()
      .then('http')
      .maybe('s')
      .then('://')
      .maybe('www.')
      .anythingBut(' ')
      .endOfLine();
The project has been ported to many different languages and it outputs a normal regular expression that you can use to match your text.
Drup 15 days ago

This is cute! It's basically a nice syntax for full parsing of extended regular expressions. It would make for a very nice tool. Far too often, regular expression tools only implement "matching" (i.e., extract a list of strings). Full parsing gives you the complete parsetree, and thus preserve more structural information, especially under repetitions.

I'm very much in support of anything that uses regex parsing instead of matching. Matching is nearly always the wrong tool and cause more bugs than anything. The only reason it's used is that the engines are easier to implement. I wrote a small paper on how to retrofit parsing on top of an engine that only gives you matching recently (https://gabriel.radanne.net/papers/tyre/tyre_paper.pdf).

Given the amount of prior art in the academic community, patenting this is probably worthless, but eh ...

  • perlgeek 15 days ago

    The Perl 6 grammar engine does full parsing, and gives you the parse tree that matches the structure of the rules.

    • Drup 15 days ago

      Well, except Perl doesn't parse regular grammars (it parses much more) and is far from being "one pass" (since the complexity guarantees are not valid anymore) ....

      • tux1968 15 days ago

        Regular grammars a strict subset of what Perl 6 will parse, no? If you stick to that subset it will parse in a single pass.

        • Drup 15 days ago

          Except Perl 6 doesn't enforce it, so you have to guess yourself if you are really in the regular case. Additionally, PCRE used to be exponential for certain "bad cases", some of which were (truly) regular expressions.

          My point is simply that Perl doesn't give you any guarantee except "we will try to parse it". Maybe you will hit the right case for the right version of Perl, who knows ? The Web is full of DDOS attack based on exponential PCREs.

          With actual regular expression, the complexity is guaranteed.

          • perlgeek 15 days ago

            > My point is simply that Perl doesn't give you any guarantee except "we will try to parse it".

            Perl 6 gives you pretty good control over where to use backtracking and where not. It parses regular languages with a NFA or DFA, and can switch to a backtracking engine.

            More details in https://www.apress.com/us/book/9781484232279 (sorry for the plug, wrote it myself).

          • pawelmurias 15 days ago

            The way regular expressions work in Perl doesn't randomly degrade across Perl versions. For a given regex you can determine what it's complexity would be across versions just like you would for an arbitrary chunk of Perl code. It's true you don't always get a complexity guarantee for free like in a DFA implementation of formal regular expressions.

          • b2gills 12 days ago

            Actually Perl6 does enforce it if you tell it to. The following fails to match.

                'abc' ~~ m:ratchet/ .* c /
            
            The `.*` gobbles everything up, which means there is nothing left for `c`

            When writing grammars, use `token` which turns on `:ratchet` for you.

          • cestith 15 days ago

            PCRE isn't Perl.

  • centerOfMass 15 days ago

    This might be the real value from this thread.

    Any more resources on this? Is this the same as "Parsing expression grammar"?

    • Drup 15 days ago

      No, Parsing expression grammar (PEG) are not regular grammars at all.

      For more resources, I think the related work of my paper has the main ones. The "Kleene meets Church" project has lot's of very good publications on the topic: https://di.ku.dk/kmc/publications/

feanaro 15 days ago

I see two novel aspects of this language:

1. the ability to easily break patterns into named subpatterns which can be referenced later on

2. the `@` operator which gives you the ability to talk about things inside these subpatterns

These seem like worthwhile additions which would make regex more manageable. I don't see any reason why the whole of regex would need to be abandoned for some completely new, potentially proprietary technology, though.

It also reminds me of parser combinators (in the form they are popular in Haskell, for instance).

glangdale 15 days ago

It looks like there are some interesting things in this project beyond regex, but the way that it skates over the fact that there are already automata-based multiple pattern matchers is pretty rank.

When we started developing automata-based pattern matching s/w in 2006 - 2006! - we were well aware that it wasn't really a new thing. 12 years of work on Hyperscan (https://github.com/intel/hyperscan), 1 acquisition and 1 open source release later - as well as multiple other projects doing similar things (re2, Rust regex, Parabix, icgrep, Rosie Pattern Language, ...) has only added to this crowded field. All this has taught me many things, but the thing that I keep returning to is that "a lot of people apparently don't know about prior art, or assiduously pretend not to". It will be interesting to see what they try to patent.

JeffRosenberg 15 days ago

> Pattern definition language is simple and clear, thus very easy to learn

Once you get into more complicated expressions, I don't see this as much easier or simpler than regexes. For example, this expression from the tutorial looks as much like gibberish to me as an equivalent regex:

    Domain = Word + [1+] ("." + Word + [0+] {Word, "_", "-"});
That said, I can see some possible value in pattern matching based on text tokens, rather than individual characters. I'm sure there is a subset of pattern matching problems that could be solved more simply using this.
  • brudgers 15 days ago

    In the end, automata are automata and anything that tries to do what regex's do winds up looking a lot like regex's -- which look a lot like regular expressions (as a representation of a finite state machine). The difference between "regex alternatives" and regex's is that the alternatives tend to be more verbose and less well documented...and less likely to elicit good answers on StackOverflow...and perhaps less likely to generate good questions there.

    I think the hard part of pattern matching is reasoning about pattern matching. The obscurity of Regex notation is mostly a function of unfamiliarity with the concepts. [:word:]+ is not easier to reason about than \w+ and "\w+" is much better documented than "[:word:]+" or "Word + [1+]."

    The other problem with learning Regex's is that regex notation is someone else's code. There's always the attraction of fixing it. I've dunning-kuger'ed it myself. Fortunately, making my new more sensible superduper regex notation complete required RTFM'ing...and then I'd read the manual and realized I'd already fixed regex notation by fixing the absence of knowledge in my head. Plus I could talk to other people about pattern matching using the common language of pattern matching.

  • JadeNB 15 days ago

    > That said, I can see some possible value in pattern matching based on text tokens, rather than individual characters.

    It just sounds like Perl's extended regular expressions.

BeeOnRope 15 days ago

Any details and/or reproducible benchmarks backing up the "100s of times faster" claim?

ychetyrko 15 days ago

First of all, thanks everyone for the valuable feedback, critics, suggestions, etc. Truly appreciate that!

And thanks for patience to all people who are playing with the Nevod right now and getting errors. We have an unexpectedly high interest and number of visitors is very high in our playground. Meanwhile, it's a preview of the technology, please keep in mind.

Let me clarify few things. The speed of Nevod is based on two things:

1. Nevod operates on words, not individual characters. From my 30 years of coding experience I would say that roughly 95% of text processing/rules are word-based. So it covers most of tasks.

2. Nevod matches MULTIPLE patterns against document in ONE PASS. Patters/expressions are indexed by state machine and filtered effectively during matching.

So, at a RULE DECK of 1000 patterns, we got 300 times improvement comparing to RegExp. This is a kind of tasks when you let say need to highlight syntax, to massively verify and route incoming requests in cloud, perform massive text validations, track certain phrases and relations in massive human communication, etc.

With growing number of patterns the difference between Nevod and RegExp is higher and higher and go far beyond 300 times. So, I'm a kind of disagree with moderators removed the word "hundreds" from the headline. :)

We will publish benchmark results and we will also release "negrep" (Nevod GREP) command line tool for Windows, Linux, Mac, so everyone will be able to run "negrep", play with it and verify benchmark him/herself. Generally, Nevod technology will be available both in form of a library, in form of command line tool, and in form of service.

  • glangdale 15 days ago

    > 2. Nevod matches MULTIPLE patterns against document in ONE PASS. Patters/expressions are indexed by state machine and filtered effectively during matching.

    This is a good idea. It's such a good idea that we did it in 2006, sold a company to Intel based around in in 2013 and open-sourced the resulting library (https://github.com/intel/hyperscan) in 2015.

    All multi-pattern systems get a huge improvement over one-at-a-time regex. It would be embarrassing if they did not. Our rule of thumb for Hyperscan was that it should be about 5-10x faster than libpcre on single patterns and the gap should get progressively bigger.

    I don't doubt that there is quite a bit of interesting work (entirely outside of regex - the word-based focus looks interesting) going on in this project, but the fact that you couldn't be bothered to Google around for a sensible baseline is not encouraging.

    Also, which bit of your system do you think is novel enough to patent? You may find considerable prior art out there, which you don't really seem to be aware of based on how much emphasis you're placing on the not-entirely-novel idea of using a state machine.

    • ychetyrko 14 days ago

      Geoff, for me it's real honor that you, well-known creator of Hyperscan, paid attention to what we are doing. We truly respect all the prior work in this field (Hyperscan, RE2, Rust, etc), so different benchmarks will be available.

      However, we all need to keep in mind that Nevod operates on words, not characters, and Nevod targets large rule decks (1000-100000 patterns). So, we don't compete in the field of individual pattern matching or streaming matching (to some extent, goals of Hyperscan and Nevod are different). In other words, a lot of things depend on target goals, which usually are reflected in the kind of patterns and in their number.

      Regarding patent. Sorry, for legal reasons I can't expose details at the moment.

      Thank you for your valuable comments.

      • glangdale 14 days ago

        I don't think I'm all that well-known. :-)

        Staying away from streaming is a good idea for a lot of designs. We all lost a lot of our hair making streaming work in Hyperscan, and it was often the thing suppressing easier optimizations. Or at least we would have to have a separate approach for streaming and block mode; not exactly great s/w engineering.

        Word-based approaches are interesting. So is large scale. Hyperscan does embarrassingly badly on natural language stuff at scale due to some baked in assumptions about relative frequencies of literal factors and the literal matchers tend to fall apart on really huge literal sets in this domain. So there's definitely an opportunity here.

        I would strongly encourage you to produce some repeatable benchmarks (at least, things where we can see what the task is for something like RE2 or Hyperscan). We spent many years in stealth mode with a lot of superlatives and NDA-only Powerpoint decks and I don't think it did ourselves any favors. It means that you wind up being the only guardian of your own honesty, which is ... tricky ("What the hell is a medium-complexity pattern, anyhow?").

        • ychetyrko 12 days ago

          Based on what you are saying I'm concluding that we are on the right path. Again, thank a lot for all the valuable feedback and advice!

    • nostrademons 15 days ago

      Isn't this basically what regexp-based lexer generators (eg. lex, 1975) do? Stdlib regexps need to run multiple passes because the library itself has no idea whether it's being used with one pattern or multiple patterns, but any lexer generator that has all the patterns at once (i.e. all of them) can merge the DFAs generated and match the input in a single pass.

      Hell, I've been known to do quick & dirty versions of this technique by concatenating all the patterns together with "|" before passing them to Pattern.compile. It takes all of one line of code. Backtracking regexp engines don't really benefit from this, but DFA-based regexp engines like RE2 can get massive speedups.

      • glangdale 14 days ago

        The idea of compiling everything together into a single DFA is very old. However, this doesn't always work - it will result in a combinatorial explosion in many cases (informally, if the patterns contain a lot of states that overlap with other potential states you might be in from other patterns or elsewhere in that pattern). So DFA merging is a dark art and many people spend whole careers taping things to the side of the basic DFA algorithm to handle this.

        This doesn't seem to come up with most lexers because the idea of using some crazy overlapping tokens doesn't sound great - i.e. you're probably not going to want to allow your tokens to overlap that much. So it's not a problem for lex, but it might be a problem if you're scanning 10,000 text analytics patterns simultaneously.

        Concatenating patterns is fine for RE2 - and RE2 can survive this - to a point. However, RE2 just falls back to a 'survivable' Thompson NFA regex when determinization fails and has no other strategies.

        For backtracking engines, concatenating patterns with "|" is equivalent to just trying one pattern, then the next, etc.

  • theamk 15 days ago

    > 2. Nevod matches MULTIPLE patterns against document in ONE PASS. Patters/expressions are indexed by state machine and filtered effectively during matching.

    Regexps do that too. Specifically, there is a whole class of "lexers"/"tokenizers" which are regexp-based, and which specialize in matching 1000's of patterns against the text, in one pass.

    (You can kinda approximate it in regular regexp by doing multiple rules at once, each in it's own group, and seeing which groups are non-empty, like in "(rule1)|(rule2)|(rule3)" and so on. If your regexp library is good, it will match this in time independent from number of rules. But real lexers will have nicer syntax)

  • ilaksh 15 days ago

    I see a lot of criticism but not a lot of praise. For me creating something so comprehensive and also easy to use is an achievement. And in my opinion this is much better than Regex for most cases because of the ease of use and readability. I think the main pushback you are seeing is just because people are invested in Regex.

    • theamk 15 days ago

      Um.. no? He is seeing pushback because one of his main claims (100x faster) is very likely factually wrong. And also because he is not even acknowledging all other prior art in this space.

  • burntsushi 15 days ago

    > Nevod matches MULTIPLE patterns against document in ONE PASS.

    Any non-backtracking regex implementation will do the same.

    > So, at a RULE DECK of 1000 patterns, we got 300 times improvement comparing to RegExp.

    There is no generic "regex." There are many implementations of regex. So which regex implementation are you comparing it with?

  • bdcravens 15 days ago

    You have provided a lot of detail here, but it seems you aren't addressing the concerns about patents.

    • ychetyrko 15 days ago

      Patent application was filed for defense reasons. Also, patent sends marketing message that we are serious about the technology we created.

      • umvi 15 days ago

        > Also, patent sends marketing message that we are serious about the technology we created.

        That's the message it sends to business people maybe, but your target audience here is hackers and tinkerers and it sends a completely different message (namely: "keep this technology out of your projects or you'll get into legal trouble down the road").

bakpakin 15 days ago

Pretty cool tool, the NLP focus is surprising but I'm not sure if it is a marketing ploy or fundamental to the technology. The nice syntax is also an improvement over things like LPeg, which seem to obscure the grammar by forcing it to be written in Lua. Seems like it could overlap in functionality with Rebol/RED parse, which I believe has been posted on here before, LPeg, or just PEGs and Packrat parsing in general.

https://www.red-lang.org/2013/11/041-introducing-parse.html

The parse engine is implemented with a PEG, so it is probably a lot slower, but also supports pattern matching over whole words. Also, the database aspect of Nevod seems interesting and offers potentially a huge speed. But as far as I can tell, the syntax looks like a PEG with some implicit rules that separate words. As far as speed goes, I definitely believe that Nevod could be very fast but I haven't seen any numbers.

Shameless plug, I've implemented something similar with PEGs for Janet, a lisp I have been working on for a while. I make no claims to it's speed, but the peg interpreter is written in tight C so it shouldn't be too slow. The peg module in Janet works with a DSL that looks like a lispy EBNF and results in a recursive parser compiled to bytecode for the interpreter.

https://janet-lang.org/peg.html

You can also mess with the language in a browser on the home page.

https://janet-lang.org/

janoc 15 days ago

"Nevod is a language and patent-pending technology for pattern-based text search."

Yawn. I will stay with my patent-free regexps (or whatever other tech I may need) than to rely on something that will let you put a gun to my head if I ever wish to sell my product.

No thanks. Software patents are evil crap.

rusk 15 days ago

Now you've got 3 problems!

Zelphyr 15 days ago

One of the best and most powerful alternatives to RegEx I've ever seen is the PARSE function used in the Rebol and Red programming languages.

    page: read http://hq.rebol.net
    parse page [thru <title> copy title to </title>]
    print title
    The REBOL Developer Network
ggm 15 days ago

Is this another syntax (of the mechanisms written form) vs semantics (of how it actually operates) confusion moment?

If Nevod builds a different textual model and applies what you "say" to it, differently to the regex underlying model, thats about the speedup. how you say what you want in pattern matching, thats just a pure syntactic moment: I like regex from the UNIX philosophy because of the syntax fluidity for saying things. What actually happens when I say a|b|c|d is it builds a DFA and its not that bad, but if I do individual /a/p /b/p /c/p patterns in SED, same engine, but no DFA builder, its slow.

So is Nevod a new syntax and a DSL tied to language, or is it a new syntax and a generalized text matching model with some real semantic shift from regex?

asah 15 days ago

You lost me at "patent pending"

tianshuo 15 days ago

You could easily do something like this using open source parser generators, like Pegjs(https://github.com/pegjs/pegjs) and own the final code yourself.

  • aidenn0 15 days ago

    [edit] After reading my comment it sounds like I don't like packrat parsers. I actually love them and when they are available for the language I'm using they are my first choice, but the first rule of engineering is everything has it's trade-offs, so...

    I'm not familiar with Pegjs, but other PEG parsers I've seen tend to use the packrat algorithm, which is suboptimal for regular languages, because it memoizes parses to speed up backtracking, and regular languages do not need backtracking.

    For example, if you were to write a recursive-descent parser for JSON and convert it to a packrat parser, you will often find the packrat parser is slower.

    Now, extended regex's include backtracking, and that's where packrat parsers can soundly defeat recursive-descent parsers: super-linear time parses can become linear time. This makes packrat parsers a wonderful "default choice" but if constant factors are important and your language is regular, you will want to look beyond packrat parsers.

peteforde 15 days ago

Sounds exciting if it's real, but extraordinary claims require extraordinary evidence.

OP (@ychetyrko) are you involved with this project? If so, announcing this before you have usable code that people can use without licencing encumbrance might have been an opportunity lost. Much more detail is required.

speedplane 14 days ago

Higher level text searching has been around for decades. In fact lawyers use this all the time. For example:

motion w/3 (denied or deny!)

This will match any sentence with the word "motion" within three words of the word "denied" or any other word that starts with "deny" (e.g., "denying"). Often, these systems use regex underneath with fancy libraries built on top to determine word separation, sentence separation, etc.

The truth is that regex is really great for character level searching, but if you commonly do word or sentence level searching there are a variety of solutions already available.

runxel 15 days ago

> […] works in every language

> Operates on words, not characters.

Yeah, well. How shall I tell you?

stevefan1999 14 days ago

Sigh. https://swtch.com/~rsc/regexp/regexp1.html

Sadly, RegEx has evolved far away from the original regular expression we learnt in school, and it is certainly less NFA like. This make it harder to execute a faster speed, e.g. backtracing makes it more context sensitive etc.

mimixco 15 days ago

This could be interesting but without any information about licensing, cost, or available platforms, most people are going to ignore it.

meruru 15 days ago

I learned about lua patterns while trying out OpenBSD httpd and I quite liked it as Regex alternative. It's easy to learn and covers most of my needs: https://man.openbsd.org/patterns.7

dzink 15 days ago

If they want to actually succeed in making this ubiquitous, it cannot be a grammarly-like plug-in that sends text back to the motherland. It has to be standalone and locally hosted. Otherwise it’s just another MITM / spying prone library.

amitport 13 days ago

Please add implementation details. I've once made a search engine that could support a similar feature set. It's possible that we did the same core tricks but I have no way to evaluate this.

jermaustin1 15 days ago

I tried to use the URL matching pattern from the reference, and I got the following error:

Error: Compilation error: Pattern capture is not supported in this version of the pattern matching engine, pattern 'Url'

chaitanya 15 days ago

This might be a good tool but what is patent worthy about it?

bratao 15 days ago

It remembered me how much I miss LPeg - Lua parsing expression grammars. I was very fast and way more nicer to work than regex.

JoelMcCracken 15 days ago

I have often thought that a EBNF-grammar like system would work better in many situations than regular expressions.

ungamed 15 days ago

The syntax reminds me very much of the under-appreciated pyparsing library.

maxk42 15 days ago

Patent-pending?

No thanks.

tentakull 15 days ago

lol, spare me, only reinvent the broken and work towards standardization

beiller 15 days ago

But can it parse XML!?