HarHarVeryFunny 14 days ago

The explanation is a bit lengthy, but the idea is simple.

Consider the normal naive way to search for a string in a document. First you search for the next instance of the first letter of your search string, then try to match the entire string at that position. If the match fails then you advance your search position by one character and repeat (looking for next occurrence of first letter, etc).

The Knuth-Morris-Pratt algorithm improves upon this naive approach by usually advancing by more than one character after a failed match, thereby speeding up the search. It does this by taking advantage of its knowledge of the search string and the position at which the match failed.

To get the idea, consider searching for the string "explosion" .. first we find the next "e", then try matching the rest of the word. Say we match "explo" then fail (perhaps the document had "explode"), so now we want to start over and find the next "e" in the document... What KMP would do here is note that the document matched the first 5 letters "explo", none of which (other than 1st letter) are an "e", so it can advance by 5 characters, not just 1, to start looking for the next "e".

The amounts it can advance at each failed match position are pre-calculated to be efficient.

  • hulium 14 days ago

    That's the idea, but what makes it a bit more complicated is that a string can contain a prefix of itself, like "mama" in the article, or in the worst case "aaaaaaa".

    • HarHarVeryFunny 14 days ago

      Yes, so for example if you were searching for "elephant" and matched "ele", then you could only advance by 2 since the 2nd "e" might be beginning of "elephant", but if you matched "elep" than you could advance by 4 since "ep" is NOT a prefix of "elephant".

      But still, it all comes down to how far can you advance at any given match failure position.

  • itronitron 14 days ago

    >> The amounts it can advance at each failed match position are pre-calculated to be efficient.

    Wouldn't knowing that require having already run the search?

    • throw_pm23 14 days ago

      No, check the example given. To know how much we can advance, we only needed to examine the search string "explosion" in advance.

  • gorkempacaci 14 days ago

    That was a great explanation. Thanks.

Terr_ 14 days ago

Another fun one is the Aho-Corasick algorithm [0] which is great (i.e. more efficient) when you have multiple different target strings you want to find and you don't want to do multiple scans of the document.

Like KMP, it involves a preliminary step of analyzing the search term(s), and from them it builds a directed graph representing rules for what to do while consuming the stream of document-characters.

To find all occurrences, it's something like O(document_length + all_target_words_combined_length + number_of_hits) .

[0] https://en.wikipedia.org/wiki/Aho-Corasick_algorithm

  • byronvickers 14 days ago

    Thanks for this - I need to solve exactly this problem on a toy project I've been working on; you've saved me the time to hunt down the appropriate algorithm!

    • Terr_ 13 days ago

      I hope it helps, but if you are just looking for text strings, you might want to benchmark your local regex libraries first. Among other optimizations, they might use that algorithm under the hood for long (foo|bar|baz|...) expressions.

mehulashah 16 days ago

This starts off well, but the Haskell makes it hard for me to understand. Perhaps I should learn Haskell?

  • nine_k 14 days ago

    Learning some Haskell is very educational; it changes the way you think about programming, including daily code (much like a Lisp, Erlang, SQL, Smalltalk also do).

    These examples use pretty minimal features of Haskell, mostly expressions, function definitions with argument pattern matching, and algebraic type definitions. These would take an hour or two to get acquainted with. Nothing fancy is used in the code, in particular, nothing outright monadic :)

    For comparison, you can check out a Lisp version at the end. It's much longer, and Lisp (well, Racket here) is usually a very expressive language.

    • foobarian 14 days ago

      If Haskell seems too intimidating I think Standard ML is easier to wrap one's head around as it has fewer features. But it still delivers that FP way of thinking insight.

    • lispm 14 days ago

      Looks problematic. Is a string in Haskell a list? I would think that in Racket a string is a vector. FIRST and REST operations then would have very different implementations.

      • nine_k 14 days ago

        Much like in Rust or C++, there are several ways to represent sequences of characters in Haskell. The default way is a List, which is like a classic Lisp list,va chain of cons cells.

      • pitkali 14 days ago

        Yes, the String is a list of characters. The Haskell code uses head and tail on it. In Racket, they provided string-first and string-rest (at the end) to replace those.

        • lispm 14 days ago

          Which makes it very different in Racket. STRING-REST creates a new string with the contents copied from the original string, minus one character.

  • teo_zero 14 days ago

    I had the same reaction! I was excited about this article before realizing that "only elementary functional programming techniques" promised in the abstract actually means a full programming language used by 1% of programmers.

  • mrkeen 14 days ago

    Probably!

      -- The any function determines if some element of the input list satisfies the given predicate:
      any f     [] = False
      any f (x:xt) = f x || any f xt
    
    In the meantime here's a more palatable version of 'any' for the 99% of programmers who are put off by the confusing Haskell:

        // Returns whether any elements of this stream match the provided predicate. May not evaluate the predicate on all elements if not necessary for determining the result. If the stream is empty then false is returned and the predicate is not evaluated.
    
        public final boolean anyMatch(Predicate<? super P_OUT> predicate) {
            return evaluate(MatchOps.makeRef(predicate, MatchOps.MatchKind.ANY));
        }
    
        final <R> R evaluate(TerminalOp<E_OUT, R> terminalOp) {
            assert getOutputShape() == terminalOp.inputShape();
            if (linkedOrConsumed)
                throw new IllegalStateException(MSG_STREAM_LINKED);
            linkedOrConsumed = true;
    
            return isParallel()
                   ? terminalOp.evaluateParallel(this, sourceSpliterator(terminalOp.getOpFlags()))
                   : terminalOp.evaluateSequential(this, sourceSpliterator(terminalOp.getOpFlags()));
        }
    
        public static <T> TerminalOp<T, Boolean> makeRef(Predicate<? super T> predicate,
                MatchKind matchKind) {
            Objects.requireNonNull(predicate);
            Objects.requireNonNull(matchKind);
            class MatchSink extends BooleanTerminalSink<T> {
                MatchSink() {
                    super(matchKind);
                }
    
                @Override
                public void accept(T t) {
                    if (!stop && predicate.test(t) == matchKind.stopOnPredicateMatches) {
                        stop = true;
                        value = matchKind.shortCircuitResult;
                    }
                }
            }
    
            return new MatchOp<>(StreamShape.REFERENCE, matchKind, MatchSink::new);
        }
    • mbwgh 14 days ago

      Thanks. I believe with the additional null checks, this version is safer to use, too!

      • poorlyknit 13 days ago

        There's no null in Haskell to check for. Go check it out!

    • poorlyknit 13 days ago

      This would be even more palatable if it involved some more design patterns.

  • ks2048 14 days ago

    I recently began learning Haskell via Graham Hutton’s YouTube series. Recommended!

  • necovek 14 days ago

    It would help if it was clear what tokens are part of the language or standard library, and what tokens are implementation of this algorithm (as in, some syntax highlighting would be nice :).

    • mrkeen 14 days ago

      I think it this case everything is defined in the article, except for List, which is either an empty list [] or a non-empty list x:xs

      Anything lowercase is a function/variable, and will appear somewhere on the left-hand-side of an equals.

      Anything uppercase is a Type/constructor. In this case I think they've only used a Tree, which they defined themselves (either a Nil or a Node).

      any and scanl would typically be imports, but they've defined these in full.

      They also defined init themselves unrelated to the one in the stdlib.

  • epgui 14 days ago

    I mean this is the Journal of Functional Programming... That being said, I wish more people had your response. Curiosity and openness are beautiful things.

hzay 14 days ago

Nice. I tried visualizing this algorithm here - https://visuallyexplain.pages.dev/kmp/algorithm-(wip) . The sidebar has a bunch of completed algos -- DFS, BFS, binary heap queries, dijkstra's, union find, topological sort, quicksort, etc.

I abandoned this project without completing KMP. If anyone is interested in this sort of algorithm explanations, I'd love to collaborate / finish this project.

  • klabb3 14 days ago

    Memories from algorithm class.. I recall going in seeing the course plan the intimidating words of dynamic programming, linear programming, weighted graph search, etc. But string algorithms looked much easier on a surface level – how hard can it be? Turns out, pretty hard. Who’d have thought?

andrewp123 14 days ago

It’s really easy to come up with KMP. Just write an algorithm that searches for a substring p in string s. Make your algorithm efficient by sliding two pointers down s to find the biggest possible match, often called “sliding window”.

You want to grow the window as big as possible to match the substring. The data structure you naturally come up with to do checks efficiently here is the one you use in KMP.

tome 14 days ago

Is this post showing a bug in HN? It says it was posted 10 hours ago, but Algolia says it was posted 2 days ago

https://hn.algolia.com/?dateRange=pastWeek&page=0&prefix=tru...

And mehulashah's comment that the thread claims was posted 7 hours ago was also posted two days ago, according to Algolia

https://hn.algolia.com/?dateRange=pastWeek&page=0&prefix=tru...

I'm aware of the second chance pool, but I don't think it can be that

https://news.ycombinator.com/item?id=26998308

  • defrost 14 days ago

    This K-M-P post is on page 2 of the pool list .. reverse chrono sorted by time of first post:

    https://news.ycombinator.com/pool

    My recent post: https://news.ycombinator.com/item?id=40037466

    shows with the second chance repost time (12 hours ago), but was actually first posted ~ 24 hours ago.

    It sank w/out notice then, this morning I woke to find it was suddenly active and "recently" posted.

    That's the second chance pool effect for you :-)

    • tome 14 days ago

      I'm not surprised the pool can make the article reappear but I'm a bit surprised it adjusts the submission date of the article and I'm very surprised that the comment dates have changed.

      • defrost 14 days ago

        There has to be multiple time fields with differing view contexts using different fields.

        My submissions page shows the first submission time, the "actual" submission comments page shows the second post time.

        There's a fair bit of lispy magic going on around here; dang and other mods can merge comments on seperate submissions, view comment voting history, and generally do a wide range of back end housekeeping | forensic | anti troll operations and views .. very probably nothing really changes .. just our end user perception of the HN world is filtered by transformation.

dreamcompiler 14 days ago

Good article. I've implemented Boyer-Moore several times and it's ridiculously fast on typical English text, especially with long search strings.

I haven't implemented KMP but I might try after reading this.

  • rurban 14 days ago

    KMP is so slow, it's not worth it. https://rurban.github.io/smart/results/all/englishTexts.html

    BM beats it easily, not talking about the best two-way search algorithms, as implemented in musl.

    Usually Thierry Lecroqs site has all the graphical descriptions of string search algos, just the latest are missing. https://www-igm.univ-mlv.fr/~lecroq/string/index.html

    • klabb3 14 days ago

      That’s funny. I remember the wake up after doing all the complexity analysis and such of academia, that once you’re engineering for performance, it’s often very different from theory. For instance, a lot of heap-allocations and the resulting memory fragmentation can really tank performance on real hardware. Naive linked lists for instance, is a good example of something that looks super useful in CS class, but turns out is quite niche in practice, and also quite differently implemented.

      • taeric 13 days ago

        To be fair, it is largely the Big O analysis that is very different. And that is easily understood as a case of Goodhart's Law. It remains a good way of comparing the growth of algorithms, of course; but a full analysis would include more than only the top end growth numbers.

mrbonner 13 days ago

And some interviewers ask us to come up with it in a 45-minute coding session!

amelius 13 days ago

What's the fastest known algorithm?

And for these related problems:

- fastest regexp search

- fastest search, matching minimum edit distance

  • burntsushi 13 days ago

    https://github.com/BurntSushi/rebar

    For regex, you can't really distill it down to one single fastest algorithm.

    It's somewhat similar even for substring search. But certainly, the fastest algorithms are going to be the ones that make use of SIMD in some way.

rugina 14 days ago

We, humankind managed to get a good optimisation for this problem by using spaces between words. When trying these algorithms for searching a word in a string of text, I was surprised how little they could improve vs just skipping to the next word.

  • dwallin 12 days ago

    I think you would be surprised about how much of a performance hit that would be over existing state of the art. The thing you are missing is that the human visual system evaluates existence of spaces in parallel, a single threaded algorithm would need to check every character to see if it's a space, in addition to checking the letters of the search string. Also that fails to generalize to languages without spaces, search strings with spaces, etc.

    Also if you look at algorithms like Boyer-Moore, they effectively DO skip spaces, but do so in a manner that is language / content agnostic. (https://www-igm.univ-mlv.fr/~lecroq/string/node14.html)

  • jrpelkonen 14 days ago

    By “improving” you mean changing the problem definition: Word search is a completely different problem.

mehulashah 14 days ago

Sounds like I’m learning Haskell soon.