josephg 6 days ago

This is really great to hear. The different internal string representations really age a language:

- C sort of assumes all strings are ASCII

- Java, C#, Obj-C, Javascript and Python2 were all written when it was assumed Unicode would have no more than 65536 characters. They all use 2 byte encodings, which has become the worst of all worlds. ASCII text is twice the size it needs to be, you still need to handle the mismatch between array length and codepoint length, and errors are harder to find because there are very few common characters that actually cross the 2 byte boundary - although the popularity of emoji has changed this.

- Go, Dart, Rust and Zig all return to single byte encodings. They all use UTF-8 internally, and have different functions to interact with the string as a list of bytes, and the string as a list of unicode codepoints. (In Go's case its because Rob Pike was behind both Go and UTF8.)

Its delightful to see Swift stepping away from Obj-C's mistakes and improving things internally. Although in Obj-C's defence, NSString has always hidden its internal representation, which did a lot of work to protect developers from the footguns.

One of the big ironies of all this is how well C has stood the test of time. C strings interoperate beautifully with UTF8 - all thats missing is some libc helper methods to count & iterate through UTF8 codepoints. strlen / strncpy / strcmp / etc all work perfectly when dealing with UTF8. The only change is that you need to supply lengths in bytes not characters.

  • josteink 6 days ago

    > One of the big ironies of all this is how well C has stood the test of time. C strings interoperate beautifully with UTF8

    I wouldn't say that.

    I'd rather claim that UTF-8 has been engineered to be ASCII pass-through compatible because of the lacking Unicode support in most C-codebases out there.

    I mean... If you compare Unicode support in platforms and OSes where this was clearly a thought (Windows, C#, Java) to platforms with the more naive approach (Linux, C, PHP, etc), you will see a very clear picture of which side has the most unicode bugs and encoding errors.

    And I say that as a Linux-guy.

    C is terrible for text-processing. UTF-8 was designed as it was because of that.

    Trying to paint C as the best choice here because someone found a working solution they could also apply to terrible C-code is clearly seeing things backwards.

    • the_mitsuhiko 6 days ago

      > If you compare Unicode support in platforms and OSes where this was clearly a thought (Windows, C#, Java)

      Their insistence on UCS2 however gave us the worst outcome of all. We have forever lost all north of 23 bits for characters due to UTF16. The result of this misdesign creates bigger problem than C’s char type.

      • mikeash 6 days ago

        UCS-2 was all there was at the time, though. Blame Unicode for thinking 65,536 code points would be enough.

        • gpderetta 5 days ago

          That's not a very good excuse though: UTF-8 was invented, by Thompson and Pike, specifically because existing solutions were not deemed adequate for their needs.

          "Reasonable people adapt themselves to the world. Unreasonable people attempt to adapt the world to themselves. All progress, therefore, depends on unreasonable people"

          • mikeash 5 days ago

            When Java and friends were being designed, it looked like 16 bits would be adequate. They didn’t insist on UCS-2 over something better, they chose what looked like the best option at the time. It later turned out not to be. UTF-16 was created to fix that problem without having to completely redo their strings from scratch.

            Likewise, C was created in a time when it looked like 7 or 8 bits would suffice. That turned out to be wrong. UTF-8 was invented to solve that problem without redoing C strings from scratch.

    • slavik81 6 days ago

      Windows is the platform where I see most of my Unicode problems. Here's an example of one of them:

      hello.py:

          print("こ")
      
      Ubuntu:

          $ python3 --version
          Python 3.5.2
      
          $ python3 hello.py
          こ
      
          $ python3 hello.py > out.txt
          $ cat out.txt
          こ
      
      Windows:

          C:\Users\u\tmp>python --version
          Python 3.7.1
      
          C:\Users\u\tmp>python hello.py
          ?
      
          C:\Users\u\tmp>python hello.py > out.txt
          Traceback (most recent call last):
            File "hello.py", line 1, in <module>
              print("?")
            File "C:\Users\u\AppData\Local\Programs\Python\Python37-32\lib\encodings\
          cp1252.py", line 19, in encode
              return codecs.charmap_encode(input,self.errors,encoding_table)[0]
          UnicodeEncodeError: 'charmap' codec can't encode character '\u3053' in position
          0: character maps to <undefined>
      • sametmax 6 days ago

        The cmd console is terrible. It has a limited set of unicode code point, and worse, an even more limited set of fonts support.

        The python unidecode pypi package is your best friend if you have to deal with it.

        • josteink 6 days ago

          > The cmd console is terrible.

          Agreed. And it's not really as much a native Windows app, as it is an emulator for the "good old" times of MS-DOS, which has received the necessary adjustments to keep chugging along, but little else.

          Given that you probably have what is the single weakest link unicode-wise you can find anywhere in the entirety of the Windows universe.

          It's really a completely obsoleted, abandoned mess. The fact that we had to wait until Windows 10 before you could actually resize the window freely should tell you at what level you should put your expectations.

          If you want to live your life in the command-line on Windows (for python or whatever else), do yourself a favour and get any other terminal.

          There's really quite a few to choose from. CMDer, ConEmu and MinTTY springs to mind.

          • jacobush 6 days ago

            What about this then? https://stackoverflow.com/a/47843552

            Claims unicode support is decent in Windows 8 and up.

            • sametmax 6 days ago

              Your post clearly state that not only you have to manually setup your console on each session for each machine to get anything remotly workable, but you need to limit yourself to a subset of unicode.

          • sametmax 6 days ago

            Cmder is very decent.

    • TorKlingberg 6 days ago

      You are right that UTF-8 was created to pass safely through C code that isn't Unicode aware. But that turned out to be the right design everywhere. Every system that went with UTF-16 or UCS-2 is either carefully switching to UTF-8, or suffering from various encoding bugs and inefficiencies.

      • quietbritishjim 6 days ago

        It is not true that UTF8 is the right design "everywhere". Perhaps it is the right design when you have strings that are stored and later read by various programs, but it's not necessarily right for in-memory strings.

        Python 3 internally uses ASCII, UCS2 or UCS4 for its strings depending on which is most space efficient but still capable of representing the string. It can do that because Python strings are immutable* and because it is impossible for user code to see what the encoding is (to see the bytes of a string you must explicitly convert it, at which point you specify the destination encoding and Python ensures the translation is correct). If you join a UCS2 string with a UCS4 string then the result is automatically UCS4 and there's no way to tell from user code (except by memory usage)!

        There's a good reason for this: indexing into a UTF8 string takes O(n) time because you must parse all the bytes before that point in order to count the number of characters. Iterating over a string by index would either take O(n^2) time or you would have to use some sort of awkward string iterator. If you do not provide any way to index by character (rather than byte position) then I would argue that your data structure is not really a "string". I suppose an alternative implementation would be to use UTF-8 but have an auxiliary data structure that maps between character indices and byte positions.

        * Immutability matters because if you had a mutable string in ASCII and you changed one character to be a code point >127 then you'd need to copy the string before making the modification, which if possible at all would be O(n) for what you would expect to be O(1).

        • bradleyjg 6 days ago

          I agree with much of what you said here, but 'character', and therefore indexing by character, isn't a well behaved concept in the unicode world. Normalization helps somewhat, though it's another expense at string creation time. However, even after normalization there are plenty of graphemes that don't have a precomposed characters.

          Our fundamental ideas about how a string should work and what kinds of things we might want to do with one are deeply influenced by the languages spoken by the pioneers of the computer age. Short of someone from a very different language tradition not being exposed to these implicit assumptions in their formative programming years I don't think we are ever going to know what a different path would look like.

          • jfk13 6 days ago

            > Normalization helps somewhat, though it's another expense at string creation time

            And needs to be handled with care...there are edge cases where it can bite you. For example, if Unicode is being used internally to process data in a legacy CJK encoding, normalisation may lose distinctions that are needed for accurate round-trip conversion.

            Another surprise "gotcha" is that simply concatenating two already-normalised strings may give you a result that is not normalised.

        • jstimpfle 5 days ago

          > There's a good reason for this: indexing into a UTF8 string takes O(n) time

          Here's the thing: You simply don't do that, ever. It's a meaningless operation. Most you would do is iterate over a string.

          • slededit 5 days ago

            Bookmarks into a string are one such example of the need for this. No need to O(n) iterate over the string again when the bookmarking logic has already done so.

            Computed indicies into strings are a no-go but that is far from the only use case.

            • jstimpfle 5 days ago

              But you can index UTF-8 strings with precomputed bookmark indices (byte offsets) just fine. Point is, to precompute them, you would surely have iterated the string before.

              What I mean is that "extract the 42th codepoint/glyph/whatever from this UTF-8 string" is a pointless operation for free-form strings because in a free-form string the character position is meaningless.

              (Basically all non-ASCII UTF-8 is pretty much a black box. You can't do serious computation with general Unicode because it's so complex and, as a consequence, ill-defined in practice).

              • slededit 5 days ago

                This is an advanced trick, but I've successfully done this as part of a binary search to find the glyph at the visual mid-point of the string. UTF-8 is self-synchronizing so you can detect if you cut a code point in half. You still need to handle things like ligatures which even UTF-32 won't speak to directly.

                Doesn't work well with BIDI languages, or Mongolian. Seriously if you haven't tested with Mongolian your code is probably wrong.

            • masklinn 5 days ago

              > Bookmarks into a string are one such example of the need for this. No need to O(n) iterate over the string again when the bookmarking logic has already done so.

              You can do that just fine with UTF8, both Swift and Rust allow it.

        • masklinn 6 days ago

          > Python 3 internally uses ASCII, UCS2 or UCS4 for its strings depending on which is most space efficient but still capable of representing the string.

          UTF8 is as efficient as ASCII in the ASCII range, and more than or as efficient as UCS4 beyond the BMP.

          The only iffy point is the U+0800 ~ U+FFFF range: Samaritan script to the end of the BMP, which mostly affects extremely contents-dense (mostly text, little to no markup) asian, and native scripts, as well as a few african ones.

    • yoklov 6 days ago

      While I agree that it's not really a measure of C standing the test of time, I kind of doubt that an alternate universe where 'C interop' / 'ASCII compatibility' weren't a design consideration, we'd have ended up with something much better (or TBH much different) than UTF-8.

    • jcelerier 6 days ago

      > I mean... If you compare Unicode support in platforms and OSes where this was clearly a thought (Windows, C#, Java) to platforms with the more naive approach (Linux, C, PHP, etc), you will see a very clear picture of which side has the most unicode bugs and encoding errors.

      so... windows ? unicode has always worked fine for me on linux, and has always been a complete pain in windows - as a software user and even worse as a developer. The only problem until ~ 6-7 years ago was libre unicode fonts with complete coverage but they are now pretty good.

  • kibwen 6 days ago

    > C strings interoperate beautifully with UTF8

    Reversing cause and effect, methinks: UTF-8 interoperates beautifully with ASCII. :P

  • deathanatos 6 days ago

    > One of the big ironies of all this is how well C has stood the test of time. C strings interoperate beautifully with UTF8 - all thats missing is some libc helper methods to count & iterate through UTF8 codepoints. strlen / strncpy / strcmp / etc all work perfectly when dealing with UTF8. The only change is that you need to supply lengths in bytes not characters.

    That's exaggerating the truth, I think. Programmers can and do presume strlen's result is things like the length of the string in characters (and the typename "char" does not help), or the number of terminal columns wide a string is. E.g., I'm pretty sure the mysql client can still, to this day, not properly format a table, and IIRC, it is written in C. strcmp() does a byte-for-byte comparison of strings, which in Unicode may very well be meaningless. strncpy() has its own footguns w/ leaving the buffer unterminated. (But yes, as long as your indexes into the string make "sense" — aligned to code points that will make sense at the destination — it works fine.)

    If you have to write any of the helper methods to count & iterate through UTF8 codepoints, you have to watch out for the footgun that char is sometimes signed, sometimes not, so attempting to get the high bits to determine things like "is this a continuation byte?" with >> or & is either implementation defined or UB, depending on the specifics of the platform. (>> is UB on negatives, which char can be, and often is; & on a signed int doesn't make much sense, and it's implementation defined what it does in C (it operates on the bitwise representation of whatever the underlying signed integer representation is, which is implementation defined. Yes, yes, all the world is two's complement. The earlier UB is much more likely to bite you.))

    unsigned char * is better for both raw binary buffers and dealing w/ buffers of UTF-8 data.

    (Honestly, I'm not even sure that C requires the character literal 'A' to correspond to an ASCII "A". But all the world is ASCII…)

    • uranusjr 6 days ago

      > unsigned char * is better for both raw binary buffers and dealing w/ buffers of UTF-8 data.

      Or, even better, uint8_t. Fundamental types don’t have guarantee fixed sizes, only minimum. Explicitly asking for 8 bits only can save sanity in edge cases in a long running code base where no one is sure what other people is doing anymore.

      • maccard 6 days ago

        > Or, even better, uint8_t. Fundamental types don’t have guarantee fixed sizes

        This is overly pedantic IMO. Practically speaking, a char will always be 8bits, unless you're working in embedded (even then, only some ancient DSP devices IIRC have non 8 bit Char). If you're writing code for POSIX, a chae will always be 8 bits, as will windows^. Almost evrry embedded device also works with an 8 bit Char, and anything that doesn't you probably know about in advance, and probably aren't handling Unicode text on.

        ^ Citation needed

        • mikeash 6 days ago

          Char is also the smallest data type the language can have, so if char isn’t 8 bits then uint8_t won’t be available and your code won’t build anyway.

          That’s preferable to building and misbehaving, but if you got that far without realizing you’re dealing with an unusually-sized char then you’re probably doomed anyway.

        • dagenix 6 days ago

          > This is overly pedantic IMO. Practically speaking, a char will always be 8bits ... [List of some cases where it isn't]

          If you want an 8-bit unsigned type, uint8_t is that type. If it's available, going with it isn't "overly pedantic", not going with it is irresponsible.

          What does "overly pedantic" even mean here? Whenever I see that phrase, it's often a case of someone who is wrong trying to shame someone who is a right for the crime of being right.

    • kevin_thibedeau 6 days ago

      > Honestly, I'm not even sure that C requires the character literal 'A' to correspond to an ASCII "A".

      It doesn't. Nor do chars have to be 8-bits.

    • xeeeeeeeeeeenu 6 days ago

      >(Honestly, I'm not even sure that C requires the character literal 'A' to correspond to an ASCII "A". But all the world is ASCII…)

      EBCDIC is still alive and well on IBM mainframes.

  • masklinn 6 days ago

    > - Java, C#, Obj-C, Javascript and Python2 were all written when it was assumed Unicode would have no more than 65536 characters.

    Python 2 had "narrow builds" (2-byte unichar) and "wide builds" (4 byte unichar). Wide was generally the default on unices.

    > - Go, Dart, Rust and Zig all return to single byte encodings. They all use UTF-8 internally, and have different functions to interact with the string as a list of bytes, and the string as a list of unicode codepoints. (In Go's case its because Rob Pike was behind both Go and UTF8.)

    Dunno about Dart and Zig, but AFAIK Go is closer to Ruby's old model of "IDK lol". A number of string-processing functions assume (and sometimes even assert) UTF8, but that's not a guarantee at all[0]

    > It's important to state right up front that a string holds arbitrary bytes. It is not required to hold Unicode text, UTF-8 text, or any other predefined format. As far as the content of a string is concerned, it is exactly equivalent to a slice of bytes.

    For Rust on the other hand, having non-UTF8 byte sequences in an str is one of the language's UB (hence str::from_utf8_unchecked existing and being unsafe).

    > C strings interoperate beautifully with UTF8 - all thats missing is some libc helper methods to count & iterate through UTF8 codepoints. strlen / strncpy / strcmp / etc all work perfectly when dealing with UTF8.

    C strings don't "interoperate with UTF8": they treat everything as a nul-terminated bag of bytes and that's it. Which means they won't properly process valid UTF8 sequences containing NUL, and they will improperly process invalid UTF8 sequences. Ignoring the entire thing is not interoperability, you can not actually operate on text using the C stdlib. All you can do is treat it as an opaque blob and move it from one place to the other mostly unmolested.

    [0] https://blog.golang.org/strings

  • jstimpfle 6 days ago

    > C sort of assumes all strings are ASCII

    How so? Concerning runtime values, you can do whatever you want in C, which means UTF-8 support in C is basically free. Note that UTF-8 was explicitly designed to have this kind of upward compatibility.

    Concerning string literals, just keep your source files encoded in UTF-8 and put in quotes the UTF-8 string you want to process at runtime. I'm not aware how the spec handles source code files that are not ASCII, but in practice it works, and furthermore you can always opt for hex escapes.

  • kjeetgill 6 days ago

    As of Java 9 have uses one byte UTF-8 array instead UTF-16 chars internally.

    See: http://openjdk.java.net/jeps/254

    • dwaite 6 days ago

      It is a Latin-1 byte array, not a UTF-8 array. This is because Java exposes some details like the length of the UTF-16 character array through the string api.

      • TazeTSchnitzel 6 days ago

        (And Latin-1 is a strict subset of Unicode, which means all the direct code point lookups are superfast, literally just a zero extension.)

      • kjeetgill 3 days ago

        Ah, my bad. I stand corrected.

  • ken 6 days ago

    > Although in Obj-C's defence, NSString has always hidden its internal representation, which did a lot of work to protect you from this.

    Technically the docs never said what the internal representation was, but it was not very well hidden at all. As the article points out, NSString presents an API with "constant-time access to UTF-16 code units".

    • josephg 6 days ago

      The foot gun happens when you assume that the string representation's length is equal to the number of unicode characters. This is true in C if all your test data uses ASCII characters. Its true in Java / C# / etc if you only test using ASCII + any characters in the lower unicode planes. And thats most but not all of the characters used in the world today.

      In Javascript, all the standard APIs make the wrong assumptions about unicode. Unless you think about it and actually test using exotic characters, its very common to make mistakes where certain characters would crash your software. (I'm talking about you, iMessage).

      I ran into a bug just the other day where emoji in a comment in a rust file with RLS caused a weird rendering error in VSC.

      In javascript String#length and naive array indexing is basically always wrong, and as far as I know there's no official API that does the right thing. But because its mostly right, people use String#length and don't think about it.

        $ node
        > "𝄞".length
        2
        > "𝄞"[0]
        '�' // oops!
      
      In Objective-C, even though large NSStrings use a 2 byte encoding under the hood, the API that NSString provides encourages you to do the right thing by default a lot more than javascript does. We could talk about whether language or the documentation is responsible, but either way the difference translates into fewer bugs.
      • dwaite 6 days ago

        Unicode is complex enough it is often best just to not think of it in terms of a character count unless you are doing text layout (in which case you care about grapheme clusters, not characters).

        • josephg 6 days ago

          I take your point, but it really depends on the application.

          I care about this having worked on realtime collaborative editors. In this space it definitely does make sense to describe string offsets with unicode codepoints. The problem with counting the number of grapheme clusters is that that number can change with every revision of the unicode spec, and you can get different answers on different platforms. Or the same platform, but a different version of the OS. Does a zero width space offset the position count or not? Do you get the same answer on Windows, MacOS and through whatever unix you happen to use AWS? I shudder just thinking about it.

          In contrast, counting unicode codepoints is simple, consistent and well defined. Counting string lengths that way is also implemented in the standard library of just about all languages.

      • jcelerier 6 days ago

        > '�' // oops!

        why oops ? what do you expect ? do you also '[0]' random image data and expect it to be the first pixel's red component ?

        • nemetroid 6 days ago

          In Python and Haskell, indexing a Unicode string yields a Unicode codepoint. In Rust, there is no index operator for Strings.

    • pjscott 6 days ago

      So... that's either UTF-16, an eight-bit encoding like ASCII or Latin-1, a variable-width encoding with an offset-translation table (as mentioned in the article), or a string with a different encoding but a constant upper bound on its size. (Or a few other things that seem less practical.)

      The first two are used by CFStringRef (bridged to NSString), the third by Swift strings that have been hit with UTF-16 offset lookups, and the fourth by Objective C tagged pointer strings (on 64-bit devices).

      Yeah, I guess that NSString has mostly exhausted the possibilities for ways to fulfill their API obligations.

  • sametmax 6 days ago

    Thats's one of main reason we got python 3. The string string story is not as good as the swift one, but in 3.6, it's one of the best i could work with. Espacially for a language that works on so many platforms, for so many use cases including web and scripting in i18n corporate environnements.

    Bottom line, don't promote python 2. At this stage, it's legacy. We maintain existing money makers with it, but all new projects have no reason to be in 2.

  • mpweiher 6 days ago

    > ..Obj-C... assumed Unicode...use 2 byte encodings

    That's actually not entirely accurate. NSString in Foundation was (and is, sort of) a class cluster that is quite flexible and agnostic about the actual encoding of the representations. I remember this quite clearly, because I fleshed out some of the missing NSString pieces in libFoundation, which was a fairly clean/simple implementation of the OPENSTEP Foundation spec.

    What locked in the representation was CoreFoundation, Apple's C re-implementation of Foundation for the Carbon crowd, who apparently wouldn't countenance even linking against Objective-C, even if hidden behind a C facade.

    So instead of taking the tried-and-true, fast and flexible Foundation and adding some wrappers, the existing Foundation had to be abandoned and rebuilt on top of a newly-written much inferior object-oriented C library.

    With that, we got much worse performance[1], much reduced flexibility in terms of representation and locking away of flexibility. Just as an example, the binary plist format is quite capable, but almost all of that capability is lost because it's hidden behind a monolithic C API.

    Now there are some issues in the API, for example just a single "length" attribute that clashes with NSData's length and makes it...challenging...to present both Byte and String faces at the same time, making for a lot of otherwise unnecessary copying and conversion.

    [1] https://groups.google.com/group/comp.sys.next.advocacy/brows...

  • bluejekyll 6 days ago

    > C strings interoperate beautifully with UTF8

    Isn’t this exactly the design goal of UTF8? To be backward compatible with ASCII strings?

  • nwellnhof 5 days ago

    > Java, C#, Obj-C, Javascript and Python2 were all written when it was assumed Unicode would have no more than 65536 characters.

    That's not true for C# and Python2. While both were first released in 2000 and Unicode code points beyond the BMP first appeared with Unicode 3.1.0 in 2001, the foundation for additional Unicode planes was laid as early as 1996 with Unicode 2.0.0.

  • benibela 6 days ago

    Delphi/Pascal had the best strings

    Single byte storage, so it works well with utf-8.

    Byte length field, so it knows the length of every string and it can store null bytes.

    An additional null byte past the end and pointer shifting, so every Pascal string is also a C pchar string.

    Reference-counted copy-on-write, so you can copy it in constant time for reading and treat it as if the string was fully copied for writing.

    Low-level memory management, so it can store 2 GB long strings. And in most situations you do not need a string builder, because strings are freed immediately and very fast when the refcounts zeros without staying around as garbage.

    Behaves as a value type with the null pointer being the empty string, so you can never get a null pointer exception. Any string operation returns always a valid string.

    (Optionally) index checked with the length, so you can never get a buffer overflow.

    Unfortunately the newer FreePascal/Delphi versions made it all very confusing by adding an encoding field. In the past you could just assume all strings are UTF-8 as code style rule. Now each string has an individual encoding. It is converted automatically from one encoding, but not always, so in any function that needs to access the characters, you kind of need to check if it is called with an utf-8 string, or latin1 string, or some other encoding.

    • masklinn 6 days ago

      > Delphi/Pascal had the best strings

      Delphi/Pascal had fucking terrible string.

      > Single byte storage, so it works well with utf-8.

      No awareness of encoding, so easy to break your text if you try to do anything other than pipe it through.

      > Byte length field, so it knows the length of every string and it can store null bytes.

      Single-byte length field, so your strings can't exceed 255 bytes, and you need a pointer indirection just to get the length of your string.

      > An additional null byte past the end and pointer shifting, so every Pascal string is also a C pchar string.

      Scratch that, 254.

      > Reference-counted copy-on-write, so you can copy it in constant time for reading and treat it as if the string was fully copied for writing.

      Rc'd cow, so you never know whether your write is going to be free or incur a complete copy.

      > Behaves as a value type with the null pointer being the empty string, so you can never get a null pointer exception. Any string operation returns always a valid string.

      The null pointer is the empty string, you don't get to know whether you got no input or got an empty input.

      > (Optionally) index checked with the length, so you can never get a buffer overflow.

      (Optionally) you're completely unable to implement generic string operations because the string type carries a length and the language gives no way to be generic over that.

      • vintagedave 5 days ago

        With all due respect,

        > No awareness of encoding > Single-byte length field > Scratch that, 254.

        is completely wrong. Or rather, hasn't been right for several decades.

        A modern Delphi String has unlimited length (well, 32-bit value, a multi-gigabyte string is okay for 1996, I think), carries encoding (see my comment in this same thread), etc.

        COW is another argument, but one that seems to have won out over time. Many string implementations use many tricks to achieve something similar, or go with COW directly.

      • benibela 5 days ago

        Now you are confusing everything. Delphi has three single byte string types since 1996: (ansi)string, shortstring and string[length], and the last two are mostly there for backward compatibility.

        >No awareness of encoding, so easy to break your text if you try to do anything other than pipe it through.

        Which is good, when you have data with an unknown encoding. Unix file names for example, you can create a file that has an invalid utf-8 file name, and many tools that want to store the file name in a string with an encoding, just cannot access that file

        >Single-byte length field, so your strings can't exceed 255 bytes, and you need a pointer indirection just to get the length of your string.

        Ansistrings have a 32-bit length. I have not used Delphi on a 64 bit system, but in FreePascal they have a 64-bit length there.

        shortstrings have a 255 max length, but since they are short they are mostly stored on the stack and you do not need any pointer indirection.

        >Rc'd cow, so you never know whether your write is going to be free or incur a complete copy.

        Shortstrings are not rc'd cow and are always copied. Since they are short and on the stack that copying is fast.

        No write on long strings is free, since you can get a cache miss. And you only make a copy if some other function still has a reference string. How could that happen? The function should only keep the string around if it still needs the old value, and then it would need to make a copy anyways. No cow always leads to more copying.

        >The null pointer is the empty string, you don't get to know whether you got no input or got an empty input.

        If you want no input, you can use a pointer to a string.

        >(Optionally) you're completely unable to implement generic string operations because the string type carries a length and the language gives no way to be generic over that.

        Ansistrings do not have a length in the type, nor the basic shortstring. Nowadays you would define all string operations for ansistrings (=utf8strings).

    • vintagedave 6 days ago

      > Unfortunately the newer FreePascal/Delphi versions made it all very confusing by adding an encoding field

      I can clear this up. This change is really useful and important if you are using strings.

      A string of bytes has attached metadata saying what it is. Is it ANSI of some sort, or UTF8, or...? Is it a specific encoding, such as Windows-1252? Without that data, all you have are bytes, and you don't know how to interpret them.

      Thus, RawByteString (bytes); UTF8String (UTF8); and ANSI strings with the encoding, plus UnicodeString which is native Unicode on whichever platform (eg, on Windows it matches Windows UTF16.)

      This data is essential to convert to and from different string types. I don't know where conversion does "not always" happen - can't think of anywhere offhand. But if you ever run into issues, there are RTL functions for conversion. Check out the TEncoding class: http://docwiki.embarcadero.com/Libraries/Tokyo/en/System.Sys...

      > In the past you could just assume all strings are UTF-8 as code style rule.

      This was an incorrect assumption, because before the encoding metadata was added, you would have been using an AnsiString there, and that, by definition ('Ansi'), is not UTF8. These days, if you have a UTF8 string, you can place it in a UTF8String type. Correctness enforced by libraries is much better than a coding convention that a certain type contains a subtly different payload. That way lies horror. Metadata and strong typing is much safer.

      --

      I do agree with you that Delphi has the best strings :) Copy on write and embedded length both seem a real win, after twenty years of use, not to mention great string twiddling methods.

      I'm looking at adding string_view support to the strings currently (for C++17 support); one thing it highlights is how much more powerful the inbuilt String types are, and how much string_view is a workaround for a problem in C++'s string design which other string libraries - not just ours, but ours is IMO very good - do not suffer from.

    • jchb 6 days ago

      So now Swift has the best strings :) Or at least will have, when these changes are released.

      > Single byte storage, so it works well with utf-8.

      Check, underlying storage is a byte array with the utf-8 encoding.

      > Byte length field, so it knows the length of every string and it can store null bytes.

      Check.

      > An additional null byte past the end and pointer shifting, so every Pascal string is also a C pchar string.

      Check, getting a C string from a Swift string is O(1).

      > Reference-counted copy-on-write, so you can copy it in constant time for reading and treat it as if the string was fully copied for writing.

      Check.

      > Low-level memory management, so it can store 2 GB long strings.

      Check, just tested with a 4 GB string on macOS.

      > And in most situations you do not need a string builder, because strings are freed immediately and very fast when the refcounts zeros without staying around as garbage.

      Check, Swift objects are reference counted, storage of intermediary strings will be freed as soon as they go out of scope.

      > Behaves as a value type with the null pointer being the empty string, so you can never get a null pointer exception.

      Check, String is a value type.

      > (Optionally) index checked with the length, so you can never get a buffer overflow.

      Check, this is the default - trap on out-of-bounds access rather than an illegal memory access. If you really want can disable by compiling with -Ounchecked.

      > Unfortunately the newer FreePascal/Delphi versions made it all very confusing by adding an encoding field. In the past you could just assume all strings are UTF-8 as code style rule

      Check. String operations are performed on grapheme clusters (rather than for example UTF-8 code points), which is generally the right level of abstraction to work with. There are "views" for accessing specific encodings.

  • needusername 6 days ago

    > Java ... They all use 2 byte encodings

    No, not anything that is OpenJDK 9+ based which uses 1 byte where possible.

    > They all use UTF-8 internally

    Which means a lot of functions now have linear instead of constant asymptotic complexity.

    • masklinn 6 days ago

      > Which means a lot of functions now have linear instead of constant asymptotic complexity.

      They already do if they do proper text manipulation as unicode itself is variable length and has to be stream-processed. O(1) access to codepoints is not actually useful, and most languages don't even provide it since they don't internally encode to UTF-32.

  • WalterBright 6 days ago

    D has string, wstring, and dstring types, corresponding to UTF-8, UTF-16, and UCS-32 encodings. But people normally just stick with string.

    It's been a huge win for D.

  • paulddraper 6 days ago

    > C strings interoperate beautifully with UTF8

    C strings interoperate terribly with almost any encoding, including ASCII and UTF8.

    C strings cannot contain the NUL character.

    • unwind 6 days ago

      But NUL is not used by UTF-8 (by design, of course) so that cannot be a problem.

      • kibwen 5 days ago

        Any implementation of strings that cannot contain 0x0 is not a conformant implementation of UTF-8. It's not a conformant implementation of ASCII either, for that matter. C strings are neither (not that they claim to be).

      • taejo 6 days ago

        U+0000 is a unicode character, which is encoded as 0x00 in UTF-8

      • masklinn 6 days ago

        Not sure why you'd think that, NUL is a perfectly valid, supported and usable character in unicode, and thus UTF8.

  • jcelerier 6 days ago

    > although the popularity of emoji has changed this.

    are they that popular outside of chat apps ? how many emojis on this very page ?

    • masklinn 6 days ago

      > are they that popular outside of chat apps ?

      Yes.

      > how many emojis on this very page ?

      HN literally strips out arbitrary codepoints — including emoji — from comments. So zero. Because HN forbids them.

iknowstuff 6 days ago

Swift's String implementation is just porn to me. They had a few misguided attempts in Swift 1 through 3, but their final design is truly marvelous and points programmers who have no idea about encodings towards the right solutions, like by simply counting grapheme clusters correctly, avoiding cloning thanks to views and not allowing for direct string[subscript] access without deliberately stepping down into utf8 or utf16 codepoint representations.

The fact that Swift is aiming for ABI stability, something neither C++ nor Rust have because we all rely on C for FFI, is very interesting.

  • eridius 6 days ago

    Unfortunately, since Swift's String implementation operates on grapheme clusters by default, a lot of parsing code people write is actually subtly broken in the presence of combining characters. As a trivial example, let's say the input is a comma-delimited string (say, a line of CSV without quotes). The obvious way to split this is

      let fields = line.split(separator: ",")
    
    But given the input "foo,\u{301}bar" (which looks like foo,́bar), this won't split correctly and you'll end up with a single field that contains a comma. The correct way to split this is

      let fields = line.unicodeScalars.split(separator: ",").map(Substring.init)
    
    This will get you the correct 2 fields (at the cost of an intermediate array, as there is no lazy split).
    • masklinn 6 days ago

      The original split looks correct to me. If you're assuming CSV is a textual format, a comma with a combining acute accent is not the same thing as a comma: if I'm asking for a split on "e", I don't want a split on "é" whether in its precomposed form or not, and I especially do not want lone combining characters floating around as a result.

      If you're assuming CSV is a binary format, you should split on code units before the textual decoding.

      • eridius 5 days ago

        CSV is a machine-readable format. I've never heard of a machine-readable format that treats combining characters as significant when parsing syntax. Certainly the CSV format doesn't.

        Trivial proof: If it did, then it would be literally impossible to represent a field starting with a combining character.

        Also, the set of combining characters has changed over time. Machine-readable formats in general do not change as Unicode does. A CSV that parses today should not fail to parse next year because a field starts with a codepoint that today is unused and next year has been assigned to a combining character.

    • Twisell 6 days ago

      Have you opened a ticket about that? This is more due to the split implementation than to the format itself. Room for improvement.

      • eridius 5 days ago

        This is not an issue with the split implementation. Split is behaving correctly (according to the defined semantics of String). You'd get the exact same issue if you said str.index(of: ",") instead; in both cases, the "," is a Character, not a UnicodeScalar, and "," != ",\u{301}".

    • dbaupp 6 days ago

      "Parsing" CSV using split (whether on graphemes or on scalars) isn't correct at all, though, due to quoting.

      • eridius 5 days ago

        This is why I said "without quotes". The same issue happens if you write a parsing routine that handles quotes too of course, it was just a lot simpler to demonstrate it with a single split() line than a whole parsing loop.

  • kjeetgill 6 days ago

    Views are ... Hit and miss. I think java has flip-floped on that one. The issue is cases where you parse a config to extract a single property name and that keeps the whole thing around.

    • masklinn 6 days ago

      The problem of Java is that its views were implicit: a given `String` could be either the owner of its data entire, or a reference to a slice in an other string. This lead to cheap sub-stringing, but significant memory leaks as you'd carry around a short substring and it would carry along the entire string it originated from (possibly gigabytes of it).

      In Swift, string views are a different type from strings. For the most part you can perform similar operations on them but they are not the same type, so a `Substring` clearly tells you that it's linked to arbitrary-size baggage, while a `String` tells you it owns its data. You can operate generically over both using string protocols, or you can specifically ask for one or the other.

    • dwaite 6 days ago

      String slices and views in swift have a distinct type from string. Implicit property types can bring this problem back, but it is at least harder to do by accident than in Java.

  • woolvalley 6 days ago

    And as a result, string performance is relatively horrible in swift when you don't need all of that, and you have to drop down to something approaching a c string to avoid it.

ComputerGuru 6 days ago

Wait. A programming language written this side of 2010 stored strings in a bastardized sometimes-ansii sometimes-utf16?

blink

Why on earth not just use utf8 from the start? Surely no micro-optimization made possible by their choice could be worth such a convoluted design?

  • josephg 6 days ago

    There's a huge performance benefit to using a complex string type like this in user-facing applications. The reason is that most strings are really small - like, a few bytes small. When strings are smaller than pointers, allocating them on the heap is silly and inefficient.

    So NSString uses a bunch of wacky encodings internally to pack very small strings into the 8 byte pointer that would otherwise be used to point to an object on the heap.

    I suspect that microoptimizations like this have done a lot of the work in making iOS outperform Android clock-for-clock. Although I suppose right now thats less of a big deal given how fast their A-series chips are as well.

    • saagarjha 6 days ago

      > So NSString uses a bunch of wacky encodings internally to pack very small strings into the 8 byte pointer that would otherwise be used to point to an object on the heap.

      Nitpick: these pointers are odd, so they can't point to a valid object on the heap (malloc is 16-byte aligned on macOS). So it makes sense to reuse them by tagging them and packing a string (or number, or date) into the remaining bits to save on an allocation.

    • kevin_thibedeau 6 days ago

      Most of iOS is native code. That's why it beats Android.

  • ken 6 days ago

    The article says: Chinese text is "over 3x faster than before", and "ASCII benefits even more, despite the old model having a dedicated storage representation and fast paths for ASCII-only strings". That sounds like a pretty good reason to me.

  • masklinn 6 days ago

    Because Swift was built on the foundations (heh) of and to cheaply interact with Objective-C, and Obj-C is of that misguided generation of languages which believed UCS2 was a good idea, and later had to back-define it into UTF-16.

  • ubernostrum 6 days ago

    Python (as of version 3.3) does something similar. Source code is assumed UTF-8 by default, and all other string creation requires explicit conversion from bytes. So when creating and storing a string object, Python looks at the widest code point that will be in the string, and chooses the narrowest encoding -- latin-1, UCS-2, or UCS-4 -- that can store the string as fixed-width units.

    This gets all the benefits people usually want from "just use UTF-8", and then some -- strings containing only code points in the latin-1 range (not just the ASCII range) take one byte per code point -- and also keeps the property of code units being fixed width no matter what's in the string. Which means programmers don't have to deal with leaky abstractions that are all too common in languages that expose "Unicode" strings which are really byte sequences in some particular encoding of Unicode.

    • rectang 6 days ago

      The tradeoff is that at unpredictable moments, memory requirements for string content can quadruple.

      Python is inexorably committed to the idioms which depend assume fixed width characters -- there's no persuading the community to use e.g. functions to obtain substrings rather than array indexes. So this is an understandable design decision.

      • ubernostrum 6 days ago

        assume fixed width characters

        Python strings are not iterables of characters. They're iterables of Unicode code points. This is why leaking the internal storage up to the programmer is problematic; prior to 3.3, you'd routinely see artifacts of the internal storage (like surrogate pairs) which broke the "strings are iterables of code points" abstraction.

        e.g. functions to obtain substrings rather than array indexes

        Strings are iterables of code points. Indexing into a string yields the code point at the requested index. While I'd like to have an abstraction for sequences of graphemes, strings-as-code-points is not the worst thing that a language could do. And all the "just use this thing that does exactly the same thing with a different name because I want indexing/length but also want to insist people don't call them that" is frankly pointless.

        • rectang 6 days ago

          > And all the "just use this thing that does exactly the same thing with a different name because I want indexing/length but also want to insist people don't call them that" is frankly pointless.

          Array index syntax over variable width data is problematic: either deceptively expensive -- O(n) for what looks like an O(1) operation -- or wrong. I suspect in that we agree.

          As for the alternative, I'm talking about Tom Christiansen's argument here: https://bugs.python.org/msg142041

          To paraphrase Tom's examples, this usage of array indexes is more idiomatic...

              s = "for finding the biggest of all the strings"
              x_at = s.index("big")
              y_at = s.index("the", x_at)
              some = s[x_at:y_at]
              print("GOT", some)
          
          ... than this, which the Python community would never adopt:

              import re
              s = "for finding the biggest of all the strings"
              some = re.search("(big.*?)the", s).group(1)
              print("GOT", some)
          
          The first involves more logical actions indexing into the string. If those operations are both O(1) and correct, then that's not a serious problem, which is what justifies the Python 3.3+ design.

          However, in terms of language design for handling Unicode strings, I prefer the tradeoffs of the second idiom: a single O(n) operation which is relatively easy to anticipate and plan for, rather than unpredictable memory blowups.

          • ubernostrum 5 days ago

            Array index syntax over variable width data is problematic

            Which is why Python uses a solution that ensures fixed-width data. There's never a need to worry if a code point will extend over multiple code units of the internal storage model, because the way Python now handles strings ensures that won't happen.

            I really think a lot of your problem with this is not actually with the string type, but with the existence of a string type. You want to talk in terms of bytes and indexes into arrays of bytes and iterating over bytes. But that's fundamentally not what a well-implemented string type should ever be, and Python has a bytes type for you (rather than a kinda-string-ish-sometimes type that's actually bytes and will blow up if you try to work with it as a string) if you really want to go there.

            • rectang 4 days ago

              No, I believe that we need a string type to encapsulate Unicode, but that that it should encourage use of stream processing idioms and discourage random access idioms.

          • masklinn 6 days ago

            For what that's worth, Swift has string indexes, they're just opaque and only obtainable through searching or seeking operations.

            Rust also has string indexes, they are not opaque, are byte indices into the backing buffer, and will straight panic if falling within a codepoint.

  • dep_b 6 days ago

    Probably backwards compatibility with NSString

  • akvadrako 6 days ago

    UTF-8 is very inefficient for some character sets - of course it shouldn't be the only option.