Conflating pointers with arrays: C's biggest mistake? (2009)

260 points by etrevino 6 years ago

xroche 6 years ago

This is IMHO by far NOT C' biggest mistake. Not even close. A typical compiler will even warn you when you do something stupid with arrays in function definitions (-Wsizeof-array-argument is the default nowadays).

On the other hand, UBE (undefined, or unspecified behavior) are probably the nastiest stuff that can bite you in C.

I have been programming in C for a very, very long time, and I am still getting hit by UBE time to time, because, eh, you tend to forget "this case".

Last time, it took me a while to realize the bug in the following code snippet from a colleague (not the actual code, but the idea is there):

struct ip_weight { in_addr_t ip; uint64_t weight; };

const struct ip_weight ipw1 = {0x7F000001, 1}; const struct ip_weight ipw2 = {0x7F000001, 1};

const uint32_t hash1 = hash_function(&ipw1, sizeof(ipw1)); const uint32_t hash2 = hash_function(&ipw2, sizeof(ipw2));

The bug: hash1 and hash2 are not the same. For those who are fluent in C UBE, this is obvious, and you'll probably smile. But even for veterans, you tend to miss that after a long day of work.

This, my friends, is part of the real mistakes in C: leaving too many UBE. The result is coding in a minefield.

[You probably found the bug, right ? If not: the obvious issue is that 'struct ip_weight' needs padding for the second field. And while all omitted fields are by the standard initialized to 0 when you declare a structure on the stack, padding value is undefined; and gcc typically leave padding with stack dirty content.]

jcelerier 6 years ago

> [You probably found the bug, right ? If not: the obvious issue is that 'struct ip_weight' needs padding for the second field.
No, the bug is thinking that hashing random bytes in your memory is correct. Why wouldn't you make a correct hash function for your struct ?!
- kjeetgill 6 years ago
  
  I really have to second this one. It's still a rough point against C, but if padding isn't treaded as "semantically unreachable" idiomatically that IS the real gotcha.
  The idiom here is to take the address of the struct and read the width of it's whole footprint in memory not just the field data. It's a weak idiom breaking under a common use case.
- Too 6 years ago
  
  Maybe this is the biggest mistake of C, allowing users to access underlying raw memory so easily and misleading people with "convenient" functions like memcmp etc ?
  Most "UB bugs" stem from users who think they know that a struct or data type will be laid out in a certain sequence in memory.
  
  tonysdg 6 years ago
  
  Unless I'm mistaken, the low-level acess to memory is one of the defining features of C. It's basically designed to be human-readable assembly (which is just human-readable machine code).
  If anything, I'd blame compilers here -- IMO, they should automatically throw at least a warning any time they need to pad/rearrange a struct to make it explicitly clear to developers what's happening.
  
  Too 6 years ago
  
  Access to low level memory such as registers is always explicitly requested with the volatile keyword. All other memory is implementation details. C is far from being a human readable assembly and has never been, except accidentally.
  Even the old language spec from 1989 talks about an "abstract machine with the expressions being evaluated as specified by the semantics". One of the first chapters 1.2 Scope makes this very clear and then 2.1.2.3 Program execution clarifies this even further. Other things to note is that the spec doesn't mention variables being ever stored on the stack, instead it's called automatic storage, see "Storage durations of objects", the word stack is not even mentioned once throughout the whole spec.
  Today with modern CPUs all this is even more important, if you think you are operating on a global memory array you are doing it wrong.
WalterBright 6 years ago

By 'mistake' I am considering the context of the times in which C was developed. Most of the UBE in C is there because of (1) the cost of mitigating it and (2) specifying it would impede portability.
Buffer overflows are UBE, too. But the way I proposed the fix is a pretty much cost-free solution, and it's optional.
Redefining C so that struct padding is always 0'd is an expensive solution, and rarely needed.
- blub 6 years ago
  
  I am curious why you say cost-free.
  Using std::vector/array/string::at would literally eliminate buffer overflows and yet programmers aren't generally using this style.
  I would love it if I could prove to my colleagues that mandatory bounds-checking would not result in a noticeable performance loss, but my gut feeling is that it's not so. Interestingly Rust (and I guess D) does just that and seems to be getting away with it.
  However, in the C UB thread the author of a Rust crate mentioned that a case for using unsafe is exactly this: avoiding the performance loss of bounds checking.
  
  WalterBright 6 years ago
  
  Turning the bounds checking on/off is done with a compiler switch. There indeed is a cost for leaving it on. Most users, however, regard the cost as worth paying for the protection.
  It's still up to you, the programmer. With C, though, you have no choice. No checking for you!
  I meant cost-free in the sense that one way or another in C code you wind up passing the length anyway, or in the case of 0 terminated strings, you wind up recomputing it if you don't pass it.
  
  pjmlp 6 years ago
  
  Regarding being worthwhile to pay for the protection, I would quote Hoare, regarding his experience with Algol compilers in production presented at the Turing Awards speech.
  "Many years later we asked our customers whether they wished us to provide an option to switch off these checks in the interests of efficiency on production runs. Unanimously, they urged us not to--they already knew how frequently subscript errors occur on production runs where failure to detect them could be disastrous. I note with fear and horror that even in 1980, language designers and users have not learned this lesson. In any respectable branch of engineering, failure to observe such elementary precautions would have long been against the law."
  
  jcelerier 6 years ago
  
  I will reply with another Hoare quote, from a paper from 15 years later:
  > Fortunately, the problem of program correctness has turned out to be far less serious than predicted. A recent analysis by Mackenzie has shown that of several thousand deaths so far reliably attributed to dependence on computers, only ten or so can be explained by errors in the software: most of these were due to a couple of instances of incorrect dosage calculations in the treatment of cancer by radiation.
  https://www.gwern.net/docs/math/1996-hoare.pdf
  
  pjmlp 6 years ago
  
  Except that "far less serious than predicted" is not the same as non-existent, and that quote in no way talks about bounds checking, which is the issue being discussed here.
  In real life, we use safety gloves, seatbelts, elbow and knee protection, helmets, chainsaw cover, gun lock, ....
  Naturally there are those that think such protections are only for children and an accident will never occur to them, until the day they happen to be part of the statistic.
  I long for the day that lawsuits for buggy software become a regular activity, only then the industry will actually care to change.
  
  jibal 6 years ago
  
  "Except that "far less serious than predicted" is not the same as non-existent"
  Pure strawman. The statement was that program correctness is far less serious than predicted, not totally benign.
  "and that quote in no way talks about bounds checking, which is the issue being discussed here"
  Um, programs that exceed array bounds are not correct, so yes, it does talk about them.
  [further strawmen not worth addressing]
  
  pjmlp 6 years ago
  
  Sure, you made it clear where we stand regarding quality.
  
  jibal 6 years ago
  
  "I am curious why you say cost-free."
  What's the cost? It's an optional feature.
  "Using std::vector/array/string::at would literally eliminate buffer overflows and yet programmers aren't generally using this style."
  C programmers don't use that because it doesn't exist in C.
  "I would love it if I could prove to my colleagues that mandatory bounds-checking would not result in a noticeable performance loss"
  Did you read his article? There's nothing in it about mandatory bounds-checking.
  
  blub 6 years ago
  
  If the feature is not used in production, it's not exactly improving safety any more than ASan, which is already available for C. So it has to be mandatory (e.g. by team/company decision) and then one has to look at the costs of enabling it.
  
  jibal 6 years ago
  
  No, that's completely wrong.
blub 6 years ago

Interestingly, the idea of hashing over the bytes of a data structure would be possible in C++, but it is non-idiomatic and instead one would use hash functions for each type and in the case of a struct the individual members would be accessed to build the hash value.
The idea of using bytes is error-prone, but now that you mentioned it, pretty typical of the C mindset.
Of course C++ also has some of these cultural biases. I think they're an important reason why unsafe code continues to be written.
- pjmlp 6 years ago
  
  In C++, it depends from which tribe you came from.
  The safe programming tribe, which I include myself, usually refugees from Wirth languages, makes use of C++ abstractions and type safety to deal with unsafety, including type driven development.
  Meaning heavy use of templates, type wrappers, STL (or standard library if you prefer) data types, pre-processor only for #include and yes some meta-programming as well.
  Then there is the tribe of C refugees, whose C++ compiler was forced on them due to a platform SDK, usually eschew anything standard. Might write some C++ like code due to interop with the SDK APIs and that's it.
  Naturally there are a few tribes in-between, but these are the two major groups.
- bluecalm 6 years ago
  
  I object to reinterpreting bytes of one object as object of another type being C mindset. Yes, this is done sometimes if you really need it (compression being one example) but it's not really something you write everyday in C and doing it to calculate hash is just lazy.
  
  amluto 6 years ago
  
  Except that hashing a block of memory is likely to be much faster than hashing individual fields once there are more than a couple fields.
  IMO the right solution would be a special annotation on a struct that says “I want the logical value of this struct to uniquely determine the bytes of the struct’s in-memory representation.”
  Of course, adding such an attribute without nasty edge cases may be tricky.
  
  Tarean 6 years ago
  
  But this mostly matter for arrays and then an alternative optimization might be to store it as a structure of arrays. This doesn't need the padding and is more simd friendly.
  
  blub 6 years ago
  
  The C mindset is one of always looking for optimisations and thinking at the bit & byte level. The above solution flows naturally...
- jibal 6 years ago
  
  "Interestingly, the idea of hashing over the bytes of a data structure would be possible in C++, but it is non-idiomatic"
  Because it's not conformant. Those possible C++ programs are broken.
  
  blub 6 years ago
  
  I was making a point about language culture and decision making.
  Conformance is a property of the code, the programmer's decision process is the really interesting thing here to me and I was arguing that such a solution is not unusual for C, but it would be for C++ , where a programmer would avoid it not because it's not conformant, but because it looks odd. pjmpl further clarified that it depends on the type of C++ programmer - I agree.
  
  jibal 6 years ago
  
  Non conformant code doesn't become an idiom, because it's broken. If you want to wrap that in a cultural framework, go ahead, but it rather misses the point.
  > I was arguing that such a solution is not unusual for C
  You may have asserted that (funny how people call their unsupported assertions "arguments"), but you're wrong if so. Again, it's broken, in C just as much as C++. You wrote
  > instead one would use hash functions for each type and in the case of a struct the individual members would be accessed to build the hash value.
  But this is just as much true in C as C++, because hashing pad bytes is wrong. In fact, C programmers are probably more aware of this than C++ programmers.
beefhash 6 years ago

Bonus mess: If hash_function operates on bytes, but the input is cast to uint8_t* instead of (unsigned) char*, this is also a violation of aliasing rules and the compiler can technically just do whatever.
- im3w1l 6 years ago
  
  But uint8_t will most of the time be typedefed to unsigned char in stdint.h. So your code should be fine right now but may or may not become undefined in the future
- xroche 6 years ago
  
  Hummm, are you sure ? C Standard seems to allow signed/unsigned variants for aliasing rules ("a type that is the signed or unsigned type corresponding to the effective type of the object, a type that is the signed or unsigned type corresponding to a qualified version of the effective type of the object")
- isaachier 6 years ago
  
  Probably not a big deal assuming uint8_t is a typedef for unsigned char.
  
  beefhash 6 years ago
  
  The problem is that this is an assumption, not a guarantee. If somewhere, somewhen, some toolchain decides "actually, let's define this to __u8 and do fancy stuff with this compiler-internal type", your code breaks in the most mysterious way possible.
  
  v_lisivka 6 years ago
  
  Type uint8_t is guaranteed to have 8 bit, no padding bits, 2s complement. You are talking about uint_least8_t or int_fast8_t.
  Quote:
  Exact-width integer types
  The typedef name intN_t designates a signed integer type with width N, no padding bits, and a two's-complement representation. Thus, int8_t denotes a signed integer type with a width of exactly 8 bits. The typedef name uintN_t designates an unsigned integer type with width N. Thus, uint24_t denotes an unsigned integer type with a width of exactly 24 bits.
  
  Mindless2112 6 years ago
  
  But what about unsigned char? It is guaranteed to have CHAR_BIT bits, where CHAR_BIT is at least 8. If CHAR_BIT is greater than 8, then uint8_t cannot be typedef'd to unsigned char.
  I've stopped programming in C if I can help it. It's a crazy language with support for crazy machines.
  
  TheCoelacanth 6 years ago
  
  If CHAR_BITS is greater than 8, then uint8_t cannot be defined because CHAR_BITS is defined as the width of the smallest object that is not a bit-field. If uint8_t exists, it must be the same width as unsigned char.
  
  beefhash 6 years ago
  
  It must have the same width as unsigned char in that scenario, but that does not make it unsigned char. § 6.5 of C11 only allows casting to "a character type" (and a few cases irrelevant in this context), not "a type of CHAR_BITS width". There's no requirement that uint8_t is a typedef for unsigned char.
wruza 6 years ago
This wouldn’t be obvious if met in the wild, but once you pointed that bug exists, it is clear. The problem here is that we trust written code (its author) and rarely go into deep analysis at each line.
But like you said, you have to be aware of this effect at least, and it is not possible to eliminate that by simply clearing all variables to zero. See,
```
  p = (struct ip_weight *)some_used_mem;
  p->ip = ip;
  p->weight = weight;
```
At which point should imaginary p->garbage be set to zero? At cast? But that may again do unexpected thing, since casts usually do not modify data. The entire struct abstraction seems leaky as hell, but that’s the price of not dealing with asm directly.
These examples show that not C itself is hard, but low-level semantics are. You have to somehow deal with struct {x, y} and the fact that y has to be aligned at the same time. And have different platforms in mind. Maybe it is platforms that should be fixed? Maybe, but these are hardware with other issues that may be even harder to get right.
I think C is okay, really (apart from compilers that push ub to the limit). Type systems in successors try to hide its rough edges, but in the end of the day you end up with semantics of the compiler (c++, rust), that a regular guy has to understand anyway; it’s trading one complexity for another. C++ folks often seen treating it as magic, simply not doing what they’re not sure about. Good part is some languages force you to write correct code, no matter how much knowledge you have. But NewC could e.g. instead force one to create that ‘auto garbage’ to make it clear (why not, safety measures are inconvenient irl too).
I have no strong conclusion, but at least let’s think of all non-cs people who make their 64KB arduinos drive around and blink leds.
- bluecalm 6 years ago
  
  You don't need to set those bytes to 0 at all, why would you? The problem is that inside hash function you do incorrect type puning aka interpreting random bytes as numbers. Just read the fields and use their values for hashing.
sehugg 6 years ago

Clang has -Wpadded to catch these kinds of bugs, FWIW. (and at least on my system it inits locals' padding with zero)
- xroche 6 years ago
  
  -Wpadded causes interesting... results when including glibc's headers.
hzhou321 6 years ago

While other comments suggest the solution is to implement the hash function based on field values, it throws away the simple, efficient, and general implementations of original memory based hash function. But if we understand the true source of the problem, isn't the obvious solution is to redefine the structure into two 64-bit fields or add in explicit padding bytes so one can explicitly zero them when necessary?
The reason for undefined behaviors is to avoid over engineering. In a capable engineer's eye, it is beautiful.
- kahlonel 6 years ago
  
  That, or declare them as packed.
bluetomcat 6 years ago

C requires you to understand its conceptual execution and memory layout model in order to write safe code. That is, how the call stack works, the different types of storage, that each type has alignment requirements, and I'm not even mentioning threading issues.
No amount of syntax sugar on top will prevent you from writing unsafe code, unless that basic model is thrown away.
- Too 6 years ago
  
  Quite the contrary, it's when people think they understand the memory layout, the callstack, the alignment requirements, etc and abuse that knowledge to "optimize" their code or out of sheer laziness you get all the problems of C. Just ignore how things are stored in memory and treat it like Java and you will be just fine. Only difference should be that you have to think about memory allocation and object lifetime/ownership.
- pjmlp 6 years ago
  
  Except that languages even older than C made those issues easier to track down.
- bluecalm 6 years ago
  
  You don't need to know anything about alignment to write normal safe C code. Only when to dive deep into performance optimization you start to worry about those things but hopefully by then you care enough to make sure to check what the language spec allows.
  Seriously, just don't cast object of one type to another type and you can forget about alignment until you need to optimize.
  
  MaulingMonkey 6 years ago
  
  Compiler generated SSE, and multithreading primitives, etc. will break on unaligned access even on relatively unalignment-tolerant x86 chips - to say nothing of stricter ARM CPUs.
  You need to keep this in mind basically any time you're going from byte buffers to more complicated types. Basically anything touching allocation or IO needs to be aware of it, IMO, even before optimization.
pornel 6 years ago

Rust solved this particular class of problems nicely with `#[derive(Hash)]` without having to define what the padding bytes are.
jibal 6 years ago

I think Walter gave a very good argument for why it's the biggest mistake, and no warning will help with the consequences that he pointed out.
That C has UBE is not a mistake, it's fundamental to the language design, which allows for unrestricted access to the bare metal. If you want a different sort of language, use Java.
kahlonel 6 years ago

I don't think there is any UB case that applies to the given snippet, or am I missing something? Treating unpacked structs as byte buffers is asking for trouble.
bluecalm 6 years ago

To run into UB here you need to read struct's bytes as something else and then do calculations on them. UB or not it is asking for trouble. Why not just read field of the struct and use them to computer a hash?
jasonkostempski 6 years ago

I don't know much about C. Are there code analysis tools that would have caught the issue?
- xroche 6 years ago
  
  valgrind (--tool=memcheck) would probably have detected the issue, yes.
senatorobama 6 years ago

Why is struct ip_weight padded.. is it to make it word aligned on 64-bit platforms?
- blattimwind 6 years ago
  
  in_addr_t is a 32 bit integer. On 64 bit archs this will usually result in a 32 bit padding for word alignment.
- bluecalm 6 years ago
  
  Padding is needed for two reasons: 8 bytes element need to start at multiple of 8 bytes address and to make the whole struct size multiple of the biggest member. This is needed for alignment once those structs are need to each other for example in an array.
- pandaman 6 years ago
  
  To align 'weight' member on its size.
  
  senatorobama 6 years ago
  
  so the pad will be before the 'weight' member in memory?
  
  tardo99 6 years ago
  
  Yes. See here: https://stackoverflow.com/questions/4306186/structure-paddin...

drfuchs 6 years ago

The proposal here is way too vague. And if you flesh it out, things start to fall apart: If nul-termination of strings is gone, does that mean that the fat pointers need to be three words long, so they have a "capacity" as well as a "current length"? If not, how do you manage to get string variable on the stack if its length might change? Or in a struct? How does concatenation work such that you can avoid horrible performance (think Java's String vs. StringBuffer)? On the other hand, if the fat pointers have a length and capacity, how do I get a fat pointer to a substring that's in the middle of a given string?

Similar questions apply to general arrays, as well. Also: Am I able to take the address of an element of an array? Will that be a fat pointer too? How about a pointer to a sequence of elements? Can I do arithmetic on these pointers? If not, am I forced to pass around fat array pointers as well as index values when I want to call functions to operate on pieces of the array? How would you write Quicksort? Heapsort? And this doesn't even start to address questions like "how can I write an arena-allocation scheme when I need one"?

In short, the reason that this sort of thing hasn't appeared in C is not because nobody has thought about it, nor because the C folks are too hide-bound to accept a good idea, but rather because it's not clear that there's a real, workable, limited, concise, solution that doesn't warp the language far off into Java/C#-land. It would be great if there were, but this isn't it.

WalterBright 6 years ago
I happened to know the idea does work, and has been working in D for 18 years now.
> If nul-termination of strings is gone, does that mean that the fat pointers need to be three words long, so they have a "capacity" as well as a "current length"?
No. You'll still have the same issues with how memory is allocated and resized. But, once the memory is allocated, you have a safe and reliable way to access the memory without buffer overflows.
> If not, how do you manage to get string variable on the stack if its length might change? Or in a struct? How does concatenation work such that you can avoid horrible performance (think Java's String vs. StringBuffer)?
As I mentioned, it does not address allocating memory. However, it does offer one performance advantage in not having to call strlen to determine the size of the data.
> On the other hand, if the fat pointers have a length and capacity, how do I get a fat pointer to a substring that's in the middle of a given string?
In D, we call those slices. They look like this:
```
    T[] array = ...
    T[] slice = array[lower .. upper];
```
The compiler can insert checks that the slice[] lies within the bounds of array[].
> Am I able to take the address of an element of an array?
Yes: `T* p = &array[3];`
> Will that be a fat pointer too?
No, it'll be regular pointer. To get a fat pointer, i.e. a slice:
```
    slice = array[lower .. upper];
```
> How about a pointer to a sequence of elements?
Not sure what you mean. You can get a pointer or a slice of a dynamic array.
> Can I do arithmetic on these pointers?
Yes, via the slice method outlined above.
> If not, am I forced to pass around fat array pointers as well as index values when I want to call functions to operate on pieces of the array?
No, just the slice.
> How would you write Quicksort? Heapsort?
Show me your pointer version and I'll show you an array version.
> And this doesn't even start to address questions like "how can I write an arena-allocation scheme when I need one"?
The arena will likely be an array, right? Then return slices of it.
- drfuchs 6 years ago
  
  How exactly then would one declare a stack variable that is a string that is initialized to "abc", and then later has "1234" concatenated to it, so it's now "abc1234", without using the heap in any way? If the answer is "you can't do that", then that's fine for Java/C#/D, but not for C.
  
  Too 6 years ago
  
  You can't do that in C either. Without either: 1) "abc" string happens to be the last variable on the stack, good luck with that. Or 2) You pre-allocate the string to fit a bit extra, not exactly a "solution".
  Maybe option 1 is feasible but not really practical, i can only see it being used in extremely low level stuff with a non standard compiler and going through tons of hoops like pre-creating loop-variables used later in the function and disabling optimizer.
  
  drfuchs 6 years ago
  
  char foo[9]; strcpy(foo,"abc"); strcat(foo,"1234");
  And yes, that's specifically written to be awful and unsafe, but there are circumstances where you need to be close to the metal and carefully resort to more complicated variations of such things. That's what C is fairly uniquely appropriate for.
  
  WalterBright 6 years ago
  
  Your example uses three strlen's. It also is at high risk for buffer overflows. Whenever I review code like this, sure as shootin', there's an error in it in the lengths somewhere. Here's the same code using dynamic arrays:
  char foo[9]; foo[] = "abc"; foo[3..3+5] = "1234";
  No unchecked buffer overflows, and no calls to strlen. The +5 puts the terminating 0 on.
  
  drfuchs 6 years ago
  
  You seem to be confused about the implementation of C's standard string library. There are certainly not 3 strlen calls underneath a strcpy() plus a strcat() call.
  But what's the mention of "terminating 0"? The article says that terminating zeros should not be needed under its proposal; and that's what I was saying didn't make sense. [Added:] So, if you didn't just happen to know that the global string variable foo contains a string that was 3 characters long, how would you concatenate "1234" to it? I don't see any way without either double-fat pointers, or terminating NUL.
  
  WalterBright 6 years ago
  
  strcpy(s1,s2) does a strlen on s2.
  strcat(s1,s2) does a strlen on s1 and s2.
  Now, two of the strlen's can be replaced with byte-by-byte copies checking for 0 for each, but that tends to lose the efficiency that a memcpy would bring, so you're pretty much suffering from it anyway.
  BTW, here's the strcat I wrote eons ago:
  https://github.com/DigitalMars/dmc/blob/master/src/CORE32/ST...
  It does do two strlen's (the repne scasb instructions). With the improvements in CPUs since there are probably better ways to write it, but that was pretty good for its day.
  Here's strcpy:
  https://github.com/DigitalMars/dmc/blob/master/src/CORE32/ST...
  which does the test-every-byte method. I think Steve Russell wrote it, but I'm not sure.
  If there's anything efficiently implemented in a C compiler, it's memcpy. Being able to implement string processing in terms of memcpy leverages that very nicely. strcat() and strcpy() don't leverage it.
  Which do you think is faster (s2 is 1024 bytes long)?
  strcpy(s1, s2); memcpy(s1, s2, 1024);
  I've dramatically speeded up a lot of my code and other peoples' by replacing the strxxx functions with memcpy. It's low hanging fruit and one of the first things I look for.
  
  WalterBright 6 years ago
  
  To illustrate the advantage of memcpy, consider this inner loop possible implementation:
  mov EAX,[ESI] mov EBX,4[ESI] mov ECX,8[ESI] mov EDX,12[ESI] mov [EDI],EAX mov 4[EDI],EBX mov 8[EDI],ECX mov 12[EDI],EDX add ESI,16 add EDI,16
  Modern processors can likely do the 4 loads and 4 stores in parallel. That can't be done with 0 terminated strings, as you have to check every byte for 0. Even worse, you have to take care not to seg fault by reading too far past the 0, as there may not be any valid memory there.
  
  gpderetta 6 years ago
  
  Well, you can use vevctorized string operations that allow for null terminated strings.
  I do agree with the general idea that null terminated strings are a mistake though.
  
  shakna 6 years ago
  
  I'm usually one to lean on your knowledge, but looking at glibc's strcpy [0], I don't see a strlen. I do see two macros, CHECK_BOUNDS_LOW and CHECK_BOUNDS_HIGH, which get defined about here [1].
  Am I missing something?
  [0] https://github.com/lattera/glibc/blob/master/string/strcpy.c
  [1] https://github.com/lattera/glibc/blob/master/sysdeps/generic...
  
  WalterBright 6 years ago
  
  Good question. I preface this by the fact that I am not any expert on gcc compiler internals, I am assuming how I'd do it.
  1. This implementation tests every byte, as discussed in other posts here. That makes it slow.
  2. This implementation is likely not used - the gcc compiler probably has an internal code sequence it emits for a strcpy.
  
  MikeWey 6 years ago
  
  Seeing that the implementation relies on CHECK_BOUND_(HIGH/LOW) it's using bounded pointers, which are implemented as structs holding 3 fields. The value, the upper bound and the lower bound. That's one extra field compared to D arrays.
  The function still checks every byte for \0.
  And if i'm reading it correctly, only checks the upper bound after copying the data. And checking it before copying would require a call to strlen.
  
  iforgotpassword 6 years ago
  
  I would've assumed glibc has a bunch of different implementations and then picks the fastest one at runtime depending on CPU capabilities, but maybe they just scrapped that since GCC has built-in versions of most string functions anyways, since it can often know or at least guess the characteristics of the input and output (length, alignment) and emit different code depending on that.
  
  vardump 6 years ago
  
  > There are certainly not 3 strlen calls underneath a strcpy() plus a strcat() call.
  No, but strcpy does need to check each char for NUL byte as it copies the string. And strcat will need to redo same check on the original strcpy'ied string plus the new one.
  So "abc" length is effectively checked twice and "1234" once.
  So there's some truth to the matter, even though they're not "true" strlen calls.
  
  WalterBright 6 years ago
  
  The "terminating 0" is when you need it, for example when interfacing with the C standard library. D, for example, adds a terminating 0 to string literals so that they'll work with C code. For code that uses slices, the 0 is not necessary.
  
  drfuchs 6 years ago
  
  So, if you don't just happen to know that the global variable "char foo[9]" currently contains a string that is 3 characters long, how would you concatenate "1234" to it in place? I don't see any way without either double-fat pointers, or terminating NUL.
  Non-dynamic arrays of char is just supposed to be a simplistic representation of a sort of thing that one has occasion to want to do in C that doesn't seem to fit into the proposed model without going into "old C NUL-termination" mode, or a "keep track of the string's length yourself" scheme, either of which would seem to ruin the whole thing. Thus my claim that this single feature would be hard to graft onto C in a useful, upward-compatible way. It's fine to have a language where all strings are dynamically allocated on the heap, or have an immutable known length, but that's a non-starter in the existing C universe.
  The point isn't that doing things D's way isn't great; the point is that there's no reasonable way to put this feature into C. Every reasonable approach to string (and pointer) safety ends up being a new language: C#, Swift, Java, etc.
  
  jibal 6 years ago
  
  " I don't see any way without either double-fat pointers, or terminating NUL."
  You don't seem to have followed the discussion or the paper. The array length is known; a slice is a "fat pointer", but not "double-fat".
  "The point isn't that doing things D's way isn't great; the point is that there's no reasonable way to put this feature into C."
  You're plainly wrong; the proposal in the article does exactly that.
  
  WalterBright 6 years ago
  
  You're trying to require this to manage memory allocation. It does not address memory allocation any more than C pointers do. Memory allocation remains up to the programmer with this proposal.
  
  Too 6 years ago
  
  Ok, so pre-allocating then. That's sort of equivalent to new StringBuffer(9), except java uses the heap.
  Safe types doesn't stop you from doing this. You do need another length-field though, one for the allocated size and one for the used size. In c++, std::string already has this feature with the reserve/capacity functions, STL is also heap based but it is possible with some effort to pass it a stack allocator. Now c++ isn't exactly the best reference when it comes to these things either but just saying conceptually fat pointers doesn't stop you from doing these things, see WalterBrights reply for a better example.
  
  bumholio 6 years ago
  
  > there are circumstances where you need to be close to the metal and carefully resort to more complicated variations of such things. That's what C is fairly uniquely appropriate for.
  You can still do that in a C dialect with fat pointers. You can have strings with 5 byte chars and use 0x1337 as string terminator if that's what your metal needs.
  The point is, you don't have to, the compiler provides you with a sane array implementation that is adequate for 99% higher level algorithmic tasks.
  
  WalterBright 6 years ago
  
  Your example does save on code size (but not performance, due to the 3 strlen()'s). But where it won't save on code size is when, say, one wants to retrieve just the filename from filename.ext. One has to strlen it, allocate memory, copy the memory, then remember to free it exactly once.
  With dynamic arrays, just return a slice.
  
  kyberias 6 years ago
  
  You say strlen() but you probably mean "code has to iterate all the string elements".
  This doesn't happen 3 times, though:
  > char foo[9]; strcpy(foo,"abc"); strcat(foo,"1234");
  Strcpy has to iterate through all the elements because it copies them. This would happen regardless. It doesn't do strlen.
  Strcat has to find the end of the destination string, so it has to iterate (or call strlen). Then it's just strcpy again.
  Instead of 3 strlens, there is 1.
  Do you not understand how C string/arrays work, or why do you insist on 3 strlens?
  
  tjoff 6 years ago
  
  > Strcpy has to iterate through all the elements because it copies them. This would happen regardless. It doesn't do strlen.
  Iterate all the elements and copy a fixed size is two very different things.
  strcpy has to read, byte for byte, and check it for null (which is exactly what strlen does). A "real" copy would just blindly copy a chunk of memory with no other processing on it. The speed difference is huge.
  
  WalterBright 6 years ago
  
  You're right. I apologize for using strlen metaphorically rather than literally.
  
  jibal 6 years ago
  
  > You say strlen() but you probably mean "code has to iterate all the string elements".
  Yes, but that's much more expensive than a memcpy (of the two source strings) or just knowing the length (of the string in the target buffer).
  > Strcpy has to iterate through all the elements because it copies them. This would happen regardless.
  No, it wouldn't; memcpy is generally a lot faster than strcpy.
  > Do you not understand how C string/arrays work
  Do you not understand that he's written a few C compilers, and designed and implemented a language that is known for its runtime compatibility with C?
  > or why do you insist on 3 strlens?
  Um, you already noted that he really means "code has to iterate all the string elements". What he should have said is that you have to find 3 NULs. That's an expensive operation even when you're copying the string while finding it.
  
  WalterBright 6 years ago
  
  See my other reply to this above.
  
  dmitrygr 6 years ago
  
  char str[16] = "ABC";
  
  v_lisivka 6 years ago
  
  #include <stdio.h> #include <string.h> int main() { const char* s1="abc"; const char* s2="1234"; char str[strlen (s1) + strlen (s2) + 1]; strcpy (str, s1); strcat (str, s2); puts(str); return 0; } $ gcc -std=c99 -Wall -Werror ./tst.c -o tst && ./tst abc1234
  
  WalterBright 6 years ago
  
  Off the top of my head (in D syntax):
  char[3] s = "abc"; char* p = cast(char*)malloc(s.length + 4); assert(p != null); char[] a = p[0 .. s.length + 4]; a[0 .. 3] = s[]; a[3 .. 3+4] = "1234";
  It's more verbose than necessary, but I wanted to illustrate the idea. Note how the allocation is turned into a dynamic array.
  Note that my proposal is not for a new memory allocation scheme for C, just a way to map data onto arrays.
  
  drfuchs 6 years ago
  
  You don't seem to understand heap vs. stack. The call to malloc() does a heap allocation, and the question was "without using the heap in any way". If it makes it conceptually easier, how about a solution where the string is simply a global variable? No malloc allowed.
  
  vkazanov 6 years ago
  
  You don't seem to understand who you are arguing with :-)
  
  geezerjay 6 years ago
  
  > You don't seem to understand heap vs. stack.
  Please check who you are replying to.
  https://en.wikipedia.org/wiki/Walter_Bright
  
  jkabrg 6 years ago
  
  WalterBright posted the following code snippet somewhere above:
  char foo[9]; foo[] = "abc"; foo[3..3+5] = "1234";
  `foo` is an array. `foo[3..8]` is a slice, which is an object that does its own bounds-checking. I don't think the heap is used here.
  Another explicit example:
  char foo[5] = "abcd"; // still NUL-terminated, carries length aswell char[] bar = foo[2..3]; // a fat pointer with length 1 bar[0] = 'C'; printf(foo); // abCd bar[1] = 'D'// ERROR!!! bar has length 1
  Note that the array `foo` is now bounds-checked, which may affect backwards-compatibility. Also, `bar` is no longer null-terminated, which means you can't do printf on it.
  
  WalterBright 6 years ago
  
  I showed you how to do that in another reply.
  
  acehreli 6 years ago
  
  Perfectionism is not engineering. Not finding a way to fix every use case that you can come up with is not an excuse to insist on unsafe features that are proven to be disasters in practice.
  
  thermodynthrway 6 years ago
  
  That sounds like a fairly bad idea but both Java and C# let you access the stack and mess with variables from previous frames if you really want to. You can't access stack memory outside the heap hut I can't think of a use-case
  
  barrkel 6 years ago
  
  C# does (via ref parameters) but Java does not - any values you mutate must logically live in the heap (or a static).
  Some JIT optimization may allocate an instance in the stack, I'm not counting that.
alphaglosined 6 years ago

This article was not created based upon theory.
It was created based upon real world experience having designed and implemented it in D. Where all of your concerns have not been discussed in the years following that article (it was already in the language for about 8 years at that point aka the start and has been solidly proven to work in the exact same context as it would have done in C).
In D at least, you can grab the pointer by a simple .ptr and for length .length. To get a specific element, it is as you would expect &f[i] all nice and straight forward. But what if you want to create an array from malloc? In D that is easy, just slice it! malloc(len)[0 .. len]. And free is just as you would expect from above, free(array.ptr);
GlitchMr 6 years ago

> The proposal here is way too vague. And if you flesh it out, things start to fall apart
No, it's not, those ideas were implemented in practice in D and Rust, and there are no real issues with those. This feature could be easily implemented in C, there are no dependencies on features that C doesn't have.
> If nul-termination of strings is gone, does that mean that the fat pointers need to be three words long
No need to store the capacity. This is a slice, not a buffer. Go conflates those two for user's convenience, but this is not necessary, and in fact is waste of RAM - not an issue for Go, but it is an issue for C. For instance, `&str` in Rust is a pair of pointer to a string and its length and it works really well.
> If not, how do you manage to get string variable on the stack if its length might change? How does concatenation work such that you can avoid horrible performance (think Java's String vs. StringBuffer)?
Use your own slice buffer abstraction for that purpose. It can be implemented as a struct storing a slice and its capacity. Pass a pointer to slice buffer abstraction, if you want a function to be able to add elements to it. This is also how it works in Go, for that matter.
Slices don't define concatenation. This is C, not a high level programming language.
> Am I able to take the address of an element of an array?
Yes. `&a[3]`. It's still an array, it just knows its size.
> Will that be a fat pointer too?
No.
> How about a pointer to a sequence of elements?
Probably you could add some sort of a range access syntax. Say, something like `&a[1:3]`.
> Can I do arithmetic on these pointers?
I don't know whether pointer arithmetic should be allowed or not, but even if it shouldn't be, there is nothing to stop you from doing `&a[4]` as a replacement for `a + 4`.
> How would you write Quicksort? Heapsort?
The same way you would with a regular array. Think of it as a struct storing an array pointer and its length. If you prefer to working with pair of start/end pointers instead of pair of start and array size, then note that `end - start` is array length, so getting an end pointer is trivial.
- pjmlp 6 years ago
  
  > No, it's not, those ideas were implemented in practice in D and Rust, and there are no real issues with those.
  Even older than that, those features already existed in NEWP, Mesa and Modula-2, just to pick some examples back when C was being designed still.
Too 6 years ago

Can't see how this would affect string concatenation negatively compared to plain char? The problem with java string is that it's pre-allocated to the exact length, normally char-strings are also that. They don't magically make things faster just because they are missing a length-field, quite the contrary actually because now you need to iterate it twice to concatenate two strings without a StringBuffer equivalent, once to figure out the length of the result so you can allocate the correct size and once to do the actual copy.
Don't see why you shouldn't be able to make a fat pointer point into a range inside the original array either? Just point it to an element and make the length-field shorter than the original? This is usally called array_view, span or slice in other languages.
_ph_ 6 years ago

It would be 2/3rds of a Go slice. You have a fat pointer with the capacity of the array it is pointing to. If you want to implement shorter strings, you would have to store the length independently, or use 0 termination. You still would have the length information as a safeguard of overflowing the array. You still could do everything, what you can do with current C strings, just safer. One could have fat pointers to array elements to, just with an accordingly shorter capacity.
In the end, I think the Go slices are the consequential implementation of safe fat pointers, having both the capacity and length and allowing efficient and still safe reslicing. The overhead of having 24 vs 8 bytes per pointer on a 64 bit machine should be worth it in modern times.
jacinabox 6 years ago

I tried implementing a scheme like this once. What you do for efficiency is allocate some extra header space with the array size, and access it with negative pointer offsets. You pass to fat pointers if asked to take slices of the array. This way the common use case has good locality. The way you get things onto the stack is with a macro, which preallocates and initializes the array with the (statically known) array length.
fnord123 6 years ago

>If nul-termination of strings is gone, does that mean that the fat pointers need to be three words long, so they have a "capacity" as well as a "current length"?
You don't need a fat pointer. It can be part of the memory layout on the heap. How do you think `free` knows the length of the memory you are deallocating? Because the length is on the heap snuggled in right before the actual pointer malloc returned.

Animats 6 years ago

Yes, that's C's biggest mistake. (But remember, they had to cram the compiler into a 16-bit machine.) No, "fat pointers" are not a backwards-compatible solution. They've been tried. They were a feature of GCC at one time, used by almost nobody.

I once had a proposal on this. See [1]. Enough people looked it over to find errors; this is version 3. The consensus is that it would work technically but not politically.

The basic idea is that the programmer knows how big the array is; they just don't have a way to tell the compiler what expression defines the length of the array. Instead of

    int read(int fd, char buf[], size_t n);

you write

    int read(int n; int fd, char (&buf)[n], size_t n);

It generates the same calling sequence. Arrays are still passed as plain pointers. But the compiler now knows how big "buf" is, both on the caller and callee side, and can check.

I also proposed adding slice syntax to C, so, when you want to talk about part of an array, you do it as a slice, not via pointer arithmetic.

The key idea here is that you can call old code from new ("strict") code, and strict code from old code. When you get to all strict code, subscript errors should be all checkable.

[1] http://www.animats.com/papers/languages/safearraysforc43.pdf

WalterBright 6 years ago

I suspect that the reason your idea was not adopted was the syntax. It's not a phat pointer, it's two arguments with some rather complex syntax to connect the two.
The reason I'm fairly confident of that assessment is I've had similar experiences with D when the syntax for something was too complex. Early on, the syntax for lambdas was rather clunkly. Everyone either hated it, or insisted that D didn't even have lambdas. Greatly simplifying the syntax was a revelation, suddenly D had lambdas and they became used everywhere.
Syntax matters a great deal.
- Animats 6 years ago
  
  Yes.
  int read(int n; int fd, char (&buf)[n], size_t n);
  is a bit bulky. The initial "int n;" is a little used GCC extension. Allowing
  int read(int fd, char (&buf)[n], size_t n);
  is an option. "n" is used before it is declared, which is strange for C. This is only a problem because of the UNIX idiom that buffer pointer comes before size in most system calls.
  char (&buf)[n]
  is also a bit bulky, but that, too, is forced by C/C++ tradition.
  char &buf[n]
  would be an array of refs, and
  char buf[n]
  would be an array passed by copy.
  There have been many, many attempts to "fix" C in a non-backwards compatible way. The result is always a new language. It's the backwards compatibility that's hard.
  
  loup-vaillant 6 years ago
  
  How about eschewing the passing by value altogether? If someone wants to do that, they can memcpy() the array inside the function for no extra syntax. So, the following would mean fat pointer:
  int read(int fd, char buf[n], size_t n);
  The following guarantees your array is not modified (but it's still passed by pointer).
  int write(int fd, const char buf[n], size_t n);
  The following emulates passing by value:
  void foo(const int buf[n], size_t n) { int tmp[n]; memcpy(buf, tmp, n); }
  Alternatively:
  void foo(const int buf[n], size_t n) { int tmp[n]; arrcpy(buf, tmp); // may or may not check bounds }
  Maybe this would render the proposition less useful, but it would already help. Here's for instance authenticated encryption from Monocypher, my crypto library:
  void crypto_lock_aead(uint8_t mac[16], uint8_t *cipher_text, const uint8_t key[32], const uint8_t nonce[24], const uint8_t *ad, size_t ad_size, const uint8_t *plain_text, size_t text_size);
  It is not crystal clear that `text_size` is referring to the size of both the `plaintext` and the `cipher_text`. With something like your proposition, I could write this instead:
  void crypto_lock_aead(uint8_t mac [16], uint8_t cipher_text[text_size], const uint8_t key [32], const uint8_t nonce [24], const uint8_t ad [ad_size], size_t ad_size, const uint8_t plain_text [text_size], size_t text_size);
  That way, the size of each buffer is crystal clear. Bonus: a sanitizer can check that I don't overflow my bounds (and I love sanitisers for stuff as sensitive as a crypto library).
raverbashing 6 years ago

> Yes, that's C's biggest mistake. (But remember, they had to cram the compiler into a 16-bit machine.
Pascal's compiler was smaller and it worked in 16bit machines no problem.
Maybe C's base library was bigger? I'm not sure
mFixman 6 years ago

> I also proposed adding slice syntax to C, so, when you want to talk about part of an array, you do it as a slice, not via pointer arithmetic.
I highly disagree with this. One of the advantages of conflating pointers with arrays is an obvious and very consistent way of indexing and slicing on the entire language that has minimal syntactic baggage.
- pjmlp 6 years ago
  
  Yes, because typing ptr =&array[0] vs ptr = array is so hard.
- SamReidHughes 6 years ago
  
  The thing is, you don't want to index pointers. You only want to index a particular kind of "indexable pointers." They should be separate constructs, even if you don't have fat pointers.
  Edit: That is more useful if you have function overloading, or templates, to avoid touchy ambiguities. It's still a slightly useful distinction to have in C, just for human readability.
pjmlp 6 years ago

Burroughs and IBM also had to cram safer languages in more restrained environments.
int_19h 6 years ago

It's not just argument passing, though. You also want to be able to return slices, store them inside structs etc.

MrBingley 6 years ago

I absolutely agree. Adding an array type to C that knows its own length would solve so many headaches, fix so many bugs, and prevent so many security vulnerabilities it's not even funny. Null terminated strings? Gone! Checked array indexing? Now possible! More efficient free that gets passed the array length? Now we could do it! The possibilities are incredible. Sadly, C is so obstinately stuck in its old ways that adding such a radical change will likely never happen. But one can dream ...

bluetomcat 6 years ago

> Adding an array type to C that knows its own length would solve so many headaches
C arrays know their length, it's always `sizeof(arr) / sizeof(*arr)`. It's just that arrays become pointers when passed between functions, and dynamically-sized regions (what is an array in most other languages) are always accessed via a pointer.
- jibal 6 years ago
  
  He said "array type". There isn't one in C.
  "It's just that arrays become pointers when passed between functions"
  Oh, is that all?
  Did you read the article, or the comment you're responding to? They point out the cost of "just" doing that.
m_mueller 6 years ago

I’ll add to this that C having committed to this mistake is one of thr main reasons some people (scientific programmers) are still using Fortran. Arrays with dimensions, especially multidimensional ones, allow for a lot of syntactic sugar that are very useful, such as slicing.
- geoalchimista 6 years ago
  
  Modern Fortran (90 to 2008) has evolved a lot regarding array arithmetic and broadcasting, yet still maintain backward compatibility. I don't think that couldn't be done in C, but as many has pointed out, the problem seems to be why bother when there are already C++/D/Java/C#/Go/Rust ...
  However, I'd recommend people who deal heavily with multidimensional arrays but couldn't sacrifice the low-level C environment for a dynamic language to consider using the ISO_C_BINDING of Fortran 2003. It provides fully C compatible native types, and can be compiled together with C (you get gfortran from GCC anyway).
  
  macintux 6 years ago
  
  Without knowing Fortran, I’d speculate it’s easier to maintain backwards compatibility in a language that doesn’t have as direct a mapping to hardware as C. Fortran seems to have more abstractions built in.
  
  geoalchimista 6 years ago
  
  That's true. It predated C but even then abstracted the user away from the hardware (and still does). I wouldn't suggest any use of Fortran beyond number crunching and array arithmetic.
- Athas 6 years ago
  
  Hell, you don't even have to go to slicing for language-supported multidimensional arrays to make sense. Simply being able to index with a[i][j] is so much nicer than the manual flat addressing a[i*n+j] that you end up with in C. (a[i][j] does work in C, but only if the array dimensions are constants.)
ars 6 years ago

> But one can dream ...
There's nothing stopping you from simply doing it. With a couple of macros the whole thing can just be a header file.
True, it doesn't take you all the way there (you'll still need to manually check array access to make sure they don't go over), but it's a start. And those manual checks can be a macro as well, to make it easy to add them where needed.
- slededit 6 years ago
  
  Malloc already includes a length and most arrays are heap based. I wish it could be exposed in a nice way. Of course it would have to support sub allocators or it wouldn't be C.
  
  ric129 6 years ago
  
  There's malloc_usable_size[1], assuming you mean asking the memory allocator what the array size is. But chances are that wouldn't work correctly, because what the amount of memory a malloc calls gives you and the amount you requested are often not the same. Modern memory allocators round up the request size to the nearest "size class".
  [1]: http://man7.org/linux/man-pages/man3/malloc_usable_size.3.ht...
kahlonel 6 years ago

Its actually quite common for C programmers to create their own array type that knows its length, and use it in their projects. See this for example: https://github.com/antirez/sds
- WalterBright 6 years ago
  
  Everybody writes their own string package for C. I've written probably a couple dozen of them. They're all inadequate for one reason or another, hence my subsequent attempts.
  Probably the most damning problem is none of them are able to interoperate.
webkike 6 years ago

Adding anything to C is such a useless exercise because we've made so many advancements in plt since it's release we might as well make a new language.
- WalterBright 6 years ago
  
  Well, I did that, too :-)

WalterBright 6 years ago

Just for fun, type in this program:

    int fred(int a[10]) {
        return a[11];
    }

It compiles without error with gcc and clang, even with -Wall. The code generated by clang is:

    mov EAX,02Ch[RDI]
    ret

i.e. buffer overflow, even though the array size is given. Compile the equivalent DasBetterC program:

    int fred(ref int[10] a) {
        return a[11];
    }

    fred.d(2): Error: array index 11 is out of bounds a[0 .. 10]

And the 32 bit code generated (when using 9 instead of 11 so it will compile):

    mov     EAX,024h[EAX]
    ret

bluetomcat 6 years ago

Quite surprised to see this not mentioned. C99 allows you to use the "static" keyword in array function parameters like this:

    void foo(int arr[static 10]);

It cannot check whether a passed pointer will point to enough space, but the compiler can warn you if you pass a fixed-size array of a smaller size.

WalterBright 6 years ago

Dynamic arrays are far, far more common than static ones.
- bluetomcat 6 years ago
  
  I beg to differ. In C especially, static arrays are quite common as struct members and as static objects at file scope, because dynamic allocations are a pain to manage and unnecessary when the maximum expected size is reasonably small.
  When the size of such arrays is computed at compile-time via macro definitions, that feature is quite handy.
  
  WalterBright 6 years ago
  
  > when the maximum expected size
  I pretty much never use static arrays because if I do I always without fail get a bug report when some user exceeds it.
  
  jibal 6 years ago
  
  "In C especially, static arrays are quite common as struct members and as static objects at file scope"
  Not among good programmers (unless the array is immutable).

WalterBright 6 years ago

Apparently someone posted this here because of my remark:https://www.reddit.com/r/programming/comments/90ov9i/a_respo...

Nice to see it get such a nice response!

chmike 6 years ago

From my experience Go's array (slice) is a far better solution. It does not only carry the size (number of elements), it also carries the array buffer capacity. To me it's the epitome of what arrays should be.

zaphirplane 6 years ago

Usually people have to refer to a slice cheat sheet to work with it, perhaps it’s not an intuitive concept/api
- burntsushi 6 years ago
  
  That's because of a lack of named methods to perform common operations. It has nothing to do with the fact that slices are fat pointers.
  Also, nobody I know constantly looks at a cheat sheet. The concepts motivating the various slice transformations get ingrained pretty quickly.

speedplane 6 years ago

Gimme a break, making stricter requirements on C arrays may theoretically make some things easier, but we’re talking 1% improvement. What makes C hard (and great) is requiring an understanding of not just memory, but memory allocation and deallocation schemes. For many beginners this is hard conceptually, but for everyone, keeping track of allocated and unallocated memory is extremely difficult.

WalterBright 6 years ago

My proposal is purely additive, you won't need to use it at all if you don't want to. But I suspect you'll gravitate towards it over time, most everyone does (experience with D dynamic arrays).
mankash666 6 years ago

Disagree. C was the first language I/we learnt, and it's still my favorite.
It's a bit like the first language you learn. For someone from the Latin family of languages, Mandarin's verbal & written structure might seem hard, but for native Chinese, it's second nature.
- speedplane 6 years ago
  
  Mandarin may be your first and favorite, but that doesn't make it easy to learn. Same with C.

ufmace 6 years ago

I haven't written much C, and I don't have a firm opinion on whether or not that particular issue is C's biggest mistake. I do think that just this one change sounds radical enough, as far as the effort it would take to convert existing C code that uses the high-risk pattern, that it seems better to just wholesale convert to a language that already mandates safety like Rust or Java. Particularly when you consider all of the other high-risk patterns in C that these other languages eliminate.

User23 6 years ago

This is a very good article that highlights the importance of semantics.

hota_mazi 6 years ago

Conflating pointers and arrays seem pretty minor and not the cause for many bugs.

The main source of bugs in C to me would be pointer arithmetics.

WalterBright 6 years ago

Pointer arithmetic is mainly used to access arrays, and is where the buffer overflows come from. Using actual arrays instead allows the compiler to insert overflow checking code.

nearmuse 6 years ago

What's the mistake? You pass a pointer and the number of elements, it's just the C way. At any point in time you have to pay attention. What is the proposal here? Make all arrays structures? Or add some weird un-C syntactic sugar?

SamReidHughes 6 years ago

It's a question of priorities. It depends whether your goal is to maximize productivity and minimize the defect rate, or if your goal is to tell people they need to pay attention.
jibal 6 years ago

"What is the proposal here?"
You're commenting without reading the article?

bluecalm 6 years ago

Why is this such a serious issue? I mean it is inconvenient to always pass length along with the pointer but it's not that inconvenient. It's a bit more typing but that's where problems end.

altrego99 6 years ago

Agree that this is a problem (if the programmer is not careful).

But serious question, why even bother with this one fix?

The only reason for the fix is so to make it more difficult to make errors.

Fix arrays, then you would fix null pointer, then you might add objects, templating/generics to support a good collections library, rtti, and before you know it you are creating another one of c++, D, go, java. And we already have those.

C paved the way. Why not let it be the end of it?

WalterBright 6 years ago

Because buffer overflows are probably the number 1 security bug in C programs.
- ahmedalsudani 6 years ago
  
  I was wondering why you were championing this idea and agreeing with the posted link in almost every way. Then I went back to the link and figured it out :)
  P.S. thank you for everything you have done with D. I read in another HN thread about Better C, and it convinced me that D is the language I should be investing my time in learning and using.
  
  bachmeier 6 years ago
  
  > I read in another HN thread about Better C
  A good tool to check out, but which hasn't been promoted much because it's new, is dpp[1]. You can directly reference C header files in your D code. With that, betterC mode becomes a viable option for adding to an existing C project.
  [1] https://github.com/atilaneves/dpp
pjmlp 6 years ago

Paved the way for what? Mainstream security exploits?
There were OSes being written in better languages outside Bell Labs, had it been allowed to sell UNIX instead of giving it away for a symbolic price to universities, and the historical outcome would have been completely different.

toolslive 6 years ago

Isn't the fact that core types don't have a fixed representation a bigger mistake ? a char can be 16 bits, for example, aso.

jibal 6 years ago

"Isn't the fact that core types don't have a fixed representation a bigger mistake ? "
No. There were problems when 64-bit CPUs came along, but they have been pretty much ironed out, and don't nearly compare to the pervasive bugs that Walter mentions in his article.
WalterBright 6 years ago

The unfixed type sizes are mostly just a nuisance, though I've wasted a lot of time dealing with them.
- p0nce 6 years ago
  
  It can also create untold naming conflicts.
flingo 6 years ago

When can a char be 16 bits? I presume it'd still have a sizeof() of 1 though.
- toolslive 6 years ago
  
  Texas Instruments C54x DSPs
  It can even be funkier, like 12 bits in a char
  https://stackoverflow.com/questions/2098149/what-platforms-h...
  It's a mess
- jibal 6 years ago
  
  "When can a char be 16 bits?"
  Whenever an implementation says so. There are now few machines where the addressable unit is not 8 bits though, which is why languages like D and Java can get away with not supporting them.
  > I presume it'd still have a sizeof() of 1 though.
  The language standard requires that.
- WalterBright 6 years ago
  
  I've seen 32 bit chars on some DSP C implementations.
hcs 6 years ago

Also char not specified in the language as signed or unsigned...

nurettin 6 years ago

Fat pointers manifested themselves in Pascal as strings and are still being used in modern Delphi.

apz28 6 years ago

I would love one day that programming should adhere to the discipline as in bridge/car safety. Simple malpractice will go to jail for it then there will be no argumment/discussion about this stupid mistake that can be verified by tool Cheers Pham

xaduha 6 years ago

That's why I hope Red/System and just Red in general takes off https://static.red-lang.org/red-system-specs.html

robert_foss 6 years ago

Fair enough.

Arrays losing dimensionality when passed through functions is a pain every now and then.

grrrrrrrrrrrrr 6 years ago

"C retains the basic philosophy that programmers know what they are doing; it only requires that they state their intentions explicitly."

The real 'mistake', is programmers not stating their intention explicitly.

flingo 6 years ago

Is it better to pass the length of the array, or a pointer to the last valid address in the array? (or one past that) There's probably an advantage in the two types being the same.

Thought of this as I was reading the article.

jibal 6 years ago

The length is better, since that's almost always what you want.
> There's probably an advantage in the two types being the same.
Not really.

rurban 6 years ago

The mentioned Safe C Library is now at https://github.com/rurban/safeclib

pjmlp 6 years ago

The problem with secure C11 Annex K functions, is that they are only secure in name.
They are still as insecure as any traditional C string and memory function.
Yes, they sorted out the issues about then a string always gets its null terminator.
However given that buffer and size are still two different parameters, the issue of mixing up the values is still present.
- rurban 6 years ago
  
  Nope. The buffer size is checked at compile-time. Much like glibc fortify, just better. There's no chance to mix them up. Even the spec'd unsafeties of the truncating n versions are fixed.
  
  pjmlp 6 years ago
  
  Can you please explain how strcpy_s() validates that dest actually points to a memory region with enough space for destsz bytes?
  https://en.cppreference.com/w/c/string/byte/strcpy

analognoise 6 years ago

Fat Pointers - Pascal has had them since I think the beginning?

So...30+ years later, we decide Pascal was right. Just saying. Shoutout to FreePascal/Lazarus!

WalterBright 6 years ago

In Pascal, every array with a different dimension was a different type.
- clouddrover 6 years ago
  
  That was true in the past, but Pascal has had dynamic arrays for over twenty years. The current versions of Free Pascal and Delphi are nice to use.
  
  WalterBright 6 years ago
  
  I kinda gave up on Pascal 35 years ago when I picked up a copy of K+R :-)
  The only thing I really liked from Pascal were the nested functions, which I put in D.
- pjmlp 6 years ago
  
  That was already sorted out in early 80's Pascal dialects, even ISO Extended Pascal included support for it.
TimJYoung 6 years ago

I keep reading articles like this on HN, and coming to the same conclusions as yourself over and over: why isn't Object Pascal more popular ? It solves many issues that have presented themselves over the last decade or so in various, more-popular languages, without any major downsides. I'm talking the language here - the RTL/system libraries and ecosystem is something that I think solves itself once more people start using the language.
Is it really just about the issues with begin..end and verbosity ???
- analognoise 6 years ago
  
  I don't know, I wish I did. I think nobody who ran into it 20 years ago has looked at it since. It's really a damn shame.
  
  pjmlp 6 years ago
  
  Borland is to blame, they scared people away from Delphi.
- pjmlp 6 years ago
  
  Borland kind of killed it, literally.
  Object Pascal was originally developed by Apple and adopted by Borland into Turbo Pascal 5.5, which then started to adopt ideas from C++.
  In the PC world, Turbo Pascal was the king of Pascal dialects, for a Pascal compiler it was more relevant to be Turbo Pascal compatible than ISO Extended Pascal (the standard revision that fixed the issues with ISO Pascal).
  Borland switched focus to the enterprise, leaving the hobby developers behind, increasing the prices of their compilers to enterprise tools range, and then went through an identity crisis with Inprise and Codegear.
  The Kylix attempt to bring Delphi and C++ Builder into Linux was never that serious.
  They lost key people like Anders to Microsoft, nice story of events why he left in this interview.
  https://behindthetech.libsynpro.com/001-anders-hejlsberg-a-c...
  So most of us moved away, on the mid-90's C++ was an welcoming home for Object Pascal refugees.
  Had the mix of OOP and procedural programming, thanks to classes and overloading it was possible to write type safe abstractions, RAII better than Object Pascal had, and even if the standard was a few years away, every compiler had a nice framework that would relive us from the pain of dealing with plain old C arrays and strings.
  And for those moments that we were forced to deal with C APIs, being almost copy-paste compatible with C helped. Which incidentally is one of the pain points in modern C++.
  This in the PC world.
  On the Mac, Apple decided to cater to the UNIX crowd and started to move away from Object Pascal.
  http://basalgangster.macgui.com/RetroMacComputing/The_Long_V...
  http://basalgangster.macgui.com/RetroMacComputing/The_Long_V...
  http://basalgangster.macgui.com/RetroMacComputing/The_Long_V...
  Outside PC and Mac worlds Object Pascal was hardly used.
  Max Weinreich said "A language is a dialect with an army and navy".
  On the context of systems programming languages, "A systems programming language is a language with an OS".
  If it isn't tied to an OS SDK there will be always attrition why use it at all.
  
  TimJYoung 6 years ago
  
  Re: the term "Object Pascal". When I say "Object Pascal", I'm referring to the Delphi version that was released in 1995-96, not the older versions. It was still going strong into the early 2000s, but starting to experience some issues due to the C# headwinds and the pressure that MS was putting on Borland (thanks for the link on Anders, I've bookmarked for listening this evening). We were (are) there for the whole thing as a 3rd party component company in the Delphi market.
  But, my original point was about the pros and cons of the language, itself. Object Pascal seems to solve some major pain points experienced with other languages (specifically strings and dynamic arrays, but also others), but doesn't get copied or adopted as much as one would think. Instead, newer languages keep copying the same bad ideas that kill performance and/or limit the versatility of the language.
  
  pjmlp 6 years ago
  
  I think Java is partially to blame for that.
  When it came out I was disappointed that they adopted and interpreter, followed by JIT with 1.2, leaving to commercial third parties the AOT compiler toolchain.
  I was then double disappointed with .NET, due to the NGEN/JIT mix, because NGEN was no match for a proper AOT compilation, just for faster startups.
  And it took them Singularity, Midori, to finally arrive at CoreRT and .NET Native, and still it only applies to certain deployment scenarios.
  Back then it wasn't only Delphi, there was Oberon, Component Pascal, Eiffel.
  But they were all commercial and then around the same time FOSS started to pick up steam, Kylix was very badly managed, and due to its UNIX roots everyone was mostly writing GNU tools in C, which wasn't actually that much used in the PC world where we were already quite happily using OWL, VCL and MFC.
  At least Pascal style syntax is fashionable again.
  
  TimJYoung 6 years ago
  
  Yeah, I keep remarking that I think that a proper AOT C# or Java is a game-changer, but I'm not sure if these languages will ever be able to shed the baggage of the very large frameworks that developed around them. But, I will also be very glad to be wrong.
  As for Kylix, I simply think that there was no way that it was going to work on Linux. IOW, trying to do a GUI-based development tool on Linux was a bad idea from the start. They couldn't even nail down a few distributions very well - it was a constant moving target...

IshKebab 6 years ago

I feel like Sibiu should just write a new language that is C with fixes, and no more.

pjmlp 6 years ago

Has been tried a couple of times, the problem is mostly human not technical.

Annatar 6 years ago

I don’t understand what the hoopla is about: in assembler we deal with arrays by having to know the size of each field without giving it a second thought. The solution is to learn assembler first, then move on to C. And AWK, as the next generation C doesn’t have this problem, or any of the C problems.

WalterBright 6 years ago

I've written a lot of assembler code (including Empire in 100% assembler https://github.com/DigitalMars/Empire-for-PDP-11)
Assembler programs are very tedious to write, and so they tend to be rather small. You don't get any help from the non-existent compiler for even simple HLL features like static type checking.
- Annatar 6 years ago
  
  As a demo scene coder, I have to disagree vehemently: assembler is a joy to write, at least on MC68000 and 6502 (Amiga and C=64). The only reason why I don’t write code in assembler but in C on UNIX is becuase of portability. To me, C is nothing more than a portable assembler with the extra unnecessary code because of the inefficiencies of optimizing compilers.
pjmlp 6 years ago

There are quite a few reasons why using Assembly in 2018 is very niche.
- Annatar 6 years ago
  
  What exactly are the arguments for that statement?
  
  pjmlp 6 years ago
  
  In the age of IoT and distributed computing we need portability, Assembly is the opposite of that.
  For stuff like SIMD we have compiler intrisics, loop vectorization and compute shaders.
  Even when writing straight Assembly, unless we are talking about a PIC class processor, the amount of opcodes and their behaviors across a CPU family are so broad no human manages to fit the instruction manuals on their head, several thousand pages long.
  Modern CPUs, aren't a Z80 on a Speccy where we could fit the opcodes and memory map on our head.
  So it is constrained to things like a few hundred KB PICs, people writing compiler backends, software decoding for video/audio codecs or kernel level drivers, a very specialized set of tasks, the very definition of niche.
  
  Annatar 6 years ago
  
  Sorry that you feel that way, but the presented arguments are totally bogus:
  - there are only two CPU's nowadays to code for, intel and ARM, so if you don't need portability across operating systems, that is manageable; and with cpp(1) macros, one might even be able to write code which would assemble across operating systems on the same processor (for example illumos and GNU/Linux on intel);
  - even if you use the simplest of instructions, assembler code written by a human will always beat the compiler - ALWAYS!; don't take my word for it - try it out for yourself.
  On slower / older systems, assembler is the only way to get the required speed. And it's fun, really lots and lots of fun to code in assembler. Not to mention that it's easy. Lots of people do it, just look at the demo / cracking scene.
  You're just hanging around in the wrong circles, with the wrong crowd if you think assembler is niche.
  
  pjmlp 6 years ago
  
  I don't think I am the one hanging around with the wrong crowd, specially when looking for job advertisements, even on embedded space.
  You can keep your Assembler, my crowd rather uses C++ with intrisics when we need performance.
  Being part of the Demoscene was cool, like when Amiga 500 actually mattered.
  
  Annatar 6 years ago
  
  For learning how the machine functions, Amiga is still very much relevant as a teaching tool, as I’ve come to experience recently.
  As someone who is part of the C++ experts group and maintains the GCC compilers at my organization, I will “keep my assembler” over C++ any day of the week.
  But in one thing I’m starting to think that you actually might be right: it is I who seem to be hanging around with the wrong crowd by being here on “Hacker News”, where actual hackers (in the MIT sense of the word) are in very short supply. The more I read what people write and how they think around here, the more n-gate.com is right, critique by critique, point for point. As it stands right now, this site is a gross misnomer.
jibal 6 years ago

> I don’t understand what the hoopla is about
Ignorance is not a virtue.
> The solution is to learn assembler first, then move on to C
So if you learn assembler first, then suddenly C has fat pointers, strings aren't NUL-terminated, and the massive code base written by millions of programmers doesn't contain any buffer overflows?
- Annatar 6 years ago
  
  Ignorance is not a virtue.
  It's not ignorance, but knowledge: since I know assembler, pointers are no big deal in C. For one who does not understand how the machine functions, they are a big deal. I don't understand why understanding assembler is so hard for so many people since machine code is so simple.
  So if you learn assembler first, then suddenly C has fat pointers, strings aren't NUL-terminated, and the massive code base written by millions of programmers doesn't contain any buffer overflows?
  No, suddenly your code doesn't have those problems any more because you actually understand what's going on and how it works. It's not magic. Except apparently on "Hacker News", where hackers seem to be in very short supply.
  
  jibal 6 years ago
  
  > It's not ignorance, but knowledge
  "I don't understand" is clearly a statement of ignorance and not knowledge. The ignorance you expressed was about why people make an issue of something. That ignorance could be dispelled if you actually read and attempted to understand their points, but that requires qualities like humility and intellectual honesty.
  As a top class programmer who wrote his first ASM program in 1967, was on the C Standards committee, and has programmed at every other level, I will simply smh at the naivety and point missing of your comments, and avoid engaging you further. Ta ta.
  
  Annatar 6 years ago
  
  You go ahead and do that then. I still don’t understand why it’s so hard for people to master assembler, because noone has explicitly addressed that. If that’s “intellectual dishonesty”, so be it. That’s nothing more than a cliche Anglosaxon phrase with no real meaning.
  I’m interested about getting to the bottom of fighting to master assembler because that’s the real issue here, everything else is overhead. Ta ta!

known 6 years ago

Difference between Array and Linked List is enough to start confusion on pointers

jibal 6 years ago

Say wut?

rebootthesystem 6 years ago

My guess is this won't be a popular post given the average age of HN participants.

There's nothing whatsoever wrong with C. The problem are programmers who grew up completely and utterly disconnected from the machine.

I am from that generation that actually did useful things with machine language. I said "machine language" not "assembler". Yes, I am one of those guys who actually programmed IMSAI era machines using toggle switches. Thankfully not for long.

There is no such thing as an "array". That's a human construct. All you have is some registers and a pile of memory with addresses to go store and retrieve things from it. That's it. That is the entire reality of computing.

And so, you can choose to be a knowledgeable software developer and be keenly aware of what the words you type on your screen actually do or you can live in ignorance of this and perennially think things are broken.

In C you are responsible for understanding that you are not typing magical words that solve all your problems. You are in charge. An array, as such, is just the address of the starting point of some bunch of numbers you are storing in a chunk of memory. Done. Period.

Past that, one can choose to understand and work with this or saddle a language with all kinds of additional code that removes the programmer from the responsibility of knowing what's going on at the expense of having to execute TONS of UNNECESSARY code every single time one wants to do anything at all. An array ceases to be a chunk-o-data and becomes that plus a bunch of other stuff in memory which, in turn, relies on a pile of code that wraps it into something that a programmer can use without much thought given.

This is how, for example, coding something like a Genetic Algorithm in Objective-C can be hundreds of times slower than re-coding it in C (or C++), where you actually have to mind what you are doing.

To me that's just laziness. Or lack of education. Or both. I have never, ever, had any issues with magical things happening in C because, well, I understand what it is and what it is not. Sure, yeah, I program and have programmed in dozens of languages far more advanced than C, from C++ to APL, LISP, Python, Objective-C and others. And I have found that C --or the language-- is never the problem, it's the programmer that's the problem.

I wonder how much energy the world wastes because of the overhead of "advanced" languages? There's a real cost to this in time, energy and resources.

This reminds me of something completely unrelated to programming. On a visit to windmills in The Netherlands we noted that there were no safety barriers to the spinning gears within the windmill. In the US you would likely have lexan shields protecting people and kids from sticking their hands into a gear. In other parts of the world people are expected to be intelligent and responsible enough to understand the danger, not do stupid things and teach their children the same. Only one of those is a formula for breeding people who will not do dumb things.

Stop trying to fix it. There's nothing wrong with it. Fix the software developer.

kazinator 6 years ago

> There is no such thing as an "array". That's a human construct.
Oh yeah; social construct, I would say, like gender.
> I am from that generation that actually did useful things with machine language.
Unfortunately, most of them are undefined behavior in C.
> You are in charge.
Less so than you may imagine. You're in charge as long as you follow the ISO C standard to the letter, and deviate from it only in ways granted by the compiler documentation (or else, careful object code inspection and testing).
- rebootthesystem 6 years ago
  
  This is a typical misinterpretation of the reality of programming. There is no such thing as undefined behavior. Once you get down to bits and bytes in memory and instructions the processor does EXACTLY what it is designed to do and told to do by the programmer.
  Despite what many might believe the universe didn't come to a halt when all we had was C and other "primitive" languages. The world ran and runs on massive amounts of code written in C. And any issues were due to programmers, not the language.
  In the end it all reduces down to data and code in memory. It doesn't matter what language it is created with. Languages that are closer to the metal require the programmer to be highly skilled and also carefully plan and understand the code down to the machine level.
  Higher level languages --say, APL, which I used professionally for about ten years-- disconnect you from all of that. They pad the heck out of data structures and use costly (time and space) code to access these data structures.
  Object oriented languages add yet another layer of code on top of it all.
  In the end a programmer can do absolutely everything done with advanced OO languages in assembler, or more conveniently, C. The cost is in the initial planning and the fact that a much more knowledgeable and skilled programmer is required in order to get close to the machine.
  As an example, someone who thinks of the machine as something that can evaluate list comprehensions in Python and use OO to access data elements has no clue whatsoever about what and how might be happening at the memory level with their creations. Hence code bloat and slow code.
  I am not, even for a second, proposing that the world must switch to pure C. There is justification for being lazy and using languages that operate at a much higher level of abstraction. Like I said above, I used APL for about ten years and it was fantastic.
  My point is that blaming C for a lack of understanding or awareness of what happens at low levels isn't very honest at all. The processor does exactly what you, the programmer, tell it do to. Save failures (whether by design or such things as radiation triggered) I don't know of any processor that creatively misinterprets or modifies instructions loaded from memory, instructions put there by a programmer through one method or another.
  Stop blaming languages and become better software developers.
  
  kazinator 6 years ago
  
  > This is a typical misinterpretation of the reality of programming. There is no such thing as undefined behavior. Once you get down to bits and bytes in memory and instructions the processor does EXACTLY what it is designed to do and told to do by the programmer.
  Sure.
  Only problem is, all you have to do is change some code generation option on the compiler command line and millions of lines of code now produce different instructions. Or, keep those options the same, but use a different version of that compiler: same thing.
  > The processor does exactly what you, the programmer, tell it do to.
  Well, yes; and when you're doing that through C, you're telling the processor what to do via sort of autistic middleman.
  C is not the low level; you can understand your processor on a very detailed level and that expertise won't mean a thing if you don't understand the ways in which you can be screwed by the C language that have nothing to do with that processor.
  I suspect that you don't know some important things about C if you think it's just a straightforward way to instruct the processor at the low level.
  > Languages that are closer to the metal require the programmer to be highly skilled and also carefully plan and understand the code down to the machine level.
  C isn't one of these languages. (At least not any more!) It's considerably far from the metal, and requires a somewhat different set of skills than what the assembly language coder brings to the table, yet without entirely rendering useless what that coder does bring to the table.
  
  rebootthesystem 6 years ago
  
  > all you have to do is change some code generation option on the compiler command line and millions of lines of code now produce different instructions.
  It is the responsibility of a capable software engineer to KNOW these things and NOT break code in this manner.
  You are trying to blame compilers and languages for the failure of modern software engineers to truly understand what they are doing and the machine they are doing it on.
  If you truly understand the chosen language, the compiler, the machine and take the time to plan, guess what happens? You write excellent code that has few, if any bugs, and everyone walks away happy.
  And you sure as heck are not confused or challenged in any way by pointers. I mean, for Picard's sake, they are just memory addresses. I'll never understand why people get wrapped around an axle with the concept.
  I wonder, when people program in, say Python, do they take the time to know --and I mean really know-- how various data types are stored, represented and managed in memory? My guess is that 99.999% of Python programmers have no clue. And I might be short by a few zeros.
  We've reached a moment in software engineering were people call themselves "software engineers" and yet have no clue what the very technologies they are using might be doing under the hood. And then, when things go wrong, they blame the language, the compiler, the platform and the phase of the moon. They never stop to think that it is their professional duty to KNOW these things and KNOW how to use the tools correctly in the context of the hardware they might be addressing.
  I've also been working with programmable logic and FPGA's, well, ever since the stuff was invented. Hardware is far less forgiving than software --and costly. It forces one to be far more aware of, quite literally, what ever single bit is doing and how it is being handled. One has to understand what the funny words one types translate into at the hardware level. You have to think hardware as you type what looks like software. You see flip-flops and shift registers in your statements.
  This is very much the way a skilled software developer used to function before people started to pull farther and farther away from the machine. It is undeniable that today's software is bloated and slow. Horribly so. And 100% of that is because we've gotten lazy. Not more productive, lazy.
  
  kazinator 6 years ago
  
  > It is the responsibility of a capable software engineer
  Nobody is saying that it's a acceptable for an engineer to screw up and then blame it on the tools (compiler, slide rule, calculator, ...).
  However, if something goes wrong in your work, it's foolish not to recognize the role of the tools, even though it's not acceptable to blame them as a public position.
  As objective observers of a situation gone wrong in engineering, we do have the privilege of assigning blame between people and tools. Tools are the work of people also. The choice of tools is also susceptible to criticism. We have to be able to take an objective look at our own work.
  
  bendmorris 6 years ago
  
  I don't understand how anyone can spend a career in software development, and still have such a poor understanding of the process. Space and time are far from the only concerns.
  >As an example, someone who thinks of the machine as something that can evaluate list comprehensions in Python and use OO to access data elements has no clue whatsoever about what and how might be happening at the memory level with their creations. Hence code bloat and slow code.
  Not having to care about details that aren't contextually important is a good thing. When someone is constrained more by development time than by computational resources, working in a high level language means you're explicitly shunting low level concerns so you can spend more time dealing with domain logic.
  There are many situations where finishing something faster, which will run 10x slower and use more memory, is a worthwhile tradeoff.
  
  rebootthesystem 6 years ago
  
  Nowhere did I say that modern languages don't have their place and advantages. I use them all the time. In fact, I prefer them when they make sense for precisely the reasons you point out.
  You might be reading far more into my comments than what they were intended to address. Namely that blaming languages for the failings of software engineers is dishonest. A true software engineer will know the chosen tools and languages and use them appropriately. Blaming C for pointer issues is dishonest and misguided. There's nothing wrong with the language if used correctly.
  
  jibal 6 years ago
  
  BTW, even physical machines have undefined behavior, when values exceed the specs and there's no telling what might happen ... I remember the days when people would destroy their monitors by giving them scan frequencies they can't handle. And there are CPU operations that have undefined behavior due to race conditions ... you can get one of several outcomes.
  But there's no arguing with extreme ignorance coupled with extreme unwarranted arrogance.
  
  rebootthesystem 6 years ago
  
  > BTW, even physical machines have undefined behavior, when values exceed the specs and there's no telling what might happen
  And if you (plural) are an ENGINEER, it is your JOB to KNOW these things and prevent them from happening.
  I get the sense that the term "software engineer" has been extended so far that we grant it to absolute hacks who know nothing about what they are doing and what their responsibilities might be. Blaming a language, compiler and machine are perfect examples of this.
  True engineering isn't about HOPING things will work. It is about KNOWING things will work. And testing to ensure success.
  I've been involved in aerospace for quite some time. People can die. This isn't a game. And it requires real engineering not "oh, shit!" engineering that finds problems by pure chance. Sadly, though, we are not perfect and things do happen. It isn't for lack of trying though.
  
  kazinator 6 years ago
  
  > I've been involved in aerospace for quite some time.
  That's nice; not all engineering is aerospace and not all aerospace processes are always appropriate everywhere else.
  Even in aerospace, still I don't want to write code that depends on knowing exactly how the compiler works. I will write code mostly to the language spec. Then treat the compiler as a black box: obtain the object code, and verify that it implements the source code (whose own correctness is separately validated).
  Safety is not treated the same way regardless of project. For instance, an electronic device that has a maximum potential difference of 12V inside the chassis is not designed the same way, from a safety point of view, as one that deals with 1200V.
  
  jibal 6 years ago
  
  rebootthesystem seems to be a chatbot that specializes in shouting cliches and non sequiturs. His responses to me indicate a complete failure to understand what I wrote. smh
  
  rebootthesystem 6 years ago
  
  Nice try at a weak ad hominem.
  Your parent comment is utterly irrelevant. The conversation is about the C language and the perception some seem to have that it has problems. My only argument here is that a capable software engineer knows the language and tools he or she uses and has no such problems, particularly with a language as simple as C. Things like pointer "surprises" are 100% pilot error, not a deficiency of the language itself.
  
  jibal 6 years ago
  
  > There is no such thing as undefined behavior.
  Read the C Standard. (Do you even understand that it defines an abstract machine? Do you have any idea what an abstraction is?)
  
  rebootthesystem 6 years ago
  
  You just proved my point. A programmer who truly knows (a) the machine they are working with and (b) the language they are using will know exactly how to use both in order to deliver intended results.
  For example, reading the processor data book to understand it, the instruction set and how it works could be crucially important in certain contexts. I would not expect someone doing Javascript to do this but how many have studied the virtual machine in depth?
  Don't confuse being lazy with problems with languages and compilers.
  
  kazinator 6 years ago
  
  That programmer could be Mel!
  https://news.ycombinator.com/item?id=7869771
  
  rebootthesystem 6 years ago
  
  That's a great story, thanks!
  I once worked with on a project that needed specialized timing in relation to high speed (well, 38.4k) RS422 communications. I don't remember all of the details, it's been decades. I remember one of the engineers came up with a super clever way to trigger the time measurement and actually measure it. Rather than using a UART he bit-banged the communications and actually used the serial stream for timing (meaning the one's and zero's). It worked amazingly well. If I remember correctly that was a Z80 processor with limited resources.
  
  jibal 6 years ago
  
  "You just proved my point."
  This is the least intelligent and least intellectually honest hackneyed phrase on the internet. In this case it's a complete non sequitur. It would tell me a lot about you if you hadn't already made it evident. Over and out, forever.
  
  rebootthesystem 6 years ago
  
  A shift from logic to ad hominem is always an indication that there's nothing further to discuss. Live long and prosper.
jibal 6 years ago

"There is no such thing as an "array". That's a human construct."
There's also no such thing as a computer, or memory, or operating systems ... they're all just a bunch of molecules.
I too am from the generation before people understood the power of abstraction ... but I'm intellectually honest and so I managed to learn.
> Fix the software developer.
Which one?
- rebootthesystem 6 years ago
  
  So you claim an array actually exists in a computer?
  OK. Prove it. And you have to do it without laying out a set of rules and conventions that might allow us to interpret a list of bytes as an array.
  An array is a fabrication by convention. At the simplest level it is a list of numbers in memory. Adding complexity you can store additional numbers that indicate type size and shape. Adding yet more complexity you can extend that to be lists of memory addresses to other lists of numbers, thereby supporting the concept of each array element storing more than just a byte or a word. And, yet another layer removed you can create a pile of subroutines that allow you to do a bunch of standard stuff with these data structures (sort, print, search, add, subtract, trim, reshape, etc.).
  Nowhere in this description does an array exist. There were experimental architectures ages ago that actually defined the concept of arrays in hardware and attempted to build array processors. These lost out to simpler machines where multidimensional arrays could be represented and utilized via convention and software.
  Arrays do not exist. If you land in the middle of a bunch of memory and read the data at that location without having access to the conventions used for that processor or language nothing whatsoever tells you that byte or word is part of an n-dimensional array. The best you can say is "The number at location 1234 is 23". No clue about what that might mean at all.

okket 6 years ago

(2009)

See also discussion from 9 years ago: https://news.ycombinator.com/item?id=1014533 (47 comments)

auslander 6 years ago

OpenBSD replaced strcat by strlcat, strcpy by strlcpy 20 years ago, in OpenBSD 2.4.

They are implemented in the C libraries for OpenBSD, FreeBSD, NetBSD, Solaris, OS X, and QNX.

They have not been included in the GNU C library used by Linux.

WalterBright 6 years ago

Those functions have a separate length parameter. There is no way to mechanically check that the length argument accurately reflects the length of the string. It's not an effective solution.
- pjmlp 6 years ago
  
  Which is the reason why I consider the C11 security annex, anything but safe.
- auslander 6 years ago
  
  Not an expert, but shouldn't they have a length parameter, it makes sense?
  
  acehreli 6 years ago
  
  I current C, yes, they should. The whole point is, the length parameter should not be separate from the array. It's even worse than that: the parameter is not an "array", it's a pointer to a single element. This whole thing relies on a convention and human attention; can't work in practice.
  
  grrrrrrrrrrrrr 6 years ago
  
  And yet, it clearly does (work).
  When it doesn't (work), it is NOT because of a failure in the language; It is because C has (and always will have) the "basic philosophy that programmers know what they are doing;".
  Criticising C, is like criticising assembly. What's the point?
  If people want to criticise a programming language, then they should always start with C++, not C.
  C++ was designed to allow us to develop bigger and more complex programs, and yet, C++ inherited from C?
  How stupid was that! But people are happy to give out various awards and medals to the person who made one of the dumbest decisions ever made, in the whole history of computing!
  Leave C alone. It's fine. It's C++ that is the problem.
  
  auslander 6 years ago
  
  Were OpenBSD people wrong, making strlcat and strlcpy ? Honest question.
  
  grrrrrrrrrrrrr 6 years ago
  
  That is a library issue.
  Trying to make C 'foolproof' however, is an excercise in futility, and in any case, can only come about by morphing it into a fundamentally different language.
  An argument in this thread, is that you shouldn't be able to pass an array without the argument being passed having some implicit 'size' element associated with it. That is NOT C.
  Conflating pointers with arrays, that is C.
  Again I feel the need to quote this:
  "C retains the basic philosophy that programmers know what they are doing; it only requires that they state their intentions explicitly."
  If you don't know what you're doing, don't use C.
  C should be considered a 'specialist' language - much like doing brain surgery - if you're doing it, you better know what you're doing, else go be a GP or something.
  And, if you're project doesn't absolutely require that you use C, don't use it. Instead, use something that is more 'foolproof'. (and I don't mean C++!!!)
  D should focus less on being a better C, and more on being a replacement for C++. Then, I might take D more seriously.
  No attempt to morph C (i.e. the language, not the library) into something else will ever succeed.
  Leave C alone!
medecau 6 years ago

A few years ago I remember seeing strl* mentioned in a slide presentation for something that must have come from the oBSD camp.
Since then I thought, not knowing any amount of C, that strl* was part of the language and available to all.
Imagine if you will my confusion everytime I read people complaining about C being insecure.
Your comment corrected my perception of this. Thank you.
- auslander 6 years ago
  
  I admire the no-nonsense approach by OpenBSD: "The process we follow to increase security is simply a comprehensive file-by-file analysis of every critical software component. We are not so much looking for security holes, as we are looking for basic software bugs, and if years later someone discovers the problem used to be a security issue, and we fixed it because it was just a bug, well, all the better." https://www.openbsd.org/security.html
  I'm in DevOps field, and this 'make it right from the start' resonates strongly with me. Richard Feynman: Disregard others :)
auslander 6 years ago

Wiki on subj:
https://en.wikipedia.org/wiki/C_string_handling#Replacements

the_duke 6 years ago

Should have a 2009 in the title.

kyberias 6 years ago

Why does the author of D want to "fix" C by changing it into D? Concentrate on that D language.

alphaglosined 6 years ago

You see, before D ever existed, Walter wrote a C/C++ compiler professionally. That is why he would like to see C improved. He has just as much interest in seeing it improved as you or I do. In fact, I would say he has more reason as he has dedicated a good part of his life towards it...
pjmlp 6 years ago

Because it is clear to us, advocates of safer systems programming languages, that no one is going to rewrite POSIX based platforms in something else.
So we would like to improve our foundations, to move UNIX derived OSes to some kind of safer C, instead of having it be the backdoor of the whole security infrastructure.
WalterBright 6 years ago

I like to fix things. Why not share a simple and effective fix?
- kyberias 6 years ago
  
  Isn't there a standards organization / body for C language. Have you proposed this there? What was the outcome?
jibal 6 years ago

> Why does the author of D want to "fix" C
What's the downside?
> by changing it into D?
That comment is tendentious, hyperbolic, and generally credibility-damaging.
> Concentrate on that D language.
You think he doesn't?

earenndil 6 years ago

> Notable among these are C++, the D programming language, and most recently, Go

I would remove go from that list, and add rust and zig.

dosshell 6 years ago

This was written in 2009, didn't rust first appeared in 2010?
- lightgreen 6 years ago
  
  Why was it posted now without a year in square brackets? That was misleading.
  
  jibal 6 years ago
  
  The year is in parentheses in the title, and of course it's in the article. I for one have learned to always look at the date something was written. It irks me that there are so many web pages with time-relevant content that contain no date.

Drdrdrq 6 years ago

Meh. What do you do with the dynamically allocated arrays then? Do you pass their dimensions alongside pointer? If that bothers you so much, you can create a struct that holds the pointer and metadata, and do the checks yourself. Calling this "C's biggest mistake" is a bit sensationalistic.

EDIT: besides, you should start new projects in Rust anyway, because it takes security to whole other level. C did a great job, but it's a bit old. :)

PeCaN 6 years ago

>besides, you should start new projects in Rust anyway
Thanks, I almost forgot what website I was on for a second.
- Drdrdrq 6 years ago
  
  No problem - happened to me 6 hours ago too. </s>
int_19h 6 years ago

> Do you pass their dimensions alongside pointer
That is literally the meaning of "fat pointer", and the linked article even explains it.
adamnemecek 6 years ago

The pattern is very present in all c code and rolling your own just causes paid when interacting with 3rd party code.
jibal 6 years ago

The article explains why it's C's biggest mistke, and neither your nor anyone else has refuted it ... in fact, you don't even touch on it.