This analysis is somewhat dated and leaves out one important fact: nowadays, floating point arithmetic is carried out using a set of special scalar SSE instructions (and not the ancient x87 co-processor, as was done in the author's benchmark).

SSE instructions remove performance pitfalls related to infinities and NaNs. The only remaining case where slowdowns are to be expected denormals (which can be set to flush-to-zero if desired.)

In other words: it's perfectly fine to work with infinities and NaNs in your code.

> There’s only one zero for posit numbers, unlike IEEE floats that have two kinds of zero, one positive and one negative.

> There’s also only one infinite posit number.

Those two things are a big deal-breaker for me. Yes, having positive and negative 0 can be useful--there are times when you want to think of 0 not as "this is exactly 0" but as "this value underflowed our range", and it matters whether or not you are an underflowing negative number or an underflowing positive number. Of course, using IEEE-754 to check for exactly one of positive and negative 0 is painful.

Similarly, having NaN as a distinct type can be useful. You get to distinguish between "this computation shrank too small to be represented", "this computation grew too large to be represented", and "this computation makes no mathematical sense". Posits don't give you that. Furthermore, as many language runtimes have discovered, the sheer number of NaN values means you can represent every pointer and integer as a tagged NaN.

The only thing in IEEE-754 I would truly toss in a heartbeat is that x != x holds true for NaN values.

Let me share my view on those. Posits deal with really large and really small values differently. If I may assume astronomically large values are essentially overflow and small are underflow, posits use an actual value for these. Because of the increased dynamic range (and diminishing resolution) large numbers don't go to infinity but to the largest representable number. Underflow go to the smallest non-zero number. As such they both retain their sign AND can be used in subsequent computations. NaN is only the result of division by zero or an operation on a NaN. In my experience - and yours may vary of course - this is exactly how I want such situations to be handled. I don't want NaN as a result, I want the most reasonable thing to put in the rest of the calculations. Part of this comes from wanting high performance and part is from the realization that you can't check flags after every computation. Exceptions IMHO should not be raised if your code is correct.

The guy who came up with posits also has his UNUM concept which I think is interesting but not for me. The idea is to use two numbers (posits) to track an upper and lower bound for a computation. At the end you can then see what confidence you have in the result. To me that's just wasted storage and computation, but if you want to validate some complex code it actually seems better to me than having some NaN come out the end.

I think of UNUMs and posits as two different ideas. Posits work the way I want them to, while UNUMs will provide the features you want (correctness checking) weather they use posits or IEEE floating point as the underlying representation.

If you consider the two goals of high performance and correctness of a computation, I think UNUMs and posits handle it far more elegantly and simply. IEEE doesn't actually give you both at the same time, but it pretends to.

I recently implemented dijkstra's shortest path using nearest neighbor nodes on a 3D image problem. While IEEE-745 is usually just a pain for me, in this case it was pretty cool. Instead of allocating a separate boolean array of "visited" nodes, I just used negative numbers to denote visited. Since negative zero is a thing, I didn't have to add any exceptions to handle zero distance.

That's a nice hack, but you're violating what many people consider a tenet of writing good code. I don't consider that an argument in favor of negative zero and all the baggage of IEEE 754.

Groking that way that floats work really is a lot of fun.

Years ago a put together a math library (https://github.com/KimBurgess/netlinx-common-libraries/blob/...) for a domain specific language that had some "limited" capabilities. All functionality had to be achieved through a combination of some internal serialisation functions and bit twiddling.

It was simultaneously one of the most painful and interesting projects I've done.

I assume what you are saying is to assume a 1 in the MSB of the mantissa. That has been done. HP minicomputers, I believe, and maybe some orhers of that era.

Unfortunately, that means that you have no way to represent numbers in the denormal binate, which leads to severe problems with monotonicity. As you move to binates with smaller exponents the distance between representable numbers halves in all the normalizable binates. Unless you allow for denormals, you have a GIANT jump from the smallest normalizable number to zero.

This leads to problems in numerical algorithms. Taking differences to find slopes gets unstable as you approach convergence, causing converge to fail.

Yeah, people that aren't deep into numerical algorithms often don't appreciate that there is an important reason to have denormals. The people who came up with the scheme weren't dumb, even though it has some limitations/oddities due to the time it was designed that would probably be done today, I think it's a pretty big design success that it has worked as well as it has for as long as it has.

In my opinion, the confusion that arises when programmers get results from floating-point computations that are not what they expect stems from this:

> Floats represent continuous values.

But as you probably know, this isn't possible. The concept of infinite precision is interesting in theory, but disappears when any actual calculation needs to be made, whether on a digital electric computer or not.

I wonder if this is not a flaw in the crude mechanical representation of numbers, but a flaw in the decision to base floating-point computation on the concept of continuous numbers. I believe that a better model for floating-point computational representation and manipulation would be to reflect the rules of scientific measurements - that each number includes an explicit amount of precision that is preserved during mathematical operations.

This would not only keep JavaScript newbies from freaking out when they add 0.1 and 0.2, but prevent problems of thinking calculation results are correct, when they are not.

If you aren't getting what I am saying, let me give an example. Let's say, for some reason, you want to measure the diameter of a ball. You have a measuring tape so you wrap it around the widest part and record that it is 23.5cm. To calculate the radius, you should divide by π. If you do this in double-precision floating point, you will get 7.480282325319081, but this is nonsense. You can't create a result that is magically more precise than your initial measurement though division or multiplication. The correct answer is 7.48cm. This preserves the amount of precision in the least precise operand, and is arguably the most correct result.

I've seen this idea of storing the precision mentioned many times on HN, but I must say I don't believe in it outside of some few niches.

First reason being that it's much more complex and it's unclear what the complexity buys us.

Second reason is that it doesn't model how variables co-vary. As a toy example imagine that I have a number x: 5+-1.

Then I let y = x - 1: 4+-1

Finally I let z = 1 / (x - y).

Now, by construction z will be very close to 1. But a system naively tracking uncertainties will be very concerned x - y. If it does a worst case analysis it gets 1+-2. If it does an average case analysis assuming independent gaussian errors it gets 1+-sqrt(2). When we perform the division our uncertainty goes infinite.

I don't see any reason to claim that explicit-precision floating-point is more complex, just different. Yes, change is uncomfortable and takes effort, but it does necessarily mean the new way is inherently more complicated. I worry about objections based on "that's not what we are taught in school." I think that what we teach can (and should) be improved if need be, and not used as a motivation to deny criticism of established dogma.

I am not sure if I fully understand your example, but I don't see any problem with it. Using basic significant figure rules, this is (with an additional step for clarity):

x: 5e0
y = x - 1: 4e0
z1 = x - y: 1e0
z = 1/1e0
z2 = 1e0

The answer seems to simply be 1+-1. "Significant Figures" are a simplification of precision where the precision is an integer that represents the total digits of the least precise measurement. A more accurate way is to represent precision as standard deviation, then calculate the precision of the result with basic statistical techniques.

I recently saw a presentation that modeled floating point error using Monte Carlo simulation. Any time you introduce uncertainty (such as the sub-ULP rounding error in every operation), you can insert random variation and find which bits end up being stable to find out how good your answer is. While the point was modelling floating point error, it's not hard to extend the idea to modelling the input measurement error in the first place.

I like the idea of storing the precision by default. Some algorithms already do this in order to keep track of what they're losing in lossy arithmetic operations, such as adding small numbers to big ones.

It would be nontrivial to implement this efficiently - the simplest implementation would require a second float for each original float that has an uncertainty, doubling the time and memory requirements. A tensorflow-like system which can see the whole flow of numbers might be able to provide precision estimates efficiently only where needed.

I don't know if an entire floating-point number is necessary to keep track of precision. If we are willing to accept precision to the closest decimal digit, and my calculations of the max mantissa of a double-precision floating-point is 9.0071993e+15, we could store the precision in 4 bits.

I'm not convinced that has something to do with why floats are so confusing. The same could be said about integers but people are fine with 1000/16 as it matches how they naturally think about integers even though in the computer it's done in binary.

To me it seems much more likely the simple explanation that 2^E+1.M for some number of bits is unnatural to people used to n*10^E with an unlimited number of digits.

To drive this home I don't think I ever heard anyone be confused why 2 fixed point numbers resulted in a certain value after being told what a fixed point number was.

The article content is decent, but I can't stand the fact that the author used this bit of CSS to make all the text unreasonably small: <style> body, div, ... { font-size: x-small } </style>

This analysis is somewhat dated and leaves out one important fact: nowadays, floating point arithmetic is carried out using a set of special scalar SSE instructions (and not the ancient x87 co-processor, as was done in the author's benchmark).

SSE instructions remove performance pitfalls related to infinities and NaNs. The only remaining case where slowdowns are to be expected denormals (which can be set to flush-to-zero if desired.)

In other words: it's perfectly fine to work with infinities and NaNs in your code.

You may also want to read about posits:

https://www.johndcook.com/blog/2018/04/11/anatomy-of-a-posit...

I'm a fan of these not because of the claims regarding precision, but because they drop all the complexity and baggage of IEEE floating point.

> There’s only one zero for posit numbers, unlike IEEE floats that have two kinds of zero, one positive and one negative.

> There’s also only one infinite posit number.

Those two things are a big deal-breaker for me. Yes, having positive and negative 0 can be useful--there are times when you want to think of 0 not as "this is exactly 0" but as "this value underflowed our range", and it matters whether or not you are an underflowing negative number or an underflowing positive number. Of course, using IEEE-754 to check for exactly one of positive and negative 0 is painful.

Similarly, having NaN as a distinct type can be useful. You get to distinguish between "this computation shrank too small to be represented", "this computation grew too large to be represented", and "this computation makes no mathematical sense". Posits don't give you that. Furthermore, as many language runtimes have discovered, the sheer number of NaN values means you can represent every pointer and integer as a tagged NaN.

The only thing in IEEE-754 I would truly toss in a heartbeat is that x != x holds true for NaN values.

Let me share my view on those. Posits deal with really large and really small values differently. If I may assume astronomically large values are essentially overflow and small are underflow, posits use an actual value for these. Because of the increased dynamic range (and diminishing resolution) large numbers don't go to infinity but to the largest representable number. Underflow go to the smallest non-zero number. As such they both retain their sign AND can be used in subsequent computations. NaN is only the result of division by zero or an operation on a NaN. In my experience - and yours may vary of course - this is exactly how I want such situations to be handled. I don't want NaN as a result, I want the most reasonable thing to put in the rest of the calculations. Part of this comes from wanting high performance and part is from the realization that you can't check flags after every computation. Exceptions IMHO should not be raised if your code is correct.

The guy who came up with posits also has his UNUM concept which I think is interesting but not for me. The idea is to use two numbers (posits) to track an upper and lower bound for a computation. At the end you can then see what confidence you have in the result. To me that's just wasted storage and computation, but if you want to validate some complex code it actually seems better to me than having some NaN come out the end.

I think of UNUMs and posits as two different ideas. Posits work the way I want them to, while UNUMs will provide the features you want (correctness checking) weather they use posits or IEEE floating point as the underlying representation.

If you consider the two goals of high performance and correctness of a computation, I think UNUMs and posits handle it far more elegantly and simply. IEEE doesn't actually give you both at the same time, but it pretends to.

I recently implemented dijkstra's shortest path using nearest neighbor nodes on a 3D image problem. While IEEE-745 is usually just a pain for me, in this case it was pretty cool. Instead of allocating a separate boolean array of "visited" nodes, I just used negative numbers to denote visited. Since negative zero is a thing, I didn't have to add any exceptions to handle zero distance.

That's a nice hack, but you're violating what many people consider a tenet of writing good code. I don't consider that an argument in favor of negative zero and all the baggage of IEEE 754.

Fwiw, in my code, I got to write "-0" explicitly and the method for checking is std::signbit(x)

Groking that way that floats work really is a lot of fun.

Years ago a put together a math library (https://github.com/KimBurgess/netlinx-common-libraries/blob/...) for a domain specific language that had some "limited" capabilities. All functionality had to be achieved through a combination of some internal serialisation functions and bit twiddling.

It was simultaneously one of the most painful and interesting projects I've done.

On a personal note, this representation annoys me:

value = (-1) sign * 2 (exponent-127) * 1.fraction

It should be:

value = (-1) sign * 2 (exponent-127) * (1 + fraction*2^-23)

It sounds trivial, but you can't reason mathematically about the first equation.

I assume what you are saying is to assume a 1 in the MSB of the mantissa. That has been done. HP minicomputers, I believe, and maybe some orhers of that era.

Unfortunately, that means that you have no way to represent numbers in the denormal binate, which leads to severe problems with monotonicity. As you move to binates with smaller exponents the distance between representable numbers halves in all the normalizable binates. Unless you allow for denormals, you have a GIANT jump from the smallest normalizable number to zero.

This leads to problems in numerical algorithms. Taking differences to find slopes gets unstable as you approach convergence, causing converge to fail.

Yeah, people that aren't deep into numerical algorithms often don't appreciate that there is an important reason to have denormals. The people who came up with the scheme weren't dumb, even though it has some limitations/oddities due to the time it was designed that would probably be done today, I think it's a pretty big design success that it has worked as well as it has for as long as it has.

The complaint is about the text in the article, which reads "The hardware interprets a float as having the value:

"1.fraction" is nonsense that doesn't really mean anything, whereas (1 + fraction * 2^-23) does mean something.It's just a different notation same as 1234 is shorthand for

I don't understand your concern, what's the point of the "1" in your representation for the mantissa?

In my opinion, the confusion that arises when programmers get results from floating-point computations that are not what they expect stems from this:

> Floats represent continuous values.

But as you probably know, this isn't possible. The concept of infinite precision is interesting in theory, but disappears when any actual calculation needs to be made, whether on a digital electric computer or not.

I wonder if this is not a flaw in the crude mechanical representation of numbers, but a flaw in the decision to base floating-point computation on the concept of continuous numbers. I believe that a better model for floating-point computational representation and manipulation would be to reflect the rules of scientific measurements - that each number includes an explicit amount of precision that is preserved during mathematical operations.

This would not only keep JavaScript newbies from freaking out when they add 0.1 and 0.2, but prevent problems of thinking calculation results are correct, when they are not.

If you aren't getting what I am saying, let me give an example. Let's say, for some reason, you want to measure the diameter of a ball. You have a measuring tape so you wrap it around the widest part and record that it is 23.5cm. To calculate the radius, you should divide by π. If you do this in double-precision floating point, you will get 7.480282325319081, but this is nonsense. You can't create a result that is magically more precise than your initial measurement though division or multiplication. The correct answer is 7.48cm. This preserves the amount of precision in the least precise operand, and is arguably the most correct result.

I've seen this idea of storing the precision mentioned many times on HN, but I must say I don't believe in it outside of some few niches.

First reason being that it's much more complex and it's unclear what the complexity buys us.

Second reason is that it doesn't model how variables co-vary. As a toy example imagine that I have a number x: 5+-1.

Then I let y = x - 1: 4+-1

Finally I let z = 1 / (x - y).

Now, by construction z will be very close to 1. But a system naively tracking uncertainties will be very concerned x - y. If it does a worst case analysis it gets 1+-2. If it does an average case analysis assuming independent gaussian errors it gets 1+-sqrt(2). When we perform the division our uncertainty goes infinite.

I don't see any reason to claim that explicit-precision floating-point is more complex, just different. Yes, change is uncomfortable and takes effort, but it does necessarily mean the new way is inherently more complicated. I worry about objections based on "that's not what we are taught in school." I think that what we teach can (and should) be improved if need be, and not used as a motivation to deny criticism of established dogma.

I am not sure if I fully understand your example, but I don't see any problem with it. Using basic significant figure rules, this is (with an additional step for clarity):

The answer seems to simply be 1+-1. "Significant Figures" are a simplification of precision where the precision is an integer that represents the total digits of the least precise measurement. A more accurate way is to represent precision as standard deviation, then calculate the precision of the result with basic statistical techniques.I recently saw a presentation that modeled floating point error using Monte Carlo simulation. Any time you introduce uncertainty (such as the sub-ULP rounding error in every operation), you can insert random variation and find which bits end up being stable to find out how good your answer is. While the point was modelling floating point error, it's not hard to extend the idea to modelling the input measurement error in the first place.

I like the idea of storing the precision by default. Some algorithms already do this in order to keep track of what they're losing in lossy arithmetic operations, such as adding small numbers to big ones.

It would be nontrivial to implement this efficiently - the simplest implementation would require a second float for each original float that has an uncertainty, doubling the time and memory requirements. A tensorflow-like system which can see the whole flow of numbers might be able to provide precision estimates efficiently only where needed.

I don't know if an entire floating-point number is necessary to keep track of precision. If we are willing to accept precision to the closest decimal digit, and my calculations of the max mantissa of a double-precision floating-point is 9.0071993e+15, we could store the precision in 4 bits.

Sounds a bit like unums:

https://en.wikipedia.org/wiki/Unum_(number_format)

I'm not convinced that has something to do with why floats are so confusing. The same could be said about integers but people are fine with 1000/16 as it matches how they naturally think about integers even though in the computer it's done in binary.

To me it seems much more likely the simple explanation that 2^E+1.M for some number of bits is unnatural to people used to n*10^E with an unlimited number of digits.

To drive this home I don't think I ever heard anyone be confused why 2 fixed point numbers resulted in a certain value after being told what a fixed point number was.

The article content is decent, but I can't stand the fact that the author used this bit of CSS to make all the text unreasonably small: <style> body, div, ... { font-size: x-small } </style>