Welcome dear guest! Very Happy

To start posting and being part of the community, you simply need to register an account or log into an existing one.

Be sure to check out disposable e-mail services, in case you prefer using one for this site instead of your legit address: http://10minutemail.com/10MinuteMail/

If you do not wish to register at all, that's fine but there will be more advertisements. :/

You can see and download all content provided for regular members even without an account!

Your contributions will be greatly appreciated though, give it a shot and register today! thumbsup

Gaming, Modding & Programming

Important reminders:

- Click *HERE* for advanced forum search or check out the text field below on the front page for Google before posting
- NO support via private message (use the forum)
- Write meaningful topic titles
Site Translation

Display results as :

Rechercher Advanced Search

October 2017

Calendar Calendar

Country Statistics
Free counters!
LaTeX Generator

You are not connected. Please login or register

View previous topic View next topic Go down  Message [Page 1 of 1]

1Download IEEE 754-1985 Float Calculation on 2/25/2016, 1:53 pm

Reclaimer Shawn

Code Creator
Alright, so first off: what exactly does IEEE 754-1985 stand for anyways? Well, it stands for Institute of Electrical and Electronics Engineers, and was a system adopted in 1985. It is the current system used in modern computers to represent decimal numbers in a binary format. Before I go on any further, this tutorial assumes you already know how to perform written calculations in both binary and hexadecimal, without the use of a calculator. If you cannot do these things, please leave while you still can. Alright, now onto the good stuff.

Float values are stored as 32-bit integers, meaning they use 32 0's and 1's(bits) to represent a number. Here's the IEEE 754 Format:

Alright, time to break it apart. The sign goes by a signed magnitude component. Sign magnitude means that in this, a 1 signifies a negative number, while 0 a positive. Now, to cover the rest. Now, let's choose a random number... Let's try -6.125 for instance... The first part to do is place a one in the sign to represent a negative

Sign|   Exponent  | Mantissa/Significand
Now, we find the Mantissa. First, we work out the full number part of the number. We know by now that 6= 110 in binary. Now, we have to we'll place this down here:


Now, you know how with binary we do powers of 2? Well, now we'll use negative powers to represent decimals, like in scientific notation. The first value is 2^-1, or 1/2^1. The next is 1/2^2, and so on. Now, we check if 1/2^1 goes in... Does .5 go in? Nope, so we place a zero. Now, we try 1/2^2. Does .25 go in? Nope, we place another zero. Now, we try 1/2^3, which is .125, which goes in, and 0's out the number, so we stop there. Our "denormalized" number is as such:


Now, we have to "normalize" this to make it work. Sure, we represent decimals in binaries in the denormalized format, but a computer does not. What we do is Move the decimal to the left place as many times as it takes to place it right next to the last 1-bit. It turns out like this:



Now, we drop the 1 and the decimal point, and get this:



"Pad" the remaining 18 bits with 0's, and we get this... We know this due to (Number of bits) - (Number in "Normalized Strand" 23-5=18 remaining bits


1    | XXXXXXXX| 10001000000000000000000
Sign| Exponent  | Significand/Mantissa

Now, we find the exponent. We now have to remember how many times we moved the decimal place to the left to "normalize" it. We moved it two times to the left. We will add what is called the "Bias" The bias is the highest number we get in a signed(+/-) system of that many bits. The highest number in a signed system with 8 bits is 127. We now add the exponent(2) with 127,and we get 129. Now, all we do is calculate out 148 in binary, and load it into the exponent bits. 129 = 10000001 in Binary, so we load that into the exponent... Our full float notation number is:



And we're done! Now, we have Double notation. I included this in the same lesson due to its similarities. The only difference is this:

The bias is now 1023(due to using 11 bits, 1023 is the highest signed number), and the significand holds 52 bits, allowing for a calculation of up to 1/2^52 in precision, instead of a 1/2^23 precision in float. Keep in mind that this is God-Awful for numbers that are not powers of 2, and will most likely have to be rounded in the end, and EVERY bit will have to be used just to represent that rounded number. In the next post, I'll put some little extra terminology in the next post, but for now, this is how you can calculate in Float! Enjoy!
A float Calculator to check your work:

Last edited by Reclaimer Shawn on 2/25/2016, 2:18 pm; edited 1 time in total

2Download Terminology and Other Factoids on 2/25/2016, 2:15 pm

Reclaimer Shawn

Code Creator
Truncation: Rounding a number to a whole number(if it is 1,2,3, or 4, it'll be rounded down. 5+ will round up)

Flooring: Rounding a value down.(Bringing it to the floor as I like to think)

Ceiling: Rounding a value up.(Raising it up to the ceiling)

For example

Number EX: -12.4 12.6 -12.6 12.4
Rounding Methods: Flooring -13 12 -13 12
Ceiling -12 13 -12 13
Truncating -12 13 -13 12

Not a Number(NaN)
Types of NaNs
Quiet NaN(QNaN): A NaN that simply results from an undefined or erroneous calculation. Say, the hexadecimal number 0x7FFFFFFF, which in a signed 32 bit system is usually the highest number, but here, it's an error.
Signalling NaN(SNaN): Used for either debugging purposes or setting illegal program operations. A SNaN might be 0x7FC00000.

Special Operations in IEEE 754:
Number/Infinity = 0
(+/-)Infinity*(+/-)Infinity = (+/-)Infinity
(+/-)Nonzero number/0 = (+/-)Infinity
(+/-)0/(+/-)0 = NaN
Infinity-Infinity = NaN
(+/-)Infinity/0 = NaN

Special Numbers in IEEE 754:
0x7F800000 = Infinity
0xFF800000 = -Infinity
0x7FC00000 = SNaN(Probably many more than this)
0x80000000 = Negative Zero

View previous topic View next topic Back to top  Message [Page 1 of 1]

Permissions in this forum:
You cannot reply to topics in this forum