Skip to main content

The IEEE Standard for Floating-Point Numbers

The IEEE standard for floating-point numbers specifies how the single-precision (32-bit) and double-precision (64-bit) floating-point numbers are represented. 

1. Single Precision: 
The IEEE single-precision floating-point standard representation requires a 32-bit word. The first bit to the left is the sign bit, the next eight bits are the exponent bits, and the final 23 bits form the fraction part

IEEE single-precision floating-point standard representation

The value of a normalized number $= (-1)^s \times 1.m \times 2^{e^{-127}}$, where $s = 1$ fornegative number and $s = 0$ for the positive number, $m = $ mantisa and $e = $ exponent. 

2. Double Precision: 
The IEEE double-precision floating-point standard representation requires a 64-bit word. The first bit is the sign bit, the next eleven bits are the exponent bits, and the final 52 bits form the fraction part. 

IEEE double-precision floating-point standard representation

Example Problem: 
The following is a scheme for floating-point number representation using 16 bits. 
Bit Position1514...98...0
$s$$e$$m$
signexponentmantissa
Let $s$, $e$, and $m$ be the numbers represented in binary in the sign, exponent, and mantissa fields, respectively. Then the floating-point number represented is
$\left\{\begin{matrix}(-1)^s(1+m\times2^{-9})2^{e^{-31}} & \mbox{if exponent} \neq 111111\\0 & \mbox{otherwise}\end{matrix}\right.$
What is the maximum difference between two successive real numbers representable in this system? (GATE 2003) 
(A) $2^{-40}$ 
(B) $2^{-9}$ 
(C) $2^{22}$ 
(D) $2^{31}$ 

Answer: (C) $2^{22}$ 
Explanation: 
Largest positive number, $m = 11111111$ 
exponent, $e = 111110$ 
Second largest positive number $= 11111110$ 
exponent, $e = 111110$ 
Difference $= 2^{31}(2 − 2^{-9} - 2 + 2^{-8}) = 2^{22}$. 

Previous Post Next Post

Comments