The IEEE Standard for Floating-Point Numbers

The IEEE standard for floating-point numbers specifies how the single-precision (32-bit) and double-precision (64-bit) floating-point numbers are represented.

1. Single Precision:

The IEEE single-precision floating-point standard representation requires a 32-bit word. The first bit to the left is the sign bit, the next eight bits are the exponent bits, and the final 23 bits form the fraction part.

The value of a normalized number $= (-1)^s \times 1.m \times 2^{e^{-127}}$, where $s = 1$ fornegative number and $s = 0$ for the positive number, $m = $ mantisa and $e = $ exponent.

2. Double Precision:

The IEEE double-precision floating-point standard representation requires a 64-bit word. The first bit is the sign bit, the next eleven bits are the exponent bits, and the final 52 bits form the fraction part.

Example Problem:

The following is a scheme for floating-point number representation using 16 bits.

Bit Position	15	14...9	8...0
	$s$	$e$	$m$
	sign	exponent	mantissa

Let $s$, $e$, and $m$ be the numbers represented in binary in the sign, exponent, and mantissa fields, respectively. Then the floating-point number represented is

$\left\{\begin{matrix}(-1)^s(1+m\times2^{-9})2^{e^{-31}} & \mbox{if exponent} \neq 111111\\0 & \mbox{otherwise}\end{matrix}\right.$

What is the maximum difference between two successive real numbers representable in this system? (GATE 2003)

(A) $2^{-40}$

(B) $2^{-9}$

(D) $2^{31}$

Answer: (C) $2^{22}$

Explanation:

Largest positive number, $m = 11111111$

exponent, $e = 111110$

Second largest positive number $= 11111110$

exponent, $e = 111110$

Difference $= 2^{31}(2 − 2^{-9} - 2 + 2^{-8}) = 2^{22}$.

Previous Post Next Post

GATE Breaker

Search This Blog

The IEEE Standard for Floating-Point Numbers

Labels

Comments

Post a Comment