Manual Pages for UNIX Darwin command on man float

FLOAT(3) BSD Library Functions Manual FLOAT(3)

NAME

ffllooaatt - description of floating-point types available on OS X

DESCRIPTION

This page describes the available C floating-point types. For a list of

math library functions that operate on these types, see the page on the math library, "man math". TTEERRMMIINNOOLLOOGGYY Floating point numbers are represented in three parts: a ssiiggnn, a mmaannttiissssaa (or ssiiggnniiffiiccaanndd), and an eexxppoonneenntt. Given such a representation with sign s, mantissa m, and exponent e, the corresponding numerical value is s*m*2**e.

Floating-point types differ in the number of bits of accuracy in the man-

tissa (called the pprreecciissiioonn), and set of available exponents (the eexxppoo-

nneenntt rraannggee).

Floating-point numbers with the maximum available exponent are reserved

operands, denoting an iinnffiinniittyy if the significand is precisely zero, and

a Not-a-Number, or NNaaNN, otherwise.

Floating-point numbers with the minimum available exponent are either

zzeerroo if the significand is precisely zero, and ddeennoorrmmaall otherwise. Note

that zero is signed: +0 and -0 are distinct floating point numbers.

Floating-point numbers with exponents other than the maximum and minimum

available are called nnoorrmmaall numbers.

PPRROOPPEERRTTIIEESS OOFF IIEEEEEE-775544 FFLLOOAATTIINNGG-PPOOIINNTT

Basic arithmetic operations in IEEE-754 floating-point are ccoorrrreeccttllyy

rroouunnddeedd: this means that the result delivered is the same as the result

that would be achieved by computing the exact real-number operation on

the operands, then rounding the real-number result to a floating-point

value.

OOvveerrffllooww occurs when the value of the exact result is too large in magni-

tude to be represented in the floating-point type in which the computa-

tion is being performed; doing so would require an exponent outside of the exponent range of the type. By default, computations that result in overflow return a signed infinity.

UUnnddeerrffllooww occurs when the value of the exact result is too small in mag-

nitude to be represented as a normal number in the floating-point type in

which the computation is being performed. By default, underflow is grad-

ual, and produces a denormal number or a zero.

All floating-points number of a given type are integer multiples of the

smallest non-zero floating-point number of that type; however, the con-

verse is not true. This means that, in the default mode, (x-y) = 0 only

if x = y.

The sign of zero transforms correctly through multiplication and divi-

sion, and is preserved by addition of zeros with like signs, but x - x

yields +0 for every finite floating-point number x. The only operations

that reveal the sign of a zero are x/(+-0) and copysign(x,+-0). In par-

ticular, comparisons (x > y, x != y, etc) are not affected by the sign of zero. The sign of infinity transforms correctly through multiplication and division, and infinities are unaffected by addition or subtraction of any

finite floating-point number. But Inf-Inf, Inf*0, and Inf/Inf are, like

0/0 or sqrt(-3), invalid operations that produce NaN.

NaNs are the default results of invalid operations, and they propagate through subsequent arithmetic operations. If x is a NaN, then x != x is TRUE, and every other comparison predicate (x > y, x = y, x <= y, etc)

evaluates to FALSE, regardless of the value of y. Additionally, predi-

cates that entail an ordered comparison (rather than mere equality or inequality) signal Invalid Operation when one of the arguments is NaN.

IEEE-754 provides five kinds of floating-point eexxcceeppttiioonnss, listed below:

Exception Default Result Invalid Operation NaN or FALSE

Overflow +-Infinity

Divide by Zero +-Infinity

Underflow Gradual Underflow Inexact Rounded Value

NOTE: An exception is not an error unless it is handled incorrectly.

What makes a class of exceptions exceptional is that no single default response can be satisfactory in every instance. On the other hand, because a default response will serve most instances of the exception satisfactorily, simply aborting the computation cannot be justified.

For each kind of floating-point exception, IEEE-754 provides a flag that

is raised each time its exception is signaled, and remains raised until the program resets it. Programs may test, save, and restore the flags, or a subset thereof.

PPRREECCIISSIIOONN AANNDD EEXXPPOONNEENNTT RRAANNGGEE OOFF SSPPEECCIIFFIICC FFLLOOAATTIINNGG-PPOOIINNTT TTYYPPEESS

On both Intel and PPC macs, the type float corresponds to IEEE-754 single

precision. A single-precision number is represented in 32 bits, and has

a precision of 24 significant bits, roughly like 7 significant decimal digits. 8 bits are used to encode the exponent, which gives an exponent

range from -126 to 127, inclusive.

The header defines several useful constants for the float type:

FLTMANTDIG - The number of binary digits in the significand of a float.

FLTMINEXP - One more than the smallest exponent available in the float

type.

FLTMAXEXP - One more than the largest exponent available in the float

type.

FLTDIG - the precision in decimal digits of a float. A decimal value

with this many digits, stored as a float, always yields the same value up

to this many digits when converted back to decimal notation.

FLTMIN10EXP - the smallest n such that 10**n is a non-zero normal num-

ber as a float.

FLTMAX10EXP - the largest n such that 10**n is finite as a float.

FLTMIN - the smallest positive normal float.

FLTMAX - the largest finite float.

FLTEPSILON - the difference between 1.0 and the smallest float bigger

than 1.0.

On both Intel and PPC macs, the type double corresponds to IEEE-754 dou-

ble precision. A double-precision number is represented in 64 bits, and

has a precision of 53 significant bits, roughly like 16 significant deci-

mal digits. 11 bits are used to encode the exponent, which gives an

exponent range from -1022 to 1023, inclusive.

The header defines several useful constants for the double

type:

DBLMANTDIG - The number of binary digits in the significand of a dou-

ble.

DBLMINEXP - One more than the smallest exponent available in the double

type.

DBLMAXEXP - One more than the exponent available in the double type.

DBLDIG - the precision in decimal digits of a double. A decimal value

with this many digits, stored as a double, always yields the same value up to this many digits when converted back to decimal notation.

DBLMIN10EXP - the smallest n such that 10**n is a non-zero normal num-

ber as a double.

DBLMAX10EXP - the largest n such that 10**n is finite as a double.

DBLMIN - the smallest positive normal double.

DBLMAX - the largest finite double.

DBLEPSILON - the difference between 1.0 and the smallest double bigger

than 1.0.

On Intel macs, the type long double corresponds to IEEE-754 double

extended precision. A double extended number is represented in 80 bits, and has a precision of 64 significant bits, roughly like 19 significant decimal digits. 15 bits are used to encode the exponent, which gives an

exponent range from -16383 to 16384, inclusive.

The header defines several useful constants for the long double

type:

LDBLMANTDIG - The number of binary digits in the significand of a long

double.

LDBLMINEXP - One more than the smallest exponent available in the long

double type.

LDBLMAXEXP - One more than the exponent available in the long double

type.

LDBLDIG - the precision in decimal digits of a long double. A decimal

value with this many digits, stored as a long double, always yields the

same value up to this many digits when converted back to decimal nota-

tion.

LDBLMIN10EXP - the smallest n such that 10**n is a non-zero normal

number as a long double.

LDBLMAX10EXP - the largest n such that 10**n is finite as a long dou-

ble.

LDBLMIN - the smallest positive normal long double.

LDBLMAX - the largest finite long double.

LDBLEPSILON - the difference between 1.0 and the smallest long double

bigger than 1.0. LLOONNGG DDOOUUBBLLEE OONN PPOOWWEERRPPCC MMAACCSS

On PowerPC macs, by default the type long double is mapped to IEEE-754

double precision, described above. If additional precision is required,

a non-IEEE-754 128 bit long double format is also available. To use this

format, compile with the -mlong-double-128 flag. If you wish to use the

functions, you must include the linker flag -lmx as well as the

usual -lm. The -mlong-double-128 flag is only valid when the target

architecture is ppc or ppc64.

Each 128-bit long double is made up of two IEEE doubles (head and tail).

The value of the long double is the sum of the values of the two parts

(unless the head double has value -0.0, in which case the value of the

long double is -0.0). The value of the head is required to be the value

of the long double rounded to the nearest double. If the head is an infinity, the value of the long double is the value of the head, and the

tail must be +-0.0. The tail of a NaN can be any double value. There

are many 128-bit bit patterns that are not valid as long doubles. These

do not represet any value.

The 128-bit long double format provides 106 significant bits, which is

roughly like 31 significant decimal digits. It has the same exponent

range as double, from -1022 to 1023, inclusive. The usual constants are

provided from , as described above.

In the 128-bit long double format, addition and subtraction have a rela-

tive error bound of one uullpp, or 2**-106. Multiplication has a relative

error within 2 ulps, and division a relative error within 3 ulps.