The Floating Point Class float

Previously whole numbers known as integers were examined. However not all physical numbers are whole numbers and have a fractional component…

Method Resolution Order and Consistency

In the previous tutorials, the integer and float classes were examined which corresponded to whole numbers. When the method resolution order of the bool class was examined it was seen to be a child class of the integer class. The method resolution order of the float class shows that it is independent from the int class and is a direct subclass of object:

[float, object]

If help on the float class is examined:

Help on class float in module builtins:

class float(object)
 |  float(x=0, /)
 |  Convert a string or number to a floating point number, if possible.
 |  Methods defined here:
 |  __abs__(self, /)
 |      abs(self)
 |  __add__(self, value, /)
 |      Return self+value.
 |  __bool__(self, /)
 |      True if self else False
 |  __ceil__(self, /)
 |      Return the ceiling as an Integral.
 |  __divmod__(self, value, /)
 |      Return divmod(self, value).
 |  __eq__(self, value, /)
 |      Return self==value.
 |  __float__(self, /)
 |      float(self)
 |  __floor__(self, /)
 |      Return the floor as an Integral.
 |  __floordiv__(self, value, /)
 |      Return self//value.
 |  __format__(self, format_spec, /)
 |      Formats the float according to format_spec.
 |  __ge__(self, value, /)
 |      Return self>=value.
 |  __getattribute__(self, name, /)
 |      Return getattr(self, name).
 |  __getnewargs__(self, /)
 |  __gt__(self, value, /)
 |      Return self>value.
 |  __hash__(self, /)
 |      Return hash(self).
 |  __int__(self, /)
 |      int(self)
 |  __le__(self, value, /)
 |      Return self<=value.
 |  __lt__(self, value, /)
 |      Return self<value.
 |  __mod__(self, value, /)
 |      Return self%value.
 |  __mul__(self, value, /)
 |      Return self*value.
 |  __ne__(self, value, /)
 |      Return self!=value.
 |  __neg__(self, /)
 |      -self
 |  __pos__(self, /)
 |      +self
 |  __pow__(self, value, mod=None, /)
 |      Return pow(self, value, mod).
 |  __radd__(self, value, /)
 |      Return value+self.
 |  __rdivmod__(self, value, /)
 |      Return divmod(value, self).
 |  __repr__(self, /)
 |      Return repr(self).
 |  __rfloordiv__(self, value, /)
 |      Return value//self.
 |  __rmod__(self, value, /)
 |      Return value%self.
 |  __rmul__(self, value, /)
 |      Return value*self.
 |  __round__(self, ndigits=None, /)
 |      Return the Integral closest to x, rounding half toward even.
 |      When an argument is passed, work like built-in round(x, ndigits).
 |  __rpow__(self, value, mod=None, /)
 |      Return pow(value, self, mod).
 |  __rsub__(self, value, /)
 |      Return value-self.
 |  __rtruediv__(self, value, /)
 |      Return value/self.
 |  __sub__(self, value, /)
 |      Return self-value.
 |  __truediv__(self, value, /)
 |      Return self/value.
 |  __trunc__(self, /)
 |      Return the Integral closest to x between 0 and x.
 |  as_integer_ratio(self, /)
 |      Return integer ratio.
 |      Return a pair of integers, whose ratio is exactly equal to the original float
 |      and with a positive denominator.
 |      Raise OverflowError on infinities and a ValueError on NaNs.
 |      >>> (10.0).as_integer_ratio()
 |      (10, 1)
 |      >>> (0.0).as_integer_ratio()
 |      (0, 1)
 |      >>> (-.25).as_integer_ratio()
 |      (-1, 4)
 |  conjugate(self, /)
 |      Return self, the complex conjugate of any float.
 |  hex(self, /)
 |      Return a hexadecimal representation of a floating-point number.
 |      >>> (-0.1).hex()
 |      '-0x1.999999999999ap-4'
 |      >>> 3.14159.hex()
 |      '0x1.921f9f01b866ep+1'
 |  is_integer(self, /)
 |      Return True if the float is an integer.
 |  ----------------------------------------------------------------------
 |  Class methods defined here:
 |  __getformat__(typestr, /) from builtins.type
 |      You probably don't want to use this function.
 |        typestr
 |          Must be 'double' or 'float'.
 |      It exists mainly to be used in Python's test suite.
 |      This function returns whichever of 'unknown', 'IEEE, big-endian' or 'IEEE,
 |      little-endian' best describes the format of floating point numbers used by the
 |      C type named by typestr.
 |  fromhex(string, /) from builtins.type
 |      Create a floating-point number from a hexadecimal string.
 |      >>> float.fromhex('0x1.ffffp10')
 |      2047.984375
 |      >>> float.fromhex('-0x1p-1074')
 |      -5e-324
 |  ----------------------------------------------------------------------
 |  Static methods defined here:
 |  __new__(*args, **kwargs) from builtins.type
 |      Create and return a new object.  See help(type) for accurate signature.
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  imag
 |      the imaginary part of a complex number
 |  real
 |      the real part of a complex number

Despite being independent from the int class, it can be seen that a large number of data model identifiers are consistent between the int and float classes.

Moreover interaction between a float instance and an int instance, automatically casts the int into a float instance. For example:

num1 = 1
num2 = 2.1
num3 = num1 + num2

Decimal Numbering System

The typical numbering system employed by humans is the decimal numbering system which is a result of a human having 10 fingers.

Often in physics, the physical unit of measurement is not quantised. And in many cases is not comparable to the physical object being measured. Take length for example.

If length is measured in metres.

The radius of a hydrogen atom:


The radius of a human:


The radius of the sun:


Scientific Notation

As the large number of leading or trailing zeros become quite hard to transcribe for a human, scientific notation is preferred. Essentially the value is made comparable to a unit in the form of a mantissa and this mantissa is raised to an exponent power of 10.

The radius of a hydrogen atom becomes:


The radius of a sun becomes:


Generally the interaction between a large number and a small number leaves the large number unchanged for addition and subtraction. The sun for example is made up of a very large number of hydrogen atoms and the difference of one hydrogen atoms length is insignificant in comparison to the error in the suns radius.

For operations such as multiplication, the mantissas are multiplied and then the exponents are added. In the case of division, the mantissas are divided and then the exponents are subtracted. The number of hydrogen atoms along the radius of the sun is therefore:

(6.957 / 1.2) e (8-(-10))

Recursive Rounding Issues

In the decimal system, the concept of a 1/3 can be examined using integer division:

10 // 3
10 % 3

If the remainder is multiplied by 10 to get to the next unit, the calculation recurs:

(10 % 3) * 10

This means that 1/3 cannot be perfectly represented in a system with only 10 unique digits and ultimately there will be a recursive rounding error due to a limit in the number of digits specified.


Binary Encoding

In Python the floating point number is used to represent a decimal number. Although the representation of the number is displayed in decimal, under the hood the number is stored in a form of binary scientific notation because the computers fundamental unit of storage is a bit. This leads to some unexpected recursive rounding issues.

Recursive Rounding Issues

Because the binary system only has 2 unique characters opposed to 10 in decimal, this recursive rounding error occurs far more frequently. Moreover due to the floating point number being encoded in binary but displayed in decimal, some of these recursive rounding issues can be unexpected for those who are used to working with decimal. Care needs to be taken using comparison operators in particular:

0.1 + 0.2 == 0.3

If the left hand is examined:

0.1 + 0.2

A recursive rounding error is shown which is why the two sides of the is equal to statement are different.

IEE-754 (Binary Scientific Notation)

The pickle module is used to serialise Python objects into a byte string. The dumps function can be used to retrieve the pickled bytes string for 0.1:

import pickle

pickles adds a prefix and suffix which can be removed using slicing:


This byte can be viewed in hex:


This string can then be cast into an integer using the base 16:

int(pickle.dumps(0.1)[12:20].hex(), base=16)

The bin function can be used to view this in binary:

bin(int(pickle.dumps(0.1)[12:20].hex(), base=16))

The prefix can be removed using the string method removeprefix:

bin(int(pickle.dumps(0.1)[12:20].hex(), base=16)).removeprefix('0b')
bin(int(pickle.dumps(0.1)[12:20].hex(), base=16)).removeprefix('0b')

Finally the string method zfill can be used to zero fill to 64 binary characters including leading zeros:

bin(int(pickle.dumps(0.1)[12:20].hex(), base=16)).removeprefix('0b').zfill(64)

The 0th bit corresponds to the sign:

bin(int(pickle.dumps(0.1)[12:20].hex(), base=16)).removeprefix('0b').zfill(64)[0]

The 1st-12th bit corresponds to the biased exponent:

bin(int(pickle.dumps(0.1)[12:20].hex(), base=16)).removeprefix('0b').zfill(64)[1:12]

The remaining bits corresponds to the fraction:

bin(int(pickle.dumps(0.1)[12:20].hex(), base=16)).removeprefix('0b').zfill(64)[12:]

The IEEE-754 representation is a form of binary scientific notation that has been optimised to be more memory efficient. It has a 1 bit sign, 11 bit biased exponent and 52 bit fraction.

For the sign bit, 0 is positive and 1 is negative; this number is positive and the sign is 0.

The biased exponent can be converted to an int using string concatenation for the binary prefix and using a base of 2:

int('0b'+bin(int(pickle.dumps(0.1)[12:20].hex(), base=16)).removeprefix('0b').zfill(64)[1:12],

The exponent is biased so all negative numbers are encoded by a positive value. The actual 0 is at 1023 so this gives an unbiased exponent of -4.

1023 - int('0b'+bin(int(pickle.dumps(0.1)[12:20].hex(), base=16)).removeprefix('0b').zfill(64)[1:12], base=2)

To get 0.1 to be a number of the magnitude of a unit and taking binary as the power of 2:

0.1 * 2 #-1
0.1 * 2 * 2 #-2
0.1 * 2 * 2 * 2 #-3
0.1 * 2 * 2 * 2 * 2 #-4

This is where the -4 comes from in the exponent.

This leaves 1.6 which has to be encoded in binary scientific notation to create the fraction. This is done by use of integer division using powers of 2. For convenience this can be done using a series of divmods:

divmod(1.6, 2**0)
(1.0, 0.6000000000000001)

All the powers are below a unit so a binary point will be added here:

divmod(0.6000000000000001, 2**-1)
(1.0, 0.10000000000000009)
divmod(0.09999999999999998, 2**-2)
(0.0, 0.09999999999999998)
divmod(0.09999999999999998, 2**-3)
(0.0, 0.09999999999999998)
divmod(0.09999999999999998, 2**-4)
(1.0, 0.03749999999999998)
divmod(0.03749999999999998, 2**-5)
(1.0, 0.006249999999999978)
divmod(0.006249999999999978, 2**-6)
(0.0, 0.006249999999999978)
divmod(0.006249999999999978, 2**-7)
(0.0, 0.006249999999999978)
divmod(0.006249999999999978, 2**-8)
(1.0, 0.002343749999999978)
divmod(0.002343749999999978, 2**-9)
(1.0, 0.0003906249999999778)

This is essentially repeated until the divmod involving 2**-52 is reached.

If each integer division value is taken alongside the binary point, this gives:


The … represents further digits that occur past the binary point. This is actually a recurring number and therefore becomes:


Because the binary unit is always 1. for binary scientific notation.


To save memory, it is excluded in the encoding. The fraction is therefore encoded as:


To recap a float is encoded in binary using 64 bits. For the float 0.1:

  • 0th bit is the sign which is 0
  • 1st-12th bit is the biased exponent which is 01111111011 and corresponds to an biased power of -4
  • The leading 1. is constant and is not encoded
  • 12th-64th bit is the fraction which is 1001100110011001100110011001100110011001100110011010

These can also be expressed using a hexadecimal floating-point literal:

  • The + represents the sign
  • The 0x represents hexadecimal
  • The 1. which is constant for a binary number is explicitly shown.
  • The 999999999999ap is the 12-64th bit fraction expressed in hexadecimal, recall that 1001 (binary) is 9 (hexadecimal)
  • The p-4 represents an unbiased power of -4

The class method fromhex can be used to create a float from a hexadecimal floating-point literal:

num = float.fromhex('0x1.999999999999ap-4')

There is also the associated method hex, which returns the hexadecimal floating-point literal of a float instance:


Rounding Data Model Identifiers

The data model identifiers __round__, __floor__, __ceil__, __trunc__ define the behaviour behind the builtins function round and the math functions ceil, floor and trunc.

The round function has two input arguments, number which is the number to be rounded and ndigits which is the number of digits:

Signature:  round(number, ndigits=None)
Round a number to a given precision in decimal digits.

The return value is an integer if ndigits is omitted or None.  Otherwise
the return value has the same type as the number.  ndigits may be negative.
Type:      builtin_function_or_method

ndigits has a default value of None:

num1 = 3.925
round(num1, 2)

The math trunc function on the other hand will always truncate a number towards the nearest integer:

import math
Signature:  math.trunc(x, /)
Truncates the Real x to the nearest Integral toward 0.

Uses the __trunc__ magic method.
Type:      builtin_function_or_method

The math floor function will always bring the number down to the nearest integer:

Signature:  math.floor(x, /)
Return the floor of x as an Integral.

This is the largest integer <= x.
Type:      builtin_function_or_method

For positive numbers ceil and floor behave similarly, there is a difference when negative numbers are used:


The math ceil (ceiling) function will always bring the number up to the nearest integer:

Signature:  math.ceil(x, /)
Return the ceiling of x as an Integral.

This is the smallest integer >= x.
Type:      builtin_function_or_method

Casting Data Model Identifiers

The __int__ data model identifier defines the behaviour when a float is cast to an integer using the builtins int class. This essentially does the same as the math trunc function:


The __bool__ data model identifier defines the behaviour when a float is cast to a bool using the builtins bool class. All non-zero values are True and a zero value is False:


The __float__ data model identifier defines the behaviour when a float is cast to a float using the builtins float class, because it is already a float, a copy of the float is returned:


Binary Comparison Data Model Identifiers

Since float are ordinal, the 6 comparison operators are configured:

Data Model Identifier Operator

Care with these comparison operators needs to be taken due to recursive rounding issues. In the example seen earlier for example:

0.1 + 0.2 == 0.3

This can be changed by rounding the result:

round(0.1 + 0.2, 6) == round(0.3, 6)