Round-off errors and finite-precision arithmetic¶

An important consideration when performing a numerical calculation is the fact that all calculations are preformed with finite precision. There can be VERY different behavior when compared to infinite precision calculations (i.e. doing the calculation by hand).

For the first example, consider the numerical representation of $\sqrt{2} = 1.4142\ldots$. As we know, $\sqrt{2}$ is an irrational number and it has an infinite number of decimal digits that never repeat. To store this "simple" number on a computer using only information about its decimal digits we would need an infinite amount of storage.

a = sqrt(2)
abs(2-a^2)

a =

    1.4142


ans =

   4.4409e-16

Some "symbolic" mathematical software (such as Maple and Mathematica) treat $\sqrt{2}$ using its mathematical definition that $(\sqrt{2})^2 = 2$. In this way, the full precision is retained without infinite storage. But we are going use MATLAB in this class, and the only information we will keep about numbers are their (binary) digits.

Binary Machine Numbers¶

For our purposes, a binary number will be a sequence of 64 ones and zeros:

$$ 000000010101000001010001010101010101000010101010000001111111111 $$

The digits in a binary number are partitioned for different purposes:

The first digit $s$ is indicates the sign of the number, 0: positive, 1: negative, $(-1)^s$
The next 11 digits specify an exponent (base 2, called the characteristic $c$, $2^{11} = 2048$)
The last 54 digits are used to specify a fraction (called the mantissa $f$)

In software, numbers are actually represented as

$$(-1)^s 2^{c-1023} (1 + f)$$

To obtain "zero" set $c = 0$, $f = 0$ and $s = 0,1$ (two possible choices of zero):

$$ 10000000\ldots \quad \text{and} \quad 0000000\ldots$$

For example, consider the binary number

$$ 0 \,11000000001\, 110100001000\ldots $$

$s = 0$
$c = 1\cdot 2^{10} + 1\cdot 2^9 + 0 \cdot 2^8 + \cdots + 0 2^1 + 1 \cdot 2^0 = 1537$
$f = \displaystyle 1 \cdot \left(\frac{1}{2}\right)^1 + 1 \cdot \left(\frac{1}{2}\right)^2 + 1 \cdot \left(\frac{1}{2}\right)^4 + 1 \cdot \left(\frac{1}{2}\right)^{9}$ $= 0.814453125$

So, then the number represented by this binary number, in decimal notation is

$$(-1)^s 2^{c-1023} (1 + f) = 1 \cdot 2^{514} (1 + .814453125) \approx 9.7311 \times 10^{154}$$

As we said, smallest possible positive number ("zero") is given by $s = 0, c = 1, f = 0$:

$$ N_{\min} = 2^{-1022} \cdot (1+ 0) $$

The largest positive number is $s = 0$, $c = 2046$ (this is one less than the maximum value for $c$) and $f = 1 - 2^{-52}$

$$ N_{\max} = 2^{1023} \cdot (2 - 2^{-52})$$

Any numbers that occur in computation that are less (in absolute value) then $N_{\min}$ give rise to underflow. This is not usually an issue as it can treated as zero.

Numbers that are larger (in magnitude) than $N_\max$ give overflow. This is a much more serious problem.

exp(-100000000)
exp(100000000)

ans =

     0


ans =

   Inf

Our discussion of rounding errors is somewhat paradoxical. The whole point to using a computer is that we do not need to do every computation ourselves. But at the same time we want to know what the computer is doing. So, we use a simplified model of what a computer is actually doing:

We work with the decimal system (base 10)
We assume we only have a finite mantissa but an unbounded characteristic

Let $k > 0$ be an integer and we assume that all numbers available to us are represented as $k$-digit machine numbers:

$$ \pm 0.d_1d_2\cdots d_k \times 10^n. $$

And while the mechanism for rounding in a computer can be complicated, we take the convention of chopping to the closest $k$-digit machine number.

If $k = 4$, $y= 1.25555455$ is chopped to $.1255 \times 10^1$
If $k = 5$, $y= 302.4533$ is chopped to $.30245 \times 10^3$

We use the notation $fl(y)$ to denote this truncation ($k$ is implied):

$$ fl(1.25555455) = .1256 \times 10^1 $$

Let's use MATLAB to construct the function $fl$:

k = 4;
digits(k) % set precision at k
fl = @(x) double(vpa(x)); % vpa() stands for variable-precision aritmetic, using double() truncates
% Note that this can give different answer from doing it by hand.  double() may introduce random digits.
a = fl(sqrt(2));
digits(32); %reset the precision to the default of 32 digits
a-vpa(sqrt(2))

ans =
 
-0.00021356237309504880168872420969808

Definition¶

Suppose that $p^*$ is an approximation to $p\neq 0$. The absolute error of $p^*$ is given by $|p^* - p|$ and the relative error is given by

$$ \left|\frac{p^*-p}{p} \right|.$$

Example¶

Suppose we approximate $0.89 \times 10^3$ by $0.9 \times 10^3$. The absolute error is

$$ 0.89 \times 10^3 - 0.9 \times 10^3 = -0.1 \times 10^2 $$

and the relative error is

$$ \left | \frac{ 0.89 \times 10^3 - 0.9 \times 10^3}{ 0.89 \times 10^3} \right| = 0.112 \times 10^{-1}. $$

Because the numbers we are comparing are large (compared to one), the absolute error is large but the relative error is small.

When measuring the accuracy of a calculation, in terms of correct digits (how correct is the mantissa?) we use

Definition¶

The approximation $p^*$ of $p$ is said to be correct to $t$ significant digits if

$$ \left| \frac{p^*-p}{p} \right| \leq 5 \times 10^{-t}. $$

Usually we use the largest value of $t$ such that this is true.

Recall that we chose for $fl(y)$ to truncate $y$ after the $k$ digit in the fraction. So, for $y = 0.d_1d_2\ldots d_kd_{k+1}\ldots \times 10^n$ let's consider:

$$ \left| \frac{y - fl(y)}{y} \right| = \left| \frac{0.d_1d_2\ldots d_kd_{k+1}\ldots \times 10^n - 0.d_1d_2\ldots d_k \times 10^n}{0.d_1d_2\ldots d_kd_{k+1}\ldots \times 10^n} \right|\\ = \frac{0.d_{k+1}d_{k+2}\ldots \times 10^{n-k}}{0.d_1d_2\ldots d_kd_{k+1}\ldots \times 10^n}\\ = \frac{0.d_{k+1}d_{k+2}\ldots \times 10^{-k}}{0.d_1d_2\ldots d_kd_{k+1}\ldots \times 10^0}.$$

Here we assume that $d_1 \neq 0$ (otherwise we'd reduce $n$) and the denominator has to be larger than

$$ 0.1 \leq 0.d_1d_2\ldots d_kd_{k+1}\ldots \times 10^0 \quad \text{or} \quad \frac{1}{0.d_1d_2\ldots d_kd_{k+1}\ldots \times 10^0} \leq 10.$$

This then gives

$$ \left| \frac{y - fl(y)}{y} \right| \leq 10^{-k+1}. $$

So, with the truncation strategy for $fl(y)$ we obtain an approximation that is guaranteed to be accurate to at least $k-1$ significant digits. If the truncation strategy is rounding to the nearest $k$-digit number, we can be guaranteed

$$ \left| \frac{y - fl(y)}{y} \right| \leq 5 \times 10^{-k}. $$

So that $k$ significant digits are correct

Example¶

Compute the roots of $1/2 x^2 + 60x + 1 = 0$ to a relative accuracy of $10^{-4}$ using $5$-digit machine arithmetic

We know that

$$ x = {-60 \pm \sqrt{60^2-2}}. $$

So, we just want to evaluate this expression, right?

sqrt(3598.0)

ans =

   59.9833

digits(5) % set precision at 5
b = 60;
fl = @(x) double(vpa(x)); % vpa() stands for variable-precision aritmetic, using double() truncates
xplus = fl(-b) + fl(sqrt(fl(fl(b^2)-fl(2))))

digits(32)
realxplus = vpa(-b + sqrt(b^2-2))

relerror = double(abs(xplus-realxplus)/abs(realxplus))  % 2 significant digits

xplus =

   -0.0170

 
realxplus =
 
-0.016668982124706133163272170349956
 

relerror =

    0.0199

digits(5) % set precision at 5
b = 60;
fl = @(x) double(vpa(x)); % vpa() stands for variable-precision aritmetic, using double() truncates
xminus = fl(-b) - fl(sqrt(fl(fl(b^2)-fl(2))))

digits(32)
realxminus = vpa(-b - sqrt(b^2-2))

relerror = double(abs(xminus-realxminus)/abs(realxminus)) % five significant digits

xminus =

 -119.9830

 
realxminus =
 
-119.98333101787528676140937022865
 

relerror =

   2.7589e-06

So, there is a problem with the formula

$$x = {-60 + \sqrt{60^2-2}}. $$

This is the subtraction of two nearly equal numbers. With 5-digit arithmetic

$$ fl(60) = 60.000 \\ fl\left(\sqrt{fl(fl(60^2)-2)}\right) = 59.983 \\ fl(60) - fl\left(\sqrt{fl(fl(60^2)-2)}\right) = 00.017 = .17 000 \times 10^{-1}.$$

In this calculation we have lost 3 digits! This is why the relative accuracy is only $2 \times 10^{-2}$.

How do we fix this? We can rewrite the formula as:

$$ x = \left( -60 + \sqrt{60^2-2} \right) \left (\frac{-60 - \sqrt{60^2-2}}{-60 - \sqrt{60^2-2}} \right) \\= \frac{-2}{60 + \sqrt{60^2-2}}.$$

digits(5) % set precision at 5
b = 60;
fl = @(x) double(vpa(x)); % vpa() stands for variable-precision aritmetic, using double() truncates
xplus = fl(-2/(fl(b) + fl(sqrt(fl(fl(b^2)-fl(2))))))

digits(32)
realxplus = vpa((-b + sqrt(b^2-2)))

double(abs(xplus-realxplus)/abs(realxplus))  % 5 significant digits

xplus =

   -0.0167

 
realxplus =
 
-0.016668982124706133163272170349956
 

ans =

   1.0724e-06

Polynomial evaluation¶

Another situation where round-off error comes into play is the evaulation of polynomials. Consider $$ f(x) = x^3 - 4x^2 + 1/4 x - 1.$$ We want to evaulate $f(4.16)$ with $3$-digit machine aritmetic

digits(3) % set precision at 3
fl = @(x) double(vpa(x));
x = 4.16;
x1 = fl(x); x2 = fl(x*x); x3 = fl(x*x2);
approx = fl(x3-fl(4*x2)+fl(.25*x1)-1)

digits(32)
actual = vpa(x^3 - 4*x^2+.25*x-1)

double(abs(approx-actual)/abs(actual))  % 2 significant digits

approx =

    2.8400

 
actual =
 
2.808896
 

ans =

    0.0111

To help with this issue, we rewrite the polynomial

$$ f(x) = x^3 - 4 x^2 + 1/4x -1 = x(x^2 - 4x + 1/4) - 1\\ = x(x(x-4) + 1/4) -1. $$

digits(3) % set precision at 3
fl = @(x) double(vpa(x));
x = 4.16;
approx = fl(x-4);
approx = fl(fl(x*approx) + 1/4);
approx = fl(fl(x*approx)-1)

digits(32)
actual = vpa(x^3 - 4*x^2+.25*x-1)

double(abs(approx-actual)/abs(actual))  % 4 significant digits!

approx =

    2.8100

 
actual =
 
2.808896
 

ans =

   3.9304e-04