An important consideration when performing a numerical calculation is the fact that all calculations are preformed with finite precision. There can be VERY different behavior when compared to infinite precision calculations (i.e. doing the calculation by hand).
For the first example, consider the numerical representation of $\sqrt{2} = 1.4142\ldots$. As we know, $\sqrt{2}$ is an irrational number and it has an infinite number of decimal digits that never repeat. To store this "simple" number on a computer using only information about its decimal digits we would need an infinite amount of storage.
a = sqrt(2)
abs(2-a^2)
Some "symbolic" mathematical software (such as Maple and Mathematica) treat $\sqrt{2}$ using its mathematical definition that $(\sqrt{2})^2 = 2$. In this way, the full precision is retained without infinite storage. But we are going use MATLAB in this class, and the only information we will keep about numbers are their (binary) digits.
For our purposes, a binary number will be a sequence of 64 ones and zeros:
$$ 000000010101000001010001010101010101000010101010000001111111111 $$The digits in a binary number are partitioned for different purposes:
In software, numbers are actually represented as
$$(-1)^s 2^{c-1023} (1 + f)$$To obtain "zero" set $c = 0$, $f = 0$ and $s = 0,1$ (two possible choices of zero):
$$ 10000000\ldots \quad \text{and} \quad 0000000\ldots$$For example, consider the binary number
$$ 0 \,11000000001\, 110100001000\ldots $$So, then the number represented by this binary number, in decimal notation is
$$(-1)^s 2^{c-1023} (1 + f) = 1 \cdot 2^{514} (1 + .814453125) \approx 9.7311 \times 10^{154}$$As we said, smallest possible positive number ("zero") is given by $s = 0, c = 1, f = 0$:
$$ N_{\min} = 2^{-1022} \cdot (1+ 0) $$The largest positive number is $s = 0$, $c = 2046$ (this is one less than the maximum value for $c$) and $f = 1 - 2^{-52}$
$$ N_{\max} = 2^{1023} \cdot (2 - 2^{-52})$$Any numbers that occur in computation that are less (in absolute value) then $N_{\min}$ give rise to underflow. This is not usually an issue as it can treated as zero.
Numbers that are larger (in magnitude) than $N_\max$ give overflow. This is a much more serious problem.
exp(-100000000)
exp(100000000)
Our discussion of rounding errors is somewhat paradoxical. The whole point to using a computer is that we do not need to do every computation ourselves. But at the same time we want to know what the computer is doing. So, we use a simplified model of what a computer is actually doing:
Let $k > 0$ be an integer and we assume that all numbers available to us are represented as $k$-digit machine numbers:
$$ \pm 0.d_1d_2\cdots d_k \times 10^n. $$And while the mechanism for rounding in a computer can be complicated, we take the convention of chopping to the closest $k$-digit machine number.
We use the notation $fl(y)$ to denote this truncation ($k$ is implied):
$$ fl(1.25555455) = .1256 \times 10^1 $$Let's use MATLAB to construct the function $fl$:
k = 4;
digits(k) % set precision at k
fl = @(x) double(vpa(x)); % vpa() stands for variable-precision aritmetic, using double() truncates
% Note that this can give different answer from doing it by hand. double() may introduce random digits.
a = fl(sqrt(2));
digits(32); %reset the precision to the default of 32 digits
a-vpa(sqrt(2))
Suppose that $p^*$ is an approximation to $p\neq 0$. The absolute error of $p^*$ is given by $|p^* - p|$ and the relative error is given by
$$ \left|\frac{p^*-p}{p} \right|.$$Suppose we approximate $0.89 \times 10^3$ by $0.9 \times 10^3$. The absolute error is
$$ 0.89 \times 10^3 - 0.9 \times 10^3 = -0.1 \times 10^2 $$and the relative error is
$$ \left | \frac{ 0.89 \times 10^3 - 0.9 \times 10^3}{ 0.89 \times 10^3} \right| = 0.112 \times 10^{-1}. $$Because the numbers we are comparing are large (compared to one), the absolute error is large but the relative error is small.
When measuring the accuracy of a calculation, in terms of correct digits (how correct is the mantissa?) we use
The approximation $p^*$ of $p$ is said to be correct to $t$ significant digits if
$$ \left| \frac{p^*-p}{p} \right| \leq 5 \times 10^{-t}. $$Usually we use the largest value of $t$ such that this is true.
Recall that we chose for $fl(y)$ to truncate $y$ after the $k$ digit in the fraction. So, for $y = 0.d_1d_2\ldots d_kd_{k+1}\ldots \times 10^n$ let's consider:
$$ \left| \frac{y - fl(y)}{y} \right| = \left| \frac{0.d_1d_2\ldots d_kd_{k+1}\ldots \times 10^n - 0.d_1d_2\ldots d_k \times 10^n}{0.d_1d_2\ldots d_kd_{k+1}\ldots \times 10^n} \right|\\ = \frac{0.d_{k+1}d_{k+2}\ldots \times 10^{n-k}}{0.d_1d_2\ldots d_kd_{k+1}\ldots \times 10^n}\\ = \frac{0.d_{k+1}d_{k+2}\ldots \times 10^{-k}}{0.d_1d_2\ldots d_kd_{k+1}\ldots \times 10^0}.$$Here we assume that $d_1 \neq 0$ (otherwise we'd reduce $n$) and the denominator has to be larger than
$$ 0.1 \leq 0.d_1d_2\ldots d_kd_{k+1}\ldots \times 10^0 \quad \text{or} \quad \frac{1}{0.d_1d_2\ldots d_kd_{k+1}\ldots \times 10^0} \leq 10.$$This then gives
$$ \left| \frac{y - fl(y)}{y} \right| \leq 10^{-k+1}. $$So, with the truncation strategy for $fl(y)$ we obtain an approximation that is guaranteed to be accurate to at least $k-1$ significant digits. If the truncation strategy is rounding to the nearest $k$-digit number, we can be guaranteed
$$ \left| \frac{y - fl(y)}{y} \right| \leq 5 \times 10^{-k}. $$So that $k$ significant digits are correct
Compute the roots of $1/2 x^2 + 60x + 1 = 0$ to a relative accuracy of $10^{-4}$ using $5$-digit machine arithmetic
We know that
$$ x = {-60 \pm \sqrt{60^2-2}}. $$So, we just want to evaluate this expression, right?
sqrt(3598.0)
digits(5) % set precision at 5
b = 60;
fl = @(x) double(vpa(x)); % vpa() stands for variable-precision aritmetic, using double() truncates
xplus = fl(-b) + fl(sqrt(fl(fl(b^2)-fl(2))))
digits(32)
realxplus = vpa(-b + sqrt(b^2-2))
relerror = double(abs(xplus-realxplus)/abs(realxplus)) % 2 significant digits
digits(5) % set precision at 5
b = 60;
fl = @(x) double(vpa(x)); % vpa() stands for variable-precision aritmetic, using double() truncates
xminus = fl(-b) - fl(sqrt(fl(fl(b^2)-fl(2))))
digits(32)
realxminus = vpa(-b - sqrt(b^2-2))
relerror = double(abs(xminus-realxminus)/abs(realxminus)) % five significant digits
So, there is a problem with the formula
$$x = {-60 + \sqrt{60^2-2}}. $$This is the subtraction of two nearly equal numbers. With 5-digit arithmetic
$$ fl(60) = 60.000 \\ fl\left(\sqrt{fl(fl(60^2)-2)}\right) = 59.983 \\ fl(60) - fl\left(\sqrt{fl(fl(60^2)-2)}\right) = 00.017 = .17 000 \times 10^{-1}.$$In this calculation we have lost 3 digits! This is why the relative accuracy is only $2 \times 10^{-2}$.
How do we fix this? We can rewrite the formula as:
$$ x = \left( -60 + \sqrt{60^2-2} \right) \left (\frac{-60 - \sqrt{60^2-2}}{-60 - \sqrt{60^2-2}} \right) \\= \frac{-2}{60 + \sqrt{60^2-2}}.$$digits(5) % set precision at 5
b = 60;
fl = @(x) double(vpa(x)); % vpa() stands for variable-precision aritmetic, using double() truncates
xplus = fl(-2/(fl(b) + fl(sqrt(fl(fl(b^2)-fl(2))))))
digits(32)
realxplus = vpa((-b + sqrt(b^2-2)))
double(abs(xplus-realxplus)/abs(realxplus)) % 5 significant digits
Another situation where round-off error comes into play is the evaulation of polynomials. Consider $$ f(x) = x^3 - 4x^2 + 1/4 x - 1.$$ We want to evaulate $f(4.16)$ with $3$-digit machine aritmetic
digits(3) % set precision at 3
fl = @(x) double(vpa(x));
x = 4.16;
x1 = fl(x); x2 = fl(x*x); x3 = fl(x*x2);
approx = fl(x3-fl(4*x2)+fl(.25*x1)-1)
digits(32)
actual = vpa(x^3 - 4*x^2+.25*x-1)
double(abs(approx-actual)/abs(actual)) % 2 significant digits
To help with this issue, we rewrite the polynomial
$$ f(x) = x^3 - 4 x^2 + 1/4x -1 = x(x^2 - 4x + 1/4) - 1\\ = x(x(x-4) + 1/4) -1. $$digits(3) % set precision at 3
fl = @(x) double(vpa(x));
x = 4.16;
approx = fl(x-4);
approx = fl(fl(x*approx) + 1/4);
approx = fl(fl(x*approx)-1)
digits(32)
actual = vpa(x^3 - 4*x^2+.25*x-1)
double(abs(approx-actual)/abs(actual)) % 4 significant digits!