CUDA vs OpenCL math builtin precision¶
CUDA Guarantees¶
From Appendix E.1 of the CUDA C Programming Guide:
This section specifies the error bounds of each function when executed on the device and also when executed on the host in the case where the host does not supply the function.
The error bounds are generated from extensive but not exhaustive tests, so they are not guaranteed bounds.
In Section 11.1.5 of the CUDA C Best Practices Guide on Math Libraries and Section 11.1.6 of the CUDA C Best Practices Guide on Precision-related Compiler Flags, there are mentions of the precision of math built-ins.
Single Precision¶
The following table uses the following sources:
Appendix E.1 of the CUDA C Programming Guide which is referenced from the CUDA Math API documentation
In addition to the following table, the CUDA documentation also includes:
Addition and multiplication are IEEE-compliant, so have a maximum error of 0.5 ulp.
The recommended way to round a single-precision floating-point operand to an integer, with the result being a single-precision floating-point number is rintf(), not roundf(). The reason is that roundf() maps to an 8-instruction sequence on the device, whereas rintf() maps to a single instruction. truncf(), ceilf(), and floorf() each map to a single instruction as well.
OpenCL defines ULP (units in last place) as:
If x is a real number that lies between two finite consecutive floating-point numbers a and b, without being equal to one of them, then ulp(x) = |b − a|, otherwise ulp(x) is the distance between the two non-equal finite floating-point numbers nearest x. Moreover, ulp(NaN) is NaN.
Maximum error is defined in the CUDA documentation as:
The maximum error is stated as the absolute value of the difference in ulps between a correctly rounded single-precision result and the result returned by the CUDA library function.
OpenCL Built-in |
OpenCL Min Accuracy (ULP) |
CUDA Built-in |
CUDA Maximum Error (ULP) |
---|---|---|---|
|
Correctly rounded |
|
0 ulp (IEEE-754 round-to-nearest-even) |
|
Correctly rounded |
N/A |
N/A |
|
Correctly rounded |
|
0 ulp (IEEE-754 round-to-nearest-even) |
≤ 2.5 ulp |
|
0 ulp (if compute capability ≥ 2 when compiled with |
|
≤ 2.5 ulp |
|
0 ulp (if compute capability ≥ 2 when compiled with |
|
≤ 4 ulp |
3 ulp (full range) |
||
≤ 5 ulp |
N/A |
N/A |
|
≤ 4 ulp |
4 ulp (full range) |
||
≤ 5 ulp |
N/A |
N/A |
|
≤ 5 ulp |
2 ulp (full range) |
||
≤ 6 ulp |
3 ulp (full range) |
||
≤ 5 ulp |
N/A |
N/A |
|
≤ 6 ulp |
N/A |
N/A |
|
≤ 4 ulp |
4 ulp (full range) |
||
≤ 4 ulp |
3 ulp (full range) |
||
≤ 5 ulp |
3 ulp (full range) |
||
≤ 2 ulp |
1 ulp (full range) |
||
Correctly rounded |
0 ulp (full range) |
||
0 ulp |
Undocumented. |
||
≤ 4 ulp |
2 ulp (full range) |
||
≤ 4 ulp |
2 ulp (full range) |
||
≤ 4 ulp |
2 ulp (full range) |
||
N/A |
N/A |
6 ulp (full range) |
|
N/A |
N/A |
6 ulp (full range) |
|
≤ 16 ulp |
4 ulp (full range) |
||
N/A |
N/A |
2 ulp (full range) |
|
N/A |
N/A |
4 ulp (full range) |
|
N/A |
N/A |
2 ulp (full range) |
|
≤ 16 ulp |
2 ulp (full range) |
||
≤ 3 ulp |
2 ulp (full range) |
||
≤ 3 ulp |
2 ulp (full range) |
||
≤ 3 ulp |
2 ulp (full range) |
||
≤ 3 ulp |
1 ulp (full range) |
||
0 ulp |
Undocumented. |
||
Correctly rounded |
0 ulp (full range) |
||
Correctly rounded |
0 ulp (full range) |
||
Correctly rounded |
0 ulp (full range) |
||
0 ulp |
Undocumented. |
||
0 ulp |
Undocumented. |
||
0 ulp |
0 ulp (full range) |
||
Correctly rounded |
N/A |
N/A |
|
0 ulp |
0 ulp (full range) |
||
≤ 4 ulp |
3 ulp (full range) |
||
0 ulp |
0 ulp (full range) |
||
N/A |
N/A |
9 ulp for |
|
N/A |
N/A |
9 ulp for |
|
N/A |
N/A |
For |
|
Correctly rounded |
0 ulp (full range) |
||
N/A |
N/A |
6 ulp (outside interval |
|
≤ 3 ulp |
1 ulp (full range) |
||
≤ 3 ulp |
1 ulp (full range) |
||
≤ 3 ulp |
2 ulp (full range) |
||
≤ 2 ulp |
1 ulp (full range) |
||
0 ulp |
0 ulp (full range) |
||
N/A |
N/A |
0 ulp (full range) |
|
N/A |
N/A |
0 ulp (full range) |
|
N/A |
N/A |
0 ulp (full range) |
|
N/A |
N/A |
0 ulp (full range) |
|
Any value allowed (infinite ulp) |
N/A |
N/A |
|
0 ulp |
N/A |
N/A |
|
0 ulp |
N/A |
N/A |
|
0 ulp |
0 ulp (full range) |
||
0 ulp |
Undocumented. |
||
N/A |
N/A |
0 ulp (full range) |
|
0 ulp |
Undocumented. |
||
N/A |
N/A |
4 ulp (full range) |
|
N/A |
N/A |
5 ulp (full range) |
|
N/A |
N/A |
5 ulp (full range) |
|
N/A |
N/A |
3 ulp (full range) |
|
N/A |
N/A |
3 ulp (full range) |
|
≤ 16 ulp |
8 ulp (full range) |
||
≤ 16 ulp |
N/A |
N/A |
|
≤ 16 ulp |
N/A |
N/A |
|
N/A |
N/A |
1 ulp (full range) |
|
N/A |
N/A |
2 ulp (full range) |
|
N/A |
N/A |
3 ulp (full range) |
|
N/A |
N/A |
2 ulp (full range) |
|
N/A |
N/A |
2 ulp (full range) |
|
0 ulp |
0 ulp (full range) |
||
0 ulp |
0 ulp (full range) |
||
Correctly rounded |
0 ulp (full range) |
||
≤ 16 ulp |
N/A |
N/A |
|
Correctly rounded |
0 ulp (full range) |
||
≤ 2 ulp |
2 ulp (full range) (applies to |
||
N/A |
N/A |
0 ulp (full range) |
|
N/A |
N/A |
0 ulp (full range) |
|
≤ 4 ulp |
2 ulp (full range) |
||
≤ 4 ulp for sine and cosine values |
2 ulp (full range) |
||
N/A |
N/A |
2 ulp (full range) |
|
≤ 4 ulp |
3 ulp (full range) |
||
≤ 4 ulp |
2 ulp (full range) |
||
≤ 3 ulp |
0 ulp (when compiled with |
||
≤ 5 ulp |
4 ulp (full range) |
||
≤ 5 ulp |
2 ulp (full range) |
||
≤ 6 ulp |
N/A |
N/A |
|
≤ 16 ulp |
11 ulp (full range) |
||
Correctly rounded |
0 ulp (full range) |
||
N/A |
N/A |
9 ulp for |
|
N/A |
N/A |
9 ulp for |
|
N/A |
N/A |
|
|
N/A |
N/A |
N/A |
|
N/A |
N/A |
N/A |
|
N/A |
N/A |
N/A |
|
N/A |
N/A |
N/A |
OpenCL’s native_
math built-ins map to the same CUDA built-in as the equivalent non-native_
OpenCL built-in and the precision is implementation-defined:
In section 7.4 of the OpenCL 2.1 Specification, mad
has a different requirement,
namely:
Implemented either as a correctly rounded fma or as a multiply followed by an add both of which are correctly rounded.
Precision of SPIR-V math instructions for use in an OpenCL environment, can be found in this document.
Double Precision¶
The following table uses the following sources:
Appendix E.1 of the CUDA C Programming Guide which is referenced from the CUDA Math API documentation
CUDA defines maximum error in the same way as for single precision, and also includes:
The recommended way to round a double-precision floating-point operand to an integer, with the result being a double-precision floating-point number is rint(), not round(). The reason is that round() maps to an 8-instruction sequence on the device, whereas rint() maps to a single instruction. trunc(), ceil(), and floor() each map to a single instruction as well.
Only differences from single precision are included. There are only changes to 1.0 / x
, x / y
and sqrt
from OpenCL. All built-in names changed for CUDA and many precisions too.
OpenCL Built-in |
OpenCL Min Accuracy (ULP) |
CUDA Built-in |
CUDA Maximum Error (ULP) |
---|---|---|---|
|
Correctly rounded |
|
0 ulp (IEEE-754 round-to-nearest-even) |
|
Correctly rounded |
N/A |
N/A |
|
Correctly rounded |
|
0 ulp (IEEE-754 round-to-nearest-even) |
Correctly rounded |
|
0 ulp (IEEE-754 round-to-nearest-even |
|
Correctly rounded |
|
0 ulp (IEEE-754 round-to-nearest-even) |
|
≤ 4 ulp |
1 ulp (full range) |
||
≤ 5 ulp |
N/A |
N/A |
|
≤ 4 ulp |
2 ulp (full range) |
||
≤ 5 ulp |
N/A |
N/A |
|
≤ 5 ulp |
2 ulp (full range) |
||
≤ 6 ulp |
2 ulp (full range) |
||
≤ 5 ulp |
N/A |
N/A |
|
≤ 6 ulp |
N/A |
N/A |
|
≤ 4 ulp |
2 ulp (full range) |
||
≤ 4 ulp |
2 ulp (full range) |
||
≤ 5 ulp |
2 ulp (full range) |
||
≤ 2 ulp |
1 ulp (full range) |
||
Correctly rounded |
0 ulp (full range) |
||
0 ulp |
Undocumented. |
||
≤ 4 ulp |
1 ulp (full range) |
||
≤ 4 ulp |
1 ulp (full range) |
||
≤ 4 ulp |
1 ulp (full range) |
||
N/A |
N/A |
6 ulp (full range) |
|
N/A |
N/A |
6 ulp (full range) |
|
≤ 16 ulp |
4 ulp (full range) |
||
N/A |
N/A |
6 ulp (full range) |
|
N/A |
N/A |
3 ulp (full range) |
|
N/A |
N/A |
5 ulp (full range) |
|
≤ 16 ulp |
2 ulp (full range) |
||
≤ 3 ulp |
1 ulp (full range) |
||
≤ 3 ulp |
1 ulp (full range) |
||
≤ 3 ulp |
1 ulp (full range) |
||
≤ 3 ulp |
1 ulp (full range) |
||
0 ulp |
Undocumented. |
||
Correctly rounded |
0 ulp (full range) |
||
Correctly rounded |
0 ulp (full range) |
||
Correctly rounded |
0 ulp (IEEE-754 round-to-nearest-even) |
||
0 ulp |
Undocumented. |
||
0 ulp |
Undocumented. |
||
0 ulp |
0 ulp (full range) |
||
Correctly rounded |
N/A |
N/A |
|
0 ulp |
0 ulp (full range) |
||
≤ 4 ulp |
2 ulp (full range) |
||
0 ulp |
0 ulp (full range) |
||
N/A |
N/A |
7 ulp for |
|
N/A |
N/A |
7 ulp for |
|
N/A |
N/A |
For |
|
Correctly rounded |
0 ulp (full range) |
||
N/A |
N/A |
4 ulp (outside interval |
|
≤ 3 ulp |
1 ulp (full range) |
||
≤ 3 ulp |
1 ulp (full range) |
||
≤ 3 ulp |
1 ulp (full range) |
||
≤ 2 ulp |
1 ulp (full range) |
||
0 ulp |
0 ulp (full range) |
||
N/A |
N/A |
0 ulp (full range) |
|
N/A |
N/A |
0 ulp (full range) |
|
N/A |
N/A |
0 ulp (full range) |
|
N/A |
N/A |
0 ulp (full range) |
|
Any value allowed (infinite ulp) |
N/A |
N/A |
|
0 ulp |
N/A |
N/A |
|
0 ulp |
N/A |
N/A |
|
0 ulp |
|
0 ulp (full range) |
|
0 ulp |
Undocumented. |
||
N/A |
N/A |
0 ulp (full range) |
|
0 ulp |
Undocumented. |
||
N/A |
N/A |
3 ulp (full range) |
|
N/A |
N/A |
5 ulp (full range) |
|
N/A |
N/A |
7 ulp (full range) |
|
N/A |
N/A |
2 ulp (full range) |
|
N/A |
N/A |
2 ulp (full range) |
|
≤ 16 ulp |
2 ulp (full range) |
||
≤ 16 ulp |
N/A |
N/A |
|
≤ 16 ulp |
N/A |
N/A |
|
N/A |
N/A |
1 ulp (full range) |
|
N/A |
N/A |
1 ulp (full range) |
|
N/A |
N/A |
2 ulp (full range) |
|
N/A |
N/A |
1 ulp (full range) |
|
N/A |
N/A |
1 ulp (full range) |
|
0 ulp |
0 ulp (full range) |
||
0 ulp |
0 ulp (full range) |
||
Correctly rounded |
0 ulp (full range) |
||
≤ 16 ulp |
N/A |
N/A |
|
Correctly rounded |
0 ulp (full range) |
||
≤ 2 ulp |
1 ulp (full range) |
||
N/A |
N/A |
0 ulp (full range) |
|
N/A |
N/A |
0 ulp (full range) |
|
≤ 4 ulp |
1 ulp (full range) |
||
≤ 4 ulp for sine and cosine values |
1 ulp (full range) |
||
N/A |
N/A |
1 ulp (full range) |
|
≤ 4 ulp |
1 ulp (full range) |
||
≤ 4 ulp |
1 ulp (full range) |
||
Correctly rounded |
0 ulp (IEEE-754 round-to-nearest-even) |
||
≤ 5 ulp |
2 ulp (full range) |
||
≤ 5 ulp |
1 ulp (full range) |
||
≤ 6 ulp |
N/A |
N/A |
|
≤ 16 ulp |
8 ulp (full range) |
||
Correctly rounded |
0 ulp (full range) |
||
N/A |
N/A |
7 ulp for |
|
N/A |
N/A |
7 ulp for |
|
N/A |
N/A |
For |
|
N/A |
N/A |
N/A |
|
N/A |
N/A |
N/A |
|
N/A |
N/A |
N/A |
|
N/A |
N/A |
N/A |
Half Precision¶
The following tables uses the following sources:
CUDA doesn’t specify the ULP values for any of its half precision math builtins:
OpenCL Built-in |
OpenCL Min Accuracy (ULP) |
CUDA Built-in |
CUDA Maximum Error (ULP) |
---|---|---|---|
N/A |
N/A |
Undocumented (only specifies “round-to-nearest-even mode”) |
|
N/A |
N/A |
Undocumented (only specifies “round-to-nearest-even mode”) |
|
N/A |
N/A |
Undocumented |
|
≤ 8192 ulp |
Undocumented (only specifies “round-to-nearest-even mode”) |
||
≤ 8192 ulp |
Undocumented (only specifies “round-to-nearest mode”) |
||
N/A |
N/A |
Undocumented |
|
N/A |
N/A |
Undocumented |
|
≤ 8192 ulp |
Undocumented (only specifies “round-to-nearest-even mode”) |
||
≤ 8192 ulp |
Undocumented (only specifies “round-to-nearest-even mode”) |
||
≤ 8192 ulp |
Undocumented (only specifies “round-to-nearest-even mode”) |
||
N/A |
N/A |
Undocumented |
|
N/A |
N/A |
Undocumented (only specifies “round-to-nearest-even mode”) |
|
N/A |
N/A |
Undocumented (only specifies “round-to-nearest-even mode”) |
|
N/A |
N/A |
Undocumented |
|
N/A |
N/A |
Undocumented |
|
N/A |
N/A |
Undocumented |
|
N/A |
N/A |
Undocumented |
|
N/A |
N/A |
Undocumented |
|
N/A |
N/A |
Undocumented |
|
N/A |
N/A |
Undocumented |
|
N/A |
N/A |
Undocumented |
|
≤ 8192 ulp |
Undocumented (only specifies “round-to-nearest-even mode”) |
||
≤ 8192 ulp |
Undocumented (only specifies “round-to-nearest-even mode”) |
||
≤ 8192 ulp |
Undocumented (only specifies “round-to-nearest-even mode”) |
||
N/A |
N/A |
Undocumented |
|
N/A |
N/A |
Undocumented |
|
N/A |
N/A |
Undocumented (only specifies “round-to-nearest mode”) |
|
N/A |
N/A |
Undocumented (only specifies “round-to-nearest mode”) |
|
N/A |
N/A |
Undocumented |
|
N/A |
N/A |
Undocumented |
|
N/A |
N/A |
Undocumented |
|
≤ 8192 ulp |
N/A |
N/A |
|
≤ 8192 ulp |
Undocumented (only specifies “round-to-nearest-even mode”) |
||
N/A |
N/A |
Undocumented (only specifies “halfway cases rounded to nearest even integer value”) |
|
≤ 8192 ulp |
Undocumented (only specifies “round-to-nearest mode”) |
||
≤ 8192 ulp |
Undocumented (only specifies “round-to-nearest-even mode”) |
||
≤ 8192 ulp |
Undocumented (only specifies “round-to-nearest-even mode”) |
||
N/A |
N/A |
Undocumented (only specifies “round-to-nearest mode”) |
|
N/A |
N/A |
Undocumented (only specifies “round-to-nearest mode”) |
|
≤ 8192 ulp |
N/A |
N/A |
|
N/A |
N/A |
Undocumented |
CUDA also defines math builtins that operate on a half2
type to which there is no OpenCL parallel:
CUDA Built-in |
CUDA Maximum Error (ULP) |
---|---|
Undocumented (only specifies “round-to-nearest mode”) |
|
Undocumented (only specifies “round-to-nearest-even mode”) |
|
Undocumented (only specifies “round-to-nearest-even mode”) |
|
Undocumented |
|
Undocumented |
|
Undocumented |
|
Undocumented |
|
Undocumented |
|
Undocumented |
|
Undocumented |
|
Undocumented |
|
Undocumented |
|
Undocumented |
|
Undocumented |
|
Undocumented |
|
Undocumented |
|
Undocumented |
|
Undocumented (only specifies “round-to-nearest-even mode”) |
|
Undocumented (only specifies “round-to-nearest-even mode”) |
|
Undocumented |
|
Undocumented |
|
Undocumented |
|
Undocumented |
|
Undocumented |
|
Undocumented |
|
Undocumented |
|
Undocumented |
|
Undocumented |
|
Undocumented (only specifies “round-to-nearest-even mode”) |
|
Undocumented (only specifies “round-to-nearest-even mode”) |
|
Undocumented |
|
Undocumented |
|
Undocumented |
|
Undocumented (only specifies “round-to-nearest-even mode”) |
|
Undocumented (only specifies “round-to-nearest-even mode”) |
|
Undocumented |
|
Undocumented (only specifies “round-to-nearest-even mode”) |
|
Undocumented (only specifies “round-to-nearest-even mode”) |
|
Undocumented (only specifies “round-to-nearest-even mode”) |
|
Undocumented (only specifies “round-to-nearest mode”) |
|
Undocumented |
|
Undocumented (only specifies “round-to-nearest-even mode”) |
|
Undocumented (only specifies “round-to-nearest-even mode”) |
|
Undocumented (only specifies “round-to-nearest-even mode”) |
|
Undocumented (only specifies “round-to-nearest-even mode”) |
|
Undocumented (only specifies “halfway cases rounded to nearest even integer value”) |
|
Undocumented (only specifies “round-to-nearest-even mode”) |
|
Undocumented |
Further, CUDA defines conversion and data movement functions:
CUDA Built-in |
CUDA Maximum Error (ULP) |
---|---|
Undocumented (only specifies “round-to-nearest-even mode”) |
|
Undocumented (only specifies “round-to-nearest-even mode”) |
|
Undocumented (only specifies “round-down mode”) |
|
Undocumented (only specifies “round-to-nearest-even mode”) |
|
Undocumented (only specifies “round-up mode”) |
|
Undocumented (only specifies “round-towards-zero mode”) |
|
Undocumented (only specifies “round-to-nearest-even mode”) |
|
Undocumented (only specifies “round-to-nearest-even mode”) |
|
Undocumented |
|
Undocumented |
|
Undocumented |
|
Undocumented (only specifies “round-down mode”) |
|
Undocumented (only specifies “round-to-nearest-even mode”) |
|
Undocumented (only specifies “round-up mode”) |
|
Undocumented (only specifies “round-towards-zero mode”) |
|
Undocumented (only specifies “round-down mode”) |
|
Undocumented (only specifies “round-to-nearest-even mode”) |
|
Undocumented (only specifies “round-up mode”) |
|
Undocumented (only specifies “round-towards-zero mode”) |
|
Undocumented (only specifies “round-down mode”) |
|
Undocumented (only specifies “round-to-nearest-even mode”) |
|
Undocumented (only specifies “round-up mode”) |
|
Undocumented (only specifies “round-towards-zero mode”) |
|
Undocumented (only specifies “round-down mode”) |
|
Undocumented (only specifies “round-to-nearest-even mode”) |
|
Undocumented (only specifies “round-up mode”) |
|
Undocumented (only specifies “round-towards-zero mode”) |
|
Undocumented (only specifies “round-down mode”) |
|
Undocumented (only specifies “round-to-nearest-even mode”) |
|
Undocumented (only specifies “round-up mode”) |
|
Undocumented (only specifies “round-towards-zero mode”) |
|
Undocumented (only specifies “round-down mode”) |
|
Undocumented (only specifies “round-to-nearest-even mode”) |
|
Undocumented (only specifies “round-up mode”) |
|
Undocumented (only specifies “round-towards-zero mode”) |
|
Undocumented |
|
Undocumented |
|
Undocumented |
|
Undocumented |
|
Undocumented |
|
Undocumented |
|
Undocumented |
|
Undocumented (only specifies “round-down mode”) |
|
Undocumented (only specifies “round-to-nearest-even mode”) |
|
Undocumented (only specifies “round-up mode”) |
|
Undocumented (only specifies “round-towards-zero mode”) |
|
Undocumented (only specifies “round-down mode”) |
|
Undocumented (only specifies “round-to-nearest-even mode”) |
|
Undocumented (only specifies “round-up mode”) |
|
Undocumented (only specifies “round-towards-zero mode”) |
|
Undocumented |
|
Undocumented |
|
Undocumented |
|
Undocumented |
|
Undocumented |
|
Undocumented |
|
Undocumented |
|
Undocumented |
|
Undocumented |
|
Undocumented (only specifies “round-down mode”) |
|
Undocumented (only specifies “round-to-nearest-even mode”) |
|
Undocumented (only specifies “round-up mode”) |
|
Undocumented (only specifies “round-towards-zero mode”) |
|
Undocumented |
|
Undocumented (only specifies “round-down mode”) |
|
Undocumented (only specifies “round-to-nearest-even mode”) |
|
Undocumented (only specifies “round-up mode”) |
|
Undocumented (only specifies “round-towards-zero mode”) |
|
Undocumented (only specifies “round-down mode”) |
|
Undocumented (only specifies “round-to-nearest-even mode”) |
|
Undocumented (only specifies “round-up mode”) |
|
Undocumented (only specifies “round-towards-zero mode”) |
|
Undocumented (only specifies “round-down mode”) |
|
Undocumented (only specifies “round-to-nearest-even mode”) |
|
Undocumented (only specifies “round-up mode”) |
|
Undocumented (only specifies “round-towards-zero mode”) |
|
Undocumented |