Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[glsl-out] Incorrect rounding for max-float constant expression #4568

Open
Wumpf opened this issue Aug 14, 2023 · 19 comments · May be fixed by gfx-rs/naga#2443
Open

[glsl-out] Incorrect rounding for max-float constant expression #4568

Wumpf opened this issue Aug 14, 2023 · 19 comments · May be fixed by gfx-rs/naga#2443
Labels
naga Shader Translator type: bug Something isn't working

Comments

@Wumpf
Copy link
Member

Wumpf commented Aug 14, 2023

In v0.12 (prior to gfx-rs/naga#2266), this wgsl constant
const f32max = 0x1.fffffep+127;
would inline in glsl as 3.4028234663852886e38.

However, in v0.13 this results into a glsl constant
const float f32max = 3.4028235e38;
When testing a provided max-float value (read from a texture in my case) for >=f32max this will now yield false in a WebGL setup.
Interestingly, this scenario works fine on Metal where this yields constant float f32max = 340282350000000000000000000000000000000.0;. I suspect that glsl applies different rounding here.

More backends and situations might be affected by this, but I was able to verify it only so far for glsl-out and rule it out for msl-out. It would be worth exploring the effect on other backends as well and find out if this has any other accuracy problems.

@fornwall
Copy link
Contributor

fornwall commented Aug 15, 2023

The difference between 3.4028234663852886e38 and 3.4028235e38 comes from casting f32 to f64 or not before formatting f32max using {:?}:

let f32max = 3.4028235e38_f32;
println!("f64: {:?}", f32max as f64); // What glsl-out effectively did before gfx-rs/naga#2266, outputs "3.4028234663852886e38"
println!("f32: {:?}", f32max); // What glsl-out does now, outputs "3.4028235e38"
println!("Display: {}.0", f32max);  // What metal-out does, outputs "340282350000000000000000000000000000000.0"

But while different in string format, parsed as 32-bit floating point numbers by rustc, the three numbers are identical. Check https://evanw.github.io/float-toy/ or the below snippet:

let a = 3.4028234663852886e38_f32;
let b = 3.4028235e38_f32;
let c = 340282350000000000000000000000000000000.0_f32;
println!("a: {:0>32b}", a.to_bits()); // "01111111011111111111111111111111"
println!("b: {:0>32b}", b.to_bits()); // "01111111011111111111111111111111"
println!("c: {:0>32b}", c.to_bits()); // "01111111011111111111111111111111"
println!("{}, {}", a == b, b == c); // "true, true"

Here the bits 01111111011111111111111111111111 are indeed the correct bits for f32max:

0     11111110  11111111111111111111111
^     ^         ^
sign  exponent  fraction

sign: 0, positive number
exponent: 127, maximum value
fraction: 11111111111111111111111, all bits set

Since we've found an observable change in behaviour, we still need to understand what is happening.

@Wumpf To determine if the format makes any difference, could you try wgpu = { git = "https://github.com/fornwall/wgpu", rev = "25934854" }? This is wgpu 0.17 with a single modification to output const float f32max = 3.4028234663852886e38 instead. Does that make any difference? Otherwise it might be the change from inlining to a constant variable which causes the difference in behaviour.

@Wumpf
Copy link
Member Author

Wumpf commented Aug 16, 2023

thanks for the detailed analysis and providing a test case @fornwall!
I tried it against my repro case and it worked with your tweak! This would imply that glsl reads the string interpretation incorrectly? 🤔

I didn't build a simple repro case on my side, so the instructions to see the thing are somewhat.. involved. For reference, this is the commit I tried against rerun-io/rerun@a395e5a
But it should be possible to build something very minimal: What happens in my case is that I read a f32max from a texture and then check it here https://github.com/rerun-io/rerun/blob/andreas/naga-glsl-constant-fix/crates/re_renderer/shader/utils/size.wgsl#L13 which ends up determining the size of a bunch of points (quads) being drawn.
➡️ Meaning the whole thing boils down to comparing a float constant to a float that comes in via a read.

Wumpf referenced this issue in rerun-io/rerun Aug 16, 2023
### What

* Depends on emilk/egui#3253
* allows for a bit nicer callback handling on our side. Still not
amazing but quite a bit better I reckon (no more locks and additional
ref counting on our side!). I was hoping to use the new, more flexible
`shared_paint_callback_resources` to store `ViewBuilder`, but it can't
be accessed without a handle to the eframe renderer.
The end goal would be to get rid of re_renderer's
`per_frame_data_helper`. This will likely be enabled with wgpu's
"Arcanization" effort as this will eliminate `wgpu::RenderPass`'s
lifetime dependency. In that case we could store a `Mutex<ViewBuilder>`
on `ReRendererCallback` 🤔
* For the forseeable future we'll have to enable
`fragile-send-sync-non-atomic-wasm` on wgpu (luckily egui itself no
longer has this restriction). The problem is that a lot of our
infrastructure is built around assuming `Send`/`Sync` whereas in
actuality webgpu rendering resources can't be shared accross threads
easily.
* Had to (discover and) work around
https://github.com/gfx-rs/naga/issues/2436

### Checklist
* [x] I have read and agree to [Contributor
Guide](https://github.com/rerun-io/rerun/blob/main/CONTRIBUTING.md) and
the [Code of
Conduct](https://github.com/rerun-io/rerun/blob/main/CODE_OF_CONDUCT.md)
* [x] I've included a screenshot or gif (if applicable)
* [x] I have tested [demo.rerun.io](https://demo.rerun.io/pr/2980) (if
applicable)
* [x] Test WebGPU build

- [PR Build Summary](https://build.rerun.io/pr/2980)
- [Docs
preview](https://rerun.io/preview/pr%3Aandreas%2Fupdate-wgpu/docs)
- [Examples
preview](https://rerun.io/preview/pr%3Aandreas%2Fupdate-wgpu/examples)
@fornwall
Copy link
Contributor

This would imply that glsl reads the string interpretation incorrectly? 🤔

A stackoverflow question shows that there a WebGL2 implementation computed a value as the equivalent 64-bit floating point code in C++ would have, which seems to be allowed behaviour.

Related: gpuweb/gpuweb#3431, wgsl: add way to get special values for a given numeric type (e.g. max, min, lowest).

As I understand it the previous as f64 formatting happened more to work "by chance" (?), so I don't know what a good way forward here is unfortunately.

@Wumpf
Copy link
Member Author

Wumpf commented Aug 17, 2023

If the problem is that webgl uses 64bit floats that would imply in this case that...

  • the max constant is in actuality a 64bit float, meaning it has a higher value than expected
  • the comparison operation may compare two f64 or a f64 with a f32 one, with the potential f32 one having max-f32 loaded -> this works fine if the f64 was correctly set

So that would mean that if we always set the 64bit representation of floats for constants in Glsl as it does on the branch you provided we should be at least better off? I mean the problem boils down to 3.4028234663852886e38_f64 != 3.4028235e38_f64, right? Wouldn't say it worked by chance since all f32 can be represented in f64. If the hypothesis is correct, using f64 literals always would solve the issue consistently I believe 🤔

Agreed that having limit constants in wgsl would be preferable in any case, but arguably this is an orthogonal problem to the one of "constant gets the wrong value assigned"

@fornwall
Copy link
Contributor

In a webgl2 shader, it seems like indeed 3.4028235e38 != 3.4028234663852886e38 with precision highp float: https://fornwall.net/webgl2-float-comparison (at least on all browser and os setups I've tested with)

As said above, this is in contrast to treating the numbers are 32-bit floats, for instance in rust: assert_eq!(3.4028235e38_f32, 3.4028234663852886e38_f32);.

So the effective behaviour is like the shader uses f64 in computations:

let value_from_texture = 3.4028235e38_f32 as f64;
let f32max: f64 = 3.4028235e38_f32;

which causes value_from_texture >= f32max to be false.

Using f64 literals would indeed solve this problem. Are there any potential drawbacks? Let's create a PR to raise visibility and perhaps get more eyes on it (I can do it later today).

@teoxoy
Copy link
Member

teoxoy commented Aug 17, 2023

I think we are running into rust-lang/rust#63171 whereas previously we were passing an f64 for formatting which was causing the fmt function to give us more digits.

The full value of f32::MAX is 340282346638528859811704183484516925440 (ref https://stackoverflow.com/a/60224754). So, if not enough digits are written, we are relying on the rounding behavior of each shading language's float parser.

@teoxoy
Copy link
Member

teoxoy commented Aug 17, 2023

In my case 3.4028235e38 seems to get rounded to NaN (1.0/0.0) on https://fornwall.net/webgl2-float-comparison.

Also found https://float.exposed/0x7f7fffff which is a nice tool for looking at representations of floats.

Not sure what a proper solution for this is, maybe we should just cast to f64 before writing although once we fully support f64 we'd need f128 for those. Or come up with an algo for formatting floats properly but that sounds involved.

@fornwall
Copy link
Contributor

Created gfx-rs/naga#2443, adding the as f64 which was confirmed resolving the issue mentioned here.

fornwall referenced this issue in fornwall/naga Aug 18, 2023
@fornwall
Copy link
Contributor

In my case 3.4028235e38 seems to get rounded to NaN (1.0/0.0) on https://fornwall.net/webgl2-float-comparison.

@teoxoy Interesting! So having f1=f2=3.4028235e38 made the comparison false (=red triangle), as NaN is not equal to itself? I'm not seeing that here - which browser/os/arch did you check on?

@teoxoy
Copy link
Member

teoxoy commented Aug 18, 2023

No, I'm also getting the blue triangle. The result of 1.0/0.0 is unspecified but is usually NaN or Inf, it might be Inf here.

If there is a syntax error in the shader (set f2=;) I'm getting this warning to show up (besides the syntax error):
compiling shader: WARNING: 0:5: '3.4028235e38' : Float overflow

@fornwall
Copy link
Contributor

I'm leaning towards merging gfx-rs/naga#2443, as it

  1. Fixes a real issue
  2. Was the behaviour up until recently
  3. We don't know of any drawbacks

We can always revisit if there is a related issue or more information in the future.

@teoxoy
Copy link
Member

teoxoy commented Aug 18, 2023

Is it only an issue in the GLSL backend? Potentially, all other backends (besides SPV) have the same issue. Or does their rounding match rust's? f64 also most likely has the same issue.

I was trying to figure out yesterday if we can somehow tell rust's formatting to give us one more digit than it thinks its needed since I think that would fix the issue but couldn't find how to do it. There is a way to specify the total nr of digits and also the digits after the period but we can't get the estimate rust is using in the fmt function.

@teoxoy
Copy link
Member

teoxoy commented Aug 18, 2023

The IEEE 754-2019 spec defines H as being the minimum nr of significant digits needed to perform the conversion.

For f32 it's 12 and for f64 it's 20.

Maybe we can instead use the exponential formatting rust provides and pass 11 & 19 for the nr of digits after the period.

https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=55a02ec8ed1f8adf8c4150995beb2d5e

The downside seems to be that insignficant 0's are still written out.

@fornwall
Copy link
Contributor

fornwall commented Aug 20, 2023

Maybe we can instead use the exponential formatting rust provides and pass 11 & 19 for the nr of digits after the period.

https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=55a02ec8ed1f8adf8c4150995beb2d5e

The output of println!("{:.11e}", f32::MAX) is 3.40282346639e38, which chrome does not parse as a f32 in wgsl - on the below function:

fn compute() -> f32 { return 3.40282346639e38; }

chrome fails (test here) with the below error:

value 340282346639000000870967122248196947968.0 cannot be represented as 'f32'

It seems like chrome effectively does something like:

let num = "3.40282346639e38".parse::<f64>().unwrap();
if num > (f32::MAX as f64) { // Fail with "{num} cannot be represented as 'f32'".

Is chrome doing the correct thing here? Looking at 6.2.1. Abstract Numeric Types, I guess so?

Certain expressions are evaluated at shader-creation time, and with a numeric range and precision that may be larger than directly implemented by the GPU.
[..]
The AbstractFloat type is the set of finite floating point numbers representable in the IEEE-754 binary64 (double precision) format.

I guess this is something that will be fixed in naga with (or after) const expressions, to propagate Abstract{Float,Int} instead of resolving them during parsing? Anyway, for wgsl output, {:.11e} does not work (at least for compatibility with current chrome behaviour).

@teoxoy
Copy link
Member

teoxoy commented Aug 31, 2023

I'm starting to think the most correct way to fix this is to output the float as its binary representation (via .to_bits()) then bitcasting it (leaving no room for interpretation). We can also add an inline comment with the actual value to aid readers.

@teoxoy
Copy link
Member

teoxoy commented Aug 31, 2023

@fornwall would you be up for updating our text backends (glsl, hlsl, msl & wgsl) with the solution above; as opposed to going with gfx-rs/naga#2443?

@fornwall
Copy link
Contributor

fornwall commented Sep 4, 2023

I'm starting to think the most correct way to fix this is to output the float as its binary representation (via .to_bits()) then bitcasting it (leaving no room for interpretation).

@teoxoy Do you mean a change like this for glsl:

-crate::Literal::F32(value) => write!(self.out, "{:?}", value)?,
+crate::Literal::F32(value) => write!(self.out, "uintBitsToFloat({}u)", value.to_bits())?,

causing changes like this:

-const vec3 const2_ = vec3(0.0, 1.0, 2.0);
+const vec3 const2_ = vec3(uintBitsToFloat(0u), uintBitsToFloat(1065353216u), uintBitsToFloat(1073741824u));

? That does not pass glslangValidator validation:

ERROR: 0:8: '=' : global const initializers must be constant ' const highp 3-component vector of float'
ERROR: 1 compilation errors. No code generated.

@fornwall
Copy link
Contributor

fornwall commented Sep 4, 2023

What tint ends up doing for at least glsl
(https://dawn.googlesource.com/dawn/+/refs/heads/main/src/tint/utils/text/string_stream.h#108):

template <typename T>
StringStream& EmitFloat(const T& value) {
    // Try printing the float in fixed point, with a smallish limit on the precision
    std::stringstream fixed;
    fixed.flags(fixed.flags() | std::ios_base::showpoint | std::ios_base::fixed);
    fixed.imbue(std::locale::classic());
    fixed.precision(20);
    fixed << value;

    std::string str = fixed.str();

    // If this string can be parsed without loss of information, use it.
    // (Use double here to dodge a bug in older libc++ versions which would incorrectly read
    // back FLT_MAX as INF.)
    double roundtripped;
    fixed >> roundtripped;

    // Strip trailing zeros from the number.
    auto float_equal_no_warning = std::equal_to<T>();
    if (float_equal_no_warning(value, static_cast<T>(roundtripped))) {
        while (str.length() >= 2 && str[str.size() - 1] == '0' && str[str.size() - 2] != '.') {
            str.pop_back();
        }

        sstream_ << str;
        return *this;
    }

    // Resort to scientific, with the minimum precision needed to preserve the whole float
    std::stringstream sci;
    sci.imbue(std::locale::classic());
    sci.precision(std::numeric_limits<T>::max_digits10);
    sci << value;
    sstream_ << sci.str();

    return *this;
}

That is, after having special cased infinity and NaN:

void PrintF32(StringStream& out, float value) {
      if (std::isinf(value)) {
          out << "0.0f " << (value >= 0 ? "/* inf */" : "/* -inf */");
      } else if (std::isnan(value)) {
          out << "0.0f /* nan */";
      } else {
          out << tint::writer::FloatToString(value) << "f"; // Ends up calling EmitFloat() above
      }   
}

@jimblandy
Copy link
Member

jimblandy commented Sep 6, 2023

? That does not pass glslangValidator validation:

This is surprising to me. The GLSL spec (3.20.6) says that a constant expression can be:

A built-in function call whose arguments are all constant expressions, with the exception of the texture lookup functions.

This seems to date all the way back to GLSL ES 1.0.

@cwfitzgerald cwfitzgerald added the naga Shader Translator label Oct 25, 2023
@cwfitzgerald cwfitzgerald transferred this issue from gfx-rs/naga Oct 25, 2023
@cwfitzgerald cwfitzgerald added type: bug Something isn't working and removed kind: bug labels Oct 25, 2023
@teoxoy teoxoy added this to the WebGPU Specification V1 milestone Nov 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
naga Shader Translator type: bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants