Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[clang][WebAssembly] Odd builtin behaviour #92698

Open
Photosounder opened this issue May 19, 2024 · 18 comments
Open

[clang][WebAssembly] Odd builtin behaviour #92698

Photosounder opened this issue May 19, 2024 · 18 comments

Comments

@Photosounder
Copy link

Photosounder commented May 19, 2024

Using clang (version 18.1.4) to compile WebAssembly and going the -nostdlib route I'm noticing strange things when I try to use some builtins to implement libc functions. For instance doing static double sqrt(double x) { return __builtin_sqrt(x); } works fine but doing static double exp2(double x) { return __builtin_exp2(x); } gives me wasm-ld: error: C:/msys/tmp/rl-89dcca.o: undefined symbol: exp2 and making it non-static with an extern prototype above itself just makes it call itself in a loop. That's even though __has_builtin(__builtin_exp2) is positive. Same thing with __builtin_cos or __builtin_lroundf (yet not __builtin_nearbyintf).

It seems as though some builtins aren't really there for WebAssembly, as if they were defined by #define __builtin_cos cos, but there doesn't seem to be a clear way to determine which are actually usable, and knowing which I can actually use is what I would need.

Edit: I thought about it and I guess the answer is that only the builtins that map to a WebAssembly opcode are there in my case as the rest would come from libc which I'm not including so there's nothing there. It would be nice if this was documented somewhere though.

@github-actions github-actions bot added the clang Clang issues not falling into any other category label May 19, 2024
@EugeneZelenko EugeneZelenko added backend:WebAssembly and removed clang Clang issues not falling into any other category labels May 19, 2024
@llvmbot
Copy link
Collaborator

llvmbot commented May 19, 2024

@llvm/issue-subscribers-backend-webassembly

Author: Michel Rouzic (Photosounder)

Using clang (version 18.1.4) to compile WebAssembly and going the `-nostdlib` route I'm noticing strange things when I try to use some builtins to implement libc functions. For instance doing `static double sqrt(double x) { return __builtin_sqrt(x); }` works fine but doing `static double exp2(double x) { return __builtin_exp2(x); }` gives me `wasm-ld: error: C:/msys/tmp/rl-89dcca.o: undefined symbol: exp2` and making it non-static with an `extern` prototype above itself just makes it call itself in a loop. That's even though `__has_builtin(__builtin_exp2)` is positive. Same thing with `static double cos(double x) { return __builtin_cos(x); }`.

It seems as though some builtins aren't really there for WebAssembly, as if they were defined by #define __builtin_cos cos, but there doesn't seem to be a clear way to determine which are actually usable, and knowing which I can actually use is what I would need.

@ppenzin
Copy link

ppenzin commented May 20, 2024

What are you using for standard library headers, emscripten?

@Photosounder
Copy link
Author

Photosounder commented May 20, 2024

What are you using for standard library headers, emscripten?

I'm using -nostdlib, there's no emscripten and no WASI, so there are no standard library headers outside of those the compiler provides regardless, like stdint.h and stddef.h.

@sbc100
Copy link
Collaborator

sbc100 commented May 20, 2024

These builtins lower to calls into compiler-rt functions. In wasi-sdk they would normally be provided by libc/musl code. e.g. https://github.com/WebAssembly/wasi-libc/blob/main/libc-top-half/musl/src/math/exp2.c

If you want to build with -nostdlib when you need to somehow compile those math functions yourself and include them in your project.

@Photosounder
Copy link
Author

If you want to build with -nostdlib when you need to somehow compile those math functions yourself and include them in your project.

I understand, that's already what I've been doing (see https://github.com/Photosounder/MinQND-libc/blob/main/minqnd_libc.h, I was actually wondering if I really needed all these homemade implementations instead of relying on builtins), the problem is not having an easy way of knowing what's actually available or not. There's __has_builtin(__builtin_exp2) being positive so that's no good, and when I call a builtin in a function I don't know if it's actually there until the linker actually needs to link that function, the compiler itself doesn't say anything. So the best way to guess might be to look at WebAssembly opcodes and try to guess which builtins it might be. That's not a very good way of doing things, there should be a way to know what's actually there or not, and even when I look at clang source code I can't see a difference, probably because I was looking at how builtins turn into IR and not how the matching IR opcodes would turn into wasm bytecode.

@sbc100
Copy link
Collaborator

sbc100 commented May 20, 2024

I'm afraid I don't know where the LLVM source code you can look for find a complete list the libcalls that the Wasm backend depends on. You probably want to be looking for terms like libcall and RTLIB in the source code. The llvm/lib/Target/WebAssembly/WebAssemblyRuntimeLibcallSignatures.cpp might have some clues too.

@dschuff do you know if there is an easy way to tell exactly which libcalls can be generated by llvm?

The conservative thing to do would be to provide a complete set of libcalls which is what libc/compiler-rt would do, but I guess you are trying to make something more minimal. Is there some reason you can't link against the math functions from musl/compiler-rt?

@dschuff
Copy link
Member

dschuff commented May 20, 2024

A list of all libcalls LLVM knows about can be found in https://github.com/llvm/llvm-project/blob/main/llvm/include/llvm/IR/RuntimeLibcalls.def
Practically speaking, only libcalls in https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/WebAssembly/WebAssemblyRuntimeLibcallSignatures.cpp can be generated for webassembly (since we don't have wasm signatures for others)

@Photosounder
Copy link
Author

I'm afraid I don't know where the LLVM source code you can look for find a complete list the libcalls that the Wasm backend depends on. You probably want to be looking for terms like libcall and RTLIB in the source code. The llvm/lib/Target/WebAssembly/WebAssemblyRuntimeLibcallSignatures.cpp might have some clues too.

It doesn't seem like this highlights the distinction I'm looking for, for instance we have

Table[RTLIB::SQRT_F64] = f64_func_f64;
Table[RTLIB::EXP2_F64] = f64_func_f64;

The first is available without libc (I assume because it simply turns into f64.sqrt) whereas the second isn't.

The conservative thing to do would be to provide a complete set of libcalls which is what libc/compiler-rt would do, but I guess you are trying to make something more minimal. Is there some reason you can't link against the math functions from musl/compiler-rt?

Tbh I'm kind of too far ahead into implementing all the functions I need for it to be a big deal, but it feels like a defect of the compiler to not make a distinction between what is or isn't actually available (because the vanilla x86-64 clang I'm using doesn't have a libc for wasm32 no matter what). As for why I'm doing this I have a whole manifesto about this which I don't think I should bore you with 😄.

@dschuff
Copy link
Member

dschuff commented May 20, 2024

That's still kind of a conservative estimate though; e.g. we have a signature for RTLIB::MUL_F32 but I can't imagine any case where that would actually be generated because wasm has a 32-bit multiply instruction. I don't know offhand of a way to easily tell which operations are supported. https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/WebAssembly/WebAssemblyISelLowering.cpp#L46 sets a bunch of cases of how to handle different operations and types but it's also not really a comprehensive list.

@Photosounder
Copy link
Author

A list of all libcalls LLVM knows about can be found in https://github.com/llvm/llvm-project/blob/main/llvm/include/llvm/IR/RuntimeLibcalls.def Practically speaking, only libcalls in https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/WebAssembly/WebAssemblyRuntimeLibcallSignatures.cpp can be generated for webassembly (since we don't have wasm signatures for others)

So all the builtins are there but there's no distinction between the ones that rely on libc and the ones that work without libc. If you see cos or exp2 or round these are the ones that aren't really there without libc.

@Photosounder
Copy link
Author

That's still kind of a conservative estimate though; e.g. we have a signature for RTLIB::MUL_F32 but I can't imagine any case where that would actually be generated because wasm has a 32-bit multiply instruction. I don't know offhand of a way to easily tell which operations are supported. https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/WebAssembly/WebAssemblyISelLowering.cpp#L46 sets a bunch of cases of how to handle different operations and types but it's also not really a comprehensive list.

It seems like it might still not show the right distinction as I see in this file FCOS being grouped with FMA but __builtin_fma works without libc whereas __builtin_cos doesn't.

@dschuff
Copy link
Member

dschuff commented May 20, 2024

Yeah; to be clear I'm agreeing that this doesn't really seem to be exactly what you want. I unfortunately don't know of a single place that provides such a list.

Regarding your libc philosophy, emscripten actually shares some of those values. The reasoning mostly boils down to the fact that we place a higher priority on code size than most C implementations, because on the web today you need to ship all of your libc code over the wire with your program, whereas in most C implementations, libc is already on every system from install time, so code size isn't as much of a constraint. As a result we try harder than most C implementations to get separability of library functions so that only things which are actually needed can be included (so there are a few dependencies we try to break inside libc, e.g. our use of printf without long double support by default) but we have to balance that against the maintenance cost of keeping local modifications of upstream libc (and of course we have decided not to roll our own libc but use musl instead). Probably one of the biggest reasons for that is that we can't compromise on standards compliance or accuracy by default as much as you can when you're creating a library mostly for your own use or for a special purpose.

@Photosounder
Copy link
Author

Photosounder commented May 20, 2024

Yeah; to be clear I'm agreeing that this doesn't really seem to be exactly what you want. I unfortunately don't know of a single place that provides such a list.

Regarding your libc philosophy, emscripten actually shares some of those values. The reasoning mostly boils down to the fact that we place a higher priority on code size than most C implementations, because on the web today you need to ship all of your libc code over the wire with your program, whereas in most C implementations, libc is already on every system from install time, so code size isn't as much of a constraint. As a result we try harder than most C implementations to get separability of library functions so that only things which are actually needed can be included (so there are a few dependencies we try to break inside libc, e.g. our use of printf without long double support by default) but we have to balance that against the maintenance cost of keeping local modifications of upstream libc (and of course we have decided not to roll our own libc but use musl instead). Probably one of the biggest reasons for that is that we can't compromise on standards compliance or accuracy by default as much as you can when you're creating a library mostly for your own use or for a special purpose.

That's good, I didn't try to compile with emscripten to see how big the result is (nor did I think about how what I do is advantageous for the web), but that makes sense. Yes by making my own libc just for me I can go to further extremes in terms of simplicity, minimalism and mathematical tradeoffs, in fact it makes a lot of sense for someone to have their own libc if you take such considerations into account. Having my own allocator is also quite important as I can use the external visualiser I made for it to see what's happening in memory and I even have the option of doing things that aren't normally possible, such as moving the start of an allocated buffer up without moving any data.

As far as my original problem goes I suppose we could figure out which builtins are available without libc by making a C file with just one function that calls them all and see what the linker says about this, but it's odd that there would be no way of more directly determining this.

@dschuff
Copy link
Member

dschuff commented May 20, 2024

As far as my original problem goes I suppose we could figure out which builtins are available without libc by making a C file with just one function that calls them all and see what the linker says about this, but it's odd that there would be no way of more directly determining this.

Don't forget that most of the libcalls we are talking about here that differ across platforms in terms of which ones you need (e.g. MUL_F32) are from compiler-rt and are not actually part of libc. Unlike those that are actually part of libc, the compiler-rt functions tend to be well-separable and independent, and if you have a function in your library but don't actually emit calls to it, the linker will ensure it doesn't get included in any binary. So you get the size and simplicity of the linked program without any effort (and most people don't want to write these from scratch and don't have a problem with just taking the compiler-rt builtins library as a whole and just including it all, because there's no cost to the user).
Math functions like sqrt_f64 are an exception to this, although there is often some code (e.g. errno handling) that needs to run even when platforms have native instructions for the math part.

@Photosounder
Copy link
Author

I did a bit of research by making a C file that calls lots of builtins and commented out everything that couldn't be made to work:

// Compile with: \
clang -mexec-model=reactor builtins_test.c -o builtins_test.wasm --target=wasm32 -nostdlib -mbulk-memory

extern void __wasm_call_ctors(void);
__attribute__((export_name("_initialize"))) void _initialize(void) { __wasm_call_ctors(); }

typedef struct { int a, b; } struct_t;
int dummy_printf(const char *fmt, ...) { return 0; }

__attribute__((export_name("builtins_test"))) int builtins_test()
{
	double d = 1.;
	int r, i = 8;
	char s[] = "";
	__float128 v;
	void *p = 0;
	__builtin_va_list args;
	struct_t st;

//undef	r = __builtin_acosf128(d);
//undef	r = __builtin_acoshf128(d);
//undef	r = __builtin_asinf128(d);
//undef	r = __builtin_asinhf128(d);
//undef	r = __builtin_atanf128(d);
//undef	r = __builtin_atanhf128(d);
//undef	r = __builtin_cbrtf128(d);
	r = __builtin_ceil(d);
//undef	r = __builtin_cos(d);
//undef	r = __builtin_coshf128(d);
//undef	r = __builtin_erff128(d);
//undef	r = __builtin_erfcf128(d);
//undef	r = __builtin_exp(d);
//undef	r = __builtin_exp2(d);
//undef	r = __builtin_exp10(d);
//undef	r = __builtin_expm1f128(d);
//undef	r = __builtin_fdimf128(d, d);
	r = __builtin_floor(d);
//undef	r = __builtin_fma(d, d, d);
//undef	r = __builtin_fmax(d, d);
//undef	r = __builtin_fmin(d, d);
//undef	r = __builtin_atan2f128(d, d);
	//r = __builtin_copysignf16(d, d); crashes compiler
//	r = __builtin_copysignf128(d, d); needs other symbols
	//r = __builtin_fabsf16(d);
//	r = __builtin_fabsf128(d); needs other symbols
//undef	r = __builtin_fmod(d, d);
//undef	r = __builtin_frexp(d, &i);
	r = __builtin_huge_val();
	//r = __builtin_huge_valf16();
	r = __builtin_inf();
	//r = __builtin_inff16();
//undef	r = __builtin_ldexp(d, d);
//undef	r = __builtin_modff128(d, &v);
	//r = __builtin_nanf16(s);
//undef	r = __builtin_nanf128(s);
//undef	r = __builtin_nans(s);
//undef	r = __builtin_powi(d, d);
//undef	r = __builtin_pow(d, d);
//undef	r = __builtin_hypotf128(d, d);
//undef	r = __builtin_ilogbf128(d);
//undef	r = __builtin_lgammaf128(d);
//undef	r = __builtin_llrintf128(d);
//undef	r = __builtin_llroundf128(d);
//undef	r = __builtin_log10(d);
//undef	r = __builtin_log1pf128(d);
//undef	r = __builtin_log2(d);
//undef	r = __builtin_logbf128(d);
//undef	r = __builtin_log(d);
//undef	r = __builtin_lrintf128(d);
//undef	r = __builtin_lroundf128(d);
//undef	r = __builtin_nearbyintf128(d);
//undef	r = __builtin_nextafterf128(d, d);
//undef	r = __builtin_nexttowardf128(d, d);
//undef	r = __builtin_remainderf128(d, d);
//undef	r = __builtin_remquof128(d, d, &i);
	r = __builtin_rint(d);
//undef	r = __builtin_round(d);
	r = __builtin_roundeven(d);
//undef	r = __builtin_scalblnf128(d, d);
//undef	r = __builtin_scalbnf128(d, d);
//undef	r = __builtin_sin(d);
//undef	r = __builtin_sinhf128(d);
	r = __builtin_sqrt(d);
//undef	r = __builtin_tanf128(d);
//undef	r = __builtin_tanhf128(d);
//undef	r = __builtin_tgammaf128(d);
	r = __builtin_trunc(d);
	r = __builtin_flt_rounds();
	r = __builtin_complex(d, d);
	r = __builtin_isgreater(d, d);
	r = __builtin_isgreaterequal(d, d);
	r = __builtin_isless(d, d);
	r = __builtin_islessequal(d, d);
	r = __builtin_islessgreater(d, d);
	r = __builtin_isunordered(d, d);
	r = __builtin_fpclassify(d, d, d, d, d, d);
	r = __builtin_isfinite(d);
	r = __builtin_isinf(d);
	r = __builtin_isinf_sign(d);
	r = __builtin_isnan(d);
	r = __builtin_isnormal(d);
	r = __builtin_issubnormal(d);
	r = __builtin_iszero(d);
	r = __builtin_issignaling(d);
	r = __builtin_isfpclass(d, 0);
	r = __builtin_signbit(d);
	r = __builtin_signbitf(d);
//	r = __builtin_signbitl(d); needs extra symbol
	//r = __builtin_canonicalize(d); crashes compiler
	r = __builtin_clz(d);
	r = __builtin_ctz(d);
	r = __builtin_ffs(d);
	r = __builtin_parity(d);
	r = __builtin_popcount(d);
	r = __builtin_clrsb(d);
//undef	p = __builtin_calloc(i, i);
	r = __builtin_constant_p(d);
	r = __builtin_classify_type(d);
	//r = __builtin_va_start(args, d);
	//r = __builtin_stdarg_start(args, d);
	p = __builtin_assume_aligned(p, 4);
//undef	__builtin_free(p);
//undef	p = __builtin_malloc(i);
	__builtin_memcpy_inline(p, p, 0);
	p = __builtin_mempcpy(p, p, d);
	__builtin_memset_inline(p, d, 0);
//undef	r = __builtin_strcspn(s, s);
//undef	p = __builtin_realloc(p, i);
	//p = __builtin_return_address(0); doesn't like it
	p = __builtin_extract_return_addr(p);
	p = __builtin_frame_address(0);
	//__builtin___clear_cache(s, s); crashes compiler
	__builtin_unwind_init();
	r = __builtin_eh_return_data_regno(0);
	//p = __builtin_thread_pointer(); crashes compiler
	p = __builtin_launder(s);
	//__builtin_eh_return(d, p); available but suppresses linker errors
	p = __builtin_frob_return_addr(p);
	p = __builtin_dwarf_cfa();
	/*__builtin_init_dwarf_reg_size_table(p); cannot compile
	//r = __builtin_dwarf_sp_column();*/
	r = __builtin_extend_pointer(p);
	r = __builtin_object_size(p, 2);
	r = __builtin_dynamic_object_size(p, 2);
//undef	p = __builtin___memcpy_chk(p, p, d, d);
//undef	p = __builtin___memccpy_chk(p, p, d, d, d);
//undef	p = __builtin___memmove_chk(p, p, d, d);
//undef	p = __builtin___mempcpy_chk(p, p, d, d);
//undef	p = __builtin___memset_chk(p, d, d, d);
//undef	p = __builtin___stpcpy_chk(s, s, d);
//undef	p = __builtin___strcat_chk(s, s, d);
//undef	p = __builtin___strcpy_chk(s, s, d);
//undef	r = __builtin___strlcat_chk(s, s, d, d);
//undef	r = __builtin___strlcpy_chk(s, s, d, d);
//undef	p = __builtin___strncat_chk(s, s, d, d);
//undef	p = __builtin___strncpy_chk(s, s, d, d);
//undef	p = __builtin___stpncpy_chk(s, s, d, d);
//undef	r = __builtin___snprintf_chk(s, i, i, i, "");
//undef	r = __builtin___sprintf_chk(s, i, i, "");
//undef	r = __builtin___vsnprintf_chk(s, i, i, i, "%s", s);
//undef	r = __builtin___vsprintf_chk(s, i, i, "%s", s);
//undef	r = __builtin___printf_chk(i, s, d, d);
//undef	r = __builtin___vprintf_chk(i, s, args);
	r = __builtin_unpredictable(d);
	r = __builtin_expect(d, d);
	r = __builtin_expect_with_probability(d, d, 1.);
	__builtin_prefetch(p);
	r = __builtin_readcyclecounter();
	__builtin_trap();
	__builtin_debugtrap();
	//__builtin_unreachable(); disabled for obvious reasons
	//r = __builtin_shufflevector(d, d); idk how to use those
	//r = __builtin_convertvector(d, d);
	p = __builtin_alloca_uninitialized(d);
	p = __builtin_alloca_with_align(d, 8);
	p = __builtin_alloca_with_align_uninitialized(d, 8);
	//r = __builtin_call_with_static_chain(d, d); idk how to use this either
	r = __builtin_nondeterministic_value(d);
	r = __builtin_elementwise_abs(d);
	r = __builtin_elementwise_bitreverse(i);
//undef	r = __builtin_elementwise_max(d, d);
//undef	r = __builtin_elementwise_min(d, d);
	r = __builtin_elementwise_ceil(d);
//undef	r = __builtin_elementwise_cos(d);
//undef	r = __builtin_elementwise_exp(d);
//undef	r = __builtin_elementwise_exp2(d);
	r = __builtin_elementwise_floor(d);
//undef	r = __builtin_elementwise_log(d);
//undef	r = __builtin_elementwise_log2(d);
//undef	r = __builtin_elementwise_log10(d);
//undef	r = __builtin_elementwise_pow(d, d);
	r = __builtin_elementwise_roundeven(d);
//undef	r = __builtin_elementwise_round(d);
	r = __builtin_elementwise_rint(d);
	r = __builtin_elementwise_nearbyint(d);
//undef	r = __builtin_elementwise_sin(d);
	r = __builtin_elementwise_sqrt(d);
	//r = __builtin_elementwise_tan(d); unknown
	r = __builtin_elementwise_trunc(d);
	//r = __builtin_elementwise_canonicalize(d); crashes compiler
	r = __builtin_elementwise_copysign(d, d);
//undef	r = __builtin_elementwise_fma(d, d, d);
	r = __builtin_elementwise_add_sat(i, i);
	r = __builtin_elementwise_sub_sat(i, i);
	//r = __builtin_reduce_max(d); idk
	//r = __builtin_reduce_min(d); idk
	/*r = __builtin_reduce_xor(i);
	r = __builtin_reduce_or(i); idk what's a vector of integers
	r = __builtin_reduce_and(i);
	r = __builtin_reduce_add(i);
	r = __builtin_reduce_mul(i);
	r = __builtin_matrix_transpose(d); nor a matrix
	r = __builtin_matrix_column_major_load(d);
	r = __builtin_matrix_column_major_store(d);*/
//undef	r = __builtin_memcmp(p, p, i);
//undef	r = __builtin_printf("%%");
//undef	r = __builtin_bcmp(p, p, d);
	//p = __builtin_objc_memmove_collectable(p, p, d); crashes compiler
	r = __builtin_annotation(i, "%%");
	__builtin_assume(d);
	__builtin_assume_separate_storage(p, p);
	r = __builtin_addc(d, d, d, p);
	r = __builtin_subc(d, d, d, p);
	r = __builtin_add_overflow(i, i, &i);
	r = __builtin_sub_overflow(i, i, &i);
	r = __builtin_mul_overflow(i, i, &i);
	/*r = __builtin_uadd(d); unknown
	r = __builtin_usub(d);
	r = __builtin_umul(d);
	r = __builtin_sadd(d);
	r = __builtin_ssub(d);
	r = __builtin_smul(d);*/
	p = __builtin_addressof(d);
	p = __builtin_function_start(builtins_test);
//undef	p = __builtin_char_memchr(s, i, i);
	__builtin_dump_struct(&st, dummy_printf);
	//r = __builtin_preserve_access_index(d); needs -g
	r = __builtin_is_aligned(p, i);
	p = __builtin_align_up(p, i);
	p = __builtin_align_down(p, i);
//undef	p = __builtin___get_unsafe_stack_start();
//undef	p = __builtin___get_unsafe_stack_bottom();
//undef	p = __builtin___get_unsafe_stack_top();
//undef	p = __builtin___get_unsafe_stack_ptr();
	__builtin_nontemporal_store(d, &i);
	r = __builtin_nontemporal_load(&d);
	/*r = __builtin_store_half(d); unknown
	r = __builtin_load_half(d);*/
	
	return r;
}

Turns out that FMA isn't actually available after all.

@dschuff
Copy link
Member

dschuff commented May 20, 2024

I think FMA might only be available if you use -mrelaxed-simd. Also for several of the SIMD builtins (e.g. shufflevector) I would expect them to work for some vector shapes and types of shuffles but not all (but in any case I don't know of any libcalls for this; either the compiler can do it, or it can't) You might also need -msimd128 for those.

@Photosounder
Copy link
Author

Photosounder commented May 21, 2024

I think FMA might only be available if you use -mrelaxed-simd. Also for several of the SIMD builtins (e.g. shufflevector) I would expect them to work for some vector shapes and types of shuffles but not all (but in any case I don't know of any libcalls for this; either the compiler can do it, or it can't) You might also need -msimd128 for those.

____builtin_fma doesn't work regardless of extensions, however __attribute__((__vector_size__(2 * sizeof(double)))) double d2; d2 = __builtin_wasm_relaxed_madd_f64x2(d2, d2, d2); does work in those cases, and __has_builtin(__builtin_wasm_relaxed_madd_f64x2) accurately reflects the availability, unlike __has_builtin(__builtin_fma) which is always positive, so one could advantageously implement fma() using __builtin_wasm_relaxed_madd_f64x2 when it's available. Seems like ____builtin_fma is exclusively this which I expect would be quite slow and big compared to using SIMD, so you guys might want to make fma() rely on SIMD when available.

Edit: Here's my fma() implementation. It turns into 3 x f64x2.replace_lane, a f64x2.relaxed_madd and a f64x2.extract_lane.

@ppenzin
Copy link

ppenzin commented May 22, 2024

Long story short is that autogeneration of FMA is not available with relaxed SIMD enabled, it uses Wasm-specific builtins that are not mapped to generic FMA, plus there is no scalar FMA altogether.

@Photosounder I like you implementation, I've started to play with a minimal way to add malloc at some point, but never really finished it :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants