Internal error: tests using unique collections containing `nan` are flaky #3926

Zac-HD · 2024-03-17T07:00:33Z

e.g. https://github.com/HypothesisWorks/hypothesis/actions/runs/8313487434/job/22749533612?pr=3924#step:6:581

I've also hit a weird pytest crash a couple of times, but it looks like psf/black#4224 just hasn't been released yet.

tybug · 2024-03-17T16:36:40Z

I haven't tried to bisect this yet, but it's possible this regressed in implement fake_forced (#3806). I saw flaky raised internally when I had a bug in my implementation locally there. At least we have the strategy definition from this run, though it's a bit of a beast:

while generating 'Draw 1: ' from frozensets(one_of(builds(BaseExceptionGroup, text(), lists(from_type(builtins.BaseException), min_size=1, max_size=5)).filter(_can_hash), none().filter(_can_hash), just(NotImplemented).filter(_can_hash), builds(UnicodeDecodeError, just('unknown encoding'), just(b''), just(0), just(0), just('reason')).filter(_can_hash), builds(UnicodeEncodeError, just('unknown encoding'), text(), just(0), just(0), just('reason')).filter(_can_hash), builds(UnicodeTranslateError, text(), just(0), just(0), just('reason')).filter(_can_hash), builds(classmethod, just(lambda self: self)).filter(_can_hash), functions().filter(_can_hash), iterables(nothing()).filter(_can_hash), builds(dict).map(dict.values).filter(_can_hash), dates().filter(_can_hash), times().filter(_can_hash), one_of(builds(timezone, offset=builds(timedelta, hours=integers(min_value=-23, max_value=23), minutes=integers(min_value=0, max_value=59))).filter(_can_hash), builds(timezone, name=text(alphabet=characters()), offset=builds(timedelta, hours=integers(min_value=-23, max_value=23), minutes=integers(min_value=0, max_value=59))).filter(_can_hash)), just(Ellipsis).filter(_can_hash), builds(frozenset).filter(_can_hash), one_of(tuples(integers(min_value=0, max_value=4294967295), integers(min_value=-32, max_value=0).map(abs)).map(lambda x: ipaddress.IPv4Network(x, strict=False)).filter(_can_hash), sampled_from(('0.0.0.0/8', '10.0.0.0/8', '100.64.0.0/10', '127.0.0.0/8', '169.254.0.0/16', '172.16.0.0/12', '192.0.0.0/24', '192.0.0.0/29', '192.0.0.8/32', '192.0.0.9/32', '192.0.0.10/32', '192.0.0.170/32', '192.0.0.171/32', '192.0.2.0/24', '192.31.196.0/24', '192.52.193.0/24', '192.88.99.0/24', '192.168.0.0/16', '192.175.48.0/24', '198.18.0.0/15', '198.51.100.0/24', '203.0.113.0/24', '240.0.0.0/4', '255.255.255.255/32')).map(IPv4Network).filter(_can_hash)), one_of(tuples(integers(min_value=0, max_value=340282366920938463463374607431768211455), integers(min_value=-128, max_value=0).map(abs)).map(lambda x: ipaddress.IPv6Network(x, strict=False)).filter(_can_hash), sampled_from(('::1/128', '::/128', '::ffff:0:0/96', '64:ff9b::/96', '64:ff9b:1::/48', '100::/64', '2001::/23', '2001::/32', '2001:1::1/128', '2001:1::2/128', '2001:2::/48', '2001:3::/32', '2001:4:112::/48', '2001:10::/28', '2001:20::/28', '2001:db8::/32', '2002::/16', '2620:4f:8000::/48', 'fc00::/7', 'fe80::/10')).map(IPv6Network).filter(_can_hash)), binary().map(memoryview).filter(_can_hash), builds(PurePath, text()).filter(_can_hash), builds(property, just(lambda _: None)).filter(_can_hash), randoms().filter(_can_hash), one_of(builds(range, integers(min_value=0)).filter(_can_hash), builds(range, integers(), integers()).filter(_can_hash), builds(range, integers(), integers(), integers().filter(bool)).filter(_can_hash)), text().map(lambda c: re.match(".", c, flags=re.DOTALL)).filter(bool).filter(_can_hash), builds(compile, sampled_from(['', b''])).filter(_can_hash), builds(slice, one_of(none(), integers()), one_of(none(), integers()), one_of(none(), integers())).filter(_can_hash), text().filter(_can_hash), builds(super, from_type(builtins.type)).filter(_can_hash), builds(Bar, integers()).filter(_can_hash), builds(Baz, integers()).filter(_can_hash), builds(tuple).filter(_can_hash), builds(BytesIO, binary()).filter(_can_hash), one_of(booleans().filter(_can_hash), integers().filter(_can_hash), floats().filter(_can_hash), complex_numbers().filter(_can_hash), fractions().filter(_can_hash), decimals().filter(_can_hash), timedeltas().filter(_can_hash)), one_of(booleans().filter(_can_hash), binary().filter(_can_hash), integers(min_value=0, max_value=255).filter(_can_hash), lists(integers(min_value=0, max_value=255)).map(tuple).filter(_can_hash)), one_of(booleans().filter(_can_hash), integers().filter(_can_hash), floats().filter(_can_hash), complex_numbers().filter(_can_hash), decimals().filter(_can_hash), fractions().filter(_can_hash)), one_of(booleans().filter(_can_hash), integers().filter(_can_hash), floats().filter(_can_hash), decimals().filter(_can_hash), fractions().filter(_can_hash), floats().map(str).filter(_can_hash)), one_of(integers().filter(_can_hash), booleans().filter(_can_hash)), one_of(booleans().filter(_can_hash), integers().filter(_can_hash), floats().filter(_can_hash), uuids().filter(_can_hash), decimals().filter(_can_hash), from_regex('\\A-?\\d+\\Z').filter(functools.partial(can_cast, int)).filter(_can_hash)), one_of(booleans().filter(_can_hash), integers().filter(_can_hash), floats().filter(_can_hash), decimals().filter(_can_hash), fractions().filter(_can_hash)), builds(StringIO, text()).filter(_can_hash), shared(sampled_from([<class 'NoneType'>, <class 'bool'>, <class 'int'>, <class 'float'>, <class 'str'>, <class 'bytes'>]), key="typevar=<class 'collections.abc.Hashable'>").flatmap(from_type).filter(_can_hash), timezones().filter(_can_hash)))

Zac-HD · 2024-03-18T20:37:24Z

test_resolve_typing_module[typing.ChainMap] flaked with this similar-looking error. Cleaned up:

# while generating 'Draw 1: ' from
types_strat = sampled_from([type(None), bool, int, float, str, bytes])
dictionaries(
    keys=shared(types_strat, key='typevar=~KT').flatmap(from_type).filter(_can_hash), 
    values=shared(types_strat, key='typevar=~VT').flatmap(from_type)
).map(ChainMap)

It might also be relevant that all of the examples I remember have involved sets or mappings; I'd suspect iteration order but ChainMap wrapping a dict should be entirely deterministic in iteration order. Perhaps an interaction between from_type (which tries to cache) and typevars and/or _can_hash?

tybug · 2024-03-18T20:53:27Z

Manually shrunk to the following:

dictionaries(
    keys=st.floats(),
    values=st.just(None),
)

Anecdotally using float seems to be critical, vs say int. At this point I'd suspect -0.0 vs 0.0 or related float issues.

def f():
    s = dictionaries(
        keys=st.floats(),
        values=st.just(None),
    )
    s.example()

for i in range(1000):
    print("-" * 25, i, "-" * 25)
    f()

tybug · 2024-03-18T22:27:28Z

reduced further:

lists(
    st.floats(allow_infinity=False),
    min_size=0,
    max_size=3,
    unique=True,
)

something to do with multiple nans in the same list while filtering for uniqueness. I wonder if

n = 18444492273895866368
assert math.isnan(int_to_float(n))
assert int_to_float(n) not in [int_to_float(n)]

is relevant? (int_to_float destroys identity).

Zac-HD · 2024-03-18T23:23:25Z

Yep, that would do it!

List containment (and maybe other collections?) uses a is b as an optimization over a == b, only checking the latter if a is not b. Unfortunately this is unsound in the presence of aliased nans, and while that's not usually a problem here we are. I'd guess the mechanism is that we alias the first time we generate, and then don't alias on the second, causing the second to attempt a different subsequent draw.

Probably the correct general solution is to ensure that we return a different float object each time we generate a nan, which will ensure that we get the non-aliasing behavior each time.

This trades away a little bit of test power which we had ~by accident; we may well want to bring that back in future but should do so above the IR layer.

Zac-HD added flaky-tests for when our tests only sometimes pass tests/build/CI about testing or deployment *of* Hypothesis labels Mar 17, 2024

Zac-HD added the bug something is clearly wrong here label Mar 19, 2024

Zac-HD changed the title ~~test_generic_collections_only_use_hashable_elements[FrozenSet] is flaky~~ Internal error: tests using unique collections containing nan are flaky Mar 19, 2024

Zac-HD removed the tests/build/CI about testing or deployment *of* Hypothesis label Mar 19, 2024

tybug mentioned this issue Mar 20, 2024

Fix flaky error on unique collections containing nan #3931

Merged

Zac-HD closed this as completed in #3931 Mar 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Internal error: tests using unique collections containing `nan` are flaky #3926

Internal error: tests using unique collections containing `nan` are flaky #3926

Zac-HD commented Mar 17, 2024

tybug commented Mar 17, 2024 •

edited

Zac-HD commented Mar 18, 2024

tybug commented Mar 18, 2024 •

edited

tybug commented Mar 18, 2024 •

edited

Zac-HD commented Mar 18, 2024

Internal error: tests using unique collections containing nan are flaky #3926

Internal error: tests using unique collections containing nan are flaky #3926

Comments

Zac-HD commented Mar 17, 2024

tybug commented Mar 17, 2024 • edited

Zac-HD commented Mar 18, 2024

tybug commented Mar 18, 2024 • edited

tybug commented Mar 18, 2024 • edited

Zac-HD commented Mar 18, 2024

Internal error: tests using unique collections containing `nan` are flaky #3926

Internal error: tests using unique collections containing `nan` are flaky #3926

tybug commented Mar 17, 2024 •

edited

tybug commented Mar 18, 2024 •

edited

tybug commented Mar 18, 2024 •

edited