Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add next_array and collect_array #560

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

orlp
Copy link

@orlp orlp commented Jul 15, 2021

With this pull request I add two new functions to the Itertools trait:

fn next_array<T, const N: usize>(&mut self) -> Option<[T; N]>
where Self: Sized + Iterator<Item = T>;

fn collect_array<T, const N: usize>(mut self) -> Option<[T; N]>
where Self: Sized + Iterator<Item = T>;

These behave exactly like next_tuple and collect_tuple, however they return arrays instead. Since these functions require min_const_generics, I added a tiny build script that checks if Rust's version is 1.51 or higher, and if yes to set the has_min_const_generics config variable. This means that Itertools does not suddenly require 1.51 or higher, only these two functions do.

In order to facilitate this I did have to bump the minimum required Rust version to 1.34 from the (documented) 1.32, since Rust 1.32 and 1.33 have trouble parsing the file even if stuff is conditionally compiled. However, this should not result in any (new) breakage, because Itertools actually already requires Rust 1.34 for 9+ months, since 83c0f04 uses saturating_pow which wasn't stabilized until 1.34.


As for rationale, I think these functions are useful, especially for pattern matching and parsing. I don't think there's a high probability they get added to the standard library either, so that's why I directly make a pull request here. When/if TryFromIterator stabilizes we can simplify the implementation, but even then I believe these functions remain a good addition similarly how collect_vec is nice to have despite .collect::<Vec<_>> existing.

@orlp
Copy link
Author

orlp commented Jul 15, 2021

A possible enhancement might be to return Option<A> where A: FromArray<Self::Item, N> instead, and adding the FromArray trait, something similar to this:

trait FromArray<T, const N: usize> {
    fn from_array(array: [T; N]) -> Self;
}

impl<T, const N: usize> FromArray<T, N> for [T; N] { /* .. */ }
impl<T, const N: usize> FromArray<Option<T>, N> for Option<[T; N]> { /* .. */ }
impl<T, E, const N: usize> FromArray<Result<T, E>, N> for Result<[T; N], E> { /* .. */ }

In fact, I think this is highly useful because it allows things like

let ints = line.split_whitespace().map(|n| n.parse());
if let Ok([x, y, z]) = ints.collect_array() {
    ...
}

This would be completely in line with FromIterator.

@orlp
Copy link
Author

orlp commented Jul 16, 2021

So I have a working implementation of the above idea here: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=9dba690b0dfc362971635e21647a4c19.

It makes this compile:

fn main() {
    let line = "32 -12 24";
    let nums = line.split_whitespace().map(|n| n.parse::<i32>());
    if let Some(Ok([x, y, z])) = nums.collect_array() {
        println!("x: {} y: {} z: {}", x, y, z);
    }
}

It would change the interface to:

trait ArrayCollectible<T>: Sized {
    fn array_from_iter<I: IntoIterator<Item = T>>(iterable: I) -> Option<Self>;
}

trait Itertools: Iterator {
    fn collect_array<A>(self) -> Option<A>
    where
        Self: Sized,
        A: ArrayCollectible<Self::Item>;
}

where

  • ArrayCollectible<T> is implemented for [T; N];
  • ArrayCollectible<Option<T>> is implemented for Option<[T; N]>;
  • ArrayCollectible<Result<T, E>> is implemented for Result<[T; N], E>.

Copy link
Member

@phimuemue phimuemue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi there! Thanks for this. I particularly like that you thought about a way of enabling const-generic stuff without raising the minimum required rust version (even if I would imagine something else due to having an aversion against depending on other crates too much).

There has been some discussion recently about basically supporting not only tuples, but also arrays. I just want to make sure that we do not loose input from these discussions when actually settling with your solution:

On top of that, I think there are some changes in there that are not directly related to this issue. If you'd like to have them merged, could you possibly factor them out into separate PRs/commits?

build.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
@@ -0,0 +1,80 @@
use core::mem::MaybeUninit;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@phimuemue phimuemue added the const-generics Require Rust 1.51 or newer label Aug 20, 2021
@orlp
Copy link
Author

orlp commented Dec 21, 2021

@phimuemue Any update on this?

@phimuemue
Copy link
Member

@phimuemue Any update on this?

I appreciate your effort, but unfortunately nothing substantial from my side: I changed my mind regarding version-check (so we could use it as a dev-dependency), but I do not have enough time right now to review and merge PRs that ample.

@orlp
Copy link
Author

orlp commented Dec 30, 2021

@phimuemue Just for posterity's sake, version-check would be a build-dependency, not dev-dependency.

@orlp
Copy link
Author

orlp commented Oct 4, 2022

@phimuemue Just checking in what the status is, I feel very strongly about the usefulness of collect_array. I miss it very often in itertools.

@scottmcm
Copy link
Contributor

scottmcm commented Oct 4, 2022

Note that if you want collect_array, you can use https://lib.rs/crates/arrayvec, as the usual way to collect into an array.

I'll also mention Iterator::next_chunk (rust-lang/rust#98326) as a nightly API that'll be next_array.

@Expurple
Copy link

This is a very useful feature. Today there was a thread on Reddit where the author basically asks if there's a crate that provides collect_array(). IMO, itertools should be the crate to do it

@Philippe-Cholet
Copy link
Member

@Expurple
I sure would like to do use const generics and collect_array is one of them.
Our MSRV is quite old (1.43.1 currently) while min-const-generics is 1.51 but I do not think it's the main blocker.
The fact is that there is not much available in recent stable Rust yet which is sad. Iterator::next_chunk and core::array::try_from_fn would be nice to have.
Plus, we currently don't really use unsafe ourselves (only in EitherOrBoth::insert* with obvious unfaillable patterns). I guess we prefer that the std does the heavy work.

@phimuemue
Copy link
Member

I sometimes think about adding arrayvec as a dependency - and falling back to std as soon it's possible. I think it might also solve some other issues (e.g. ExactlyOneError having a manual two-element-arrayvec). Would require Rust 1.51.

Another option I just saw: Crates can offer "nightly-only experimental API" (see https://docs.rs/arrayvec/latest/arrayvec/struct.ArrayVec.html#method.first_chunk for an example) - maybe this would help some users.

I personally would lean towards arrayvec. @jswrenn @Philippe-Cholet Opinions?

@Philippe-Cholet
Copy link
Member

Philippe-Cholet commented Mar 28, 2024

@phimuemue

Another option I just saw: Crates can offer "nightly-only experimental API" (see https://docs.rs/arrayvec/latest/arrayvec/struct.ArrayVec.html#method.first_chunk for an example) - maybe this would help some users.

ArrayVec<T, CAP> implements Deref<Target = [T]> so (nightly-available) slice methods are directly accessible, that seems to be it.

I sometimes think about adding arrayvec as a dependency - and falling back to std as soon it's possible. I think it might also solve some other issues (e.g. ExactlyOneError having a manual two-element-arrayvec). Would require Rust 1.51.

I'm definitely not opposed to the idea but the ExactlyOneError use case is quite small.
I did not give thoughts before, do you have other examples in mind? (with private usage, in order to fall back to std ASAP).

EDIT: ArrayVec has a maximal capacity of u32::MAX, could it be an issue?

EDIT: Well I have some. With tail and k_smallest (and its variants), I had thoughts of extending them to const where I dreamt of unstable Iterator::next_chunk but I guess we could use arrayvec in the meantime.

(My idea would be that .k_smallest(50) could also support .k_smallest(Const/*::<50> if not inferred elsewhere*/) so that we don't multiply method names too much but merely add a new zero-sized type struct Const<const N: usize>; at places we only gave usize before. Then no allocation.
It's not a magic bullet for every usage though but I see a real usage for it, such as .combinations(Const): internal Vec buffer but would return arrays, so no repetitive slow allocations.)


@scottmcm Small discussion about temporarily adding arrayvec as dependency once we move to const-generics. I just saw a comment of yours related to this. Could you elaborate?

@jswrenn
Copy link
Member

jswrenn commented Mar 28, 2024

For collect_array, I think I'd prefer just taking the time myself to write the unsafe code. We can vendor the not-yet-stabilized helper functions from the standard library that we'll need.

I can allocate some time to this next week.

@orlp
Copy link
Author

orlp commented Mar 28, 2024

@jswrenn Please don't forget that we are discussing this on a PR that already has a working implementation without adding dependencies...

src/next_array.rs Outdated Show resolved Hide resolved
src/next_array.rs Outdated Show resolved Hide resolved
@jswrenn
Copy link
Member

jswrenn commented Mar 28, 2024

@orlp, thanks, I had forgotten that this was a PR and not an issue when I made my reply. Still, we're talking about adding some extremely subtle unsafe code to Itertools. I'd like us to take extreme care to avoid accidentally introducing UB.

A PR adding unsafe to itertools should:

  • rigorously document the safety and panicking conditions of every unsafe function it introduces
  • prove that every invocation of an unsafe function (even invocations occurring within other unsafe functions) satisfies the safety precondition of that invocation, with citations to official Rust documentation
  • rigorously document why any potentially panicking function within an unsafe function does not create invalid state that would cause UB upon panicking unwinds
  • intensively test its API with miri

If you can update this PR to do those things, I can see a path forward to merging it.

Copy link
Member

@jswrenn jswrenn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR! I like the ArrayBuilder abstraction quite a bit. As I mentioned, this will need additional documentation and testing before it can be merged. See the recent safety comments in my other project, zerocopy for a sense of the paranoia rigor I'd like these safety comments to take.

src/next_array.rs Outdated Show resolved Hide resolved
src/next_array.rs Outdated Show resolved Hide resolved
src/next_array.rs Outdated Show resolved Hide resolved
src/next_array.rs Outdated Show resolved Hide resolved
src/next_array.rs Outdated Show resolved Hide resolved
src/next_array.rs Outdated Show resolved Hide resolved
src/next_array.rs Outdated Show resolved Hide resolved
src/next_array.rs Outdated Show resolved Hide resolved
build.rs Outdated Show resolved Hide resolved
@orlp
Copy link
Author

orlp commented Mar 28, 2024

@jswrenn I will be busy the upcoming week but I'm willing to bring this up to standards after that. If before then you could decide on whether or not to bump the MSRV to 1.51 I could include that in the rewrite.

@jswrenn
Copy link
Member

jswrenn commented May 31, 2024

@orlp Think you might have time to revisit this soon? :-)

@orlp
Copy link
Author

orlp commented May 31, 2024

@jswrenn Yes, I do. Have you reached a decision yet on the MSRV issue?

@Philippe-Cholet
Copy link
Member

I think it's time to set the MSRV to 1.51, I'm pretty sure @jswrenn and @phimuemue will agree.

@jswrenn
Copy link
Member

jswrenn commented Jun 1, 2024

Absolutely. I'd even be fine going up to 1.55, which is nearly three years old. In my other crates, I've found that to be the lowest MSRV that users actually need.

@Philippe-Cholet
Copy link
Member

Philippe-Cholet commented Jun 1, 2024

After quickly going though release notes, I may have missed something but I only noted two things 1.55 has over 1.51 that I considered to be of potential use for us:

EDIT: 1.51 has those things over 1.43.1: const-generics Require Rust 1.51 or newer 🎉, bool::then, slice::select_nth_unstable[_by[_key]] (unlocking #925), VecDeque::make_contiguous, Option::zip.

@phimuemue
Copy link
Member

Out of curiosity and slightly off-topic: What's a real reason to not update to stable Rust? Does it ever remove support for some platform or raise the system requirements dramatically? Or, put alternatively: Are there situations where someone could use cutting-edge itertools but not stable Rust?

@jswrenn
Copy link
Member

jswrenn commented Jun 4, 2024

Are there situations where someone could use cutting-edge itertools but not stable Rust?

Yes: Libraries that depend on itertools, but set a MSRV lower than stable. They are, of course, welcome to use an older, MSRV-compatible version of itertools, but we currently don't backport bugfixes to older versions.

What's a real reason to not update to stable Rust? Does it ever remove support for some platform or raise the system requirements dramatically?

Rust occasionally does remove support for platforms; e.g.: https://blog.rust-lang.org/2022/08/01/Increasing-glibc-kernel-requirements.html

(The above post suggests that, conservatively, we could increase our MSRV to 1.63 without causing major problem for users. Maybe that's a good target MSRV for now?)

@Philippe-Cholet
Copy link
Member

Philippe-Cholet commented Jun 4, 2024

After skimming all release notes up to 1.78.0, I noted this:

  • 1.57.0: Iterator::map_while, [T; N]::as_[mut_]slice
  • 1.60.0: MaybeUninit::assume_init_[drop|read]
  • 1.62.0: bool::then_some
  • 1.63.0: array::from_fn
  • 1.65.0: let else
  • 1.71.0: [T; N] <--> (T, ...) (N in 1..=12)

I'm not particularly waiting anything in there (maybe array::from_fn). I'd prefer some nightly stuff like the Try trait and array::try_from_fn.

Unless we need <[T; N]>::map (1.55.0), I'm fine with 1.51 for quite some time.
Is there something currently stable that you are waiting for? Maybe something I missed?

@orlp
Copy link
Author

orlp commented Jun 4, 2024

@Philippe-Cholet I feel <[T; N]>::map is rather fundamental and a strong reason to prefer 1.55 over 1.51. Similarly with array::from_fn for 1.63. The other stuff does not feel very relevant for this crate.

@orlp
Copy link
Author

orlp commented Jun 5, 2024

I have force-pushed with a new cleaner implementation, based on 1.55 for [T; N]::map. It only has two instantiations of unsafe.

Copy link
Member

@jswrenn jswrenn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This looks mostly good to me, though I've left some pedantic suggestions.

Can also you bump the MSRV in Cargo.toml and do a quick check for newly-resolvable clippy lints:

itertools/Cargo.toml

Lines 18 to 19 in ad5cc96

# When bumping, please resolve all `#[allow(clippy::*)]` that are newly resolvable.
rust-version = "1.43.1"

src/next_array.rs Outdated Show resolved Hide resolved
src/next_array.rs Outdated Show resolved Hide resolved
Comment on lines 18 to 24
pub fn push(&mut self, value: T) {
// We maintain the invariant here that arr[..len] is initialized.
// Indexing with self.len also ensures self.len < N, and thus <= N after
// the increment.
self.arr[self.len] = MaybeUninit::new(value);
self.len += 1;
}
Copy link
Member

@jswrenn jswrenn Jun 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documenting the panic here is important in cases where a panic may cause invariants to be invalidated. That doesn't happen yet in our code, but could occur in future uses of ArrayBuilder.

Suggested change
pub fn push(&mut self, value: T) {
// We maintain the invariant here that arr[..len] is initialized.
// Indexing with self.len also ensures self.len < N, and thus <= N after
// the increment.
self.arr[self.len] = MaybeUninit::new(value);
self.len += 1;
}
/// Pushes `value` onto the end of the array list.
///
/// # Panics
///
/// This panics if `self.len() >= N`.
pub fn push(&mut self, value: T) {
// PANICS: This will panic if `self.len >= N`.
// SAFETY: Initializing an element of `self.arr` cannot violate its
// safety invariant. Even if this line panics, we have not created any
// intermediate invalid state.
self.arr[self.len] = MaybeUninit::new(value);
// SAFETY: By invariant on `self.arr`, all elements at indicies
// `0..self.len` are valid. Due to the above initialization, the element
// at `self.len` is now also valid. Consequently, all elements at
// indicies `0..(self.len + 1)` are valid, and `self.len` can be safely
// incremented without violating `self.arr`'s invariant.
self.len += 1;
}

Copy link
Author

@orlp orlp Jun 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really feel this is overly verbose and ultimately reduces readability of the code. In my opinion, sometimes less is more.

In my opinion a safety comment should guide a critical thinker towards the right thought process of seeing why code is correct, but should not provide a 'proof-by-nodding-along' experience which is rather dangerous.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I share that opinion, except in the case of documenting unsafe code. This PR is the first substantial addition of unsafe code to itertools, and I'd like us to set our standard for unsafe code very high. I also suspect that ArrayBuilder may turn out to be a major internal tool as we expand our support for const-generic adaptors.

Copy link
Author

@orlp orlp Jun 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jswrenn Of course I think the standard should be very high. And for that reason I think the comments should be pristine, and not overly verbose blocks of text that just make it harder to review the code as correct. I sincerely mean here that less is more, the shorter unsafe comments are better at preventing unsoundness in my opinion.

Copy link
Author

@orlp orlp Jun 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think about it like this, lines of code that do things are signal, comments that do not help you understand the code are noise. In this example in your suggestion there are two lines of code, and nine lines of comments.

If you as a reader can reach the same understanding of the code with the same effort from a three-line comment as a nine-line comment (which I believe you can), then that means six lines were unnecessary and thus noise, giving a signal-to-noise ratio of 25%. I really prefer the signal-to-noise ratio to be as high as possible in unsafe code, allocating maximum headspace and visual area to the actual bits that matter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll have to agree to disagree; you are unlikely to convince me otherwise.

Making these comments has been an essential step in vetting this PR, and these comments will continue to aid future contributors and reviewers in making and evaluating changes to this code.

For example, I only noticed the (temporary) invariant violation issue in ArrayBuilder::take because I attempted to write detailed safety comments for this PR. I also see that you've accepted my suggested reordering, but not my suggested comments. Now, the effect achieved by reordering is totally implicit. We could easily regress on that issue.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the temporary invariant violation was ever an issue, just a preferred order for simplicity. That said, ultimately this is just style/documentation, so let's agree to disagree and move on. Can you push those comments/changes you consider to be not merely suggestions but essential directly to the branch?

src/next_array.rs Outdated Show resolved Hide resolved
Comment on lines 42 to 50
unsafe {
// SAFETY: arr[..len] is initialized, so must be dropped.
// First we create a pointer to this initialized slice, then drop
// that slice in-place. The cast from *mut MaybeUninit<T> to *mut T
// is always sound by the layout guarantees of MaybeUninit.
let ptr_to_first: *mut MaybeUninit<T> = self.arr.as_mut_ptr();
let ptr_to_slice = ptr::slice_from_raw_parts_mut(ptr_to_first.cast::<T>(), self.len);
ptr::drop_in_place(ptr_to_slice);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since our MSRV is now >= 1.60 we can leverage MaybeUninit::assume_init_drop. I know this is contrary to @scottmcm's suggestion here, but it entirely avoids working with raw pointers.

Suggested change
unsafe {
// SAFETY: arr[..len] is initialized, so must be dropped.
// First we create a pointer to this initialized slice, then drop
// that slice in-place. The cast from *mut MaybeUninit<T> to *mut T
// is always sound by the layout guarantees of MaybeUninit.
let ptr_to_first: *mut MaybeUninit<T> = self.arr.as_mut_ptr();
let ptr_to_slice = ptr::slice_from_raw_parts_mut(ptr_to_first.cast::<T>(), self.len);
ptr::drop_in_place(ptr_to_slice);
}
// LEMMA: The elements of `init` reference the valid elements of
// `self.arr`.
//
// PROOF: `slice::split_at_mut(mid)` produces a pair of slices, the
// first of which contains the elements at the indices `0..mid`. By
// invariant on `self.arr`, the elements of `self.arr` at indicies
// `0..self.len` are valid. Assuming that `slice::split_at_mut` is
// correctly implemented, the slice `init` will only reference the
// valid elements of `self.arr`.
let (init, _) = self.arr.split_at_mut(self.len);
// AXIOM: We assume that `slice::into_iter` and the subsequent
// `for_each` are implemented correctly, and will yield each and every
// element of `init`.
init.into_iter().for_each(|elt| {
// SAFETY: `elt` references a valid `T`, because by the above AXIOM,
// `elt` is an element of `init`, and by the above LEMMA, the
// elements of `init` reference the valid elements of `self.arr`.
unsafe { elt.assume_init_drop() }
});

Copy link
Author

@orlp orlp Jun 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually has the wrong semantics compared to e.g. Vec<T> I realized, as it will leak values if a Drop impl panics. ptr::drop_in_place will forcibly keep dropping elements in that scenario, causing an abort if another Drop panics.

Also, again I find this overly verbose, ultimately reducing readability of the code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, again I find this overly verbose, ultimately reducing readability of the code

I think it is likely that we have different preferences for how unsafe code is written and documented.

Abstractions like the one we're adding here are a notorious source of soundness issues in the Rust ecosystem (see the RUSTSEC advisories for smallvec, stackvector, and stack, for a sampler). Itertools has not yet had any soundness issues, and I'd like to keep it that way.

I am deeply reluctant to add any unsafe code to itertools. If we are to do so, I will require that it meets the standards I set for the other unsafe crates I maintain. The safety comments I'm proposing are not just suggestions; they're prerequisites of merging this PR.

The argument about a semantic difference is interesting. Leaking is not a soundness issue, and we don't make any promises to match the drop behavior of Vec under such conditions. Nonetheless, I'll grant that minimizing leaking is preferable. I'll have to give this one more thought.

Comment on lines +60 to +62
for _ in 0..N {
builder.push(it.next()?);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that these builder methods are safe, we can go back to calling .take, which can have performance benefits in some cases:

Suggested change
for _ in 0..N {
builder.push(it.next()?);
}
it.take(N).for_each(|elt| builder.push(elt));

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually believe take here is more likely to negatively affect performance, either way the performance is bad due to the optimizer really not liking this code: rust-lang/rust#126000.

src/next_array.rs Outdated Show resolved Hide resolved
tests/test_core.rs Outdated Show resolved Hide resolved
@orlp
Copy link
Author

orlp commented Jun 5, 2024

@jswrenn I actually wanted to suggest bumping the MSRV to whatever version you decide on in a separate PR (with whatever TODO's / clippy lints it resolves). Then we'd merge that one first rather than tacking it onto this one.

I have comitted some of your suggestions, although others I disagree with, see above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
const-generics Require Rust 1.51 or newer
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants