Skip to content

Commit

Permalink
automata: make PikeVM and backtracker work without capture states
Browse files Browse the repository at this point in the history
Previously, construction of these engines checked to make sure the NFA
given had some capture states in it. If the NFA didn't, construction
failed with an error.

To support the case where the NFA has no capture states at all (to avoid
gratuitous memory allocation), we remove this restriction and tweak the
engine implementations to stop assuming that the NFA has capture states.
This turned out to not be too hard, as we only assumed as much in a few
places.

The main reason why this restriction existed in the first place was
semantics. Namely, it's important that the PikeVM remain infallible. But
what happens when you ask for match offsets in a search with an NFA that
has no capture states? The PikeVM just doesn't support that. Previously
it would panic (and thus the reason construction would fail). But now
instead it will just report "no match." It's a little hokey, but we
justify it to ourselves because "simplicity" and "avoids footguns" are
non-goals of this crate.
  • Loading branch information
BurntSushi committed Aug 4, 2023
1 parent c4bf634 commit 0e57a63
Show file tree
Hide file tree
Showing 5 changed files with 72 additions and 61 deletions.
6 changes: 4 additions & 2 deletions regex-automata/src/meta/regex.rs
Original file line number Diff line number Diff line change
Expand Up @@ -2635,8 +2635,10 @@ impl Config {
/// states. For example, using `WhichCaptures::Implicit` will behave as if
/// all explicit capturing groups in the pattern were non-capturing.
///
/// Setting this to `WhichCaptures::None` may result in an error when
/// building a meta regex.
/// Setting this to `WhichCaptures::None` is usually not the right thing to
/// do. When no capture states are compiled, some regex engines (such as
/// the `PikeVM`) won't be able to report match offsets. This will manifest
/// as no match being found.
///
/// # Example
///
Expand Down
29 changes: 16 additions & 13 deletions regex-automata/src/nfa/thompson/backtrack.rs
Original file line number Diff line number Diff line change
Expand Up @@ -300,15 +300,6 @@ impl Builder {
&self,
nfa: NFA,
) -> Result<BoundedBacktracker, BuildError> {
// If the NFA has no captures, then the backtracker doesn't work since
// it relies on them in order to report match locations. However, in
// the special case of an NFA with no patterns, it is allowed, since
// no matches can ever be produced. And importantly, an NFA with no
// patterns has no capturing groups anyway, so this is necessary to
// permit the backtracker to work with regexes with zero patterns.
if !nfa.has_capture() && nfa.pattern_len() > 0 {
return Err(BuildError::missing_captures());
}
nfa.look_set_any().available().map_err(BuildError::word)?;
Ok(BoundedBacktracker { config: self.config.clone(), nfa })
}
Expand Down Expand Up @@ -954,8 +945,14 @@ impl BoundedBacktracker {
None => return Ok(None),
Some(pid) => pid,
};
let start = slots[0].unwrap().get();
let end = slots[1].unwrap().get();
let start = match slots[0] {
None => return Ok(None),
Some(s) => s.get(),
};
let end = match slots[1] {
None => return Ok(None),
Some(s) => s.get(),
};
return Ok(Some(Match::new(pid, Span { start, end })));
}
let ginfo = self.get_nfa().group_info();
Expand All @@ -965,8 +962,14 @@ impl BoundedBacktracker {
None => return Ok(None),
Some(pid) => pid,
};
let start = slots[pid.as_usize() * 2].unwrap().get();
let end = slots[pid.as_usize() * 2 + 1].unwrap().get();
let start = match slots[pid.as_usize() * 2] {
None => return Ok(None),
Some(s) => s.get(),
};
let end = match slots[pid.as_usize() * 2 + 1] {
None => return Ok(None),
Some(s) => s.get(),
};
Ok(Some(Match::new(pid, Span { start, end })))
}

Expand Down
46 changes: 36 additions & 10 deletions regex-automata/src/nfa/thompson/compiler.rs
Original file line number Diff line number Diff line change
Expand Up @@ -316,8 +316,8 @@ impl Config {
/// # Example
///
/// This example demonstrates that some regex engines, like the Pike VM,
/// require capturing groups to be present in the NFA. Building a Pike VM
/// with an NFA without capturing groups will result in an error.
/// require capturing states to be present in the NFA to report match
/// offsets.
///
/// (Note that since this method is deprecated, the example below uses
/// [`Config::which_captures`] to disable capture states.)
Expand All @@ -329,10 +329,13 @@ impl Config {
/// WhichCaptures,
/// };
///
/// let nfa = NFA::compiler()
/// .configure(NFA::config().which_captures(WhichCaptures::None))
/// let re = PikeVM::builder()
/// .thompson(NFA::config().which_captures(WhichCaptures::None))
/// .build(r"[a-z]+")?;
/// assert!(PikeVM::new_from_nfa(nfa).is_err());
/// let mut cache = re.create_cache();
///
/// assert!(re.is_match(&mut cache, "abc"));
/// assert_eq!(None, re.find(&mut cache, "abc"));
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
Expand Down Expand Up @@ -363,8 +366,8 @@ impl Config {
/// # Example
///
/// This example demonstrates that some regex engines, like the Pike VM,
/// require capturing groups to be present in the NFA. Building a Pike VM
/// with an NFA without capturing groups will result in an error.
/// require capturing states to be present in the NFA to report match
/// offsets.
///
/// ```
/// use regex_automata::nfa::thompson::{
Expand All @@ -373,10 +376,33 @@ impl Config {
/// WhichCaptures,
/// };
///
/// let nfa = NFA::compiler()
/// .configure(NFA::config().which_captures(WhichCaptures::None))
/// let re = PikeVM::builder()
/// .thompson(NFA::config().which_captures(WhichCaptures::None))
/// .build(r"[a-z]+")?;
/// let mut cache = re.create_cache();
///
/// assert!(re.is_match(&mut cache, "abc"));
/// assert_eq!(None, re.find(&mut cache, "abc"));
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
///
/// The same applies to the bounded backtracker:
///
/// ```
/// use regex_automata::nfa::thompson::{
/// backtrack::BoundedBacktracker,
/// NFA,
/// WhichCaptures,
/// };
///
/// let re = BoundedBacktracker::builder()
/// .thompson(NFA::config().which_captures(WhichCaptures::None))
/// .build(r"[a-z]+")?;
/// assert!(PikeVM::new_from_nfa(nfa).is_err());
/// let mut cache = re.create_cache();
///
/// assert!(re.try_is_match(&mut cache, "abc")?);
/// assert_eq!(None, re.try_find(&mut cache, "abc")?);
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
Expand Down
12 changes: 0 additions & 12 deletions regex-automata/src/nfa/thompson/error.rs
Original file line number Diff line number Diff line change
Expand Up @@ -68,9 +68,6 @@ enum BuildErrorKind {
/// The invalid index that was given.
index: u32,
},
/// An error that occurs when one tries to build an NFA simulation (such as
/// the PikeVM) without any capturing groups.
MissingCaptures,
/// An error that occurs when one tries to build a reverse NFA with
/// captures enabled. Currently, this isn't supported, but we probably
/// should support it at some point.
Expand Down Expand Up @@ -126,10 +123,6 @@ impl BuildError {
BuildError { kind: BuildErrorKind::InvalidCaptureIndex { index } }
}

pub(crate) fn missing_captures() -> BuildError {
BuildError { kind: BuildErrorKind::MissingCaptures }
}

#[cfg(feature = "syntax")]
pub(crate) fn unsupported_captures() -> BuildError {
BuildError { kind: BuildErrorKind::UnsupportedCaptures }
Expand Down Expand Up @@ -181,11 +174,6 @@ impl core::fmt::Display for BuildError {
"capture group index {} is invalid (too big or discontinuous)",
index,
),
BuildErrorKind::MissingCaptures => write!(
f,
"operation requires the NFA to have capturing groups, \
but the NFA given contains none",
),
#[cfg(feature = "syntax")]
BuildErrorKind::UnsupportedCaptures => write!(
f,
Expand Down
40 changes: 16 additions & 24 deletions regex-automata/src/nfa/thompson/pikevm.rs
Original file line number Diff line number Diff line change
Expand Up @@ -275,15 +275,6 @@ impl Builder {
/// construction of the NFA itself will of course be ignored, since the NFA
/// given here is already built.
pub fn build_from_nfa(&self, nfa: NFA) -> Result<PikeVM, BuildError> {
// If the NFA has no captures, then the PikeVM doesn't work since it
// relies on them in order to report match locations. However, in
// the special case of an NFA with no patterns, it is allowed, since
// no matches can ever be produced. And importantly, an NFA with no
// patterns has no capturing groups anyway, so this is necessary to
// permit the PikeVM to work with regexes with zero patterns.
if !nfa.has_capture() && nfa.pattern_len() > 0 {
return Err(BuildError::missing_captures());
}
nfa.look_set_any().available().map_err(BuildError::word)?;
Ok(PikeVM { config: self.config.clone(), nfa })
}
Expand Down Expand Up @@ -828,16 +819,16 @@ impl PikeVM {
if self.get_nfa().pattern_len() == 1 {
let mut slots = [None, None];
let pid = self.search_slots(cache, &input, &mut slots)?;
let start = slots[0].unwrap().get();
let end = slots[1].unwrap().get();
let start = slots[0]?.get();
let end = slots[1]?.get();
return Some(Match::new(pid, Span { start, end }));
}
let ginfo = self.get_nfa().group_info();
let slots_len = ginfo.implicit_slot_len();
let mut slots = vec![None; slots_len];
let pid = self.search_slots(cache, &input, &mut slots)?;
let start = slots[pid.as_usize() * 2].unwrap().get();
let end = slots[pid.as_usize() * 2 + 1].unwrap().get();
let start = slots[pid.as_usize() * 2]?.get();
let end = slots[pid.as_usize() * 2 + 1]?.get();
Some(Match::new(pid, Span { start, end }))
}

Expand Down Expand Up @@ -1123,15 +1114,15 @@ impl PikeVM {
if self.get_nfa().pattern_len() == 1 {
let mut enough = [None, None];
let got = self.search_slots_imp(cache, input, &mut enough);
// This is OK because we know `enough_slots` is strictly bigger
// than `slots`, otherwise this special case isn't reached.
// This is OK because we know `enough` is strictly bigger than
// `slots`, otherwise this special case isn't reached.
slots.copy_from_slice(&enough[..slots.len()]);
return got;
}
let mut enough = vec![None; min];
let got = self.search_slots_imp(cache, input, &mut enough);
// This is OK because we know `enough_slots` is strictly bigger than
// `slots`, otherwise this special case isn't reached.
// This is OK because we know `enough` is strictly bigger than `slots`,
// otherwise this special case isn't reached.
slots.copy_from_slice(&enough[..slots.len()]);
got
}
Expand Down Expand Up @@ -2108,15 +2099,16 @@ impl SlotTable {
// if a 'Captures' has fewer slots, e.g., none at all or only slots
// for tracking the overall match instead of all slots for every
// group.
self.slots_for_captures = nfa.group_info().slot_len();
self.slots_for_captures = core::cmp::max(
self.slots_per_state,
nfa.pattern_len().checked_mul(2).unwrap(),
);
let len = nfa
.states()
.len()
// We add 1 so that our last row is always empty. We use it as
// "scratch" space for computing the epsilon closure off of the
// starting state.
.checked_add(1)
.and_then(|x| x.checked_mul(self.slots_per_state))
.checked_mul(self.slots_per_state)
// Add space to account for scratch space used during a search.
.and_then(|x| x.checked_add(self.slots_for_captures))
// It seems like this could actually panic on legitimate inputs on
// 32-bit targets, and very likely to panic on 16-bit. Should we
// somehow convert this to an error? What about something similar
Expand Down Expand Up @@ -2170,7 +2162,7 @@ impl SlotTable {
/// compute an epsilon closure outside of the user supplied regex, and thus
/// never want it to have any capturing slots set.
fn all_absent(&mut self) -> &mut [Option<NonMaxUsize>] {
let i = self.table.len() - self.slots_per_state;
let i = self.table.len() - self.slots_for_captures;
&mut self.table[i..i + self.slots_for_captures]
}
}
Expand Down

0 comments on commit 0e57a63

Please sign in to comment.