Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds APIs for accessing encoding of raw stream items #760

Merged
merged 5 commits into from
May 7, 2024
Merged

Conversation

zslayton
Copy link
Contributor

@zslayton zslayton commented May 4, 2024

In aggregate, the changes in this PR make it straightforward to access a "lower level" view of a given item in an Ion stream, determining whether it was an input literal, part of a template, or a constructed value. This enables the development of a variety of tools.

Changes:

  • Uses the bump allocator to handle text escape processing, allowing RawSymbolTokenRef to hold a reference to a &'bump str instead of potentially owning a String. This change allows the RawSymbolTokenRef type to implement Copy, which in turn allows all of the LazyExpandedValue- and LazyValue-related types to also implement Copy.
  • Removes the RawSymbolToken type, which is now redundant to the RawSymbolTokenRef type.
  • Adds a Span type that provides access to the input bytes that comprised various raw stream items.
  • Adds a LazyRawVersionMarker trait and per-encoding impls that can provide a Span upon request.
  • Moves the field encoding information from the LazyDecoder::Value types into implementations of a new LazyRawFieldName trait. This fixes issue Explore moving struct field info out of Lazy*Value and into Lazy*Field #631.
  • Adds a LazyRawField trait and per-encoding impls that can provide a Span upon request.
  • Adds an UnexpandedField type that can represent both raw struct fields and struct fields from a template body. This simplified the code for expanding structs.
  • Adds methods to convert container types back to the general value type.
  • Adds EncodedBinaryValueData_1_0 and EncodedBinaryAnnotations_1_0 types that can be used to access spans and ranges for the various components of a binary 1.0 value.

This patch exposes many functions and types which we likely wish to feature gate, but that change is being left for a future PR.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Zack Slayton added 3 commits May 4, 2024 10:28
* Uses the bump allocator to handle text escape processing, allowing
  `RawSymbolTokenRef` to hold a reference to a `&'bump str` instead
  of potentially owning a `String`. This change allows the
  `RawSymbolTokenRef` type to implement `Copy`, which in turn allows
  all of the `LazyExpandedValue`- and `LazyValue`-related types
  to also implement `Copy`.
* Removes the `RawSymbolToken` type, which is now redundant to the
  `RawSymbolTokenRef` type.
* Adds a `Span` type that provides access to the input bytes that
  comprised various raw stream items.
* Adds a `LazyRawVersionMarker` trait and per-encoding impls that can
  provide a `Span` upon request.
* Adds a `LazyRawField` trait and per-encoding impls that can provide
  a `Span` upon request.
* Adds an `UnexpandedField` type that can represent both raw struct
  fields and struct fields from a template body. This simplified
  the code for expanding structs.
* Adds methods to convert container types back to the general value
  type.
* Adds `EncodedBinaryValueData_1_0` and `EncodedBinaryAnnotations_1_0`
  types that can be used to access spans and ranges for the various
  components of a binary 1.0 value.

This patch exposes many functions and types which we likely wish
to feature gate, but that change is being left for a future PR.
Copy link
Contributor Author

@zslayton zslayton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ PR tour

.write(15, RawSymbolToken::SymbolId(19))? // format
.write(13, RawSymbolTokenRef::SymbolId(17))? // logger name
.write(14, RawSymbolTokenRef::SymbolId(18))? // log level
.write(15, RawSymbolTokenRef::SymbolId(19))? // format
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ Now that RawSymbolTokenRef can hold a reference to a decoded string in the bump allocator, there's no longer a need for a separate RawSymbolToken type that can own a string.

Down the line, I'd like to rename RawSymbolTokenRef to simply RawSymbol.

Comment on lines +87 to +92
pub enum LazyRawAnyVersionMarkerKind<'top> {
Text_1_0(LazyRawTextVersionMarker_1_0<'top>),
Binary_1_0(LazyRawBinaryVersionMarker_1_0<'top>),
Text_1_1(LazyRawTextVersionMarker_1_1<'top>),
Binary_1_1(LazyRawBinaryVersionMarker_1_1<'top>),
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ There's now a per-encoding version marker type that can provide a Span of its input bytes. Previously we only surfaced the (major: u8, minor: u8) pair, but didn't have information about its encoding (particularly its location).

Comment on lines +97 to +102
match self.encoding {
Text_1_0(marker) => marker.span(),
Binary_1_0(marker) => marker.span(),
Text_1_1(marker) => marker.span(),
Binary_1_1(marker) => marker.span(),
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ This file (LazyRawAny_____ types) is entirely enum dispatch. A while ago I looked to see if I could use the enum_dispatch crate to eliminate all the repetition, but the use of lifetimes was an obstacle. I have not checked recently to see if that limitation has been overcome.

pub fn kind(&self) -> LazyRawValueKind<'top> {
self.encoding
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ Allows an application to access the encoding-specific data type that underlies this value.

@@ -1211,7 +1412,7 @@ mod tests {

assert!(matches!(
reader.next(&allocator)?,
LazyRawStreamItem::<AnyEncoding>::EndOfStream
LazyRawStreamItem::<AnyEncoding>::EndOfStream(_)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ The EndOfStream item now returns an EndPosition that can provide its range and span. This is not especially useful in and of itself, but it allows the RawStreamItem type that contains it to also provide a range and span for every item.

Comment on lines -39 to -47
fn token_is_identifier(token: &str) -> bool {
if token.is_empty() {
return false;
}
let mut chars = token.chars();
let first = chars.next().unwrap();
(first == '$' || first == '_' || first.is_ascii_alphabetic())
&& chars.all(|c| c == '$' || c == '_' || c.is_ascii_alphanumeric())
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ The methods in this section were originally copied from the IonValueFormatter. This PR introduces the IoFmtShim type that allows them to be re-used instead of copied.

Comment on lines -323 to +396
let mut iter_adapter = raw_struct_iter.map(
|field: IonResult<LazyRawFieldExpr<'top, D>>| match field? {
RawFieldExpr::NameValuePair(name, RawValueExpr::MacroInvocation(m)) => {
let resolved_invocation = m.resolve(context)?;
Ok(RawFieldExpr::NameValuePair(
name,
RawValueExpr::MacroInvocation(resolved_invocation.into()),
))
}
RawFieldExpr::NameValuePair(name, RawValueExpr::ValueLiteral(value)) => Ok(
RawFieldExpr::NameValuePair(name, RawValueExpr::ValueLiteral(value)),
),
RawFieldExpr::MacroInvocation(invocation) => {
let resolved_invocation = invocation.resolve(context)?;
Ok(RawFieldExpr::MacroInvocation(resolved_invocation.into()))
}
},
);
Self::next_field_from(context, state, e_exp_evaluator, &mut iter_adapter)
Self::next_field_from(context, state, e_exp_evaluator, raw_struct_iter)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ The introduction of the UnexpandedField enum made it possible to eliminate a lot of the complexity and generics involved in struct expansion.

Comment on lines -89 to -104
/// This helper function takes a parser and returns a closure that performs the same parsing
/// but prints the Result before returning the output. This is handy for debugging.
// A better implementation would use a macro to auto-generate the label from the file name and
// line number.
fn dbg_parse<I: Debug, O: Debug, E: Debug, P: Parser<I, O, E>>(
label: &'static str,
mut parser: P,
) -> impl Parser<I, O, E> {
move |input: I| {
let result = parser.parse(input);
#[cfg(debug_assertions)]
println!("{}: {:?}", label, result);
result
}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ This method would cause a clippy warning in release builds because label wouldn't be used. The text reader is spec compliant, so I feel we don't really need this debug helper anymore.

Comment on lines -470 to -472
// TODO: We're discarding the name encoding information here. When we revise our field name
// storage strategy[1], we should make sure to capture this for tooling's sake.
// [1]: https://github.com/amazon-ion/ion-rust/issues/631
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ 🎉

pub fn read<'data>(&self, matched_input: TextBufferView<'data>) -> IonResult<StrRef<'data>> {
pub fn read<'data>(
&self,
allocator: &'data BumpAllocator,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ Text escapes and base64 decoding now happen in bump-allocated scratch space instead of heap-allocated Strings and Vecs.

@zslayton zslayton requested review from popematt and nirosys May 4, 2024 15:41
@zslayton zslayton marked this pull request as ready for review May 4, 2024 15:42
src/lazy/any_encoding.rs Outdated Show resolved Hide resolved
src/lazy/any_encoding.rs Outdated Show resolved Hide resolved
src/lazy/binary/encoded_value.rs Outdated Show resolved Hide resolved
src/lazy/binary/immutable_buffer.rs Show resolved Hide resolved
src/lazy/binary/raw/struct.rs Outdated Show resolved Hide resolved
src/lazy/struct.rs Show resolved Hide resolved
src/lazy/span.rs Outdated Show resolved Hide resolved
Comment on lines +188 to +191
/// Returns the stream byte offset range that this buffer contains.
pub fn range(&self) -> Range<usize> {
self.offset..self.offset + self.len()
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should TextBufferView just implement HasSpan/HasRange?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could, sure. This method predates those traits.

TextBufferView is not a public-facing type though, so there's not really a need for it IMO.

src/lazy/text/encoded_value.rs Show resolved Hide resolved
src/lazy/value_ref.rs Outdated Show resolved Hide resolved
@zslayton zslayton requested a review from popematt May 7, 2024 14:28
src/lazy/raw_stream_item.rs Outdated Show resolved Hide resolved
@zslayton zslayton merged commit 892ef2d into main May 7, 2024
32 checks passed
@zslayton zslayton deleted the tooling-access branch May 7, 2024 16:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants