Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optional unescaping with serde #581

Closed
pigeonhands opened this issue Mar 29, 2023 · 4 comments · Fixed by #583
Closed

Optional unescaping with serde #581

pigeonhands opened this issue Mar 29, 2023 · 4 comments · Fixed by #583
Labels
enhancement help wanted serde Issues related to mapping from Rust types to XML

Comments

@pigeonhands
Copy link
Contributor

Currently, trying to use quick_xml::de::from_str with xml that uses entity tags in the doctype fail to parse.

<!DOCTYPE dict[
    <!ENTITY unc "unclassified">
]>

<dict>
    <word>&unc;</word>
</dict>
let data = fs::read_to_string("../data/dict.xml").unwrap();
let parsed : Dict = quick_xml::de::from_str(&data).unwrap();

thread 'main' panicked at 'called Result::unwrap() on an Err value: InvalidXml(EscapeError(UnrecognizedSymbol(1..4, "unc")))'


Could a feature flag or an alternate method be added that disables unescaping and instead always just returns the raw string?

@Mingun Mingun added enhancement help wanted serde Issues related to mapping from Rust types to XML labels Mar 29, 2023
@Mingun
Copy link
Collaborator

Mingun commented Mar 29, 2023

I'll accept a PR which will add an ability to set up entity resolver for Deserializer. In the simplest case resolver is an Fn(&str) -> Option<&str>, like there, but in that case you will not be able to capture entities from the document. So a more powerful solution is to declare Resolver trait that will be able to capture doctypes and then resolves entities.

@pigeonhands
Copy link
Contributor Author

@Mingun I'm not really sure how to go about this. I have created a fork that implements a resolver but its not going to be able to access the entities defined in the document because that parsing is handled outside the de module.

pigeonhands@ddacb31

@Mingun
Copy link
Collaborator

Mingun commented Mar 29, 2023

Wow! Your implementation looks promising! I think, it would be better if you create a draft PR so we can discuss on actual implementation in context.

You need to capture DocType events in this match:

quick-xml/src/de/mod.rs

Lines 2745 to 2765 in 2e9123a

fn trim<'a>(&mut self, event: Event<'a>) -> Option<PayloadEvent<'a>> {
let (event, trim_next_event) = match event {
Event::Start(e) => (PayloadEvent::Start(e), true),
Event::End(e) => (PayloadEvent::End(e), true),
Event::Eof => (PayloadEvent::Eof, true),
// Do not trim next text event after Text or CDATA event
Event::CData(e) => (PayloadEvent::CData(e), false),
Event::Text(mut e) => {
// If event is empty after trimming, skip it
if self.trim_start && e.inplace_trim_start() {
return None;
}
(PayloadEvent::Text(e), false)
}
_ => return None,
};
self.trim_start = trim_next_event;
Some(event)
}

For that change the EntityResolver trait to that (feel free to fix grammar if I made mistakes):

/// Used to resolve unknown entities while parsing
///
/// # Example
/// Add an example here -- you can adapt existing custom_entities.rs example
pub trait EntityResolver {
    /// Called on contents of [`Event::DocType`] to capture declared entities.
    /// Can be called multiple times, for each parsed `<!DOCTYPE >` declaration.
    fn capture(&mut self, doctype: BytesText);

    /// Called when an entity needs to be resolved.
    ///
    /// `None` is returned if a suitable value can not be found.
    /// In that case an [`Error::UnrecognizedSymbol`] will be returned.
    fn resolve<'entity>(&'entity self, entity: &str) -> Option<&'entity str>;
}

resolve method no need to be mutable -- it used only for requesting information. If implementation would need mutation in that case it should use internal mutability, using Cell or something.

@pigeonhands
Copy link
Contributor Author

@Mingun

resolve method no need to be mutable -- it used only for requesting information. If implementation would need mutation in that case it should use internal mutability, using Cell or something.

I started this a bit to late in the night 😅 It makes a lot more sense now my brain is a bit more functional

I have opened a draft PR #583

crapStone added a commit to Calciumdibromid/CaBr2 that referenced this issue Apr 18, 2023
This PR contains the following updates:

| Package | Type | Update | Change |
|---|---|---|---|
| [quick-xml](https://github.com/tafia/quick-xml) | dependencies | patch | `0.28.1` -> `0.28.2` |

---

### Release Notes

<details>
<summary>tafia/quick-xml</summary>

### [`v0.28.2`](https://github.com/tafia/quick-xml/blob/HEAD/Changelog.md#&#8203;0282----2023-04-12)

[Compare Source](tafia/quick-xml@v0.28.1...v0.28.2)

##### New Features

-   [#&#8203;581]: Allow `Deserializer` to set `quick_xml::de::EntityResolver` for
    resolving unknown entities that would otherwise cause the parser to return
    an \[`EscapeError::UnrecognizedSymbol`] error.

##### Misc Changes

-   [#&#8203;584]: Export `EscapeError` from the crate
-   [#&#8203;581]: Relax requirements for `unsescape_*` set of functions -- their now use
    `FnMut` instead of `Fn` for `resolve_entity` parameters, like `Iterator::map`
    from `std`.

[#&#8203;581]: tafia/quick-xml#581

[#&#8203;584]: tafia/quick-xml#584

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update again.

---

 - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box

---

This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate).
<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNS40MS4wIiwidXBkYXRlZEluVmVyIjoiMzUuNDEuMCJ9-->

Co-authored-by: cabr2-bot <cabr2.help@gmail.com>
Co-authored-by: crapStone <crapstone@noreply.codeberg.org>
Co-authored-by: crapStone <crapstone01@gmail.com>
Reviewed-on: https://codeberg.org/Calciumdibromid/CaBr2/pulls/1862
Reviewed-by: crapStone <crapstone@noreply.codeberg.org>
Co-authored-by: Calciumdibromid Bot <cabr2_bot@noreply.codeberg.org>
Co-committed-by: Calciumdibromid Bot <cabr2_bot@noreply.codeberg.org>
Mingun pushed a commit to danjpgriffin/quick-xml that referenced this issue Jun 9, 2023
Otherwise consequent `Text` events (which is possible if their delimited by
Comment or PI events, which is skipped) will be merged but not trimmed.
That will lead to returning a `Text` event when try to call `deserialize_struct`
or `deserialize_map` which will trigger `DeError::ExpectedStart` error.

The incorrect trim behavior was introduced in tafia#581, when DocType event began to be processed
Mingun pushed a commit to danjpgriffin/quick-xml that referenced this issue Jun 9, 2023
Otherwise consequent `Text` events (which is possible if their delimited by
Comment or PI events, which is skipped) will be merged but not trimmed.
That will lead to returning a `Text` event when try to call `deserialize_struct`
or `deserialize_map` which will trigger `DeError::ExpectedStart` error.

The incorrect trim behavior was introduced in tafia#581, when DocType event began to be processed
crapStone added a commit to Calciumdibromid/CaBr2 that referenced this issue Jun 29, 2023
This PR contains the following updates:

| Package | Type | Update | Change |
|---|---|---|---|
| [quick-xml](https://github.com/tafia/quick-xml) | dependencies | minor | `0.28.2` -> `0.29.0` |

---

### Release Notes

<details>
<summary>tafia/quick-xml</summary>

### [`v0.29.0`](https://github.com/tafia/quick-xml/blob/HEAD/Changelog.md#&#8203;0290----2023-06-13)

[Compare Source](tafia/quick-xml@v0.28.2...v0.29.0)

##### New Features

-   [#&#8203;601]: Add `serde_helper` module to the crate root with some useful utility
    functions and document using of enum's unit variants as a text content of element.
-   [#&#8203;606]: Implement indentation for `AsyncWrite` trait implementations.

##### Bug Fixes

-   [#&#8203;603]: Fix a regression from [#&#8203;581] that an XML comment or a processing
    instruction between a <!DOCTYPE> and the root element in the file brokes
    deserialization of structs by returning `DeError::ExpectedStart`
-   [#&#8203;608]: Return a new error `Error::EmptyDocType` on empty doctype instead
    of crashing because of a debug assertion.

##### Misc Changes

-   [#&#8203;594]: Add a helper macro to help deserialize internally tagged enums
    with Serde, which doesn't work out-of-box due to serde limitations.

[#&#8203;581]: tafia/quick-xml#581

[#&#8203;594]: tafia/quick-xml#594

[#&#8203;601]: tafia/quick-xml#601

[#&#8203;603]: tafia/quick-xml#603

[#&#8203;606]: tafia/quick-xml#606

[#&#8203;608]: tafia/quick-xml#608

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update again.

---

 - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box

---

This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate).
<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNS4xMTguMCIsInVwZGF0ZWRJblZlciI6IjM1LjExOC4wIiwidGFyZ2V0QnJhbmNoIjoiZGV2ZWxvcCJ9-->

Co-authored-by: cabr2-bot <cabr2.help@gmail.com>
Co-authored-by: crapStone <crapstone01@gmail.com>
Reviewed-on: https://codeberg.org/Calciumdibromid/CaBr2/pulls/1940
Reviewed-by: crapStone <crapstone01@gmail.com>
Co-authored-by: Calciumdibromid Bot <cabr2_bot@noreply.codeberg.org>
Co-committed-by: Calciumdibromid Bot <cabr2_bot@noreply.codeberg.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement help wanted serde Issues related to mapping from Rust types to XML
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants