Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge consequent text and CDATA events into one string #520

Merged
merged 10 commits into from
Feb 20, 2023

Conversation

Mingun
Copy link
Collaborator

@Mingun Mingun commented Dec 4, 2022

This PR fixes #474 and introduces a way to read current parser configuration, which was impossible before that.

I've changed the way how configuration is accessed and changed: instead of having functions to change configuration flags, readers now provides a reference to a Config object. Immutable and mutable references are provided. This new feature is used to temporary disable trimming while read text events in serde Deserializer.

After fixing #516, all configuration flags are safe to changed at any time, because their does not change the internal state of a reader in a user-visible way (for example, the expand_empty_elements changes an internal state of a reader, but that change is rolled back after next call to read_event, so user cannot see it consequences. It is safe to disable this setting just after read fake Start event and still get a fake End event after that).

@Mingun Mingun added bug enhancement serde Issues related to mapping from Rust types to XML labels Dec 4, 2022
@Mingun Mingun requested a review from dralley December 4, 2022 15:50
@codecov-commenter
Copy link

codecov-commenter commented Dec 4, 2022

Codecov Report

Merging #520 (43634b2) into master (221b57d) will increase coverage by 1.94%.
The diff coverage is 95.70%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@            Coverage Diff             @@
##           master     #520      +/-   ##
==========================================
+ Coverage   61.25%   63.20%   +1.94%     
==========================================
  Files          32       32              
  Lines       15654    16439     +785     
==========================================
+ Hits         9589    10390     +801     
+ Misses       6065     6049      -16     
Flag Coverage Δ
unittests 63.20% <95.70%> (+1.94%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/de/simple_type.rs 93.65% <ø> (ø)
src/de/var.rs 90.62% <55.55%> (-0.93%) ⬇️
src/de/map.rs 83.59% <70.00%> (+3.01%) ⬆️
src/events/mod.rs 72.38% <84.61%> (+0.74%) ⬆️
src/de/mod.rs 70.64% <96.87%> (+14.43%) ⬆️
src/reader/mod.rs 94.94% <100.00%> (ø)
src/lib.rs 13.31% <0.00%> (-0.07%) ⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@Mingun
Copy link
Collaborator Author

Mingun commented Dec 4, 2022

I've also checked, that XmlBeans is able to parse splitted numbers:

12<!--comment--><![CDATA[34]]>

is a valid representation for integer 1234, which means that XmlBeans uses the same approach as was selected here: skip all comments and PIs, merge all consequent text and CDATA events and then parse to the required type.

@Mingun Mingun marked this pull request as draft December 4, 2022 18:05
@Mingun
Copy link
Collaborator Author

Mingun commented Dec 4, 2022

I found, that, because of lookahead, we can cache trimmed event and return incorrect results (= trim spaces inside the string). More work needed

@Mingun Mingun force-pushed the merge-text-nodes branch 2 times, most recently from 87e2dc6 to e2ee98a Compare January 22, 2023 20:03
@Mingun Mingun marked this pull request as ready for review January 22, 2023 20:18
@Mingun
Copy link
Collaborator Author

Mingun commented Jan 22, 2023

Ok, it's ready for review now. Of course, it would be great, if quick-xml will provide an API to read correctly trimmed text events from Reader, but this will require lookahead from one event (to trim start) to undefinitely number of events (to trim end, you need to skip all comments / processing instructions to determine if you need to trim end or not).

So I do not provide a low-level API, but provide a BytesText::inplace_trim_start() and BytesText::inplace_trim_end() instead. Then users could use that primitives to trim their events.

@Mingun Mingun mentioned this pull request Jan 30, 2023
13 tasks

### Bug Fixes

- [#537]: Restore ability to deserialize attributes that represents XML namespace
mappings (`xmlns:xxx`) that was broken since [#490]
- [#520]: Merge consequent (delimited only by comments and processing instructions)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this only applies to the serde deserializer, that should probably be noted here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if this is something that isn't practical to handle in the low-level API for the time being, then it would be good to document the limitation. And if it may be possible to do in the future (perhaps requiring API changes to do so) then we should clone #474.

I appreciate the detailed writeup of the issue, by the way.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I clearly indicated that this applies only to the deserializer. The documentation thing probably solved by the last commit in this branch (16fca58)?

Mingun and others added 10 commits February 20, 2023 21:21
…ate events to DeEvents

Actual conversion will be added in follow up commits
We want to merge consequent text and CDATA content into one text, and
because only text parts could be escaped, now we unescape them always
failures (36):
    de::tests::merge_text::cdata_and_cdata
    de::tests::merge_text::cdata_and_text
    de::tests::merge_text::comment_between::cdata
    de::tests::merge_text::comment_between::cdata_and_cdata
    de::tests::merge_text::comment_between::cdata_and_text
    de::tests::merge_text::comment_between::empty_cdata_and_text
    de::tests::merge_text::comment_between::text
    de::tests::merge_text::comment_between::text_and_cdata
    de::tests::merge_text::comment_between::text_and_empty_cdata
    de::tests::merge_text::empty_cdata_and_text
    de::tests::merge_text::pi_between::cdata
    de::tests::merge_text::pi_between::cdata_and_cdata
    de::tests::merge_text::pi_between::cdata_and_text
    de::tests::merge_text::pi_between::empty_cdata_and_text
    de::tests::merge_text::pi_between::text
    de::tests::merge_text::pi_between::text_and_cdata
    de::tests::merge_text::pi_between::text_and_empty_cdata
    de::tests::merge_text::text_and_cdata
    de::tests::merge_text::text_and_empty_cdata
    de::tests::triples::cdata::cdata::cdata
    de::tests::triples::cdata::cdata::end
    de::tests::triples::cdata::cdata::eof
    de::tests::triples::cdata::cdata::start
    de::tests::triples::cdata::cdata::text
    de::tests::triples::cdata::text::cdata
    de::tests::triples::cdata::text::end
    de::tests::triples::cdata::text::eof
    de::tests::triples::cdata::text::start
    de::tests::triples::start::cdata::cdata
    de::tests::triples::start::cdata::text
    de::tests::triples::start::text::cdata
    de::tests::triples::text::cdata::cdata
    de::tests::triples::text::cdata::end
    de::tests::triples::text::cdata::eof
    de::tests::triples::text::cdata::start
    de::tests::triples::text::cdata::text
fixed:
    de::tests::merge_text::comment_between::text_and_empty_cdata
    de::tests::merge_text::pi_between::text_and_empty_cdata
    de::tests::merge_text::text_and_empty_cdata
@Mingun Mingun merged commit f0b3420 into tafia:master Feb 20, 2023
@Mingun Mingun deleted the merge-text-nodes branch February 20, 2023 16:55
@Pastequee
Copy link

Hello, since my question is quite related and issues have been opened for this before, I put my message here, feels more relevant.
Thanks to this PR, the trimming process is much clearer and with little modifications, it is possible to disable end trimming without breaking anything. But removing start trimming still breaks things.
But the thing is, even if it is now partially possible to do so, there is no way of doing it through some sort of config struct or something like that when doing quick_xml::de::from_str, we are still blocked with default trim which is annoying.
Is there a solution planed for this or this needs to be discussed/think @Mingun ?

To be sure, I am only talking here about ::Text and ::CData trimming and not the rest.
Thank you for your time

@Mingun
Copy link
Collaborator Author

Mingun commented Feb 22, 2023

It is possible to add a setting to enable / disable trimming. I'll accept such a PR.

To be clear: CData is never trimmed because it's purpose to represent data exactly as it written in XML.

crapStone added a commit to Calciumdibromid/CaBr2 that referenced this pull request Mar 16, 2023
This PR contains the following updates:

| Package | Type | Update | Change |
|---|---|---|---|
| [quick-xml](https://github.com/tafia/quick-xml) | dependencies | minor | `0.27.1` -> `0.28.0` |

---

### Release Notes

<details>
<summary>tafia/quick-xml</summary>

### [`v0.28.0`](https://github.com/tafia/quick-xml/blob/HEAD/Changelog.md#&#8203;0280----2023-03-13)

[Compare Source](tafia/quick-xml@v0.27.1...v0.28.0)

##### New Features

-   [#&#8203;541]: (De)serialize specially named `$text` enum variant in [externally tagged]
    enums to / from textual content
-   [#&#8203;556]: `to_writer` and `to_string` now accept `?Sized` types
-   [#&#8203;556]: Add new `to_writer_with_root` and `to_string_with_root` helper functions
-   [#&#8203;520]: Add methods `BytesText::inplace_trim_start` and `BytesText::inplace_trim_end`
    to trim leading and trailing spaces from text events
-   [#&#8203;565]: Allow deserialize special field names `$value` and `$text` into borrowed
    fields when use serde deserializer
-   [#&#8203;568]: Rename `Writter::inner` into `Writter::get_mut`
-   [#&#8203;568]: Add method `Writter::get_ref`
-   [#&#8203;569]: Rewrite the `Reader::read_event_into_async` as an async fn, making the future `Send` if possible.
-   [#&#8203;571]: Borrow element names (`<element>`) when deserialize with serde.
    This change allow to deserialize into `HashMap<&str, T>`, for example
-   [#&#8203;573]: Add basic support for async byte writers via tokio's `AsyncWrite`.

##### Bug Fixes

-   [#&#8203;537]: Restore ability to deserialize attributes that represents XML namespace
    mappings (`xmlns:xxx`) that was broken since [#&#8203;490]
-   [#&#8203;510]: Fix an error of deserialization of `Option<T>` fields where `T` is some
    sequence type (for example, `Vec` or tuple)
-   [#&#8203;540]: Fix a compilation error (probably a rustc bug) in some circumstances.
    `Serializer::new` and `Serializer::with_root` now accepts only references to `Write`r.
-   [#&#8203;520]: Merge consequent (delimited only by comments and processing instructions)
    texts and CDATA when deserialize using serde deserializer. `DeEvent::Text` and
    `DeEvent::CData` events was replaced by `DeEvent::Text` with merged content.
    The same behavior for the `Reader` does not implemented (yet?) and should be
    implemented manually
-   [#&#8203;562]: Correctly set minimum required version of memchr dependency to 2.1
-   [#&#8203;565]: Correctly set minimum required version of tokio dependency to 1.10
-   [#&#8203;565]: Fix compilation error when build with serde <1.0.139

[externally tagged]: https://serde.rs/enum-representations.html#externally-tagged

[#&#8203;490]: tafia/quick-xml#490

[#&#8203;510]: tafia/quick-xml#510

[#&#8203;520]: tafia/quick-xml#520

[#&#8203;537]: tafia/quick-xml#537

[#&#8203;540]: tafia/quick-xml#540

[#&#8203;541]: tafia/quick-xml#541

[#&#8203;556]: tafia/quick-xml#556

[#&#8203;562]: tafia/quick-xml#562

[#&#8203;565]: tafia/quick-xml#565

[#&#8203;568]: tafia/quick-xml#568

[#&#8203;569]: tafia/quick-xml#569

[#&#8203;571]: tafia/quick-xml#571

[#&#8203;573]: tafia/quick-xml#573

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update again.

---

 - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box

---

This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate).
<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNS42LjAiLCJ1cGRhdGVkSW5WZXIiOiIzNS42LjAifQ==-->

Co-authored-by: cabr2-bot <cabr2.help@gmail.com>
Co-authored-by: crapStone <crapstone01@gmail.com>
Reviewed-on: https://codeberg.org/Calciumdibromid/CaBr2/pulls/1818
Reviewed-by: crapStone <crapstone@noreply.codeberg.org>
Co-authored-by: Calciumdibromid Bot <cabr2_bot@noreply.codeberg.org>
Co-committed-by: Calciumdibromid Bot <cabr2_bot@noreply.codeberg.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug enhancement serde Issues related to mapping from Rust types to XML
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Merge text and CDATA events in serde deserializer
4 participants