Merge consequent text and CDATA events into one string #520

Mingun · 2022-12-04T15:50:26Z

This PR fixes #474 and introduces a way to read current parser configuration, which was impossible before that.

I've changed the way how configuration is accessed and changed: instead of having functions to change configuration flags, readers now provides a reference to a Config object. Immutable and mutable references are provided. This new feature is used to temporary disable trimming while read text events in serde Deserializer.

After fixing #516, all configuration flags are safe to changed at any time, because their does not change the internal state of a reader in a user-visible way (for example, the expand_empty_elements changes an internal state of a reader, but that change is rolled back after next call to read_event, so user cannot see it consequences. It is safe to disable this setting just after read fake Start event and still get a fake End event after that).

codecov-commenter · 2022-12-04T16:06:44Z

Codecov Report

Merging #520 (43634b2) into master (221b57d) will increase coverage by 1.94%.
The diff coverage is 95.70%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@            Coverage Diff             @@
##           master     #520      +/-   ##
==========================================
+ Coverage   61.25%   63.20%   +1.94%     
==========================================
  Files          32       32              
  Lines       15654    16439     +785     
==========================================
+ Hits         9589    10390     +801     
+ Misses       6065     6049      -16

Flag	Coverage Δ
unittests	`63.20% <95.70%> (+1.94%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/de/simple_type.rs	`93.65% <ø> (ø)`
src/de/var.rs	`90.62% <55.55%> (-0.93%)`	⬇️
src/de/map.rs	`83.59% <70.00%> (+3.01%)`	⬆️
src/events/mod.rs	`72.38% <84.61%> (+0.74%)`	⬆️
src/de/mod.rs	`70.64% <96.87%> (+14.43%)`	⬆️
src/reader/mod.rs	`94.94% <100.00%> (ø)`
src/lib.rs	`13.31% <0.00%> (-0.07%)`	⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Mingun · 2022-12-04T16:07:33Z

I've also checked, that XmlBeans is able to parse splitted numbers:

12<!--comment--><![CDATA[34]]>

is a valid representation for integer 1234, which means that XmlBeans uses the same approach as was selected here: skip all comments and PIs, merge all consequent text and CDATA events and then parse to the required type.

Mingun · 2022-12-04T18:07:34Z

I found, that, because of lookahead, we can cache trimmed event and return incorrect results (= trim spaces inside the string). More work needed

Mingun · 2023-01-22T20:25:24Z

Ok, it's ready for review now. Of course, it would be great, if quick-xml will provide an API to read correctly trimmed text events from Reader, but this will require lookahead from one event (to trim start) to undefinitely number of events (to trim end, you need to skip all comments / processing instructions to determine if you need to trim end or not).

So I do not provide a low-level API, but provide a BytesText::inplace_trim_start() and BytesText::inplace_trim_end() instead. Then users could use that primitives to trim their events.

dralley · 2023-02-04T05:54:30Z

Changelog.md


 ### Bug Fixes

 - [#537]: Restore ability to deserialize attributes that represents XML namespace
  mappings (`xmlns:xxx`) that was broken since [#490]
+- [#520]: Merge consequent (delimited only by comments and processing instructions)


If this only applies to the serde deserializer, that should probably be noted here.

Also, if this is something that isn't practical to handle in the low-level API for the time being, then it would be good to document the limitation. And if it may be possible to do in the future (perhaps requiring API changes to do so) then we should clone #474.

I appreciate the detailed writeup of the issue, by the way.

I clearly indicated that this applies only to the deserializer. The documentation thing probably solved by the last commit in this branch (16fca58)?

…ate events to DeEvents Actual conversion will be added in follow up commits

We want to merge consequent text and CDATA content into one text, and because only text parts could be escaped, now we unescape them always

failures (36): de::tests::merge_text::cdata_and_cdata de::tests::merge_text::cdata_and_text de::tests::merge_text::comment_between::cdata de::tests::merge_text::comment_between::cdata_and_cdata de::tests::merge_text::comment_between::cdata_and_text de::tests::merge_text::comment_between::empty_cdata_and_text de::tests::merge_text::comment_between::text de::tests::merge_text::comment_between::text_and_cdata de::tests::merge_text::comment_between::text_and_empty_cdata de::tests::merge_text::empty_cdata_and_text de::tests::merge_text::pi_between::cdata de::tests::merge_text::pi_between::cdata_and_cdata de::tests::merge_text::pi_between::cdata_and_text de::tests::merge_text::pi_between::empty_cdata_and_text de::tests::merge_text::pi_between::text de::tests::merge_text::pi_between::text_and_cdata de::tests::merge_text::pi_between::text_and_empty_cdata de::tests::merge_text::text_and_cdata de::tests::merge_text::text_and_empty_cdata de::tests::triples::cdata::cdata::cdata de::tests::triples::cdata::cdata::end de::tests::triples::cdata::cdata::eof de::tests::triples::cdata::cdata::start de::tests::triples::cdata::cdata::text de::tests::triples::cdata::text::cdata de::tests::triples::cdata::text::end de::tests::triples::cdata::text::eof de::tests::triples::cdata::text::start de::tests::triples::start::cdata::cdata de::tests::triples::start::cdata::text de::tests::triples::start::text::cdata de::tests::triples::text::cdata::cdata de::tests::triples::text::cdata::end de::tests::triples::text::cdata::eof de::tests::triples::text::cdata::start de::tests::triples::text::cdata::text

fixed: de::tests::merge_text::comment_between::text_and_empty_cdata de::tests::merge_text::pi_between::text_and_empty_cdata de::tests::merge_text::text_and_empty_cdata

Pastequee · 2023-02-22T16:11:59Z

Hello, since my question is quite related and issues have been opened for this before, I put my message here, feels more relevant.
Thanks to this PR, the trimming process is much clearer and with little modifications, it is possible to disable end trimming without breaking anything. But removing start trimming still breaks things.
But the thing is, even if it is now partially possible to do so, there is no way of doing it through some sort of config struct or something like that when doing quick_xml::de::from_str, we are still blocked with default trim which is annoying.
Is there a solution planed for this or this needs to be discussed/think @Mingun ?

To be sure, I am only talking here about ::Text and ::CData trimming and not the rest.
Thank you for your time

Mingun · 2023-02-22T17:56:05Z

It is possible to add a setting to enable / disable trimming. I'll accept such a PR.

To be clear: CData is never trimmed because it's purpose to represent data exactly as it written in XML.

This PR contains the following updates: | Package | Type | Update | Change | |---|---|---|---| | [quick-xml](https://github.com/tafia/quick-xml) | dependencies | minor | `0.27.1` -> `0.28.0` | --- ### Release Notes <details> <summary>tafia/quick-xml</summary> ### [`v0.28.0`](https://github.com/tafia/quick-xml/blob/HEAD/Changelog.md#0280----2023-03-13) [Compare Source](tafia/quick-xml@v0.27.1...v0.28.0) ##### New Features - [#541]: (De)serialize specially named `$text` enum variant in [externally tagged] enums to / from textual content - [#556]: `to_writer` and `to_string` now accept `?Sized` types - [#556]: Add new `to_writer_with_root` and `to_string_with_root` helper functions - [#520]: Add methods `BytesText::inplace_trim_start` and `BytesText::inplace_trim_end` to trim leading and trailing spaces from text events - [#565]: Allow deserialize special field names `$value` and `$text` into borrowed fields when use serde deserializer - [#568]: Rename `Writter::inner` into `Writter::get_mut` - [#568]: Add method `Writter::get_ref` - [#569]: Rewrite the `Reader::read_event_into_async` as an async fn, making the future `Send` if possible. - [#571]: Borrow element names (`<element>`) when deserialize with serde. This change allow to deserialize into `HashMap<&str, T>`, for example - [#573]: Add basic support for async byte writers via tokio's `AsyncWrite`. ##### Bug Fixes - [#537]: Restore ability to deserialize attributes that represents XML namespace mappings (`xmlns:xxx`) that was broken since [#490] - [#510]: Fix an error of deserialization of `Option<T>` fields where `T` is some sequence type (for example, `Vec` or tuple) - [#540]: Fix a compilation error (probably a rustc bug) in some circumstances. `Serializer::new` and `Serializer::with_root` now accepts only references to `Write`r. - [#520]: Merge consequent (delimited only by comments and processing instructions) texts and CDATA when deserialize using serde deserializer. `DeEvent::Text` and `DeEvent::CData` events was replaced by `DeEvent::Text` with merged content. The same behavior for the `Reader` does not implemented (yet?) and should be implemented manually - [#562]: Correctly set minimum required version of memchr dependency to 2.1 - [#565]: Correctly set minimum required version of tokio dependency to 1.10 - [#565]: Fix compilation error when build with serde <1.0.139 [externally tagged]: https://serde.rs/enum-representations.html#externally-tagged [#490]: tafia/quick-xml#490 [#510]: tafia/quick-xml#510 [#520]: tafia/quick-xml#520 [#537]: tafia/quick-xml#537 [#540]: tafia/quick-xml#540 [#541]: tafia/quick-xml#541 [#556]: tafia/quick-xml#556 [#562]: tafia/quick-xml#562 [#565]: tafia/quick-xml#565 [#568]: tafia/quick-xml#568 [#569]: tafia/quick-xml#569 [#571]: tafia/quick-xml#571 [#573]: tafia/quick-xml#573 </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] If you want to rebase/retry this PR, check this box --- This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate).  Co-authored-by: cabr2-bot <cabr2.help@gmail.com> Co-authored-by: crapStone <crapstone01@gmail.com> Reviewed-on: https://codeberg.org/Calciumdibromid/CaBr2/pulls/1818 Reviewed-by: crapStone <crapstone@noreply.codeberg.org> Co-authored-by: Calciumdibromid Bot <cabr2_bot@noreply.codeberg.org> Co-committed-by: Calciumdibromid Bot <cabr2_bot@noreply.codeberg.org>

Mingun added bug enhancement serde Issues related to mapping from Rust types to XML labels Dec 4, 2022

Mingun requested a review from dralley December 4, 2022 15:50

Mingun marked this pull request as draft December 4, 2022 18:05

dralley mentioned this pull request Dec 15, 2022

gh: add pull request template rpm-rs/rpm#58

Merged

Mingun mentioned this pull request Dec 29, 2022

Disable trim_text in Deserializer from_reader #285

Open

Mingun force-pushed the merge-text-nodes branch 2 times, most recently from 87e2dc6 to e2ee98a Compare January 22, 2023 20:03

Mingun marked this pull request as ready for review January 22, 2023 20:18

Mingun mentioned this pull request Jan 30, 2023

Release 0.28.0 #549

Closed

13 tasks

dralley reviewed Feb 4, 2023

View reviewed changes

Mingun force-pushed the merge-text-nodes branch 2 times, most recently from 16fca58 to 43634b2 Compare February 6, 2023 14:58

Mingun force-pushed the merge-text-nodes branch from 43634b2 to 30636d3 Compare February 17, 2023 14:13

dralley approved these changes Feb 20, 2023

View reviewed changes

Mingun and others added 10 commits February 20, 2023 21:21

Create an intermediate event and a reader that will convert intermedi…

d0fa030

…ate events to DeEvents Actual conversion will be added in follow up commits

Always unescape text content

2b88487

We want to merge consequent text and CDATA content into one text, and because only text parts could be escaped, now we unescape them always

Decode texts on time of converting events from PayloadEvent to DeEvent

da0f6fe

Reformat read_string_impl to increase readability

fc6c674

Trim text events during conversion from Event to PayloadEvent

6628898

Trim text events from end during conversion from PayloadEvent to DeEvent

488b70e

Do not trim spaces in events using Reader capabilities

c4a1eb4

fixed: de::tests::merge_text::comment_between::text_and_empty_cdata de::tests::merge_text::pi_between::text_and_empty_cdata de::tests::merge_text::text_and_empty_cdata

Merge consequent text events

6a11473

Add warning to the trim_text and trim_text_end methods

ac1ad0c

Mingun force-pushed the merge-text-nodes branch from 30636d3 to ac1ad0c Compare February 20, 2023 16:29

Mingun merged commit f0b3420 into tafia:master Feb 20, 2023

Mingun deleted the merge-text-nodes branch February 20, 2023 16:55

Pastequee mentioned this pull request Feb 23, 2023

Add trimming option for the de::from_* functions #561

Draft

Mingun mentioned this pull request May 19, 2023

Deserialization of a doctype with very long content fails #533

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge consequent text and CDATA events into one string #520

Merge consequent text and CDATA events into one string #520

Mingun commented Dec 4, 2022

codecov-commenter commented Dec 4, 2022 •

edited

Mingun commented Dec 4, 2022

Mingun commented Dec 4, 2022

Mingun commented Jan 22, 2023

dralley Feb 4, 2023

dralley Feb 4, 2023

Mingun Feb 4, 2023

Pastequee commented Feb 22, 2023

Mingun commented Feb 22, 2023

Merge consequent text and CDATA events into one string #520

Merge consequent text and CDATA events into one string #520

Conversation

Mingun commented Dec 4, 2022

codecov-commenter commented Dec 4, 2022 • edited

Codecov Report

Mingun commented Dec 4, 2022

Mingun commented Dec 4, 2022

Mingun commented Jan 22, 2023

dralley Feb 4, 2023

Choose a reason for hiding this comment

dralley Feb 4, 2023

Choose a reason for hiding this comment

Mingun Feb 4, 2023

Choose a reason for hiding this comment

Pastequee commented Feb 22, 2023

Mingun commented Feb 22, 2023

codecov-commenter commented Dec 4, 2022 •

edited