-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add streaming XML loading/parsing support #32
Conversation
BIG NOTE: this is just a proposal and proof of concept. Once consensus is reached, we can/will make the PR a lot more formal and rigorous. We have use cases where we need to load/parse large XML documents in streaming fashion to avoid memory constraints. Example XML: ``` <AAA> <BBB> .... .... </BBB> <BBB> .... .... </BBB> .... </AAA> ``` Currently we duplicate the `node.go/parse()` function in our own private repo and implement the stream parsing. As a result, we had to duplicate `Node` struct, as well as `NodeNavigator` impl. Thought if you want to take a look and consider if it's a use case others might benefit from? Limitations: - The streaming target xpath support is very very basic: `/AAA/BBB/DDD` only. I always wanted to stay away from using string, but instead a compiled `*xpath.Expr`. That way most of xpath features/functionalities (minus those XPATH directives involving position() etc) could be supported. The difficulty here is when I get an element `*Node`, how to evaluate it against the caller specified `*xpath.Expr` to see if the node "fits" the xpath? - Due to the nature and purpose of streaming, every time when a target node is read and done processing by the caller, the very next `StreamParser.Read()` call would remove the previous target node from the DOM tree (otherwise, the DOM tree would eventually grow to the full extent of the XML document, which defeats the purpose of streaming). Now that's a bit unorthodox if you might say. Want to know what's your opinion here. The behavior described here fits our needs perfectly though.
Hello, first, thanks for your support. I've been a little busy on days. Add streaming feature to reading a large XML document is a good idea. Is it only supported the basic XPath query? like "/aaa/bbb", looks it not support logical operation query. I need some time to consider whether we can support XPath.Expr in streaming parser. Thanks again! |
Yes, the proof of concept PR only supports
When the
(of course if the This aforementioned logic works because we always delete target nodes from previous Updated the PR with a commit contains the the logic above. In fact I figured there is no need to do (BTW, I worked on your xpath lib previously with adding |
I saw your new method and tested some cases, it looks good, it has supported the basic logical operations, like |
Great, if you are happy with the basic concept and implementation of the proposal, I will close this PR and start submitting official one (or ones to break it into smaller chunks) with more rigor and testing. Yes will not use panic. :) |
@zhengchun oh, let me you if you prefer a single PR contains all changes, or you prefer or are okay with smaller incremental PRs. Typically I'd like to have small incremental (but still functional, fully test covered and regression free) PRs instead of a big one. But this is your project, so please advise. |
And last but not least, I'd make sure all the new code is thoroughly tested, which means I probably need to introduce two new 3rd party dependencies:
What do you think? |
Okay 😄 |
Close the proof-of-concept PR per agreement above. Small/incremental PRs are coming. |
Reopening because I realized a bug in my implementation that needs some discussion with @zhengchun . My current implementation is roughly this: during the handling of This implementation doesn't work with the following xpath:
The fundamental problem in the current proposed implementation is that we do stream xpath evaluation at Now you might say, how about "let's instead do stream xpath evaluation at
Thoughts? Or we're over-engineering? Originally my proposal only supports extremely simple "/AAA/BBB" style stream xpath. With the supporting of full fledge Or we can simply stick with this proposed implementation and call out limitation in the documentation. |
I just realized there is another way to solve this problem, without loosing much perf and without adding too much complexity. basically we can introduce an optional xpath. For simple case: But for the tricky case: where we eventually want:
The code will use the first xpath to do evaluation in |
Okay, I'm happy with the two xpath solution proposed above. Closing this prototype PR again. |
BIG NOTE: this is just a proposal and proof of concept. Once consensus is reached, we can/will
make the PR a lot more formal and rigorous.
We have use cases where we need to load/parse large XML documents in streaming fashion to avoid
memory constraints. Example XML:
Currently we duplicate the
node.go/parse()
function in our own private repo and implement thestream parsing. As a result, we had to duplicate
Node
struct, as well asNodeNavigator
impl.Thought if you want to take a look and consider if it's a use case others might benefit from?
Limitations:
The streaming target xpath support is very very basic:
/AAA/BBB
only. I always wanted tostay away from using string, but instead a compiled
*xpath.Expr
. That way most of xpathfeatures/functionalities (minus those XPATH directives involving position() etc) could be
supported. The difficulty here is when I get an element
*Node
, how to evaluate it against thecaller specified
*xpath.Expr
to see if the node "fits" the xpath? I could traverse the element nodeall the way to the top and then do select with the
*xpath.Expr
against the top node, but when theexpr returns how do I know the returned node is the element node I'm interested in?
Due to the nature and purpose of streaming, every time when a target node is read and done
processing by the caller, the very next
StreamParser.Read()
call would remove the previoustarget node from the DOM tree (otherwise, the DOM tree would eventually grow to the full extent
of the XML document, which defeats the purpose of streaming). Now that's a bit unorthodox if
you might say. Want to know what's your opinion here. The behavior described here fits our needs
perfectly though.
@zhengchun ^^
PS we also have similar needs in
jsonquery
where we want to stream read a very large json document.