Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize HTMLTree.build #511

Merged
merged 2 commits into from Dec 28, 2023
Merged

Conversation

ypconstante
Copy link
Contributor

@ypconstante ypconstante commented Dec 24, 2023

Today when HTMLTree.build is called, after each element is build, the parent node children_nodes_ids list is updated, and the parent node and current node are put on tree.nodes map. This repeated updates ends up being quite costly on both memory and CPU.

This PR applies multiple changes to this module to significantly reduce the tree building cost:

  • Add parent_node to tree.nodes only after all it's children are build and added to children_nodes_ids - benchmark reduced-updates
  • During build, instead of storing the nodes as map and updating it on each built node, store the node as list of {node_id, node}, and build the map only at the end - benchmark reduced-updates
  • Instead of using HTMLTree on build_tree calls, use node_ids and nodes values directly - benchmark pr
##### With input big #####
Name                      ips        average  deviation         median         99th %
pr                      61.20       16.34 ms    ±40.16%       15.42 ms       31.98 ms
reduced-updates         34.61       28.89 ms    ±17.36%       29.11 ms       36.52 ms
today                   12.61       79.30 ms    ±26.62%       83.80 ms      116.42 ms

Comparison:
pr                      61.20
reduced-updates         34.61 - 1.77x slower +12.55 ms
today                   12.61 - 4.85x slower +62.96 ms

Memory usage statistics:

Name               Memory usage
pr                      7.72 MB
reduced-updates        10.95 MB - 1.42x memory usage +3.22 MB
today                  35.95 MB - 4.65x memory usage +28.23 MB

**All measurements for memory usage were the same**

##### With input medium #####
Name                      ips        average  deviation         median         99th %
pr                     443.81        2.25 ms    ±20.15%        2.37 ms        3.68 ms
reduced-updates        149.36        6.70 ms    ±20.07%        6.01 ms       10.76 ms
today                   55.43       18.04 ms    ±37.78%       20.38 ms       29.83 ms

Comparison:
pr                     443.81
reduced-updates        149.36 - 2.97x slower +4.44 ms
today                   55.43 - 8.01x slower +15.79 ms

Memory usage statistics:

Name               Memory usage
pr                      2.21 MB
reduced-updates         3.43 MB - 1.55x memory usage +1.22 MB
today                   9.96 MB - 4.51x memory usage +7.75 MB

**All measurements for memory usage were the same**

##### With input small #####
Name                      ips        average  deviation         median         99th %
pr                     879.27        1.14 ms   ±103.59%        0.39 ms        4.14 ms
reduced-updates        655.86        1.52 ms    ±85.16%        0.52 ms        4.46 ms
today                  308.46        3.24 ms    ±24.74%        2.99 ms        6.37 ms

Comparison:
pr                     879.27
reduced-updates        655.86 - 1.34x slower +0.39 ms
today                  308.46 - 2.85x slower +2.10 ms

Memory usage statistics:

Name               Memory usage
pr                    507.54 KB
reduced-updates       664.93 KB - 1.31x memory usage +157.39 KB
today                1757.59 KB - 3.46x memory usage +1250.05 KB
read_file = fn name ->
  [{"html", _, _} = html | _] =
    __ENV__.file
    |> Path.dirname()
    |> Path.join(name)
    |> File.read!()
    |> Floki.parse_document!()

  html
end

inputs = %{
  "big" => read_file.("big.html"),
  "medium" => read_file.("medium.html"),
  "small" => read_file.("small.html")
}

Benchee.run(
  %{
    "bench" => fn html -> Floki.HTMLTree.build(html) end
  },
  inputs: inputs,
  time: 5,
  memory_time: 2
)

@philss
Copy link
Owner

philss commented Dec 24, 2023

@ypconstante thank you very much for the improvements you are proposing! 💜
I should review all of them this week!

Copy link
Owner

@philss philss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMPRESSIVE!!! 🎉 🚀

Thank you very much!!

@philss philss merged commit deb5807 into philss:main Dec 28, 2023
9 checks passed
@ypconstante ypconstante deleted the optimize-html-tree-build branch December 28, 2023 01:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants