Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use unique per-node "node_counter"s rather than a node hashmap in routing #2803

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

TheBlueMatt
Copy link
Collaborator

@TheBlueMatt TheBlueMatt commented Dec 21, 2023

During routing, we spend most of our time doing hashmap lookups. It turns out, we can drop two of them, the first requires a good bit of work - assigning each node in memory a random u32 "node counter", we can then drop the main per-node routefinding state map and replace it with a vec. Once we do that, we can also drop the first-hop hashmap lookup that we do on a per node basis as we walk the network graph, replacing it with a check in the same vec.

This is the first in a series of PRs that, in total, substantially more than double our routefinding performance with real data. This first step optimizes the route-finder itself, with later steps more focused on the scorer.

Based on #2802.

@shaavan
Copy link
Contributor

shaavan commented Dec 22, 2023

CI's unhappy.
Looks like there's some error in the code

@TheBlueMatt TheBlueMatt force-pushed the 2023-12-routing-dist-vec branch 2 times, most recently from 22f6f4b to 06bcbf6 Compare January 8, 2024 21:35
@TheBlueMatt
Copy link
Collaborator Author

Fixed.

@TheBlueMatt
Copy link
Collaborator Author

Rebased.

@codecov-commenter
Copy link

codecov-commenter commented Jan 18, 2024

Codecov Report

Attention: Patch coverage is 93.73297% with 23 lines in your changes are missing coverage. Please review.

Project coverage is 89.24%. Comparing base (b747b39) to head (f53cc4c).
Report is 4 commits behind head on main.

Files Patch % Lines
lightning/src/routing/router.rs 94.33% 9 Missing and 5 partials ⚠️
lightning/src/routing/gossip.rs 91.00% 3 Missing and 6 partials ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2803      +/-   ##
==========================================
+ Coverage   89.21%   89.24%   +0.02%     
==========================================
  Files         117      117              
  Lines       95541    95747     +206     
  Branches    95541    95747     +206     
==========================================
+ Hits        85240    85450     +210     
+ Misses       7814     7805       -9     
- Partials     2487     2492       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link

coderabbitai bot commented Jan 18, 2024

Walkthrough

The project has undergone a significant update, focusing on efficiency and data integrity. The .github/workflows/build.yml file reflects updated paths and keys for network graph and scorer binaries, ensuring the latest versions are used. In the lightning source code, there's a refactoring for feature flag checks, structural optimizations for network graph storage, and scoring logic revisions to enhance performance. Additionally, new counters for node tracking in routing have been introduced, suggesting a move towards more detailed network analysis.

Changes

File Path Change Summary
.github/workflows/build.yml Updated paths and keys for net graph and scorer binaries; new SHA sum checks added.
.../src/ln/features.rs Refactored requires_unknown_bits method for efficient flag comparison.
.../src/routing/gossip.rs Added node counters and restructured fields for cache optimization and consistency checks.
.../src/routing/scoring.rs Altered decay_100k_channel_bounds function to use graph scorer and current time update.
.../src/util/test_utils.rs Introduced node counters to routing structs for enhanced route tracking.

🐇✨
To code we hop, with every commit,
A graph update, a refactor bit.
With scores and nodes, we weave the net,
Our binary tales, in silicon set. 🌐🔍

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share

Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit-tests for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit tests for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit tests.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger a review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • The JSON schema for the configuration file is available here.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/coderabbit-overrides.v2.json

CodeRabbit Discord Community

Join our Discord Community to get help, request features, and share feedback.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 40

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between 5592378 and bcbf56d.
Files selected for processing (6)
  • .github/workflows/build.yml (1 hunks)
  • lightning/src/ln/features.rs (1 hunks)
  • lightning/src/routing/gossip.rs (33 hunks)
  • lightning/src/routing/router.rs (38 hunks)
  • lightning/src/routing/scoring.rs (1 hunks)
  • lightning/src/util/test_utils.rs (2 hunks)
Files not summarized due to errors (1)
  • lightning/src/routing/router.rs: Error: Message exceeds token limit
Additional comments: 40
.github/workflows/build.yml (4)
  • 86-87: Updated binary file paths and keys for the network graph to reflect new versioning. Ensure that the new binary files are correctly placed and accessible at the specified URLs.
  • 91-93: The download and hash verification steps for the network graph binary have been updated. Verify that the hash matches the expected value to ensure integrity of the downloaded file.
  • 98-98: The environment variable for the expected network graph snapshot SHA sum has been updated. Confirm that this new SHA sum is correct and corresponds to the new binary file.
  • 99-115: New steps have been added for caching, fetching, and verifying the scorer binary file. Ensure that the scorer binary is correctly integrated into the CI process and that the SHA sum verification step is accurate.
lightning/src/ln/features.rs (1)
  • 778-794: The refactoring of the requires_unknown_bits method to use chunk iteration for flag comparison is a significant improvement in terms of efficiency. By processing 64 bits at a time instead of 8, the method reduces the number of iterations needed for feature flag checks, which can be beneficial for performance, especially when dealing with a large number of feature flags.
lightning/src/routing/gossip.rs (34)
  • 40-45: > Note: This review was outside the patches, and no patch overlapping with it was found. Original lines [1-6]

Imports and uses are modified, ensure that all the newly added imports (AtomicUsize, Ordering, etc.) are used in the code and that there are no unused imports which can lead to warnings or bloat.

  • 67-67: The NodeId struct is introduced or modified. Ensure that the changes to this struct are consistent with the rest of the codebase, especially with respect to serialization and deserialization, as these are common areas where issues arise when modifying data structures.
  • 168-169: New fields removed_node_counters and next_node_counter are added to the NetworkGraph struct. Verify that the logic for managing these counters is correctly implemented throughout the codebase, especially in the context of node removal and addition.
  • 197-197: The max_node_counter field is added to the ReadOnlyNetworkGraph. Ensure that this field is properly maintained and represents the correct maximum value of node_counter across all nodes.
  • 754-761: The ChannelUpdateInfo struct is annotated with repr(C, align(32)). Confirm that the alignment and representation directives are appropriate and that they do not cause any unforeseen issues on different architectures or with FFI boundaries.
  • 847-858: The ChannelInfo struct is annotated with repr(align(128), C). Similar to the previous comment, verify that the alignment and representation directives are appropriate and do not cause issues on different architectures or with FFI boundaries.
  • 898-907: The PartialEq implementation for ChannelInfo is modified. Ensure that all fields that should be compared are included and that this change does not introduce any regressions in areas where ChannelInfo equality checks are performed.
  • 1024-1025: The node_one_counter and node_two_counter fields in ChannelInfo are initialized with u32::max_value(). Confirm that this is the intended default value and that it is handled correctly in all parts of the code where ChannelInfo is used.
  • 1036-1037: The DirectedChannelInfo struct now includes source_counter and target_counter fields. Verify that these fields are correctly updated and used in routing decisions.
  • 1046-1051: The new method for DirectedChannelInfo is modified to set source_counter and target_counter. Ensure that the logic for determining these values is correct and that it aligns with the intended use of these counters in routing.
  • 1094-1099: The source_counter and target_counter methods are added to DirectedChannelInfo. Verify that these methods are used consistently and correctly throughout the routing logic.
  • 1290-1295: The node_counter field is added to NodeInfo. Ensure that this field is correctly managed throughout the node's lifecycle and that it is consistent with the new vector-based lookup system.
  • 1359-1359: The node_counter field in NodeInfo is initialized with u32::max_value(). Confirm that this is the intended default value and that it is handled correctly in all parts of the code where NodeInfo is used.
  • 1369-1370: The write method for NetworkGraph now includes a call to test_node_counter_consistency. Verify that this method is correctly implemented and that it does not introduce performance regressions.
  • 1405-1415: The deserialization logic for ChannelInfo and NodeInfo is modified to set node_counter. Ensure that the deserialization process is correct and that the node_counter values are consistent with the serialized data.
  • 1437-1438: The NetworkGraph constructor is modified to initialize removed_node_counters and next_node_counter. Verify that these fields are initialized to the correct values and that the constructor's logic is consistent with the rest of the codebase.
  • 1484-1485: The NetworkGraph constructor is modified to initialize next_node_counter to 0 and removed_node_counters to an empty vector. Confirm that these initial values are correct and that they are handled properly throughout the graph's lifecycle.
  • 1493-1521: The test_node_counter_consistency method is added to NetworkGraph. Verify that this method is correctly implemented and that it is called in appropriate places to ensure the consistency of node_counter values.
  • 1680-1681: The node_one_counter and node_two_counter fields in ChannelInfo are initialized with u32::max_value(). Confirm that this is the intended default value and that it is handled correctly in all parts of the code where ChannelInfo is used.
  • 1696-1697: The logic for adding a channel between nodes is modified. Verify that the changes are correct and that they do not introduce any regressions in channel management.
  • 1711-1713: The remove_channel_in_nodes method is called within a match arm. Verify that the logic for removing and updating channel information is correct and that it does not introduce any inconsistencies in the network graph.
  • 1723-1727: The node_counter_id array is introduced to manage node_counter values for channels. Verify that this logic is correct and that it properly updates the node_counter values for both nodes associated with a channel.
  • 1832-1833: The node_one_counter and node_two_counter fields in ChannelInfo are initialized with u32::max_value(). Confirm that this is the intended default value and that it is handled correctly in all parts of the code where ChannelInfo is used.
  • 1862-1862: The remove_channel_in_nodes method is called. Verify that the logic for removing a channel from nodes is correct and that it does not introduce any inconsistencies in the network graph.
  • 1881-1881: The remove_channel_in_nodes method is called within a loop. Verify that the logic for removing channels and managing node counters is correct and that it does not introduce any inconsistencies in the network graph.
  • 1890-1890: The logic for removing nodes and updating removed_node_counters is modified. Verify that the changes are correct and that they do not introduce any regressions in node management.
  • 1973-1973: The remove_channel_in_nodes method is called within a loop. Verify that the logic for removing channels and managing node counters is correct and that it does not introduce any inconsistencies in the network graph.
  • 2152-2160: The remove_channel_in_nodes method is modified to update removed_node_counters. Verify that the logic for removing channels and managing node counters is correct and that it does not introduce any inconsistencies in the network graph.
  • 2218-2222: The max_node_counter method is added to ReadOnlyNetworkGraph. Verify that this method returns the correct maximum value for node_counter and that it is used appropriately throughout the code.
  • 3519-3520: The node_one_counter and node_two_counter fields in ChannelInfo are set to specific values in a test. Verify that these values are appropriate for the test scenario and that the test correctly reflects the intended behavior of the code.
  • 3539-3540: The node_one_counter and node_two_counter fields in ChannelInfo are set to specific values in a test. Verify that these values are appropriate for the test scenario and that the test correctly reflects the intended behavior of the code.
  • 3595-3595: The node_counter field in NodeInfo is set to a specific value in a test. Verify that this value is appropriate for the test scenario and that the test correctly reflects the intended behavior of the code.
  • 3632-3632: The read_network_graph benchmark function is modified. Verify that the benchmark is correctly set up and that it accurately measures the performance of reading the network graph.
  • 3642-3642: The write_network_graph benchmark function is modified. Verify that the benchmark is correctly set up and that it accurately measures the performance of writing the network graph.
lightning/src/routing/scoring.rs (1)
  • 3687-3687: The logic within decay_100k_channel_bounds has been fundamentally changed. Ensure that the new implementation aligns with the intended behavior and that all dependent features are tested with the updated logic.

Comment on lines 1129 to 1132

// While RouteGraphNode can be laid out with fewer bytes, performance appears to be improved
// substantially when it is laid out at exactly 64 bytes.
//
// Thus, we use `#[repr(C)]` on the struct to force a suboptimal layout and check that it stays 64
// bytes here.
#[cfg(any(ldk_bench, not(any(test, fuzzing))))]
const _GRAPH_NODE_SMALL: usize = 64 - core::mem::size_of::<RouteGraphNode>();
#[cfg(any(ldk_bench, not(any(test, fuzzing))))]
const _GRAPH_NODE_FIXED_SIZE: usize = core::mem::size_of::<RouteGraphNode>() - 64;

/// A [`CandidateRouteHop::FirstHop`] entry.
Copy link

@coderabbitai coderabbitai bot Jan 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: This review was outside the patches, and no patch overlapping with it was found. Original lines [1-1]

The comments about RouteGraphNode layout for performance should be followed by actual code that reflects this. If the layout is not enforced in the code, this comment is misleading.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is enforced with static assertions.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TheBlueMatt Thank you for clarifying. It's good to know that static assertions are used to ensure the layout for performance. My apologies for any confusion.

(\\_/)
( •_•)
/>🥕

lightning/src/routing/router.rs Outdated Show resolved Hide resolved
lightning/src/routing/router.rs Outdated Show resolved Hide resolved
lightning/src/routing/router.rs Outdated Show resolved Hide resolved
Comment on lines +1344 to +1381
#[inline(always)]
fn src_node_counter(&self) -> u32 {
match self {
CandidateRouteHop::FirstHop(hop) => hop.payer_node_counter,
CandidateRouteHop::PublicHop(hop) => hop.info.source_counter(),
CandidateRouteHop::PrivateHop(hop) => hop.source_node_counter,
CandidateRouteHop::Blinded(hop) => hop.source_node_counter,
CandidateRouteHop::OneHopBlinded(hop) => hop.source_node_counter,
}
}

#[inline]
fn target_node_counter(&self) -> Option<u32> {
match self {
CandidateRouteHop::FirstHop(hop) => Some(hop.target_node_counter),
CandidateRouteHop::PublicHop(hop) => Some(hop.info.target_counter()),
CandidateRouteHop::PrivateHop(hop) => Some(hop.target_node_counter),
CandidateRouteHop::Blinded(_) => None,
CandidateRouteHop::OneHopBlinded(_) => None,
}
}

Copy link

@coderabbitai coderabbitai bot Jan 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The methods src_node_counter and target_node_counter are marked with #[inline(always)] and #[inline] respectively. It's worth noting that forcing inlining is not always beneficial and can sometimes lead to increased binary size without performance gain. It's recommended to benchmark to ensure that these attributes are actually providing a benefit.

Consider removing the #[inline(always)] and #[inline] attributes and rely on the compiler's optimization unless benchmarks show a clear benefit.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was done in response to careful benchmarking :)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TheBlueMatt Thank you for the clarification! It's great to hear that these decisions were backed by careful benchmarking. Performance optimizations like these can indeed make a significant difference when they're informed by real-world data. Keep up the good work!



If you found this review helpful, would you consider giving us a shout-out on X?
Thank you for using CodeRabbit!

// Fetch 1.5x more routes than we need as after we do some scorer updates we may end up
// with some routes we picked being un-routable.
for _ in 0..route_count * 3 / 2 {
for _ in 0..route_count {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loop in generate_test_routes for selecting random (source, destination) pairs is non-deterministic and could lead to flaky tests. Consider using a deterministic approach or documenting the non-determinism.

Review the non-deterministic loop in generate_test_routes and consider making it deterministic. If non-determinism is required, document why it is necessary for the test.

Comment on lines +8665 to +8734
let (network_graph, _) = bench_utils::read_graph_scorer(&logger).unwrap();
let scorer = FixedPenaltyScorer::with_penalty(0);
generate_routes(bench, &network_graph, scorer, &Default::default(),
Bolt11InvoiceFeatures::empty(), 0, "generate_routes_with_zero_penalty_scorer");
}

pub fn generate_mpp_routes_with_zero_penalty_scorer(bench: &mut Criterion) {
let logger = TestLogger::new();
let network_graph = bench_utils::read_network_graph(&logger).unwrap();
let (network_graph, _) = bench_utils::read_graph_scorer(&logger).unwrap();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benchmarks generate_routes_with_zero_penalty_scorer and generate_mpp_routes_with_zero_penalty_scorer are using a fixed penalty scorer with a penalty of 0. Ensure that this is the intended behavior and document the rationale behind using a zero penalty.

Review the use of a fixed penalty scorer with a penalty of 0 in the benchmarks. If this is the intended behavior, add comments explaining why a zero penalty is used.

Comment on lines +8682 to +8760
let (network_graph, scorer) = bench_utils::read_graph_scorer(&logger).unwrap();
let params = ProbabilisticScoringFeeParameters::default();
let scorer = ProbabilisticScorer::new(ProbabilisticScoringDecayParameters::default(), &network_graph, &logger);
generate_routes(bench, &network_graph, scorer, &params, Bolt11InvoiceFeatures::empty(), 0,
"generate_routes_with_probabilistic_scorer");
}

pub fn generate_mpp_routes_with_probabilistic_scorer(bench: &mut Criterion) {
let logger = TestLogger::new();
let network_graph = bench_utils::read_network_graph(&logger).unwrap();
let (network_graph, scorer) = bench_utils::read_graph_scorer(&logger).unwrap();
let params = ProbabilisticScoringFeeParameters::default();
let scorer = ProbabilisticScorer::new(ProbabilisticScoringDecayParameters::default(), &network_graph, &logger);
generate_routes(bench, &network_graph, scorer, &params,
channelmanager::provided_bolt11_invoice_features(&UserConfig::default()), 0,
"generate_mpp_routes_with_probabilistic_scorer");
}

pub fn generate_large_mpp_routes_with_probabilistic_scorer(bench: &mut Criterion) {
let logger = TestLogger::new();
let network_graph = bench_utils::read_network_graph(&logger).unwrap();
let (network_graph, scorer) = bench_utils::read_graph_scorer(&logger).unwrap();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benchmarks generate_routes_with_probabilistic_scorer and generate_large_mpp_routes_with_probabilistic_scorer are using default scoring parameters. Ensure that these parameters are appropriate for the benchmarks and document any assumptions.

Review the scoring parameters used in the benchmarks. If the default parameters are appropriate, document any assumptions made. Otherwise, adjust the parameters to better reflect the intended benchmark conditions.

Comment on lines +8708 to +8789
let (network_graph, scorer) = bench_utils::read_graph_scorer(&logger).unwrap();
let mut params = ProbabilisticScoringFeeParameters::default();
params.linear_success_probability = false;
let scorer = ProbabilisticScorer::new(
ProbabilisticScoringDecayParameters::default(), &network_graph, &logger);
generate_routes(bench, &network_graph, scorer, &params,
channelmanager::provided_bolt11_invoice_features(&UserConfig::default()), 0,
"generate_routes_with_nonlinear_probabilistic_scorer");
}

pub fn generate_mpp_routes_with_nonlinear_probabilistic_scorer(bench: &mut Criterion) {
let logger = TestLogger::new();
let network_graph = bench_utils::read_network_graph(&logger).unwrap();
let (network_graph, scorer) = bench_utils::read_graph_scorer(&logger).unwrap();
let mut params = ProbabilisticScoringFeeParameters::default();
params.linear_success_probability = false;
let scorer = ProbabilisticScorer::new(
ProbabilisticScoringDecayParameters::default(), &network_graph, &logger);
generate_routes(bench, &network_graph, scorer, &params,
channelmanager::provided_bolt11_invoice_features(&UserConfig::default()), 0,
"generate_mpp_routes_with_nonlinear_probabilistic_scorer");
}

pub fn generate_large_mpp_routes_with_nonlinear_probabilistic_scorer(bench: &mut Criterion) {
let logger = TestLogger::new();
let network_graph = bench_utils::read_network_graph(&logger).unwrap();
let (network_graph, scorer) = bench_utils::read_graph_scorer(&logger).unwrap();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benchmarks generate_routes_with_nonlinear_probabilistic_scorer and generate_large_mpp_routes_with_nonlinear_probabilistic_scorer are explicitly setting linear_success_probability to false. Ensure that this setting is appropriate for the benchmarks and document the rationale behind it.

Review the setting of linear_success_probability in the benchmarks. If this setting is appropriate, add comments explaining why it is set to false. Otherwise, adjust the setting to match the intended benchmark conditions.

Comment on lines +8745 to +8818
do_route_bench(bench, graph, scorer, score_params, bench_name, route_endpoints);
}

#[inline(never)]
fn do_route_bench<S: ScoreLookUp + ScoreUpdate>(
bench: &mut Criterion, graph: &NetworkGraph<&TestLogger>, scorer: S,
score_params: &S::ScoreParams, bench_name: &'static str,
route_endpoints: Vec<(ChannelDetails, PaymentParameters, u64)>,
) {
let payer = bench_utils::payer_pubkey();
let keys_manager = KeysManager::new(&[0u8; 32], 42, 42);
let random_seed_bytes = keys_manager.get_secure_random_bytes();

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function do_route_bench is using a hardcoded seed for randomness, which may not be ideal for benchmarking. Consider using a dynamic seed or documenting why a static seed is used.

Review the use of a hardcoded seed in the function do_route_bench. If a dynamic seed is more appropriate, update the function to use one. Otherwise, add comments explaining the choice of a static seed.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 28

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between 51d9ee3 and 8d80f82.
Files selected for processing (6)
  • .github/workflows/build.yml (1 hunks)
  • lightning/src/ln/features.rs (1 hunks)
  • lightning/src/routing/gossip.rs (33 hunks)
  • lightning/src/routing/router.rs (38 hunks)
  • lightning/src/routing/scoring.rs (1 hunks)
  • lightning/src/util/test_utils.rs (2 hunks)
Files not summarized due to errors (1)
  • lightning/src/routing/router.rs: Error: Message exceeds token limit
Files skipped from review as they are similar to previous changes (4)
  • .github/workflows/build.yml
  • lightning/src/ln/features.rs
  • lightning/src/routing/gossip.rs
  • lightning/src/util/test_utils.rs

Comment on lines 1129 to 1132

// While RouteGraphNode can be laid out with fewer bytes, performance appears to be improved
// substantially when it is laid out at exactly 64 bytes.
//
// Thus, we use `#[repr(C)]` on the struct to force a suboptimal layout and check that it stays 64
// bytes here.
#[cfg(any(ldk_bench, not(any(test, fuzzing))))]
const _GRAPH_NODE_SMALL: usize = 64 - core::mem::size_of::<RouteGraphNode>();
#[cfg(any(ldk_bench, not(any(test, fuzzing))))]
const _GRAPH_NODE_FIXED_SIZE: usize = core::mem::size_of::<RouteGraphNode>() - 64;

/// A [`CandidateRouteHop::FirstHop`] entry.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📝 NOTE
This review was outside the diff hunks, and no overlapping diff hunk was found. Original lines [1-6]

The constants _GRAPH_NODE_SMALL and _GRAPH_NODE_FIXED_SIZE are defined but not used in the provided code snippet. If they are not used elsewhere, they should be removed to avoid confusion.

- const _GRAPH_NODE_SMALL: usize = 64 - core::mem::size_of::<RouteGraphNode>();
- const _GRAPH_NODE_FIXED_SIZE: usize = core::mem::size_of::<RouteGraphNode>() - 64;

Comment on lines 1174 to 1193
pub target_node_id: &'a NodeId,
/// A unique ID which describes the source node of the hop (further from the payment target).
/// It will not conflict with any [`super::gossip::NodeInfo::node_counter`]s, but may be equal
/// to one if the node is a public node.
pub(crate) source_node_counter: u32,
/// A unique ID which describes the destination node of the hop (towards the payment target).
/// It will not conflict with any [`super::gossip::NodeInfo::node_counter`]s, but may be equal
/// to one if the node is a public node.
pub(crate) target_node_counter: u32,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ensure the unique IDs (source_node_counter and target_node_counter) are properly managed and documented, especially regarding their non-conflict with public node counters and their significance in routing logic.

Consider adding more detailed comments on how these counters are generated and used, ensuring clarity for future maintainers.

Comment on lines 1197 to 1214
/// A unique ID which describes the introduction point of the blinded path.
/// It will not conflict with any [`super::gossip::NodeInfo::node_counter`]s, but will generally
/// be equal to one from the public network graph (assuming the introduction point is a public
/// node).
source_node_counter: u32,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment on line 1198 mentions that source_node_counter will generally be equal to one from the public network graph. This implies an assumption about the introduction point's visibility. Ensure this assumption is valid and clearly documented.

Clarify under what conditions the introduction point would not be a public node and how this affects the routing process.

Comment on lines +1344 to +1381
#[inline(always)]
fn src_node_counter(&self) -> u32 {
match self {
CandidateRouteHop::FirstHop(hop) => hop.payer_node_counter,
CandidateRouteHop::PublicHop(hop) => hop.info.source_counter(),
CandidateRouteHop::PrivateHop(hop) => hop.source_node_counter,
CandidateRouteHop::Blinded(hop) => hop.source_node_counter,
CandidateRouteHop::OneHopBlinded(hop) => hop.source_node_counter,
}
}

#[inline]
fn target_node_counter(&self) -> Option<u32> {
match self {
CandidateRouteHop::FirstHop(hop) => Some(hop.target_node_counter),
CandidateRouteHop::PublicHop(hop) => Some(hop.info.target_counter()),
CandidateRouteHop::PrivateHop(hop) => Some(hop.target_node_counter),
CandidateRouteHop::Blinded(_) => None,
CandidateRouteHop::OneHopBlinded(_) => None,
}
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of #[inline(always)] and #[inline] attributes on src_node_counter and target_node_counter methods should be justified with comments, especially if they were added based on performance benchmarks.

Add comments explaining the decision to use these attributes, including any benchmarks or performance tests that support this choice.

Comment on lines +1566 to +1651
const _NODE_MAP_SIZE_TWO_CACHE_LINES: usize = 128 - core::mem::size_of::<Option<PathBuildingHop>>();
const _NODE_MAP_SIZE_EXACTLY_TWO_CACHE_LINES: usize = core::mem::size_of::<Option<PathBuildingHop>>() - 128;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The constants _NODE_MAP_SIZE_TWO_CACHE_LINES and _NODE_MAP_SIZE_EXACTLY_TWO_CACHE_LINES are defined but not used in the provided code snippet. If they are not used elsewhere, they should be removed to avoid dead code.

- const _NODE_MAP_SIZE_TWO_CACHE_LINES: usize = 128 - core::mem::size_of::<Option<PathBuildingHop>>();
- const _NODE_MAP_SIZE_EXACTLY_TWO_CACHE_LINES: usize = core::mem::size_of::<Option<PathBuildingHop>>() - 128;

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change
const _NODE_MAP_SIZE_TWO_CACHE_LINES: usize = 128 - core::mem::size_of::<Option<PathBuildingHop>>();
const _NODE_MAP_SIZE_EXACTLY_TWO_CACHE_LINES: usize = core::mem::size_of::<Option<PathBuildingHop>>() - 128;

Comment on lines +2725 to +2786
.unwrap_or_else(|| CandidateRouteHop::PrivateHop(PrivateHopCandidate {
hint: hop, target_node_id: target,
source_node_counter: *private_source_node_counter,
target_node_counter: *private_target_node_counter,
}));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic for creating CandidateRouteHop instances based on the type of hop is critical for the routing algorithm. Ensure that this logic is correct and that the correct types of hops are being created.

Review the logic for creating CandidateRouteHop instances to ensure correctness. Add comments to explain the conditions under which different types of hops are created.

Comment on lines 2765 to 2828
if let Some((first_channels, peer_node_counter)) = first_hop_targets.get_mut(&target) {
sort_first_hop_channels(first_channels, &used_liquidities,
recommended_value_msat, our_node_pubkey);
for details in first_channels {
let first_hop_candidate = CandidateRouteHop::FirstHop(FirstHopCandidate {
details, payer_node_id: &our_node_id,
details, payer_node_id: &our_node_id, payer_node_counter,
target_node_counter: *peer_node_counter,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic for searching for a direct channel between the last checked hop and first hop targets is important for the routing algorithm. Ensure that this logic is correct and optimized for performance.

Review and optimize the logic for searching for a direct channel between hops for correctness and performance. Add comments to explain how this search is performed in the routing algorithm.

Comment on lines +2813 to +2876
if let Some((first_channels, peer_node_counter)) = first_hop_targets.get_mut(&NodeId::from_pubkey(&hop.src_node_id)) {
sort_first_hop_channels(first_channels, &used_liquidities,
recommended_value_msat, our_node_pubkey);
for details in first_channels {
let first_hop_candidate = CandidateRouteHop::FirstHop(FirstHopCandidate {
details, payer_node_id: &our_node_id,
details, payer_node_id: &our_node_id, payer_node_counter,
target_node_counter: *peer_node_counter,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic for sorting first hop channels and adding entries for first hop candidates is complex and should be carefully reviewed to ensure correctness.

Review the logic for sorting first hop channels and adding entries for first hop candidates to ensure correctness. Add comments to explain how first hop channels are processed in the routing algorithm.

Comment on lines +2855 to +2920
let mut new_entry = dist[payer_node_counter as usize].take().unwrap();
let mut ordered_hops: Vec<(PathBuildingHop, NodeFeatures)> = vec!((new_entry.clone(), default_node_features.clone()));

'path_walk: loop {
let mut features_set = false;
let target = ordered_hops.last().unwrap().0.candidate.target().unwrap_or(maybe_dummy_payee_node_id);
if let Some(first_channels) = first_hop_targets.get(&target) {
let candidate = &ordered_hops.last().unwrap().0.candidate;
let target = candidate.target().unwrap_or(maybe_dummy_payee_node_id);
let target_node_counter = candidate.target_node_counter();
if let Some((first_channels, _)) = first_hop_targets.get(&target) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic for constructing the final path when the target node is our own node is critical for the routing algorithm. Ensure that this logic is correct and that the path is being constructed properly.

Review the logic for constructing the final path when the target node is our own node to ensure correctness. Add comments to explain how the final path is constructed in the routing algorithm.

Comment on lines +2894 to +2956
if target_node_counter.is_none() {
break 'path_walk;
}
if target_node_counter == Some(payee_node_counter) { break 'path_walk; }

new_entry = match dist.remove(&target) {
new_entry = match dist[target_node_counter.unwrap() as usize].take() {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic for breaking out of the path walk loop based on target_node_counter is critical for the routing algorithm. Ensure that this logic is correct and that the loop is being exited at the appropriate times.

Review the logic for breaking out of the path walk loop to ensure correctness. Add comments to explain the conditions under which the loop is exited in the routing algorithm.

@TheBlueMatt
Copy link
Collaborator Author

Rebased.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 7

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between 7a35bf8 and c5e0599.
Files selected for processing (6)
  • .github/workflows/build.yml (1 hunks)
  • lightning/src/ln/features.rs (1 hunks)
  • lightning/src/routing/gossip.rs (33 hunks)
  • lightning/src/routing/router.rs (38 hunks)
  • lightning/src/routing/scoring.rs (1 hunks)
  • lightning/src/util/test_utils.rs (2 hunks)
Files not summarized due to errors (1)
  • lightning/src/routing/router.rs: Error: Message exceeds token limit
Files not reviewed due to errors (1)
  • (no review received)
Additional comments: 19
.github/workflows/build.yml (6)
  • 86-87: The path and key for caching the routing graph snapshot have been updated to lightning/net_graph-2023-12-10.bin and ldk-net_graph-v0.0.118-2023-12-10.bin, respectively. Ensure that the new binary file is correctly generated and available at the specified URL to prevent CI failures.
  • 91-93: The fetching process for the routing graph snapshot includes a SHA-256 checksum verification step. It's crucial to verify that the EXPECTED_ROUTING_GRAPH_SNAPSHOT_SHASUM matches the actual checksum of the file at https://bitcoin.ninja/ldk-net_graph-v0.0.118-2023-12-10.bin to ensure integrity and prevent potential security issues.
  • 98-98: The EXPECTED_ROUTING_GRAPH_SNAPSHOT_SHASUM is updated. Double-check that this SHA-256 checksum accurately corresponds to the new routing graph snapshot file to ensure the integrity of the downloaded file.
  • 103-104: The path and key for caching the scorer snapshot have been updated to lightning/scorer-2023-12-10.bin and ldk-scorer-v0.0.118-2023-12-10.bin, respectively. Confirm that the new binary file is correctly generated and accessible at the provided URL to avoid CI disruptions.
  • 108-110: The fetching process for the scorer snapshot includes a SHA-256 checksum verification step. It's essential to ensure that the EXPECTED_SCORER_SNAPSHOT_SHASUM matches the actual checksum of the file at https://bitcoin.ninja/ldk-scorer-v0.0.118-2023-12-10.bin to maintain integrity and avert potential security risks.
  • 115-115: The EXPECTED_SCORER_SNAPSHOT_SHASUM is updated. Verify that this SHA-256 checksum correctly matches the new scorer snapshot file to guarantee the integrity of the downloaded file.
lightning/src/routing/gossip.rs (8)
  • 67-67: The NodeId struct is correctly annotated with #[derive(Clone, Copy, PartialEq, Eq)] to ensure it can be easily copied and compared.
  • 168-169: The addition of removed_node_counters and next_node_counter fields to the NetworkGraph struct is consistent with the PR's objective to optimize routing performance by using unique counters for nodes.
  • 197-197: The max_node_counter field in ReadOnlyNetworkGraph struct is a good addition for tracking the maximum node counter value, which is likely used for performance optimizations in routing.
  • 753-761: The use of #[repr(C, align(32))] for ChannelUpdateInfo struct is a smart optimization to ensure that the struct's layout is predictable and aligned for cache performance. This is particularly important for structures that are frequently accessed and modified in performance-critical paths.
  • 846-852: Similarly, the ChannelInfo struct's alignment with #[repr(align(128), C)] is a thoughtful optimization for cache performance. Ensuring that frequently accessed fields are likely to be on the same or adjacent cache lines can significantly impact performance in routing.
  • 1024-1025: The initialization of node_one_counter and node_two_counter to u32::max_value() in ChannelInfo struct's Readable implementation seems to be a placeholder. It's crucial to ensure that these counters are correctly set elsewhere in the code to meaningful values, as using u32::max_value() directly could lead to incorrect behavior or performance issues.
Verification successful

The search results indicate that node_one_counter and node_two_counter are indeed updated from their initial u32::max_value() state in various parts of the code. There are assertions and explicit updates that suggest these counters are managed and set to meaningful values before being used in a way that could affect behavior or performance. This evidence supports the idea that there is a mechanism in place to ensure the counters are not left at their placeholder values throughout the code's execution.

Scripts Executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Verify that node counters are correctly assigned and not left at their initial max value state.
rg "node_(one|two)_counter" --type rust

Length of output: 1526

* 1381-1382: The `test_node_counter_consistency` method in `NetworkGraph` is a good practice for ensuring the internal consistency of node counters. This kind of invariant checking is crucial for debugging and maintaining the integrity of the routing graph. * 1537-1537: The `read_only` method in `NetworkGraph` provides a safe way to access a read-only view of the network graph. This is useful for operations that need to read network graph data without the risk of modifying it.
lightning/src/routing/scoring.rs (1)
  • 3687-3687: Consider evaluating the performance impact of the time_passed method within decay_100k_channel_bounds. Frequent updates to channel scores could lead to performance bottlenecks, especially with a large number of channels.

Explore optimizing the decay calculation to reduce computational overhead, possibly by batching updates or adjusting the decay frequency based on network size and activity.

lightning/src/routing/router.rs (4)
  • 1126-1132: > 📝 NOTE

This review was outside the diff hunks, and no overlapping diff hunk was found. Original lines [3-6]

The constants _GRAPH_NODE_SMALL and _GRAPH_NODE_FIXED_SIZE are used in static assertions to enforce a specific layout for RouteGraphNode, as clarified in previous discussions. Ensure that these assertions are present and effectively enforce the intended layout for performance optimization.

  • 1181-1189: The addition of source_node_counter and target_node_counter fields in this context further supports the PR's goal of optimizing routing performance by using unique u32 counters. Good consistency across different structs.
  • 1205-1209: The introduction of source_node_counter for blinded paths aligns with the PR's optimization strategy. Ensure that the assumptions regarding the introduction point's visibility as a public node are valid and clearly documented.
  • 1354-1375: The use of #[inline(always)] and #[inline] attributes is based on careful benchmarking, as previously clarified. Consider adding comments to document the benchmarking results and rationale behind these decisions to aid future maintainers.

Comment on lines +172 to +173
payer_node_counter: u32::max_value(),
target_node_counter: u32::max_value(),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of u32::max_value() as a placeholder for payer_node_counter and target_node_counter in FirstHopCandidate struct is noted. While this approach can serve as a temporary measure, it's crucial to ensure these fields are properly initialized before use to avoid potential bugs or unexpected behavior. Consider implementing a more explicit initialization process or documentation to clarify the intended use and lifecycle of these counters.

Comment on lines +201 to +202
source_node_counter: u32::max_value(),
target_node_counter: u32::max_value(),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, the use of u32::max_value() as a placeholder for source_node_counter and target_node_counter in PrivateHopCandidate struct raises concerns about proper initialization. As with the previous comment, it's important to ensure these fields are correctly set during actual use to prevent issues. Providing clear documentation or an initialization mechanism would help maintain code clarity and reliability.

Comment on lines 3684 to 3690

pub fn decay_100k_channel_bounds(bench: &mut Criterion) {
let logger = TestLogger::new();
let network_graph = bench_utils::read_network_graph(&logger).unwrap();
let mut scorer = ProbabilisticScorer::new(Default::default(), &network_graph, &logger);
// Score a number of random channels
let mut seed: u64 = 0xdeadbeef;
for _ in 0..100_000 {
seed = seed.overflowing_mul(6364136223846793005).0.overflowing_add(1).0;
let (victim, victim_dst, amt) = {
let rong = network_graph.read_only();
let channels = rong.channels();
let chan = channels.unordered_iter()
.skip((seed as usize) % channels.len())
.next().unwrap();
seed = seed.overflowing_mul(6364136223846793005).0.overflowing_add(1).0;
let amt = seed % chan.1.capacity_sats.map(|c| c * 1000)
.or(chan.1.one_to_two.as_ref().map(|info| info.htlc_maximum_msat))
.or(chan.1.two_to_one.as_ref().map(|info| info.htlc_maximum_msat))
.unwrap_or(1_000_000_000).saturating_add(1);
(*chan.0, chan.1.node_two, amt)
};
let path = Path {
hops: vec![RouteHop {
pubkey: victim_dst.as_pubkey().unwrap(),
node_features: NodeFeatures::empty(),
short_channel_id: victim,
channel_features: ChannelFeatures::empty(),
fee_msat: amt,
cltv_expiry_delta: 42,
maybe_announced_channel: true,
}],
blinded_tail: None
};
seed = seed.overflowing_mul(6364136223846793005).0.overflowing_add(1).0;
if seed % 1 == 0 {
scorer.probe_failed(&path, victim, Duration::ZERO);
} else {
scorer.probe_successful(&path, Duration::ZERO);
}
}
let (network_graph, mut scorer) = bench_utils::read_graph_scorer(&logger).unwrap();
let mut cur_time = Duration::ZERO;
cur_time += Duration::from_millis(1);
scorer.time_passed(cur_time);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📝 NOTE
This review was outside the diff hunks, and no overlapping diff hunk was found. Original lines [14-14]

There is a TODO comment regarding missing tests.

Would you like me to help generate tests for this functionality?

scorer.probe_successful(&path, Duration::ZERO);
}
}
let (network_graph, mut scorer) = bench_utils::read_graph_scorer(&logger).unwrap();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The decay_100k_channel_bounds function lacks documentation. It's crucial to have clear documentation, especially for complex logic, to help maintainers and contributors understand its purpose and how it should be used.

+ /// Decays the bounds of channel liquidity over time to reflect the passage of time and the potential change in channel state.

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change
let (network_graph, mut scorer) = bench_utils::read_graph_scorer(&logger).unwrap();
let (network_graph, mut scorer) = bench_utils::read_graph_scorer(&logger).unwrap();

Comment on lines 1148 to 1157
/// A unique ID which describes the payer. It will not conlfict with any
/// [`super::gossip::NodeInfo::node_counter`]s, but may be equal to one if the payer is a
/// public node.
pub(crate) payer_node_counter: u32,
/// A unique ID which describes the first hop counterparty. It will not conflict with any
/// [`super::gossip::NodeInfo::node_counter`]s, but may be equal to one if the counterparty is
/// a public node.
pub(crate) target_node_counter: u32,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please correct the typo in the comment on line 1148: "conlfict" should be "conflict".

- /// A unique ID which describes the payer. It will not conlfict with any
+ /// A unique ID which describes the payer. It will not conflict with any

The addition of payer_node_counter and target_node_counter fields is a good optimization for enhancing routing performance.


Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change
/// A unique ID which describes the payer. It will not conlfict with any
/// [`super::gossip::NodeInfo::node_counter`]s, but may be equal to one if the payer is a
/// public node.
pub(crate) payer_node_counter: u32,
/// A unique ID which describes the first hop counterparty. It will not conflict with any
/// [`super::gossip::NodeInfo::node_counter`]s, but may be equal to one if the counterparty is
/// a public node.
pub(crate) target_node_counter: u32,
/// A unique ID which describes the payer. It will not conflict with any
/// [`super::gossip::NodeInfo::node_counter`]s, but may be equal to one if the payer is a
/// public node.
pub(crate) payer_node_counter: u32,
/// A unique ID which describes the first hop counterparty. It will not conflict with any
/// [`super::gossip::NodeInfo::node_counter`]s, but may be equal to one if the counterparty is
/// a public node.
pub(crate) target_node_counter: u32,

Comment on lines +1530 to +1621
#[repr(align(128))]
struct PathBuildingHop<'a> {
candidate: CandidateRouteHop<'a>,
target_node_counter: Option<u32>,
/// If we've already processed a node as the best node, we shouldn't process it again. Normally
/// we'd just ignore it if we did as all channels would have a higher new fee, but because we
/// may decrease the amounts in use as we walk the graph, the actual calculated fee may
/// decrease as well. Thus, we have to explicitly track which nodes have been processed and
/// avoid processing them again.
was_processed: bool,
/// When processing a node as the next best-score candidate, we want to quickly check if it is
/// a direct counterparty of ours, using our local channel information immediately if we can.
///
/// In order to do so efficiently, we cache whether a node is a direct counterparty here at the
/// start of a route-finding pass. Unlike all other fields in this struct, this field is never
/// updated after being initialized - it is set at the start of a route-finding pass and only
/// read thereafter.
is_first_hop_target: bool,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alignment of PathBuildingHop to 128 bytes is a performance optimization. It would be beneficial to add comments explaining the rationale behind this specific alignment to aid future maintainers and contributors.

Comment on lines +7111 to +7162
use crate::routing::scoring::ProbabilisticScoringFeeParameters;

let logger = ln_test_utils::TestLogger::new();
let graph = match super::bench_utils::read_network_graph(&logger) {
Ok(f) => f,
let (graph, mut scorer) = match super::bench_utils::read_graph_scorer(&logger) {
Ok(res) => res,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of a hardcoded seed in the test generate_routes is noted. Consider documenting the rationale behind the choice of a static seed for consistency and reproducibility purposes.

Copy link
Contributor

@tnull tnull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Firstly, let me say sorry for taking this long to have a look this, I believe I self-requested review a while back.

I did a first high-level pass and added an initial round of questions. I have to say that I'm close to a concept NACK on this one: the router logic is hard to reason about as it is and we keep discovering bugs here. It seems to me that this PR significantly increases the code complexity and introduces several new angles how things can go wrong. While this seems to work just fine for now, I fear that we'll see more breakage in the router code as a consequence in the future. If we really want to go ahead with this, it would be great if we could find a better abstraction for our newly created data structure that would offer a foolproof API, e.g., so we don't for get to insert/remove reused counters in the corresponding list.

I have yet to run the benchmarks myself to see how much speedup this PR would gain us, but from my first impression I'm not convinced it's worth the increased risks and maintenance costs. Also, it seems that a good chunk of the performance improvements might come from the last few commits alone, which are optimizations that could be applied independently from switching to node counters?


/// Tries to open a network graph file, or panics with a URL to fetch it.
pub(crate) fn get_route_file() -> Result<std::fs::File, &'static str> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re: "This means future changes to the scorer's data may
be harder to benchmark": Would an alternative be to keep both versions as separate benchmarks, so we could still benchmark updates impacting the Scorer's data model with synthetic data?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could, but I think the random failures model we had before is borderline useless for benchmarking route performance changes. We're somewhat better off doing some kind of conversion from old to new scoring data and then bencharking from there.

///
/// These IDs allow the router to avoid a `HashMap` lookup by simply using this value as an
/// index in a `Vec`, skipping a big step in some of the hottest code when routing.
pub(crate) node_counter: u32,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that storing the counter in here could be pretty dangerous if we or users were to clone the NodeInfo, do something with it and come back, at which point we could have dropped the entry and recreated another NodeInfo with the same counter.

Is this an issue? Do we want store the node_counter as part of a wrapper struct holding both the NodeInfo and the counter? Alternatively, we could make this an Option<u32> and make sure that clone() would reset it to None, asserting we'd have to re-insert/lookup it as a 'fresh' info?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm not convinced its an issue. The network graph is publicly read-only - it has several internal consistency requirements (eg each channel has both side nodes in the nodes map) which imply users can't freely edit bits of it (though they're welcome to take parts of it and copy them locally to build their own graphs).

lightning/src/routing/gossip.rs Show resolved Hide resolved
@@ -1409,15 +1429,42 @@ impl<L: Deref> NetworkGraph<L> where L::Target: Logger {
logger,
channels: RwLock::new(IndexedMap::new()),
nodes: RwLock::new(IndexedMap::new()),
next_node_counter: AtomicUsize::new(0),
removed_node_counters: Mutex::new(Vec::new()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than adding these just side-by-side, could we create a new data structure wrapping them and exposing adequate insert/remove API methods so that we'd never forget to, e.g., call removed_node_counters.push(..) whenever we remove a node?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, sadly that doesn't improve things much, we end up with a bunch of places where we just replace one removed_node_counters.push(..) with a node_counters.removed_node(...). We can't super easily use a method on the graph to mark a node removed, as we're often doing it with hash map entries already in place from a previous lookup.

Still, test_node_counter_consistency is pretty thorough, so if we have any obvious bugs fuzzing or tests should easily hit assertions in that.

@@ -865,6 +857,24 @@ pub struct ChannelInfo {
/// (which we can probably assume we are - no-std environments probably won't have a full
/// network graph in memory!).
announcement_received_time: u64,

/// The [`NodeInfo::node_counter`] of the node pointed to by [`Self::node_one`].
pub(crate) node_one_counter: u32,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that these counters will be reused, how can we be sure that they won't get outdated, especially for cloned ChannelInfos as mentioned above regarding NodeInfo?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They should only get reused when a node was fully removed. The network graph already has internal consistency requirements that any channel has both the source and sink nodes in the graph as well. This just relies on that consistency requirement by adding an additional pointer. In terms of ChannelInfos copied and used outside of a specific graph, indeed, they could point nowhere, but that's kinda by definition - the counters are specific to a NetworkGraph, they aren't global in any other sense, and each route finding operation only cares about a single graph and its contained infos.

lightning/src/routing/router.rs Outdated Show resolved Hide resolved
/// public node.
pub(crate) payer_node_counter: u32,
/// A unique ID which describes the first hop counterparty. It will not conflict with any
/// [`super::gossip::NodeInfo::node_counter`]s, but may be equal to one if the counterparty is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: may be equal to one is ambiguous in this context (here and below).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how this is ambiguous? Its saying that this won't step on the toes of any data in our graph, but it may be equal to some data in our graph if the node is public.

lightning/src/routing/router.rs Outdated Show resolved Hide resolved
@TheBlueMatt
Copy link
Collaborator Author

I did a first high-level pass and added an initial round of questions. I have to say that I'm close to a concept NACK on this one: the router logic is hard to reason about as it is and we keep discovering bugs here. It seems to me that this PR significantly increases the code complexity and introduces several new angles how things can go wrong. While this seems to work just fine for now, I fear that we'll see more breakage in the router code as a consequence in the future. If we really want to go ahead with this, it would be great if we could find a better abstraction for our newly created data structure that would offer a foolproof API, e.g., so we don't for get to insert/remove reused counters in the corresponding list.

Fair, let me encapsulate the node counter logic and remove it from get_route and then we can see how we feel about it.

@TheBlueMatt
Copy link
Collaborator Author

I have yet to run the benchmarks myself to see how much speedup this PR would gain us, but from my first impression I'm not convinced it's worth the increased risks and maintenance costs. Also, it seems that a good chunk of the performance improvements might come from the last few commits alone, which are optimizations that could be applied independently from switching to node counters?

Sadly not. The last few commits reduce the pressure we put on the branch predictor, and improve things a bit on the edges, but the vast majority of the gain here is dropping the hash table lookups. A very large portion of our total routing time is spent just doing hash table lookups directly (we have like 3 or 4 of them we index into in routing - the network graph, gossip data, dist, etc), so dropping one entirely is a huge win.

Until now, our routing benchmarks used a synthetic scorer,
generated by scoring random paths to build up some history. This is
pretty far removed from real-world routing conditions, as
alternative paths generally have no scoring information and even
the paths we do take have only one or two past scoring results.

Instead, we fetch a static serialized scorer, generated using
minutely probes. This means future changes to the scorer's data may
be harder to benchmark, but makes for substantially more realistic
benchmarks for changes which don't impact the serialized state.
@TheBlueMatt TheBlueMatt force-pushed the 2023-12-routing-dist-vec branch 2 times, most recently from 3b12e7a to 75d0c2e Compare March 19, 2024 22:26
This ensures the route benchmarks themselves will appear with a
distinct callgraph, making router profiling somewhat easier.
These counters are simply a unique number describing each node.
They have no specific meaning, but start at 0 and count up, with
counters being reused after a node has been deleted.
In the next commit, we'll use the new `node_counter`s to remove a
`HashMap` from the router, using a `Vec` to store all our per-node
information. In order to make finding entries in that `Vec` cheap,
here we store the source and destintaion `node_counter`s in
`ChannelInfo`, givind us the counters for both ends of a channel
without doing a second `HashMap` lookup.
In the next commit we'll stop using `NodeId`s to look up nodes when
routing, instead using the new per-node counters. Here we take the
first step, adding a local struct which tracks temporary counters
for route hints/source/destination nodes.

Because we must ensure we have a 1-to-1 mapping from node ids to
`node_counter`s, even across first-hop and last-hop hints, we have
to be careful to check the network graph first, then a new
`private_node_id_to_node_counter` map to ensure we only ever end up
with one counter per node id.
Now that we have unique, dense, 32-bit identifiers for all the
nodes in our network graph, we can store the per-node information
when routing in a simple `Vec` rather than a `HashMap`. This avoids
the overhead of hashing and table scanning entirely, for a nice
"simple" optimization win.
When processing the main loop during routefinding, for each node,
we check whether it happens to be our peer in one of our channels.
This ensures we never fail to find a route that takes a hop through
a private channel of ours, to a private node, then through
invoice-provided route hints to reach the ultimate payee.

Because this is incredibly hot code, doing a full `HashMap` lookup
to check if each node is a first-hop target ends up eating a good
chunk of time during routing. Luckily, we can trivially avoid this
cost.

Because we're already looking up the per-node state in the `dist`
map, we can store a bool in each first-hop target's state, avoiding
the lookup unless we know its going to succeed.

This requires storing a dummy entry in `dist`, which feels somewhat
strange, but is ultimately fine as we should never be looking at
per-node state unless we've already found a path to that node,
updating the fields in doign so.
Now that `PathBuildingHop` is stored in a `Vec` (as `Option`s),
rather than `HashMap` entries, they can grow to fill a full two
cache lines without a memory access performance cost. In the next
commit we'll take advantage of this somewhat, but here we update
the assertions and drop the `repr(C)`, allowing rust to lay the
memory out as it wishes.
While LLVM should inline and elide the redundant calls, because the
router is rather large LLVM can decide against inlining in some
cases where it would be an nice win.

Thus, its worth DRY'ing the redundant calls explicitly.
Because we now have some slack space in `PathBuildingHop`, we can
use it to cache some additional hot values. Here we use it to
cache the source and target `node_counter`s for public channels,
effectively prefetching the values from the channel state.
It turns out we spend several percent of our routefinding time just
checking if nodes and channels require unknown features
byte-by-byte. While the cost is almost certainly dominated by the
memory read latency, avoiding doing the checks byte-by-byte should
reduce the branch count slightly, which may reduce the overhead.
Because fetching fields from the `$candidate` often implies an
indirect read, grouping them together may result in one or two
fewer memory loads, so we do so here.
Because we scan per-channel information in the hot inner loop of
our routefinding immediately after looking a channel up in a
`HashMap`, we end up spending a nontrivial portion of our
routefinding time waiting on memory to be read in.

While there is only so much we can do about that, ensuring the
channel information that we care about is sitting on one or
adjacent cache lines avoids paying that penalty twice. Thus, here
we manually lay out `ChannelInfo` and `ChannelUpdateInfo` and set
them to 128b and 32b alignment, respectively. This wastes some
space in memory in our network graph, but improves routing
performance in return.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants