Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: rdb loader big string loading in chunks #4623

Merged
merged 4 commits into from
Feb 20, 2025

Conversation

adiholden
Copy link
Collaborator

In this PR we introduce rdb loading of string alliteratively in chunks to prevent memory spikes on big string loading.
In this PR

  1. Add 2 apis to compact object to reserve memory and append to string
  2. Reading of string data from rdb is done in chunks in case of big string
  3. The rdb save compression logic is disabled in case of big buffer to prevent additional spike in snapshot while saving due to compression buffer and in the loader to decompress the buffer.

Signed-off-by: adi_holden <adi@dragonflydb.io>
@kostasrim
Copy link
Contributor

@adiholden ready for review ?

@adiholden
Copy link
Collaborator Author

@adiholden ready for review ?

Yes in general but I want to see green CI run

@kostasrim
Copy link
Contributor

@adiholden ready for review ?

Yes in general but I want to see green CI run

alright will take a look when it becomes green :)

fix bug
Signed-off-by: adi_holden <adi@dragonflydb.io>
type_ = OBJ_STRING;
encoding_ = OBJ_ENCODING_RAW;
MakeInnerRoom(0, size, mr);
sz_ = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why sz_ = 0; ?
MakeInnerRoom - copies the string if it existed before.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now I see the usecase. Maybe just check that inner_obj() == nullptr at line 560?

@@ -230,7 +230,7 @@ RdbSerializer::~RdbSerializer() {
VLOG(2) << "compression mode: " << uint32_t(compression_mode_);
if (compression_stats_) {
VLOG(2) << "compression not effective: " << compression_stats_->compression_no_effective;
VLOG(2) << "small string none compression applied: " << compression_stats_->small_str_count;
VLOG(2) << "string none compression applied: " << compression_stats_->size_skip_count;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
VLOG(2) << "string none compression applied: " << compression_stats_->size_skip_count;
VLOG(2) << "string compression skipped " << compression_stats_->size_skip_count;

@@ -1331,15 +1331,15 @@ error_code RdbLoaderBase::ReadObj(int rdbtype, OpaqueObj* dest) {
iores = ReadStreams(rdbtype);
break;
case RDB_TYPE_JSON:
iores = ReadJson();
iores = ReadGeneric(rdbtype);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we move this on line 1313 and let it fallthrough just like RDB_TYPE_STRING (Which you did the same 😄 )

if (ec)
return make_unexpected(ec);

if (StrLen(str_obj) == 0) {
if (!is_string_type && StrLen(str_obj) == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if is_string_type is true, and pending_read_.remaining is false we won't return Unexpected and we will miss that the rdb file is corrupted

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I understood your comment correctly.
This logic is to enable empty string -> is_string_type=true && pending_read_.remaining is false and there for we do not return rdb file corrupted

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah sorry, let me explain. If we call ReadStringObj above a string can:

  1. len < kMaxStringSize upon which we will read the whole string in one iteration -- usual path
  2. len > kMaxStringSize and therefore pending_read_.remaining > 0 so we load the string in chunks/multiple iterations

For case (1), if StrLen(str_obj) == 0 the rdb file should be corrupted because the string is an empty string. Isn't that right ? This is similar to what we had before. We tried to load a string in one go, and we received an empty string -- isn't that a corrupted rdb file ?

Maybe what I am missing here is the reason we want to support reading an empty string because to my mind:

  1. We read the whole string in one go -> if it's empty then rdb file corrupted
  2. We read the whole string in multiple chunks -> Why would a chunk be empty ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

empty string is valid . you can add empty string to dragonfly.
set x ""

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh that's true, Then if pending_read_.remaining > 0 and StrLen(str_obj) == 0) then we have a corrupted rdb file since we know that the string is not an empty string (because we got remaining) but we just read 0!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if remaining > 0 we enter ReadRemainingString and it will read from sink. if the length that I am trying to read is not available that it will return error

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotcha cheers

@@ -2695,6 +2695,48 @@ async def test_replication_timeout_on_full_sync_heartbeat_expiry(
await assert_replica_reconnections(replica, 0)


@pytest.mark.exclude_epoll
@dfly_args({"proactor_threads": 1})
async def test_big_string(df_factory):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could just DEBUG POPULATE big string, save to rdb, shutdown df and reload. Why bother with a replication test here ? It's very similar to test_big_value_serialization_memory_limit

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static seeder used in this test is debug populate , but I can change this to use debug populate directly
I don't think that using replication is a bother, we want to make sure that we don't have a spike in memory when loading the data. From my perspective using replica or if we do save and than load is simillar.

CR fix
Signed-off-by: adi_holden <adi@dragonflydb.io>
CR fix
Signed-off-by: adi_holden <adi@dragonflydb.io>
@@ -1393,10 +1392,24 @@ error_code RdbLoaderBase::ReadStringObj(RdbVariant* dest) {
}
}

if (big_string_split && len > kMaxStringSize) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we split a compressed string as well?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do not compress string on save, only big blobs f.e f.e entries together

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm we actually do if we use CompressionMode::SINGLE_ENTRY baaaaa

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so incase the string is compressed the opcode is RDB_ENC_LZF and we dont split.
we dont realy use CompressionMode::SINGLE_ENTRY so I dont think we need to optimise for this

@adiholden adiholden merged commit 6016ae1 into main Feb 20, 2025
10 checks passed
@adiholden adiholden deleted the rdb_loader_split_reading_string branch February 20, 2025 14:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants