-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: rdb loader big string loading in chunks #4623
Conversation
Signed-off-by: adi_holden <adi@dragonflydb.io>
@adiholden ready for review ? |
Yes in general but I want to see green CI run |
alright will take a look when it becomes green :) |
src/core/compact_object.cc
Outdated
type_ = OBJ_STRING; | ||
encoding_ = OBJ_ENCODING_RAW; | ||
MakeInnerRoom(0, size, mr); | ||
sz_ = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why sz_ = 0;
?
MakeInnerRoom - copies the string if it existed before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now I see the usecase. Maybe just check that inner_obj() == nullptr
at line 560?
src/server/rdb_save.cc
Outdated
@@ -230,7 +230,7 @@ RdbSerializer::~RdbSerializer() { | |||
VLOG(2) << "compression mode: " << uint32_t(compression_mode_); | |||
if (compression_stats_) { | |||
VLOG(2) << "compression not effective: " << compression_stats_->compression_no_effective; | |||
VLOG(2) << "small string none compression applied: " << compression_stats_->small_str_count; | |||
VLOG(2) << "string none compression applied: " << compression_stats_->size_skip_count; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VLOG(2) << "string none compression applied: " << compression_stats_->size_skip_count; | |
VLOG(2) << "string compression skipped " << compression_stats_->size_skip_count; |
src/server/rdb_load.cc
Outdated
@@ -1331,15 +1331,15 @@ error_code RdbLoaderBase::ReadObj(int rdbtype, OpaqueObj* dest) { | |||
iores = ReadStreams(rdbtype); | |||
break; | |||
case RDB_TYPE_JSON: | |||
iores = ReadJson(); | |||
iores = ReadGeneric(rdbtype); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we move this on line 1313 and let it fallthrough just like RDB_TYPE_STRING
(Which you did the same 😄 )
if (ec) | ||
return make_unexpected(ec); | ||
|
||
if (StrLen(str_obj) == 0) { | ||
if (!is_string_type && StrLen(str_obj) == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if is_string_type
is true, and pending_read_.remaining
is false we won't return Unexpected
and we will miss that the rdb file is corrupted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure I understood your comment correctly.
This logic is to enable empty string -> is_string_type=true && pending_read_.remaining is false and there for we do not return rdb file corrupted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah sorry, let me explain. If we call ReadStringObj
above a string can:
- len < kMaxStringSize upon which we will read the
whole string
in one iteration -- usual path - len > kMaxStringSize and therefore
pending_read_.remaining > 0
so we load the string in chunks/multiple iterations
For case (1), if StrLen(str_obj) == 0
the rdb file should be corrupted because the string is an empty string. Isn't that right ? This is similar to what we had before. We tried to load a string in one go, and we received an empty string -- isn't that a corrupted rdb file ?
Maybe what I am missing here is the reason we want to support reading an empty string because to my mind:
- We read the whole string in one go -> if it's empty then rdb file corrupted
- We read the whole string in multiple chunks -> Why would a chunk be empty ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
empty string is valid . you can add empty string to dragonfly.
set x ""
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh that's true, Then if pending_read_.remaining > 0
and StrLen(str_obj) == 0)
then we have a corrupted rdb file since we know that the string is not an empty string (because we got remaining) but we just read 0!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if remaining > 0 we enter ReadRemainingString and it will read from sink. if the length that I am trying to read is not available that it will return error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gotcha cheers
tests/dragonfly/replication_test.py
Outdated
@@ -2695,6 +2695,48 @@ async def test_replication_timeout_on_full_sync_heartbeat_expiry( | |||
await assert_replica_reconnections(replica, 0) | |||
|
|||
|
|||
@pytest.mark.exclude_epoll | |||
@dfly_args({"proactor_threads": 1}) | |||
async def test_big_string(df_factory): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could just DEBUG POPULATE
big string, save to rdb, shutdown df and reload. Why bother with a replication test here ? It's very similar to test_big_value_serialization_memory_limit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static seeder used in this test is debug populate , but I can change this to use debug populate directly
I don't think that using replication is a bother, we want to make sure that we don't have a spike in memory when loading the data. From my perspective using replica or if we do save and than load is simillar.
@@ -1393,10 +1392,24 @@ error_code RdbLoaderBase::ReadStringObj(RdbVariant* dest) { | |||
} | |||
} | |||
|
|||
if (big_string_split && len > kMaxStringSize) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we split a compressed string as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we do not compress string on save, only big blobs f.e f.e entries together
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm we actually do if we use CompressionMode::SINGLE_ENTRY baaaaa
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so incase the string is compressed the opcode is RDB_ENC_LZF and we dont split.
we dont realy use CompressionMode::SINGLE_ENTRY so I dont think we need to optimise for this
In this PR we introduce rdb loading of string alliteratively in chunks to prevent memory spikes on big string loading.
In this PR