New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chathistory
Support
#370
base: main
Are you sure you want to change the base?
chathistory
Support
#370
Conversation
23874d0
to
bdd144e
Compare
@andymandias I'll look to review this very soon 👀 |
3bb4486
to
d6f23e3
Compare
Thanks @tarkah! It's definitely still a work in progress, in particular query buffers don't have Things I think particularly need reviewing:
|
Brief addendum to note it looks like my next push will involve a moderate restructuring, so probably best to review after (hopefully tonight). |
I'll be chipping away at this still, but it's probably in a good spot for a WIP review now. |
src/screen/dashboard.rs
Outdated
Some(buffer::Event::ScrolledToTop) => { | ||
if config.buffer.chathistory.infinite_scroll { | ||
if let Some(buffer) = pane.buffer.data() { | ||
let server = buffer.server(); | ||
|
||
if clients.get_server_supports_chathistory(server) { | ||
if let Some(target) = buffer.target() { | ||
let message_reference_type = clients | ||
.get_server_chathistory_message_reference_type( | ||
server, | ||
); | ||
|
||
let oldest_message_reference = self | ||
.get_oldest_message_reference( | ||
server, | ||
&target, | ||
message_reference_type, | ||
); | ||
|
||
clients.send_chathistory_request( | ||
ChatHistorySubcommand::Before, | ||
server, | ||
&target, | ||
oldest_message_reference, | ||
); | ||
} | ||
} | ||
} | ||
} | ||
} | ||
Some(buffer::Event::ChatHistoryBeforeRequest) => { | ||
if let Some(buffer) = pane.buffer.data() { | ||
let server = buffer.server(); | ||
|
||
if clients.get_server_supports_chathistory(server) { | ||
if let Some(target) = buffer.target() { | ||
let message_reference_type = clients | ||
.get_server_chathistory_message_reference_type( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can consider infinite_scroll
within the scroll_view
and conditionally return the event or not. Then we can define a single event variant Event::RequestMoreChatHistory
that's used from both the button message & scroll to top message and dedupe all this code.
src/screen/dashboard.rs
Outdated
} | ||
} | ||
} | ||
_ => (), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_
is dangerous as the compiler won't notify us when new variants are added. Match on None
instead to fully exhaust the match
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hadn't thought about it that way before, thanks for the suggestion/tip :)
src/screen/dashboard.rs
Outdated
if config.buffer.chathistory.infinite_scroll { | ||
if let Some((_, pane)) = self.get_focused_mut() { | ||
if let Some(buffer) = pane.buffer.data() { | ||
let server = buffer.server(); | ||
|
||
if clients.get_server_supports_chathistory(server) { | ||
if let Some(target) = buffer.target() { | ||
let message_reference_type = clients | ||
.get_server_chathistory_message_reference_type(server); | ||
|
||
let oldest_message_reference = self | ||
.get_oldest_message_reference( | ||
server, | ||
&target, | ||
message_reference_type, | ||
); | ||
|
||
clients.send_chathistory_request( | ||
ChatHistorySubcommand::Before, | ||
server, | ||
&target, | ||
oldest_message_reference, | ||
); | ||
} | ||
} | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3rd place this logic is defined. Time to refactor it into a helper method :)
I'm a bit unsure about all the API changes to So it seems to me the only API change of
It'd be great to sync up some time to discuss this over IRC so I can better understand the model at a high level as there's a lot to internalize. I'm also seeing async functions getting called in from |
true | ||
} else { | ||
false | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why contains_starting_with
when we then insert is as the full value? We need to ensure we're requesting the exact same capability that we checked existed. Why doesn't just contains
work here since we check both w/ and w/out draft
prefix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought I had run into a server that listed the chathistory
capability with a value attached (e.g. draft/chathistory=1000
), and I used contains_starting_with
to sidestep having an issue with it, but I can't find any server where that's the case now. Will switch it back to contains
since that's my reading of the specification.
data/src/client.rs
Outdated
let events = if let Some(target) = batch_tag | ||
.as_ref() | ||
.and_then(|batch| self.batches.get(batch)) | ||
.and_then(|batch| batch.chathistory_target.clone()) | ||
{ | ||
if let Some(ChatHistoryRequest { | ||
subcommand, | ||
message_reference, | ||
.. | ||
}) = self.chathistory_requests.get(&target) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can collapse these two if let branches together by calling self.chathistory_requests.get
in an and_then
and pattern matching the Some(ChatHistoryRequest
in the root if let
data/src/client.rs
Outdated
_ => vec![Event::ChatHistorySingle( | ||
message, | ||
self.nickname().to_owned(), | ||
subcommand.clone(), | ||
message_reference.clone(), | ||
)], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume we're just forwarding all messages within here to history and not calling handle
as we don't want to update local state based on history? That makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. Particularly since we're requesting draft/event-playback
when available there can be messages that would mess with local state.
data/src/isupport.rs
Outdated
pub enum MessageReference { | ||
Timestamp(DateTime<Utc>, String), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the 2 field / string value of the Timestamp
variant here? I don't see that documented anywhere here:
https://ircv3.net/specs/extensions/chathistory
This seems to be our own internal representation of a message reference value. I don't think this should live in isupport
, but probably message::Reference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this warrants discussion, particularly since I'm not 100% convinced the way I've implemented this is what we want long term. Difficulties arise with Timestamp
based chathistory
since the DateTime<Utc>
is not unique identifier. All messages within the same second will have the same timestamp, and while that doesn't happen all the time it happens enough (especially since we're getting JOIN
/QUIT
/PART
/etc messages as part of our requests). Also, because of that lumping together, we need to request messages with a server time before the last message we saw, otherwise we might miss messages that happened in the same interval but the client didn't receive. That means we will basically have at least one duplicate message per chathistory request when using Timestamp
. In order to perform deduplication and proper stitching together of message histories we need to be able to distinguish between messages with the same timestamp, which is where the String
comes in.
Originally I had the String
be the text
of the message, but that has obvious issues when the message text depends on the current state (i.e. for JOIN
/QUIT
/etc messages where the user in text
will be represented by what we currently know about them). So, I switched the second bit of information to be a client_id
, which is a seahash of the Encoded
that comes in. I don't love this really, since changing the structure of Encoded
will cause ids to shift, but it works better than using text
.
Client ids are distinguished from server ids by starting with :
, which is an invalid character in a message tag (which is how server tags are delivered).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I swear I was seeing second rounded server times, but I must have screwed up a debug print somewhere because I see proper millisecond times looking at it now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pulling out the client_id
business for now. Will push that along with other changes to match our discussion tonight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tarkah I figured out why I thought I was seeing server times rounded to the nearest second; Halloy's public soju bouncer is returning message server times rounded to the nearest second. The instance of soju I fired up for personal use (edited:) is v0.8.0 now, and it's returning millisecond server times with messages. Maybe the public bouncer is running an older version?
data/src/history/manager.rs
Outdated
pub async fn load_now(&mut self, server: Server, target: &str) { | ||
let kind = if proto::is_channel(target) { | ||
history::Kind::Channel(target.to_string()) | ||
} else { | ||
history::Kind::Query(target.to_string().into()) | ||
}; | ||
|
||
let loaded_server = server.clone(); | ||
let loaded_kind = kind.clone(); | ||
|
||
let loaded_messages = history::load(&server.clone(), &kind.clone()) | ||
.map(move |result| Message::Loaded(loaded_server, loaded_kind, result)) | ||
.await; | ||
|
||
self.update(loaded_messages); | ||
} | ||
|
||
#[tokio::main] | ||
pub async fn make_partial_now( | ||
&mut self, | ||
server: Server, | ||
target: &str, | ||
message_reference: Option<MessageReference>, | ||
) { | ||
let kind = if proto::is_channel(target) { | ||
history::Kind::Channel(target.to_string()) | ||
} else { | ||
history::Kind::Query(target.to_string().into()) | ||
}; | ||
|
||
self.data.map.get_mut(&server).and_then(|map| { | ||
map.get_mut(&kind) | ||
.and_then(|history| history.make_partial(message_reference)) | ||
}); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are these _now
methods used for? They're async but we're calling them from a blocking thread so we aren't actually doing anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that the #[tokio::main]
allows them to run ~synchronously. The histories are loaded, that much I am certain of. Regardless, this is the crime against asynchronicity I was referring to earlier, and is a hack that's not really intended to stay. The ultimate goal of this is to load the history such that the history can be used to (a) find a reference message to base the chathistory request on and (b) perform any necessary deduplication on received messages. I've been working on reducing the need for (b) with the aim of ultimately removing that need entirely, but I haven't tested it lately to see if the need is gone yet. I presume that (a) could be done more elegantly than I'm doing here, but I wanted to hash out some of the basic chathistory functionality before trying to work that out.
data/src/history.rs
Outdated
let unread_messages = | ||
message_reference.and_then(|message_reference| match message_reference { | ||
isupport::MessageReference::None => Some(messages.split_off(0)), | ||
_ => messages | ||
.iter() | ||
.rev() | ||
.position(|message| message_reference == *message) | ||
.map(|reference_position| { | ||
messages.split_off(messages.len() - reference_position) | ||
}), | ||
}); | ||
let opened_at = if let Some(ref unread_messages) = unread_messages { | ||
unread_messages.iter().fold( | ||
Posix::now(), | ||
|earliest_received_at, unread_message| { | ||
std::cmp::min(earliest_received_at, unread_message.received_at) | ||
}, | ||
) | ||
} else { | ||
Posix::now() | ||
}; | ||
let messages = std::mem::take(messages); | ||
|
||
*self = Self::partial(server.clone(), kind.clone(), Posix::now()); | ||
*self = Self::partial(server.clone(), kind.clone(), unread_messages, opened_at); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I follow what unread_messages
and opened_at
means here when we convert from a full "opened" buffer history to a partial "closed" buffer history.
Why are we calling make_partial
from a chat history event? The idea behind our history API and full vs partial is that these are full / partial based on if the underlying buffer is open / closed. Our buffer state is the source of truth that drives how history opened / closed.
- Partial exists as a way to store messages received on a closed buffer which are then fully flushed to disk after X seconds of no activity.
- Full for open buffers has all messages in it and we append new messages, we save this after X seconds of no activity but keep it open. When a buffer is closed, we flush this entire history to disk and convert it to partial.
This lifecycle is managed by Manager::track
which is called after processing any Dashboard
messages, where we see which set of buffers are open.
Shouldn't chathistory messages just feed in to the existing full / partial history that exists for a buffer? I don't think chathistory should break this lifecycle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is mainly a continuation of the hack above, and is the second thing I was trying to highlight needed review. I don't disagree with anything you're saying here, crimes are being committed which need to be rectified. But let me lay out the functionality that the crimes provide because I'm not yet sure how else to get at it.
When joining a channel after launch, we want to get all messages since we last closed Halloy. To do that properly, we need to know what the last message we received for that channel was, but we can't do that with having loaded the message history for the channel. We can request the LATEST
however many messages, but there will often be duplicates in there we don't really want to process and we won't know if they're duplicates until we can compare them to the message history. That messes with our ability to tell the user whether the channel has any new messages or not, and I wasn't able to see any way around it. The process of loading a channel would also become more complicated in that scenario (deduplication would need to be done on channel load, along with any additional requests needed to fill in chat history that wasn't covered by the initial request), but that part is doable.
So, the hack that's being performed now is to load the history for every channel joined, use the loaded history to form the chathistory request for the latest messages, position received messages, and deduplicate, then make the loaded history partial. All this junk right here is a way to convert a loaded history with all the chathistory request messages added into a proper partial history that we'd see if the chathistory messages had been loaded into a proper, fresh partial history.
data/src/history.rs
Outdated
fn add_chathistory_message( | ||
&mut self, | ||
message: Message, | ||
subcommand: ChatHistorySubcommand, | ||
message_reference: MessageReference, | ||
) { | ||
match self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't tested it, but I think we can simplify this and expose a single add_message
API with the following. This is also non-dependent on ChatHistorySubcommand
, not sure why that's needed here for splicing the message in.
We basically splice messages in server_time
order, allowing us to binary_search
on that time which should be pretty fast.
Binary search returns a Result<usize, usize>
where the Ok value means an identical Ordering::Equal
match was found, so we can swap out that value with the incoming value.
This is 100% safe when we use the Lookup::MessageId
since we return Equal
if the id's match, otherwise the binary search continues based on server time.
For Lookup::Timestamp
this means the message has an identical timestamp. Since servertime
extension mandates milisecond accuracy, and if we don't have servertime, we use Utc::now
which is nanosecond accuracy, I think this is a strong match. I doubt we will have too many false positives.
The Err value means no match was found, but the position returns is the index where the element can be safely inserted to preserve ordering (server_time), so we can just safely insert it.
fn add_message(&mut self, message: Message, reference: Option<MessageReference>) {
if let History::Partial {
unread_message_count,
..
} = self
{
if message.triggers_unread() {
*unread_message_count += 1;
}
}
match self {
History::Partial {
messages,
last_received_at,
..
}
| History::Full {
messages,
last_received_at,
..
} => {
enum Lookup {
Timestamp(DateTime<Utc>),
MessageId(String),
}
let lookup = match reference {
Some(MessageReference::Timestamp(timestamp, _)) => Lookup::Timestamp(timestamp),
Some(MessageReference::MessageId(id)) => Lookup::MessageId(id),
Some(MessageReference::None) | None => {
if let Some(id) = message.id.clone() {
Lookup::MessageId(id)
} else {
Lookup::Timestamp(message.server_time)
}
}
};
let search = match lookup {
Lookup::Timestamp(timestamp) => {
messages.binary_search_by(|stored| stored.server_time.cmp(×tamp))
}
Lookup::MessageId(id) => messages.binary_search_by(|stored| {
if stored.id.as_ref() == Some(&id) {
Ordering::Equal
} else {
stored.server_time.cmp(&message.server_time)
}
}),
};
match search {
// Identical value found, replace it
//
// Message Id is 100% match
// Timestamp (server time) is milisecond accuracy so
// a very strong match
Ok(pos) => {
messages.splice(pos..pos + 1, Some(message));
}
// No match found, insert it at pos (server time sorted order)
Err(pos) => messages.insert(pos, message),
}
*last_received_at = Some(Instant::now());
}
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had considered something like this (was thinking of using a BTreeMap
), but I was worried about the overhead it would put on the non-chathistory variant of add_message
.
I'll put this in and test it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Binary search on 10k messages will be pretty quick, that's like 13-14 iterations before you've exhausted the search.
However, I think I need to make it more robust. I have some ideas and will dig in more tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we'd need to sort the messages by server_time
or id
for this to work. Sorting by server_time
should be okay, that's more or less how we expect messages to be already, but we have no guarantee that sorting by id
will match the ordering given by server_time
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, the binary search will insert things in server time sorted order.
I plan to do 2 binary search lookups +- N duration from current message server time.
I can then iterator through those messages and check for an id match. If an id match occurs, replace. If no id match, I can insert after the last position where server time <= current message.
This allows us to handle checking messages with duplicate server times, which the binary search function on its own doesn't handle.
We can also check user + content as well if no id exists for duplicate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For sure. I'm thinking this add_message can be completely unaware of chat history since we already will have the id and server time in the incoming message, so we can just treat all incoming messages the same for how we insert them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me. I will leave it to you :) Edit: Happy to help out with it in any way still, will just be getting out of your way for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok checkout 0cd37be and let me know what you think.
It inserts items in server time sorted order and uses a fuzz duration of +/- 1 second of the incoming messages server time to check the surrounding messages for "deduplication" (message id match or a looser definition of server time + target + content match)
This gives us the benefit of only having to do linear search on a 1 second subset of the messages, and a binary search to find the bounds of that linear search.
Otherwise if we want to check ALL messages for a clash of message id, it'd be linear search on 10k message which would hurt. I don't think it's reasonable to check messages outside +/- 1 second for any collision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! I agree that checking all messages for a clash is not necessary (and potentially not desirable, since a non-duplicate clash means something odd has happened; let's cross that bridge when we come to it).
I incorporated it further into the existing PR and made a couple small tweaks:
- I think we don't want to mark a message as incrementing the unread count when it is truly a duplicate of an existing message. Timestamp-based requests are fuzzed, so we will frequently get a message back that the user has already seen. So I added a bit of logic to not increment the unread message count in that scenario.
JOIN
/PART
/QUIT
messages frequently, in my experience, do not match well on text due to how the user attributes resolve. E.g. when a new user joins a channel it's not uncommon for their user attributes to not resolve at all if they weren't in another channel with the Halloy user, but if they're still in the channel with the Halloy user when thechathistory
batch is received then their user attributes will resolve when thechathistory
copy of the join is received. So I relaxed the match requirements for those messages to just match on the target, which contains the nick of the user involved in the message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me!
317a1b8
to
cb9b86c
Compare
@andymandias how does this feature interact w/ bouncers that send a replay buffer on connect? Are the bouncers programmed to only send one or the other, or will both get sent? If both are sent, do we just rely on our dedupe strategy to eliminate dupes, or do we have some other mechanism to handle this? |
@tarkah the ircv3 description of the specification says that the replay buffer should not be sent automatically when a client has negotiated |
Add `id` component to messages, and record message id when available. Request all messages since last seen message or latest <=500 messages when joining a channel. Request <=500 older messages when scrolling to the top of channel buffer. Drop duplicate messages.
Initial timestamp-based depduplication.
…or of its utilization). Fix timestamp-based deduplication. Skip `RPL_TOPIC` and `RPL_TOPICWHOTIME` messages when looking for the oldest/latest message in a channel.
…p-to-date message reference. Add fuzz interval to timestamp based `chathistory` requests. Ignore messages received at/after `JOIN` when finding the latest message in a channel (replaces ignoring `RPL_TOPIC` and `RPL_TOPICWHOTIME` messages). Ignore internal messages when finding the oldest/latest message in a channel.
…when finding oldest/latest message in channel.
…herwise unrelated to `chathistory`).
…lication, then convert to `History::Partial` for all channels without a pane.
…d when joining a channel. Filter timestamp-based `chathistory` batches in order to reduce the need for record-time deduplication. Use hash of `Encoded` for client-side message-id (used to distinguish between timestamp-based `chathistory` messages with the same timestamp).
…ry` messages. Add preference for automatically loading older `chathistory` messages when scrolled to the top of channel history.
…g at the first message that belongs in the backlog. This way the backlog divider won't be moved when messages are added to the beginning of a message history via `chathistory`.
Renamed configuration option to more typical phrase.
Use `server_time` to position message with respect to unreferenceable messages.
Fix for `opened_at` calculation in `make_partial` (the earliest received message does not always appear at the beginning of the `unread_messages` vector).
Other simplifications and clarifications suggested by review.
Efficient deduplication via the new insertion strategy can be used in place of filtering received `chathistory` batches. Minor tweaks to insertion strategy: more stringent checking for triggering unread notification (to avoid false positives) and more relaxed matching for `JOIN`/`PART`/`QUIT` messages (to avoid false negatives).
… all `chathistory` subcommands. Switch from repeated `AFTER` subcommands to `LATEST` → repeated `BETWEEN` commands to receive history when joining a channel (`AFTER` can fail silently, with 0 returned messages, if the message reference is no longer in the history available from the server). Tweaks to `insert_message` (utilize same fuzz constant as `send_chathistory_request`) and to `is_referenceable_message` (disallow `RPL_TOPIC` and `RPL_TOPICWHOTIME` as message references).
729d27e
to
d7e9d49
Compare
Another restructuring, to better allow for using Should be a bit clearer in intent and use than before, but it doesn't change any of the problem areas of the PR ( |
Allow messages with matching content to match as duplicate without matching `server_time` iff the stored message is a sent message and the inserted message is a received message; reduced strictness to match sent messages where the server's copy will have a slightly different time than the client.
… channel. Messages retrieved when joining a channel are no longer properly deduplicated.
Channels will be marked as having unread messages even if those messages will ultimately wind up deduplicated.
Work in progress PR to add IRCv3 CHATHISTORY support (i.e. #206). Since it's not a widely available feature yet, the goal is to minimize the any effects on Halloy's operation when CHATHISTORY is not available. Testing on ircs://irc.ergo.chat:6697/ to start.
Currently (more-or-less) implemented:
Planned:
event-playback
to allow replay of channel events (JOIN
/PART
/QUIT
/etc).HistServ
appears to be non-standard, so all its messages are filtered out for now for parity of experience.ConvertHistServ
PRIVMSG messages into the appropriateJOIN
/PART
/QUIT
/etc messages.JOIN
.JOIN
messages request(s)).LATEST
functionality (artificially limit the size of batches).indexmap
for message history?). And/or, avoid the need for deduplication.TARGETS
to open Query buffers for direct messages sent while disconnected.chathistory
request timeout.