-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mark virtual machines as tainted and pick untainted machines from the vm pool #1583
base: main
Are you sure you want to change the base?
Conversation
Sequel.migration do | ||
change do | ||
alter_table(:vm) do | ||
add_column :tainted_at, :timestamptz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tainted typically means something became bad/unusable. And maybe also that it should be destroyed. Also, I'm wondering if we should distinguish between the time we take a vm from a pool and the time it becomes available to a customer. Potentially something like this:
allocated_from_pool_at
: the time we took the vm from the pool
supplied_at
: the time it became available to the user
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I borrowed the term 'tainted' from @fdr, who suggested to mark virtual machines. We can use more positive work. tainted
might have little negative impact.
The time we take a vm from a pool and the time it becomes available to a customer have distinct implications for different resources. For the standard VM product, these two times are same. However, for the runner product, we use the ready_at
column to differentiate the latter. Thus, the final product should include a column to distinguish this. If we consider the runner service as a customer for the standard VM service, we have a time when the VM becomes available to the runner (which is the same as when it is picked from the pool), and when the runner becomes available to the end customer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, fair, taint is often with some kind of badness, though per computer science tradition badness can be pretty abstract, e.g. in garbage collectors. In our case, the badness is increasing restriction on how we can manipulate the VM, as it could plausibly have data that we absolutely must not leak between customers, and not delete in a casual manner. In this respect it has some similarity how the jargon is used in garbage collectors (or even virtual memory, "dirty pages" is a related concept), where objects are often "tainted" as somehow being more complex to collect.
We could keep this wording in that tradition or have a new one.
As an aside, something I did previously was the inverse, and not at all for wording reasons, which was more like a "purity" (or, to get outside that nomenclature, a "flexibility") flag that objects started with and then was removed the moment it was plausible a customer might have put their irreplaceable data in it.
This took the form of a can_recreate
semaphore that relevant objects started with and would be decr'd at some point during their provisioning process, when it was plausible irreplaceable user data might have entered the system. If recreate
was incremented and can_recreate
was still present, then the object could be re-provisioned. This "double-trigger" style provided very precise results in not permitting mistakes, and being concurrency safe. I'm not saying it's necessarily superior to what this might do, it's more just for reflection by the author.
I suppose the nice thing about that system is that the overwhelmingly common case results in can_recreate
disappearing, i.e. there is zero bloat in the common case, provided one things that knowing the far-distant tainted_at
time is extraneous...which it may not be
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thinking it over, how many times would we reference this symbol? maybe three, no more than six times? it is okay to avoid a jargon at expense of length. has_customer_data
or something like that might work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm personally ok if you decide to go with "tainted". Just wanted to bring this up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, both are fine, but I'd prefer something similar to has_customer_data
, which requires less thinking about its function if you don't have the context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I like the approach but maybe we can make it better?
57c0bc2
to
8b6c2eb
Compare
prog/vm/nexus.rb
Outdated
@@ -176,6 +176,7 @@ def clear_stack_storage_volumes | |||
def before_run | |||
when_destroy_set? do | |||
if strand.label != "destroy" | |||
vm.update(tainted_at: Time.now) unless vm.tainted_at |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose one advantage of a "purity flag removal" idea is that the code to remove it is branchless, i.e. multiple unguarded updates do not mislead.
But a way to mitigate this would be to put the branch into a special method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may refactor it to vm.this.where(tainted_at: nil).update(tainted_at: Time.now)
to make it branchless, if you believe this is not an anti-pattern
model/vm_pool.rb
Outdated
# first lock the whole pool in the join table so that no other thread can | ||
# pick a vm from this pool | ||
vms_dataset.for_update.all | ||
pick_vm_id_q = vms_dataset.left_join(:github_runner, vm_id: :id) | ||
.where(Sequel[:github_runner][:vm_id] => nil, Sequel[:vm][:display_state] => "running") | ||
.select(Sequel[:vm][:id]) | ||
Vm.where(id: pick_vm_id_q).first | ||
end | ||
def pick_vm(new_name) | ||
vm = idle_vms_dataset.order(Sequel.lit("RANDOM()")).first | ||
return if vm.nil? | ||
|
||
# If we can successfully update the VM, we can be certain that no other | ||
# thread has selected it. If the update fails, we continue to pick different | ||
# VMs until we find an idle one or exhaust the pool of runners. | ||
return pick_vm(new_name) if idle_vms_dataset.where(id: vm.id).update(name: new_name, tainted_at: Time.now).zero? | ||
|
||
vm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something about this code seems a little weird, it's odd to:
- Run a query that qualifies a tuple
- Run it again, but this time with update, to get the locking/atomicity you need
Particularly when there's not much going on between steps (1) and (2).
How about update
with returning
? You can use Model#call
to make model objects with any hash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The challenge is updating a single idle vm without an ID for filtering in the update query. Using limit
with update
results in Sequel::InvalidOperation: Dataset#update not supported on datasets with limits or offsets
.
A potential solution is to incorporate the first query as a subquery in the second one.
An example:
p.idle_vms_dataset
.where(id: p.idle_vms_dataset.order(Sequel.lit("RANDOM()")).limit(1).select(:id))
.update(name: "new-name", tainted_at: Time.now)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could also use a query with a CTE, in SQL it'd look like:
with candidate as (
select id from vm where ... limit 1
)
update vm set ... from candidate where vm.id = candidate.id
returning ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I refactored it into a CTE version with @bsatzger's help.
vm_hash = DB[:vm]
.returning
.with(:candidate, idle_vms_dataset.limit(1))
.from(:vm, :candidate)
.where(Sequel[:vm][:id] => Sequel[:candidate][:id])
.update(name: new_name, tainted_at: Time.now)
.first
Vm.call(vm_hash) unless vm_hash.nil?
At a certain stage in a virtual machine's lifecycle, it may become tainted with user data. From the moment we start the hypervisor for the first time, it's uncertain when the user logged in and began processing. After this, the virtual machine is deemed tainted if it wasn't created for a VM pool. Virtual machines from the pool are considered tainted once they are selected from the pool. Customers cannot access them until they are chosen from the pool. If the virtual machine hasn't been marked as tainted before its destroy, we mark it as tainted during the destroy process. When a virtual machine is considered tainted, it theoretically contains irreplaceable and private data. We can develop various logics based on the taint value. For instance, we discussed some 'recreate' logics for virtual machines. We can easily recreate untainted virtual machines, but we need to exercise caution with tainted ones. Furthermore, the taint value can be utilized in our pool logic. We can select only untainted virtual machines from the pool.
We don't disassociate the virtual machines from the pool when picked because we aim to maintain a fixed size of virtual machines provisioned by the pool. We join the VM dataset with the runner table to identify idle virtual machines, a practice that previously led to an incident. This incident occurred when I modified the runner to not wait for the destruction of the VM, resulting in virtual machines being selected by two different runners. In addition, we lock all virtual machines in the pool while picking one, it inadvertently locks the picked virtual machine that's in use by a runner, leading to deadlocks. We shouldn't lock the virtual machines that have already been picked. We can use the tainted value to pick an idle virtual machine from the pool. We can also use the updated row count to manage concurrent virtual machine selections by multiple threads. This approach allows us to avoid locking the virtual machines table entirely.
The 'display_state' function is used to show the current state of the virtual machines to the customer, and it's dynamically calculated in certain cases. The 'display' prefix implies that we shouldn't use it into our backend logics. 'provisioned_at' serves as a more reliable indicator to determine whether the virtual machine has been provisioned.
.from(:vm, :candidate) | ||
.where(Sequel[:vm][:id] => Sequel[:candidate][:id]) | ||
.update(name: new_name, tainted_at: Time.now) | ||
.first |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I am having difficulty to visualize what kind of SQL query this would generate when returning
, with
, update
and first
are combined, hence I'm not confident on its correctness. It feels like if we plainly write the SQL query here, it would be easier to understand and reason about it.
BTW, do we have a convention about whether we prefer Sequel or SQL for non-trivial stuff? Personally I use both without much thinking, so I don't have a strong opinion.
Add migration file to add tainted_at column to the vm
Mark virtual machines as tainted
At a certain stage in a virtual machine's lifecycle, it may become tainted with user data.
From the moment we start the hypervisor for the first time, it's uncertain when the user logged in and began processing. After this, the virtual machine is deemed tainted if it wasn't created for a VM pool. Virtual machines from the pool are considered tainted once they are selected from the pool. Customers cannot access them until they are chosen from the pool. If the virtual machine hasn't been marked as tainted before its destroy, we mark it as tainted during the destroy process.
When a virtual machine is considered tainted, it theoretically contains irreplaceable and private data.
We can develop various logics based on the taint value.
For instance, we discussed some 'recreate' logics for virtual machines. We can easily recreate untainted virtual machines, but we need to exercise caution with tainted ones.
Furthermore, the taint value can be utilized in our pool logic. We can select only untainted virtual machines from the pool.
Use tainted value to pick a vm from the pool
We don't disassociate the virtual machines from the pool when picked because we aim to maintain a fixed size of virtual machines provisioned by the pool. We join the VM dataset with the runner table to identify idle virtual machines, a practice that previously led to an incident. This incident occurred when I modified the runner to not wait for the destruction of the VM, resulting in virtual machines being selected by two different runners.
In addition, we lock all virtual machines in the pool while picking one, it inadvertently locks the picked virtual machine that's in use by a runner, leading to deadlocks. We shouldn't lock the virtual machines that have already been picked.
We can use the tainted value to pick an idle virtual machine from the pool. We can also use the updated row count to manage concurrent virtual machine selections by multiple threads. This approach allows us to avoid locking the virtual machines table entirely.
Use provisioned at instead of display state to pick a vm from the pool
The 'display_state' function is used to show the current state of the virtual machines to the customer, and it's dynamically calculated in certain cases. The 'display' prefix implies that we shouldn't use it into our backend logics. 'provisioned_at' serves as a more reliable indicator to determine whether the virtual machine has been provisioned.