Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mark virtual machines as tainted and pick untainted machines from the vm pool #1583

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

enescakir
Copy link
Member

@enescakir enescakir commented May 12, 2024

Add migration file to add tainted_at column to the vm

Mark virtual machines as tainted

At a certain stage in a virtual machine's lifecycle, it may become tainted with user data.

From the moment we start the hypervisor for the first time, it's uncertain when the user logged in and began processing. After this, the virtual machine is deemed tainted if it wasn't created for a VM pool. Virtual machines from the pool are considered tainted once they are selected from the pool. Customers cannot access them until they are chosen from the pool. If the virtual machine hasn't been marked as tainted before its destroy, we mark it as tainted during the destroy process.

When a virtual machine is considered tainted, it theoretically contains irreplaceable and private data.

We can develop various logics based on the taint value.

For instance, we discussed some 'recreate' logics for virtual machines. We can easily recreate untainted virtual machines, but we need to exercise caution with tainted ones.

Furthermore, the taint value can be utilized in our pool logic. We can select only untainted virtual machines from the pool.

Use tainted value to pick a vm from the pool

We don't disassociate the virtual machines from the pool when picked because we aim to maintain a fixed size of virtual machines provisioned by the pool. We join the VM dataset with the runner table to identify idle virtual machines, a practice that previously led to an incident. This incident occurred when I modified the runner to not wait for the destruction of the VM, resulting in virtual machines being selected by two different runners.

In addition, we lock all virtual machines in the pool while picking one, it inadvertently locks the picked virtual machine that's in use by a runner, leading to deadlocks. We shouldn't lock the virtual machines that have already been picked.

We can use the tainted value to pick an idle virtual machine from the pool. We can also use the updated row count to manage concurrent virtual machine selections by multiple threads. This approach allows us to avoid locking the virtual machines table entirely.

Use provisioned at instead of display state to pick a vm from the pool

The 'display_state' function is used to show the current state of the virtual machines to the customer, and it's dynamically calculated in certain cases. The 'display' prefix implies that we shouldn't use it into our backend logics. 'provisioned_at' serves as a more reliable indicator to determine whether the virtual machine has been provisioned.

@enescakir enescakir requested review from fdr, byucesoy, furkansahin and a team May 12, 2024 15:17
@enescakir enescakir self-assigned this May 12, 2024
Sequel.migration do
change do
alter_table(:vm) do
add_column :tainted_at, :timestamptz
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tainted typically means something became bad/unusable. And maybe also that it should be destroyed. Also, I'm wondering if we should distinguish between the time we take a vm from a pool and the time it becomes available to a customer. Potentially something like this:

allocated_from_pool_at: the time we took the vm from the pool
supplied_at: the time it became available to the user

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I borrowed the term 'tainted' from @fdr, who suggested to mark virtual machines. We can use more positive work. tainted might have little negative impact.

The time we take a vm from a pool and the time it becomes available to a customer have distinct implications for different resources. For the standard VM product, these two times are same. However, for the runner product, we use the ready_at column to differentiate the latter. Thus, the final product should include a column to distinguish this. If we consider the runner service as a customer for the standard VM service, we have a time when the VM becomes available to the runner (which is the same as when it is picked from the pool), and when the runner becomes available to the end customer.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, fair, taint is often with some kind of badness, though per computer science tradition badness can be pretty abstract, e.g. in garbage collectors. In our case, the badness is increasing restriction on how we can manipulate the VM, as it could plausibly have data that we absolutely must not leak between customers, and not delete in a casual manner. In this respect it has some similarity how the jargon is used in garbage collectors (or even virtual memory, "dirty pages" is a related concept), where objects are often "tainted" as somehow being more complex to collect.

We could keep this wording in that tradition or have a new one.

As an aside, something I did previously was the inverse, and not at all for wording reasons, which was more like a "purity" (or, to get outside that nomenclature, a "flexibility") flag that objects started with and then was removed the moment it was plausible a customer might have put their irreplaceable data in it.

This took the form of a can_recreate semaphore that relevant objects started with and would be decr'd at some point during their provisioning process, when it was plausible irreplaceable user data might have entered the system. If recreate was incremented and can_recreate was still present, then the object could be re-provisioned. This "double-trigger" style provided very precise results in not permitting mistakes, and being concurrency safe. I'm not saying it's necessarily superior to what this might do, it's more just for reflection by the author.

I suppose the nice thing about that system is that the overwhelmingly common case results in can_recreate disappearing, i.e. there is zero bloat in the common case, provided one things that knowing the far-distant tainted_at time is extraneous...which it may not be

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thinking it over, how many times would we reference this symbol? maybe three, no more than six times? it is okay to avoid a jargon at expense of length. has_customer_data or something like that might work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm personally ok if you decide to go with "tainted". Just wanted to bring this up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, both are fine, but I'd prefer something similar to has_customer_data, which requires less thinking about its function if you don't have the context.

Copy link
Contributor

@furkansahin furkansahin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I like the approach but maybe we can make it better?

model/vm_pool.rb Outdated Show resolved Hide resolved
@enescakir enescakir force-pushed the pool-lock-1 branch 2 times, most recently from 57c0bc2 to 8b6c2eb Compare May 14, 2024 15:07
prog/vm/nexus.rb Outdated
@@ -176,6 +176,7 @@ def clear_stack_storage_volumes
def before_run
when_destroy_set? do
if strand.label != "destroy"
vm.update(tainted_at: Time.now) unless vm.tainted_at
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose one advantage of a "purity flag removal" idea is that the code to remove it is branchless, i.e. multiple unguarded updates do not mislead.

But a way to mitigate this would be to put the branch into a special method.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may refactor it to vm.this.where(tainted_at: nil).update(tainted_at: Time.now) to make it branchless, if you believe this is not an anti-pattern

prog/vm/nexus.rb Outdated Show resolved Hide resolved
model/vm_pool.rb Outdated
Comment on lines 16 to 24
# first lock the whole pool in the join table so that no other thread can
# pick a vm from this pool
vms_dataset.for_update.all
pick_vm_id_q = vms_dataset.left_join(:github_runner, vm_id: :id)
.where(Sequel[:github_runner][:vm_id] => nil, Sequel[:vm][:display_state] => "running")
.select(Sequel[:vm][:id])
Vm.where(id: pick_vm_id_q).first
end
def pick_vm(new_name)
vm = idle_vms_dataset.order(Sequel.lit("RANDOM()")).first
return if vm.nil?

# If we can successfully update the VM, we can be certain that no other
# thread has selected it. If the update fails, we continue to pick different
# VMs until we find an idle one or exhaust the pool of runners.
return pick_vm(new_name) if idle_vms_dataset.where(id: vm.id).update(name: new_name, tainted_at: Time.now).zero?

vm
Copy link
Collaborator

@fdr fdr May 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something about this code seems a little weird, it's odd to:

  1. Run a query that qualifies a tuple
  2. Run it again, but this time with update, to get the locking/atomicity you need

Particularly when there's not much going on between steps (1) and (2).

How about update with returning? You can use Model#call to make model objects with any hash.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The challenge is updating a single idle vm without an ID for filtering in the update query. Using limit with update results in Sequel::InvalidOperation: Dataset#update not supported on datasets with limits or offsets.
A potential solution is to incorporate the first query as a subquery in the second one.

An example:

p.idle_vms_dataset
    .where(id: p.idle_vms_dataset.order(Sequel.lit("RANDOM()")).limit(1).select(:id))
    .update(name: "new-name", tainted_at: Time.now)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could also use a query with a CTE, in SQL it'd look like:

with candidate as (
  select id from vm where ... limit 1
)
update vm set ... from candidate where vm.id = candidate.id
returning ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored it into a CTE version with @bsatzger's help.

    vm_hash = DB[:vm]
      .returning
      .with(:candidate, idle_vms_dataset.limit(1))
      .from(:vm, :candidate)
      .where(Sequel[:vm][:id] => Sequel[:candidate][:id])
      .update(name: new_name, tainted_at: Time.now)
      .first

    Vm.call(vm_hash) unless vm_hash.nil?

model/vm_pool.rb Show resolved Hide resolved
At a certain stage in a virtual machine's lifecycle, it may become
tainted with user data.

From the moment we start the hypervisor for the first time, it's
uncertain when the user logged in and began processing. After this, the
virtual machine is deemed tainted if it wasn't created for a VM pool.
Virtual machines from the pool are considered tainted once they are
selected from the pool. Customers cannot access them until they are
chosen from the pool. If the virtual machine hasn't been marked as
tainted before its destroy, we mark it as tainted during the destroy
process.

When a virtual machine is considered tainted, it theoretically contains
irreplaceable and private data.

We can develop various logics based on the taint value.

For instance, we discussed some 'recreate' logics for virtual machines.
We can easily recreate untainted virtual machines, but we need to
exercise caution with tainted ones.

Furthermore, the taint value can be utilized in our pool logic. We can
select only untainted virtual machines from the pool.
We don't disassociate the virtual machines from the pool when picked
because we aim to maintain a fixed size of virtual machines provisioned
by the pool. We join the VM dataset with the runner table to identify
idle virtual machines, a practice that previously led to an incident.
This incident occurred when I modified the runner to not wait for the
destruction of the VM, resulting in virtual machines being selected by
two different runners.

In addition, we lock all virtual machines in the pool while picking one,
it inadvertently locks the picked virtual machine that's in use by a
runner, leading to deadlocks. We shouldn't lock the virtual machines
that have already been picked.

We can use the tainted value to pick an idle virtual machine from the
pool. We can also use the updated row count to manage concurrent virtual
machine selections by multiple threads. This approach allows us to avoid
locking the virtual machines table entirely.
The 'display_state' function is used to show the current state of the
virtual machines to the customer, and it's dynamically calculated in
certain cases. The 'display' prefix implies that we shouldn't use it
into our backend logics. 'provisioned_at' serves as a more reliable
indicator to determine whether the virtual machine has been provisioned.
.from(:vm, :candidate)
.where(Sequel[:vm][:id] => Sequel[:candidate][:id])
.update(name: new_name, tainted_at: Time.now)
.first
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I am having difficulty to visualize what kind of SQL query this would generate when returning, with, update and first are combined, hence I'm not confident on its correctness. It feels like if we plainly write the SQL query here, it would be easier to understand and reason about it.

BTW, do we have a convention about whether we prefer Sequel or SQL for non-trivial stuff? Personally I use both without much thinking, so I don't have a strong opinion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants