Mark virtual machines as tainted and pick untainted machines from the vm pool #1583

enescakir · 2024-05-12T15:17:39Z

Add migration file to add tainted_at column to the vm

Mark virtual machines as tainted

At a certain stage in a virtual machine's lifecycle, it may become tainted with user data.

From the moment we start the hypervisor for the first time, it's uncertain when the user logged in and began processing. After this, the virtual machine is deemed tainted if it wasn't created for a VM pool. Virtual machines from the pool are considered tainted once they are selected from the pool. Customers cannot access them until they are chosen from the pool. If the virtual machine hasn't been marked as tainted before its destroy, we mark it as tainted during the destroy process.

When a virtual machine is considered tainted, it theoretically contains irreplaceable and private data.

We can develop various logics based on the taint value.

For instance, we discussed some 'recreate' logics for virtual machines. We can easily recreate untainted virtual machines, but we need to exercise caution with tainted ones.

Furthermore, the taint value can be utilized in our pool logic. We can select only untainted virtual machines from the pool.

Use tainted value to pick a vm from the pool

We don't disassociate the virtual machines from the pool when picked because we aim to maintain a fixed size of virtual machines provisioned by the pool. We join the VM dataset with the runner table to identify idle virtual machines, a practice that previously led to an incident. This incident occurred when I modified the runner to not wait for the destruction of the VM, resulting in virtual machines being selected by two different runners.

In addition, we lock all virtual machines in the pool while picking one, it inadvertently locks the picked virtual machine that's in use by a runner, leading to deadlocks. We shouldn't lock the virtual machines that have already been picked.

We can use the tainted value to pick an idle virtual machine from the pool. We can also use the updated row count to manage concurrent virtual machine selections by multiple threads. This approach allows us to avoid locking the virtual machines table entirely.

Use provisioned at instead of display state to pick a vm from the pool

The 'display_state' function is used to show the current state of the virtual machines to the customer, and it's dynamically calculated in certain cases. The 'display' prefix implies that we shouldn't use it into our backend logics. 'provisioned_at' serves as a more reliable indicator to determine whether the virtual machine has been provisioned.

bsatzger · 2024-05-13T10:30:54Z

migrate/20240510_add_tainted_at_vm.rb

+Sequel.migration do
+  change do
+    alter_table(:vm) do
+      add_column :tainted_at, :timestamptz


tainted typically means something became bad/unusable. And maybe also that it should be destroyed. Also, I'm wondering if we should distinguish between the time we take a vm from a pool and the time it becomes available to a customer. Potentially something like this:

allocated_from_pool_at: the time we took the vm from the pool
supplied_at: the time it became available to the user

I borrowed the term 'tainted' from @fdr, who suggested to mark virtual machines. We can use more positive work. tainted might have little negative impact.

The time we take a vm from a pool and the time it becomes available to a customer have distinct implications for different resources. For the standard VM product, these two times are same. However, for the runner product, we use the ready_at column to differentiate the latter. Thus, the final product should include a column to distinguish this. If we consider the runner service as a customer for the standard VM service, we have a time when the VM becomes available to the runner (which is the same as when it is picked from the pool), and when the runner becomes available to the end customer.

yeah, fair, taint is often with some kind of badness, though per computer science tradition badness can be pretty abstract, e.g. in garbage collectors. In our case, the badness is increasing restriction on how we can manipulate the VM, as it could plausibly have data that we absolutely must not leak between customers, and not delete in a casual manner. In this respect it has some similarity how the jargon is used in garbage collectors (or even virtual memory, "dirty pages" is a related concept), where objects are often "tainted" as somehow being more complex to collect.

We could keep this wording in that tradition or have a new one.

As an aside, something I did previously was the inverse, and not at all for wording reasons, which was more like a "purity" (or, to get outside that nomenclature, a "flexibility") flag that objects started with and then was removed the moment it was plausible a customer might have put their irreplaceable data in it.

This took the form of a can_recreate semaphore that relevant objects started with and would be decr'd at some point during their provisioning process, when it was plausible irreplaceable user data might have entered the system. If recreate was incremented and can_recreate was still present, then the object could be re-provisioned. This "double-trigger" style provided very precise results in not permitting mistakes, and being concurrency safe. I'm not saying it's necessarily superior to what this might do, it's more just for reflection by the author.

I suppose the nice thing about that system is that the overwhelmingly common case results in can_recreate disappearing, i.e. there is zero bloat in the common case, provided one things that knowing the far-distant tainted_at time is extraneous...which it may not be

thinking it over, how many times would we reference this symbol? maybe three, no more than six times? it is okay to avoid a jargon at expense of length. has_customer_data or something like that might work.

I'm personally ok if you decide to go with "tainted". Just wanted to bring this up.

IMO, both are fine, but I'd prefer something similar to has_customer_data, which requires less thinking about its function if you don't have the context.

furkansahin

Overall I like the approach but maybe we can make it better?

model/vm_pool.rb

fdr · 2024-05-16T00:22:54Z

prog/vm/nexus.rb

@@ -176,6 +176,7 @@ def clear_stack_storage_volumes
  def before_run
    when_destroy_set? do
      if strand.label != "destroy"
+        vm.update(tainted_at: Time.now) unless vm.tainted_at


I suppose one advantage of a "purity flag removal" idea is that the code to remove it is branchless, i.e. multiple unguarded updates do not mislead.

But a way to mitigate this would be to put the branch into a special method.

I may refactor it to vm.this.where(tainted_at: nil).update(tainted_at: Time.now) to make it branchless, if you believe this is not an anti-pattern

prog/vm/nexus.rb

fdr · 2024-05-16T04:00:04Z

model/vm_pool.rb

-      # first lock the whole pool in the join table so that no other thread can
-      # pick a vm from this pool
-      vms_dataset.for_update.all
-      pick_vm_id_q = vms_dataset.left_join(:github_runner, vm_id: :id)
-        .where(Sequel[:github_runner][:vm_id] => nil, Sequel[:vm][:display_state] => "running")
-        .select(Sequel[:vm][:id])
-      Vm.where(id: pick_vm_id_q).first
-    end
+  def pick_vm(new_name)
+    vm = idle_vms_dataset.order(Sequel.lit("RANDOM()")).first
+    return if vm.nil?
+
+    # If we can successfully update the VM, we can be certain that no other
+    # thread has selected it. If the update fails, we continue to pick different
+    # VMs until we find an idle one or exhaust the pool of runners.
+    return pick_vm(new_name) if idle_vms_dataset.where(id: vm.id).update(name: new_name, tainted_at: Time.now).zero?
+
+    vm


Something about this code seems a little weird, it's odd to:

Run a query that qualifies a tuple

Run it again, but this time with update, to get the locking/atomicity you need

Particularly when there's not much going on between steps (1) and (2).

How about update with returning? You can use Model#call to make model objects with any hash.

The challenge is updating a single idle vm without an ID for filtering in the update query. Using limit with update results in Sequel::InvalidOperation: Dataset#update not supported on datasets with limits or offsets.
A potential solution is to incorporate the first query as a subquery in the second one.

An example:

p.idle_vms_dataset .where(id: p.idle_vms_dataset.order(Sequel.lit("RANDOM()")).limit(1).select(:id)) .update(name: "new-name", tainted_at: Time.now)

you could also use a query with a CTE, in SQL it'd look like:

with candidate as ( select id from vm where ... limit 1 ) update vm set ... from candidate where vm.id = candidate.id returning ...

I refactored it into a CTE version with @bsatzger's help.

vm_hash = DB[:vm] .returning .with(:candidate, idle_vms_dataset.limit(1)) .from(:vm, :candidate) .where(Sequel[:vm][:id] => Sequel[:candidate][:id]) .update(name: new_name, tainted_at: Time.now) .first Vm.call(vm_hash) unless vm_hash.nil?

model/vm_pool.rb

At a certain stage in a virtual machine's lifecycle, it may become tainted with user data. From the moment we start the hypervisor for the first time, it's uncertain when the user logged in and began processing. After this, the virtual machine is deemed tainted if it wasn't created for a VM pool. Virtual machines from the pool are considered tainted once they are selected from the pool. Customers cannot access them until they are chosen from the pool. If the virtual machine hasn't been marked as tainted before its destroy, we mark it as tainted during the destroy process. When a virtual machine is considered tainted, it theoretically contains irreplaceable and private data. We can develop various logics based on the taint value. For instance, we discussed some 'recreate' logics for virtual machines. We can easily recreate untainted virtual machines, but we need to exercise caution with tainted ones. Furthermore, the taint value can be utilized in our pool logic. We can select only untainted virtual machines from the pool.

We don't disassociate the virtual machines from the pool when picked because we aim to maintain a fixed size of virtual machines provisioned by the pool. We join the VM dataset with the runner table to identify idle virtual machines, a practice that previously led to an incident. This incident occurred when I modified the runner to not wait for the destruction of the VM, resulting in virtual machines being selected by two different runners. In addition, we lock all virtual machines in the pool while picking one, it inadvertently locks the picked virtual machine that's in use by a runner, leading to deadlocks. We shouldn't lock the virtual machines that have already been picked. We can use the tainted value to pick an idle virtual machine from the pool. We can also use the updated row count to manage concurrent virtual machine selections by multiple threads. This approach allows us to avoid locking the virtual machines table entirely.

The 'display_state' function is used to show the current state of the virtual machines to the customer, and it's dynamically calculated in certain cases. The 'display' prefix implies that we shouldn't use it into our backend logics. 'provisioned_at' serves as a more reliable indicator to determine whether the virtual machine has been provisioned.

byucesoy · 2024-05-22T23:30:59Z

model/vm_pool.rb

+      .from(:vm, :candidate)
+      .where(Sequel[:vm][:id] => Sequel[:candidate][:id])
+      .update(name: new_name, tainted_at: Time.now)
+      .first


Personally, I am having difficulty to visualize what kind of SQL query this would generate when returning, with, update and first are combined, hence I'm not confident on its correctness. It feels like if we plainly write the SQL query here, it would be easier to understand and reason about it.

BTW, do we have a convention about whether we prefer Sequel or SQL for non-trivial stuff? Personally I use both without much thinking, so I don't have a strong opinion.

enescakir requested review from fdr, byucesoy, furkansahin and a team May 12, 2024 15:17

enescakir self-assigned this May 12, 2024

bsatzger reviewed May 13, 2024

View reviewed changes

furkansahin reviewed May 13, 2024

View reviewed changes

model/vm_pool.rb Outdated Show resolved Hide resolved

enescakir force-pushed the pool-lock-1 branch 2 times, most recently from 57c0bc2 to 8b6c2eb Compare May 14, 2024 15:07

enescakir requested review from furkansahin and bsatzger May 14, 2024 15:11

fdr reviewed May 16, 2024

View reviewed changes

prog/vm/nexus.rb Outdated Show resolved Hide resolved

fdr reviewed May 16, 2024

View reviewed changes

model/vm_pool.rb Show resolved Hide resolved

enescakir added 4 commits May 20, 2024 21:47

Add migration file to add tainted_at column to the vm

9c1ac27

enescakir force-pushed the pool-lock-1 branch from 8b6c2eb to 55a85d1 Compare May 20, 2024 20:14

bsatzger approved these changes May 22, 2024

View reviewed changes

byucesoy reviewed May 22, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mark virtual machines as tainted and pick untainted machines from the vm pool #1583

Mark virtual machines as tainted and pick untainted machines from the vm pool #1583

enescakir commented May 12, 2024 •

edited

bsatzger May 13, 2024

enescakir May 14, 2024

fdr May 15, 2024

fdr May 16, 2024

bsatzger May 22, 2024

byucesoy May 22, 2024

furkansahin left a comment

fdr May 16, 2024

enescakir May 20, 2024

fdr May 16, 2024 •

edited

enescakir May 16, 2024

bsatzger May 19, 2024

enescakir May 20, 2024

byucesoy May 22, 2024

Mark virtual machines as tainted and pick untainted machines from the vm pool #1583

Are you sure you want to change the base?

Mark virtual machines as tainted and pick untainted machines from the vm pool #1583

Conversation

enescakir commented May 12, 2024 • edited

Add migration file to add tainted_at column to the vm