Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[aws-eks] Upgrading to v1.20.0 #5544

Closed
eladb opened this issue Dec 24, 2019 · 4 comments
Closed

[aws-eks] Upgrading to v1.20.0 #5544

eladb opened this issue Dec 24, 2019 · 4 comments
Assignees
Labels
@aws-cdk/aws-eks Related to Amazon Elastic Kubernetes Service management/tracking Issues that track a subject or multiple issues

Comments

@eladb
Copy link
Contributor

eladb commented Dec 24, 2019

As described in #5540, version 1.20.0 of the experimental @aws-cdk/aws-eks module includes new implementation for the resource providers behind Cluster and the KubernetesResource in order to address several stability issues.

This change requires a replacement of your existing EKS clusters and since this module is experimental, we decided to introduce these breaking changes without backwards compatibility. To alleviate the pain, we will publish the previous version of this module under @aws-cdk/aws-eks-legacy until March 1st, 2020. The legacy module be used as a drop-in replacement in case you wish to plan this migration.

We are aware that this can be disruptive, especially if your EKS cluster runs production workloads, but since the EKS module is still experimental, we are unable to invest the resources needed to offer a clean migration process in such cases. We are committed not to introduce breaking changes to stable modules.

If you will try to update a stack that contains an existing EKS cluster to this new version, you will get an error that the service token of a custom resource cannot be changed.

Unfortunately, this means that you will have to destroy and recreate your cluster in order to use the new aws-eks library. We understand that in production systems this requires intentional planning.

To allow you to migrate at your own time, we have published the old version under @aws-cdk/aws-eks-legacy. If you replace @aws-cdk/aws-eks with @aws-cdk/aws-eks-legacy, you stacks will stay unchanged, as well as your cluster.

When you are ready to recreate your cluster, the safest option is to follow these steps:

  1. Delete the code that defines the EKS cluster from your CDK app
  2. Deploy an update, and wait for your cluster to be destroyed
  3. Take a dependency on @aws-cdk/aws-eks@1.20.0 (or above)
  4. Re-add your cluster definition to your CDK app
  5. Deploy.

Alternatively you can also try to modify the logical ID of your cluster resource, so CloudFormation will think this is a new cluster and that the old cluster should be deleted. Bear in mind that this technique cannot be used if your cluster uses a physical name.

@eladb eladb added the needs-triage This issue or PR still needs to be triaged. label Dec 24, 2019
@eladb eladb changed the title AWS EKS: upgrade to v1.20.0 AWS EKS: Upgrading to v1.20.0 Dec 24, 2019
eladb pushed a commit that referenced this issue Dec 30, 2019
There were two causes of timeouts for EKS cluster creation: create time which is longer than the AWS Lambda timeout (15min) and lack of retry when applying kubectl after the cluster has been created.

The change fixes the first issue by leveraging the custom resource provider framework to implement the cluster resource as an async resource. The custom resource providers are now bundled as nested stacks so they don't take up too many resources from users, and are also reused by multiple clusters within the same stack. This required that the creation role will not be the same as the lambda role, so we define this role separately and assume it within the providers.

The second issue is fixed by adding 3 retries to "kubectl apply".

**Backwards compatibility**: as described in #5544, since the resource provider handler of `Cluster` and `KubernetesResource` has been changed, this change requires a replacement of existing clusters (deployment fails with "service token cannot be changed" error). Since this can be disruptive to users, this change includes an exact copy of the previous version under a new module called `@aws-cdk/aws-eks-legacy`, which can be used as a drop-in replacement until users decide to upgrade to the new version. Using the legacy cluster will emit a synthesis warning that this module will no longer be released as part of the CDK starting March 1st, 2020.

- Fixes #4087
- Fixes #4695
- Fixes #5259
- Fixes #5501

---

BREAKING CHANGE: (in experimental module) the providers behind the AWS EKS module have been rewritten to address multiple stability issues. Since this change requires cluster replacement, the old version of this module is available under `@aws-cdk/aws-eks-legacy`. Please read #5544 carefully for upgrade instructions.
mergify bot added a commit that referenced this issue Dec 30, 2019
There were two causes of timeouts for EKS cluster creation: create time which is longer than the AWS Lambda timeout (15min) and lack of retry when applying kubectl after the cluster has been created.

The change fixes the first issue by leveraging the custom resource provider framework to implement the cluster resource as an async resource. The custom resource providers are now bundled as nested stacks so they don't take up too many resources from users, and are also reused by multiple clusters within the same stack. This required that the creation role will not be the same as the lambda role, so we define this role separately and assume it within the providers.

The second issue is fixed by adding 3 retries to "kubectl apply".

**Backwards compatibility**: as described in #5544, since the resource provider handler of `Cluster` and `KubernetesResource` has been changed, this change requires a replacement of existing clusters (deployment fails with "service token cannot be changed" error). Since this can be disruptive to users, this change includes an exact copy of the previous version under a new module called `@aws-cdk/aws-eks-legacy`, which can be used as a drop-in replacement until users decide to upgrade to the new version. Using the legacy cluster will emit a synthesis warning that this module will no longer be released as part of the CDK starting March 1st, 2020.

- Fixes #4087
- Fixes #4695
- Fixes #5259
- Fixes #5501

---

BREAKING CHANGE: (in experimental module) the providers behind the AWS EKS module have been rewritten to address multiple stability issues. Since this change requires cluster replacement, the old version of this module is available under `@aws-cdk/aws-eks-legacy`. Please read #5544 carefully for upgrade instructions.

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
@SomayaB SomayaB added feature-request A feature should be added or improved. @aws-cdk/aws-eks Related to Amazon Elastic Kubernetes Service management/tracking Issues that track a subject or multiple issues and removed feature-request A feature should be added or improved. needs-triage This issue or PR still needs to be triaged. labels Dec 30, 2019
@eladb eladb pinned this issue Jan 13, 2020
@MatteoJoliveau
Copy link

Hi @eladb. I understand the reasoning and I understand that this package was marked experimental from the start. Problem is, this kind of upgrade path is really not ideal, we have invested a lot in CDK and now we fear that this kind of solution could be repeated in the future.

Can the CDK team guarantee, or at least try to commit to, that this kind of solution will not become the norm for your packages? This is also a clear violation of the semantic versioning system, a minor version upgrade should not introduce breaking changes, especially a huge one like this. I suggest that either you change the way packages are versioned to reflect their changes, or you try to respect the major-minor semantic. Otherwise, as a customer, we really cannot trust this project and will have to migrate away to avoid potentially losing money and time rebuilding our infrastructure each time there is a change in tooling.

I'm sorry if I sound harsh or angry, I'm really not but this update has scared us a lot, and management is starting to question our technical choices, which as you can imagine puts me in a really difficult position.

Thank you for your understanding

@eladb
Copy link
Contributor Author

eladb commented Jan 15, 2020

@MatteoJoliveau thanks for your feedback.

We absolutely commit that modules that are marked "stable" will not be broken in minor versions and such migrations will not be required, but unfortunately we can't make this commitment for "experimental" modules like EKS.

Since the entire framework uses a single version line (for a myriad of reasons), we are unable to conform to semantic versioning on modules that are still unstable. This is actually not an uncommon practice in this space. Node.js uses the same approach where experimental modules in the node.js API are not bound to semantic versioning.

I believe this type of breakage is not going to be common, and we tried hard to make it possible for you to avoid the breakage by using @aws-cdk/aws-eks-legacy module until you are ready to make the switch. Had this module been already marked as "stable", this would not have been our approach, and I we would need to provide better tools for you to migrate from your existing cluster setup.

It's a nasty tradeoff between progress and stability I am sure you are familiar with from your work. For example, if EKS was already marked "stable", it means it would have been much harder to implement a robust fix for the issues this change addresses without breaking existing clusters.

We understand this could be very painful and apologize if this caused grief with your team.

@MatteoJoliveau
Copy link

Thank you @eladb for your reply. I understand it is not an easy task to maintain such a large and complex ecosystem of packages. We'll chart a plan to upgrade our clusters some way, and will be more cautious with experimental packages in the future. Being reassured that stable packages upgrades are handled more carefully is more than enough for us.

@cseickel
Copy link

Have you considered moving "experimental" constructs out of the main library and into a separate package? That would serve three purposes:

  1. Make it absolutely clear that these are not final versions.
  2. You could then have independent upgrade paths.
  3. You can utilize semantic versioning on those experimental features.

As far as #1, just because there is a label in the documentation does not mean that people are expecting large breaking changes on a point release. We tend to think about the entire library as either being in GA or Beta, but not a little bit of both. Having a separate library makes it crystal clear.

Although it may add more complication for you to keep track of dependencies, #2 would benefit the customers in that it would allow us to take advantage of improvements to the core library without having to deal with a possible breaking change in an experimental library. This could also work the other way around, where an experimental library can iterate faster than the stable core.

I think the advantage of #3 is obvious, and would result in happy consumers of your API. Most importantly, it would give us more confidence and trust in you as providers of a core technology.

All of this is meant as constructive advice to help you build a better project. I love the product and I just want it be as good as it can be.

@eladb eladb closed this as completed Jan 23, 2020
@eladb eladb unpinned this issue Feb 26, 2020
@iliapolo iliapolo changed the title AWS EKS: Upgrading to v1.20.0 [aws-eks] Upgrading to v1.20.0 Aug 16, 2020
eladb pushed a commit to cdklabs/decdk that referenced this issue Jan 18, 2022
There were two causes of timeouts for EKS cluster creation: create time which is longer than the AWS Lambda timeout (15min) and lack of retry when applying kubectl after the cluster has been created.

The change fixes the first issue by leveraging the custom resource provider framework to implement the cluster resource as an async resource. The custom resource providers are now bundled as nested stacks so they don't take up too many resources from users, and are also reused by multiple clusters within the same stack. This required that the creation role will not be the same as the lambda role, so we define this role separately and assume it within the providers.

The second issue is fixed by adding 3 retries to "kubectl apply".

**Backwards compatibility**: as described in #5544, since the resource provider handler of `Cluster` and `KubernetesResource` has been changed, this change requires a replacement of existing clusters (deployment fails with "service token cannot be changed" error). Since this can be disruptive to users, this change includes an exact copy of the previous version under a new module called `@aws-cdk/aws-eks-legacy`, which can be used as a drop-in replacement until users decide to upgrade to the new version. Using the legacy cluster will emit a synthesis warning that this module will no longer be released as part of the CDK starting March 1st, 2020.

- Fixes #4087
- Fixes #4695
- Fixes #5259
- Fixes #5501

---

BREAKING CHANGE: (in experimental module) the providers behind the AWS EKS module have been rewritten to address multiple stability issues. Since this change requires cluster replacement, the old version of this module is available under `@aws-cdk/aws-eks-legacy`. Please read aws/aws-cdk#5544 carefully for upgrade instructions.

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@aws-cdk/aws-eks Related to Amazon Elastic Kubernetes Service management/tracking Issues that track a subject or multiple issues
Projects
None yet
Development

No branches or pull requests

5 participants