Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(stepfunctions): Distributed Map #23306

Closed
wants to merge 3 commits into from

Conversation

ayush987goyal
Copy link
Contributor

feat(stepfunctions): Distributed Map

Add native support for distributed map.

TODO: Write better documentation of the new features.

Refs:

  1. https://docs.aws.amazon.com/step-functions/latest/dg/concepts-asl-use-map-state-distributed.html
  2. https://docs.aws.amazon.com/step-functions/latest/dg/input-output-fields-dist-map.html

closes #23216


All Submissions:

Adding new Construct Runtime Dependencies:

  • This PR adds new construct runtime dependencies following the process described here

New Features

  • Have you added the new feature to an integration test?
    • Did you use yarn integ to deploy the infrastructure and generate the snapshot (i.e. yarn integ without --dry-run)?

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license

@gitpod-io
Copy link

gitpod-io bot commented Dec 10, 2022

@aws-cdk-automation aws-cdk-automation requested a review from a team December 10, 2022 15:15
@github-actions github-actions bot added star-contributor [Pilot] contributed between 25-49 PRs to the CDK effort/medium Medium work item – several days of effort feature-request A feature should be added or improved. p2 labels Dec 10, 2022
@aws-cdk-automation
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: AutoBuildv2Project1C6BFA3F-wQm2hXv2jqQv
  • Commit ID: 0caf146
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@patrickwerz
Copy link

This would be helpful to have :)

@Jacco
Copy link
Contributor

Jacco commented Dec 15, 2022

Would it be good to have the bucket in S3ReaderProps as an IBucket in stead of a string?

Also then it could be arranged the resource policy of the S3 bucket is amended properly?


Step Functions provides a high-concurrency mode for the Map state known as Distributed mode. In this mode, the Map state can accept input from large-scale Amazon S3 data sources. For example, your input can be a JSON or CSV file stored in an Amazon S3 bucket, or a JSON array passed from a previous step in the workflow.

A `Map` state with `mode` set to Distributed is known as a Distributed Map state. To create a Distributed Map state, set the `mode` property to `MapProcessorMode.DISTRIBUTED`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the benefit of this over let's say a new DistributedMap state?

@@ -500,6 +485,27 @@ export class StateMachine extends StateMachineBase {
enabled: true,
};
}

private addDistributedMapStatePolicies(stateMachineName?: string) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making this optional and using the wildcard * later on seems like it's not required?
This is a private method and we only ever call it with the argument.

/**
* Distributed Map State Resources
*/
export enum DistributedMapResource {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this provides a good user experience. What are these values for? What do they mean?
I could only find getObject in the docs you've linked.

Preferably we would refactor this to represent the semantic meaning instead of leaking the technical implementation. At least we should provide documentation explaining each of these.

Comment on lines +136 to +148
/**
* The location of the header in the CSV file
*
* @default CSVHeaderLocation.FIRST_ROW
*/
readonly headerLocation?: CSVHeaderLocation;

/**
* The headers of the CSV file
*
* @default - no headers
*/
readonly headers?: string[];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering about UX again. Maybe an enum like class would be better:

new S3CSVReader({
  headers: CSVHeaders.useFirstRow(),
});
new S3CSVReader({
  headers: CSVHeaders.use(['one', 'two', 'three']),
});

/**
* The S3 bucket
*/
readonly bucket: string;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reasons not to use IBucket?

Comment on lines +339 to +358
private validateProps(props: MapProps) {
if (props.itemSelector && props.parameters) {
throw new Error('Only one of itemSelector or parameters can be provided');
}
if (props.mode === MapProcessorMode.INLINE && props.distributatedMapOptions) {
throw new Error('distributatedMapOptions can only be used with DISTRIBUTED mode');
}
if (props.distributatedMapOptions?.label && props.distributatedMapOptions?.label.length > 40) {
throw new Error('Label can only be 40 characters long');
}
if (props.distributatedMapOptions?.itemBatcher) {
const itemBatcher = props.distributatedMapOptions?.itemBatcher;
if (itemBatcher.maxItemsPerBatch && itemBatcher.maxItemsPerBatchPath) {
throw new Error('Only one of maxItemsPerBatch or maxItemsPerBatchPath can be provided');
}
if (itemBatcher.maxInputBytesPerBatch && itemBatcher.maxInputBytesPerBatchPath) {
throw new Error('Only one of maxInputBytesPerBatch or maxInputBytesPerBatchPath can be provided');
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This long list of validations is an indication that we didn't get the UX right. There are so many ways a user can configure this wrong. We need to improve this by design.

@@ -59,9 +166,17 @@ export interface MapProps {
* The JSON that you want to override your default iteration input
*
* @default $
* @deprecated use `itemSelector` instead
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? We need to also provide an explanation for deprecations. If not for the users, than for the people working on this code base.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is deprecated by StepFunctions service: https://docs.aws.amazon.com/step-functions/latest/dg/input-output-itemselector.html

Would highly recommend going through the complete AWS documentation for the feature to understand how these things work together.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly. This explanation should go into the docstring ;)

Copy link
Contributor

@mrgrain mrgrain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, but we need to iterate over the design. Or as a first start if you could share what your design thinking was.

@ayush987goyal
Copy link
Contributor Author

@mrgrain Thanks for the comments and I do agree we could probably make it a bit simpler in some aspects. My overall thought around the design was:

  1. To make it as smooth a transition from existing Map => DistributedMap so that it doesn't feel like a lot of overhead (StepFunction's docs also do not explicitly differentiate between these two maps, they only provide a opt-in flag for the same)
  2. To have the API similar to the documentation so that people reading the docs should understand how those bits related to the API in CDK. This has been one of the principals of writing these L2 constructs but of-course these are not set to stone.

Feel free to provide any other comments/thoughts and we can take it from there.

@rishisubramanian
Copy link

Does this support the tolerated failure threshold parameters? I don't see them in the diff.

https://docs.aws.amazon.com/step-functions/latest/dg/maprun-fail-threshold.html

@aws-cdk-automation
Copy link
Collaborator

This PR has been in the CHANGES REQUESTED state for 3 weeks, and looks abandoned. To keep this PR from being closed, please continue work on it. If not, it will automatically be closed in a week.

@ayush987goyal
Copy link
Contributor Author

@mrgrain I have marked as re-requested for review since I want to have someone form CDK team review it as well once so that I do any major refactoring only once and I do not want this PR to be closed by the bot.

@mrgrain
Copy link
Contributor

mrgrain commented Jan 9, 2023

@mrgrain I have marked as re-requested for review since I want to have someone form CDK team review it as well once so that I do any major refactoring only once and I do not want this PR to be closed by the bot.

Hi @ayush987goyal I'm not sure I understand what you mean by this, I am from the CDK team and I kindly ask you to address my specific comments (either by code change or discussion) to move this forward.

If you are concerned (which is understandable) about multiple rounds of refactoring, maybe our RFC process would be more suitable for you. I'd say this feature is just on the edge of needing a RFC or just being a PR.

Finally, you could also publish a custom construct independently of the core package. This way you can gauge interest and confirm the API before graduating it to core.

@aws-cdk-automation
Copy link
Collaborator

This PR has been deemed to be abandoned, and will be automatically closed. Please create a new PR for these changes if you think this decision has been made in error.

@aws-cdk-automation aws-cdk-automation added the closed-for-staleness This issue was automatically closed because it hadn't received any attention in a while. label Jan 15, 2023
@jacob-kleidman
Copy link

Hi @ayush987goyal; has this work been genuinely abandoned?

I really hope not 😟. I think this Distributed Map is critical to be inside the main CDK repo. I've been waiting on this for a while now

@MorelSerge
Copy link

Also very much looking forward to using this out of the box!

@patrickwerz
Copy link

ayush987goyal, could you create a cdk-constructs package of this? I would be super thankful to use your patch from there.

@beck3905
Copy link

I created a new PR for this feature linked above. I tried to take some of the feedback from this PR into account. Thanks @ayush987goyal for some inspiration and concepts.

@brian-moore
Copy link

For anyone else looking for the PR @beck3905 mentioned above

#24331

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
closed-for-staleness This issue was automatically closed because it hadn't received any attention in a while. effort/medium Medium work item – several days of effort feature-request A feature should be added or improved. p2 star-contributor [Pilot] contributed between 25-49 PRs to the CDK
Projects
None yet
Development

Successfully merging this pull request may close these issues.

(stepfunctions): Add native support for DistributedMap
10 participants