Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve fine-tuning checkpoint support #259

Merged
merged 11 commits into from
Mar 11, 2025
Merged

Conversation

artek0chumak
Copy link
Contributor

Issue #ENG-18828 , #ENG-18829

Describe your changes

Improve fine-tuning checkpoint support with two features:

  • Add new command to show all checkpoints for the jobs
  • Add new argument to start a job from the checkpoint

@artek0chumak artek0chumak marked this pull request as ready for review March 10, 2025 09:47
@mryab mryab requested review from mryab and removed request for VProv March 10, 2025 09:53
click.echo(f"Job {fine_tune_id} contains the following checkpoints:")
table = tabulate(display_list, headers="keys", tablefmt="grid")
click.echo(table)
click.echo("\nTo download a checkpoint, use cmd: together fine-tuning download")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
click.echo("\nTo download a checkpoint, use cmd: together fine-tuning download")
click.echo("\nTo download a checkpoint, use `together fine-tuning download`")

display_list.append(
{
"Type": checkpoint.type,
"Timestamp": checkpoint.timestamp,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest to format the timestamps to strings only in this function. As you can see in your own example, truncating the seconds etc. can lead to a result where the order of checkpoints becomes incorrect: therefore, it's better to sort by the integer timestamp value. Furthermore, users of the Python CLI might want to apply other datetime formats to display the values in their code, so our current behavior is too destructive

get_event_step,
)

_FT_JOB_REGEX = r"^ft-[\dabcdef-]+:\d+$"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should be the behavior if the user passes a FT job ID without a colon? I believe this expression won't match, perhaps it's best to rename it to clarify that it only handles job IDs with step numbers?

if event_type == FinetuneEventType.CHECKPOINT_SAVE:
formatted_time = format_event_timestamp(event)
step = get_event_step(event)
checkpoint_name = f"{id}:{step}" if step else id
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

step=0 will put the code in a surprise :)

Suggested change
checkpoint_name = f"{id}:{step}" if step else id
checkpoint_name = f"{id}:{step}" if step is not None else id

)

# Sort by timestamp (newest first)
checkpoints.sort(key=lambda x: x.timestamp, reverse=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For correct sorting, timestamps should not be strings at this point


checkpoints.append(
FinetuneCheckpoint(
type="Intermediate",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about listing the step here as well, like "Intermediate (step 20)"? It's not completely obvious how to interpret the table now, but if we add a bit of extra information, it will be clearer to the user what the names stand for

@@ -207,6 +286,9 @@ def create(
For datasets with the "messages" field (conversational format) or "prompt" and "completion" fields
(Instruction format), inputs will be masked.
Defaults to "auto".
from_checkpoint (str, optional): The checkpoint to be used in the fine-tuning.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
from_checkpoint (str, optional): The checkpoint to be used in the fine-tuning.
from_checkpoint (str, optional): The checkpoint identifier to continue training from a previous fine-tuning job.

@@ -548,6 +654,9 @@ async def create(
For datasets with the "messages" field (conversational format) or "prompt" and "completion" fields
(Instruction format), inputs will be masked.
Defaults to "auto".
from_checkpoint (str, optional): The checkpoint to be used in the fine-tuning.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
from_checkpoint (str, optional): The checkpoint to be used in the fine-tuning.
from_checkpoint (str, optional): The checkpoint identifier to continue training from a previous fine-tuning job.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
@artek0chumak artek0chumak force-pushed the artem/add-checkpoints branch from 6c41e61 to 05d85d3 Compare March 11, 2025 14:35
@artek0chumak artek0chumak merged commit 271af98 into main Mar 11, 2025
10 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants