Skip to content

v2.3.0 - S3FS datapath support

Latest
Compare
Choose a tag to compare
@PicoCreator PicoCreator released this 25 Jan 00:09
· 197 commits to main since this release
2db0b8d

What's Changed

  • Fix a bug in DS3 for A100/H100 nodes
  • Added support for S3 datapath
  • Change lr scheduler to cosine (credit: @SmerkyG )
  • Changes to step calculation (may affect your training scripts/templates)
  • Several bug fixes (credit: @SmerkyG )

Full Changelog: v2.2.1...v2.3.0

Example of S3 datapath config

data:
  # Skip the datapath setup
  #
  # ignored if using the preload_datapath.py, useful for speeding up the trainer startup
  # provided you have your datasets all properly preinitialized
  # ---
  skip_datapath_setup: True

  # dataset_path for the prebuilt dataset, using HF `load_from_disk()`
  #
  # Use this if you have built your own dataset and saved it with `save_to_disk()`
  # with source left as null. Other wise configure this to a directory which the 
  # dataset will be built and tokenized by the huggingface dataset process.
  #
  # If using relative path, this should be relative to the trainer script path
  data_path: s3://bucket-name/subpath/

  # Data path storage options, this is used to support cloud storage
  # via the huggingface dataset API. See:
  # https://huggingface.co/docs/datasets/v2.16.1/en/filesystems#amazon-s3
  #
  # Note: As of Jan 2023, these options has been only tested to work with AWS S3, and backblaze. YMMV
  #       For S3 bucket support you will also need to install s3fs `python3 -m pip install s3fs`
  #
  # If you want to reduce the risk of accidental key/secret commits, you can use
  # `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` environment variables instead
  #
  # For datapath, it should use the `s3://bucket-name/subpath` format
  # ---
  data_path_storage_options:
     key: <example S3 key>
     secret: <example S3 secret>
     endpoint_url: <example S3 endpoint>