Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

blob: update New[Range]Reader to take ReaderOptions #765

Closed
vangent opened this issue Nov 19, 2018 · 15 comments
Closed

blob: update New[Range]Reader to take ReaderOptions #765

vangent opened this issue Nov 19, 2018 · 15 comments
Assignees
Milestone

Comments

@vangent
Copy link
Contributor

vangent commented Nov 19, 2018

Currently blob exposes:

-- NewRangeReader: takes offset, len and calls driver.NewRangeReader.
-- NewReader: calls NewRangeReader with offset=0, len=-1 (all).

The implementations of NewRangeReader generally pass through their arguments to the providers, as expected.

Imagine a user trying to read a huge blob and stream it somewhere. They would have to implement their own chunking of the blob: determine the size, read first chunk via NewRangeReader, etc. If they just use NewReader, we'll try to download the whole blob.

I propose the following signature as a replacement for both functions:

type ReaderOptions struct {
  Offset int
  ReadLen int // 0 means whole blob
  ChunkSize int // default = 1GB or some such.
}

func (b *bucket) NewReader(ctx context.Context, key string, opts *ReaderOptions) (*Reader, error)
@vangent vangent self-assigned this Nov 20, 2018
@vangent vangent added this to the Sprint 20 milestone Nov 20, 2018
@vangent vangent changed the title blob: should NewReader take Options and support chunking? blob: unify New[Range]Reader to take Options and support chunking Nov 20, 2018
@vangent
Copy link
Contributor Author

vangent commented Nov 20, 2018

IMHO the first part of this is pretty clearly better (one function instead of 2, and an Options struct for more flexibility in the future). I will send a PR to make that change, and we can add the new chunking option separately.

@vangent vangent changed the title blob: unify New[Range]Reader to take Options and support chunking blob: unify New[Range]Reader to take Options Nov 20, 2018
@vangent
Copy link
Contributor Author

vangent commented Nov 20, 2018

I moved the part about chunking to #778. This issue is now just about converting to a single NewReader with options.

@vangent vangent added the in progress This is being actively worked on label Nov 20, 2018
@zombiezen
Copy link
Contributor

zombiezen commented Nov 27, 2018

I'm on board with the exception of the zero ReadLen being whole blob: it gives an incongruous meaning to that parameter. While zero reads are not generally useful, it makes it such that any non-negative value results in a read of that size.

@vangent
Copy link
Contributor Author

vangent commented Nov 27, 2018

@zombiezen So are you suggesting that 0 (the default value) should in fact be a 0 length read? That doesn't seem ideal to me, as anyone who doesn't provide a ReaderOptions would get a zero-length read, which essentially nobody wants, meaning that everyone would always need to supply a ReaderOptions.

IMHO the zero/default value should provide reasonable default functionality; for reading, reading the whole blob is the most reasonable default behavior.

Is there a better name than ReadLen? In the PR I've called it Length, but maybe MaxLength?

@zombiezen
Copy link
Contributor

That is my suggestion, yes. I understand why that's not ideal for the Options struct, but that's precisely why the GCP libraries picked the approach of having the two methods (and why we emulated it).

@vangent
Copy link
Contributor Author

vangent commented Nov 28, 2018

Is there a use case for a zero-length read? I'm not sure what it would be. If the answer is "no", I guess I don't see the problem with interpreting 0 as "read all" instead of -1.

I don't think updating the proposed change to make the default ReaderOptions do something that's never what you want is feasible. I think the alternative proposal would be to add an empty ReaderOptions struct, keep the two separate methods, and add ReaderOptions to both of them. I'm not convinced that 2 methods are better than one here.

it makes it such that any non-negative value results in a read of that size

True; with this change it would be "any positive value..." instead.

@zombiezen
Copy link
Contributor

I see allowing zero-length reads as a way of avoiding special cases. As a contrived example, let's say you had some naive code that did something like this:

// limitedRead reads all the bytes from n to a predefined, arbitrary limit in the blob.
func limitedRead(ctx context.Context, bucket *blob.Bucket, n int) ([]byte, error) {
  const limit = 1024
  if n > limit {
    return nil, errors.New("too far")
  }
  r, err := bucket.NewRangeReader(ctx, "foo", n, limit - n)
  if err != nil {
    return nil, err
  }
  data, err := io.ReadAll(r)
  closeErr := r.Close()
  if err != nil {
    return data, err
  }
  if closeErr != nil {
    return data, closeErr
  }
  return data, nil
}

Not allowing zero length reads would require another branch on the caller to maintain correctness.

@vangent
Copy link
Contributor Author

vangent commented Nov 28, 2018

Wouldn't that code just change to if n >= limit to be correct?
I see what you're saying, but isn't there a similar argument for using -1 as a sentinel? A pretty similar math error could pass -1 by accident.

I guess we could add a LimitedLengthRead bool or similar that you'd have to explicitly set; setting Length to a non-zero value without setting LimitedLengthRead to true would be an error, Length=0 with LimitedLengthRead=false would read the full blob, Length=0 with LimitedLengthRead=true would read 0.

@zombiezen
Copy link
Contributor

The zero case above is not an error. The code change would be:

func limitedRead(ctx context.Context, bucket *blob.Bucket, n int) ([]byte, error) {
  const limit = 1024
  if n > limit {
    return nil, errors.New("too far")
  }
  if n == limit {
    return nil, nil
  }
  // ...
}

The point that I'm trying to illustrate is that while zero is a valid length for a read, a negative number is not. I think it would be a shame to treat zero differently instead of treating an invalid length as the sentinel.

The solution you state would work, but I wonder if there's a better approach. The primary constraint here is that we want the zero value of ReaderOptions to be useful, so I wonder if we could just pull out the length as an explicit argument:

type ReaderOptions struct {
  Offset int64
}

// NewReader opens a new reader that reads at most size
// bytes from the blob with the given key. If size is negative,
// then the entire blob is read. [...]
func (b *Bucket) NewReader(ctx context.Context, key string, size int64, opts *ReaderOptions) (*Reader, error)

@jba
Copy link
Contributor

jba commented Nov 28, 2018

So now the normal case (read everything) looks like

b.NewReader(ctx, "key", -1, nil)

which seems a bit much.

@vangent
Copy link
Contributor Author

vangent commented Nov 28, 2018

To summarize, the options we have are:

  1. Keep both NewReader and NewRangeReader, just add (empty) ReaderOptions to both.
  2. Merge them, keeping length as a top-level arg and moving offset into ReaderOptions (@zombiezen suggested this above, @jba says it "seems a bit much").
  3. Merge them, putting both length and offset in ReaderOptions; length=0 is interpreted as "read it all" (@zombiezen is objecting to this as 0 is a legal read length).
  4. Merge them, putting both length and offset in ReaderOptions; length=0 is interpreted as "0 length read" (@vangent objects to this).
  5. Merge them, putting both length and offset in ReaderOptions, plus a LimitLength boolean to distinguish whether length=0 actually means "0-length read" or "whole blob" (the default).

I think we've rejected #3 and #4. I think I would vote for #5. Thoughts?

@zombiezen
Copy link
Contributor

I agree with the summary. I vote for the first option, since I feel the usages are different.

@shantuo
Copy link
Contributor

shantuo commented Nov 29, 2018

I think the question now becomes that whether 0 should be a valid length. GCS client library says it is, and treat 0 as reading the metadata only. S3 only has the auto-generated HTTP client, which means the range header can only take >0. There is a separate HEAD request for 0-length "read". Since we already have a separate method "Attributes", I think we could make 0-length read invalid.

With that said, comparing 1) and 5) I still think 1) is a better API overall. But if we can agree on "0 is an invalid length", then I would vote for 3).

@vangent
Copy link
Contributor Author

vangent commented Nov 29, 2018

Yea, if we could declare 0 length read invalid I agree that 3) would be best. That's what I have in the associated PR. I'm not sure there's a valid use case for 0-length read (especially given that there is a separate Attributes call as you point out), but @zombiezen is arguing that 0 length reads should be considered valid, and after some web searching I'm inclined to agree. This leaves us with 1) or 2) or 5).

I'll update the PR to go with 1). 5) would work but seems clunky; I haven't seen that pattern much before, and I agree with @jba that 2) is a bit much.

This whole thing is an unfortunate side effect of zero values which don't correspond to the common/default usage; it comes up a lot with proto3 inside Google as well.

@jba
Copy link
Contributor

jba commented Nov 29, 2018

I'm okay with 1).

@vangent vangent changed the title blob: unify New[Range]Reader to take Options blob: update New[Range]Reader to take ReaderOptions Nov 29, 2018
@go-cloud-bot go-cloud-bot bot removed the in progress This is being actively worked on label Nov 29, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants