Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shell string encoder #1526

Closed
vit-zikmund opened this issue Jan 18, 2023 · 9 comments
Closed

Shell string encoder #1526

vit-zikmund opened this issue Jan 18, 2023 · 9 comments

Comments

@vit-zikmund
Copy link
Contributor

Please describe your feature request.
Hi Mike (thanks for this great tool and remarkable development effort), it would be great to have something as jq's @sh operator that encodes a string into a literal shell string representation, so one could safely generate things like .env files and otherwise combine yq with shell scripts.

Describe the solution you'd like
In general, one should be able to wrap a string with single quotes and escape any that are in the payload. The handling differs from the yaml "single" quoting style, though.

Let's say we have this little test.yaml:

a: |-
  some string with spaces,
  newline and ' quote '

So the result from yq '.a | @sh' test.yaml would do the same as:

$ yq . test.yaml -o json | jq -r '.a | @sh'
'some string with spaces,
newline and '\'' quote '\'''

Shell considers any consecutive string of character as a single token as long as it doesn't contain unescaped/unquoted characters from the special IFS env variable, which by default contains <space><tab><newline>. These are the default (and usually only) separators. I would safely assume this to be the case to not make the code overly complicated. As there's no way how to escape a single quote within a single-quoted string, one needs to exit the current quoted block, print a literal escaped quote and start another quoted block, which is exactly what can be seen in the output above.

Naturally, this encoder would only apply to the string type and should bail out on another type. Also I think there's no need to have a decoder thanks to all the environment variable handling operators yq already has.

Describe alternatives you've considered
I've imagined that some folks would also appreciate this to be a single-line output. Although this is not supported in lightweight shells, the widespread BASH can use literal escapes in a "dollar single quoted string" like $'\n', which is a simple newline. So in addition to the above, the encode could escape newlines with \n and wrapping that part with $'...' instead of plain '...'.

In that case the output from yq '.a | @sh("oneline")' test.yaml could look like this:

# when blindly wrapping all '-separated parts with `$'...'
$'some string with spaces,\nnewline and '\'$' quote '\'$''
# or when using `$'...' only for parts containing a newline
$'some string with spaces,\nnewline and '\'' quote '\'''
# or escaping just the newlines same as the quotes
'some string with spaces,'$'\n''newline and '\'' quote '\'''

I don't have any preference of the above, each gets the job done, however the first and last can be implemented as a single pass without holding a context. Also while this alternative is a fun-to-have, it's way beyond the original desire ;)

@vit-zikmund
Copy link
Contributor Author

I figured (duh), the basic functionality can be already done within yq, but handling quotes makes the code a little escaping nightmare not for the faint-hearted:

yq "\"'\" + sub(\"'\", \"'\''\") + \"'\"

@mikefarah
Copy link
Owner

I'd love to put something like this into yq but I'm confident of all the escaping rules. I found this library which I could use in yq: would that do the trick? https://github.com/alessio/shellescape

@vit-zikmund
Copy link
Contributor Author

Hi @mikefarah. Yes, the library would surely fit the functionality.

Its pretty much only interesting piece of code is this simple line, where the author replaces ' for "'", which is the only other way how to escape a literal ' - wrap it with doublequotes (my example passes it on escaped with a backslash \'). These approaches are technically equal, with the only difference of that using a basckslash escape costs one character less. Given my performance OCD, I prefer that solution, but I also admit that the shellescape's one looks kinda nicer 🙂

What shellescape does on the top is also searching the input string for at least one character that needs quoting/escaping and quotes the whole string in that case only. This is nice on the eyes, but I'd like to know if it outperforms a blind approach that quotes everything. Shells don't care whether the string is fun or 'fun' as long as both strings don't contain special characters.

And to alleviate your concerns - the shell quoting is really that dead simple. Once in a single quoted string block, it gobbles everything as-is. The only concern is the ' itself, as next occurrence just ends the block. So the technique to pass the ' is to end the block, escape the one ' outside of it and start a new ' block.

All in all, it's always better to have a feature, even though potentially not performance optimal 🙂 So feel free to use shellescape off the bat. It's a good one.

mikefarah added a commit that referenced this issue Feb 2, 2023
@mikefarah
Copy link
Owner

Thanks for your help - I ended up doing the \' - I prefer that too :) Checking with jq that's what it does too, makes it easy to compare :)

Will be available in the next release. I didn't end up doing the newline logic - happy to take an MR for it though

@vit-zikmund
Copy link
Contributor Author

vit-zikmund commented Feb 2, 2023

Hey @mikefarah , I love you got to it this quickly, thank you, however there's a bug in your implementation (at least in the commit referenced above). Let me break this down real quick in an example, but pls read this carefully once more 🙂

Once in a single quoted string block, it gobbles everything as-is. The only concern is the ' itself, as next occurrence just ends the block. So the technique to pass the ' is to end the block, escape the one ' outside of it and start a new ' block.

TL;DR, the replace logic (line 41) should be:

value = "'" + strings.ReplaceAll(value, "'", "'\\''") + "'"

and the rest accommodated accordingly.

In more detail - having the string strings with spaces and a 'quote', all we need to do is to make sure the single quotes pass through intact.

  1. So we start wrapping the whole string with single quotes, so we start with '
  2. Up to the first quote or (end of the string) it's trivial, just passing things on: 'strings with spaces and a
  3. But at the quote in the payload, we must handle it properly, first by ending the previous quote block with ', so now we have 'strings with spaces and a '
  4. Next we need to pass on the quote unaltered - i.e. escaped (otherwise shell would think we're starting a new quote block) - \', so now it's 'strings with spaces and a '\'
  5. With the quote handled, we can safely start a new quote block to pass on whatever comes next (and is not a quote) - coming to 'strings with spaces and a '\''. This step is virtually a goto step 1, running the same till the end, where we insert the last pairing quote '.

I think there's actually one missing piece to the puzzle - the way how shell concatenates strings. That is by sticking things right after each other, eventually taking out the quotes, e.g.:

  • 'abc def', abc' 'def and abc\ def parse the same abc def
  • "abc'def", abc\'def and 'abc'\''def' parse the same as abc'def

FYI, I just realized while this rather trivial approach is sound and will produce a valid string, it's a bit fuzzy around two edge cases that are by-producing quoted empty strings '':

  1. Having multiple consecutive quotes like a'''b:
    • will blindly end up as 'a'\'''\'''\''b' ('a' \' '' \' '' \' 'b')
    • while it could not make you squint that hard being 'a'\'\'\''b' ('a' \' \' \' 'b')
  2. Having the string start or end with the quote, e.g. 'a':
    • will blindly end up as ''\''a'\''' ('' \' 'a' \' '')
    • while it could be \''a'\' (\' 'a' \')

To tackle this, the replacing logic would have to track whether it's inside a quote or not and might look like this:

# pseudo python :D (sorry, not yet a go dev)
inQuote = false
for ic in input_characters:
  if ic == "'":  # ic is quote
    if inQuote:
      output "'"  # to get out of a quote block
      inQuote = not inQuote
    output "\'"  # print the escaped quote
  else:  # not a quote
    if not inQuote:
      output "'"  # get into a quote block
      inQuote = not inQuote
    output ic
if inQuote:
  output "'"  # to print the last quote if required

This would work IMHO the most predictable - non-quote chunks wrapped in quotes, literal quotes backslash escaped, e.g.:

  • 'abc' -> \''abc'\'
  • ab cd -> 'ab cd'
  • $ab'cd -> '$ab'\''cd'
  • $ab' cd' -> '$ab'\'' cd'\'

This doesn't take the "safe chars" \w@%+=:,./- into account and wraps with quotes any of them, but if you'd want to use those we could insert that into the logic as:

  else:  # not a quote
    if not is_safe(ic) and not inQuote:
      output "'"  # get into a quote block

In that case the quoted blocks would start with the first unsafe character and end on a quote or the very end.

This would be the minimal option (regarding extra quotes):

  • 'abc' -> \'abc\'
  • ab cd -> ab' cd' (not something I'd write by hand, but still valid)
  • $ab'cd -> '$ab'\'cd
  • $ab' cd' -> '$ab'\'' cd'\'

@vit-zikmund
Copy link
Contributor Author

Hi @mikefarah, I thought it might be fun to code some actual Go, so here's a PR built on top of what you did.

@vit-zikmund
Copy link
Contributor Author

Working implementation (hopefully) merged to master. Thank you!

@mikefarah
Copy link
Owner

mikefarah commented Feb 9, 2023 via email

@mikefarah
Copy link
Owner

Release in v4.31.1 :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants