Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propose code string literals #3450

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

Diggsey
Copy link
Contributor

@Diggsey Diggsey commented Jun 18, 2023

Add a new syntax for multi-line string literals designed to contain code and play nicely with rustfmt.

Rendered

@programmerjake
Copy link
Member

programmerjake commented Jun 18, 2023

it would be nice if there was a way to have the first line have indentation, e.g.:

fn f() {
    // something like this -- not sure the best way to indicate indentation that should be included
    let s = ```
        abc
    ```;
    assert_eq!(s, "    abc\n");
}

@petar-dambovaliev
Copy link

How big of a problem is this? I am inclined to be against adding more stuff to the language for a problem we don't know we have.

@programmerjake
Copy link
Member

programmerjake commented Jun 18, 2023

How big of a problem is this? I am inclined to be against adding more stuff to the language for a problem we don't know we have.

I've encountered this issue of wanting auto-formatting multi-line strings with embedded indentation multiple times myself, I would likely use ``` code strings a lot, maybe more often then I'd use the dyn keyword.

@digama0
Copy link
Contributor

digama0 commented Jun 18, 2023

This design makes it impossible to indent the first line.

@Diggsey
Copy link
Contributor Author

Diggsey commented Jun 18, 2023

This design makes it impossible to indent the first line.

Yes, it's designed to handle code, and in languages where whitespace is signficant (eg. python) indentation is also relative to previous lines.

In cases where you need to indent the first line, then a normal or raw string literal may be more appropriate.

@calebcartwright
Copy link
Member

cc @rust-lang/rustfmt and @rust-lang/style for awareness

@digama0
Copy link
Contributor

digama0 commented Jun 18, 2023

This design makes it impossible to indent the first line.

Yes, it's designed to handle code, and in languages where whitespace is signficant (eg. python) indentation is also relative to previous lines.

In cases where you need to indent the first line, then a normal or raw string literal may be more appropriate.

This seems rather over-indexed on python code. I'm all for preserving whitespace in this way, but why not do it like it is done in doc strings, and just trim as many characters as possible from the left margin (uniformly to all lines)?

@programmerjake
Copy link
Member

i thought of a way the first line could be indented, just have another line be indented less:

{
    let return_and_close = ```
        return retval;
    }
    ```;
    assert_eq!(return_and_close, "    return retval;\n}\n");
}

@digama0
Copy link
Contributor

digama0 commented Jun 18, 2023

By my reading of the proposal the string literal would be an error because the second line is less indented than the first.

@programmerjake
Copy link
Member

By my reading of the proposal the string literal would be an error because the second line is less indented than the first.

well, the proposal can fix that :)

@Diggsey
Copy link
Contributor Author

Diggsey commented Jun 18, 2023

This seems rather over-indexed on python code.

Python is simply the most prominent example from this list: https://en.wikipedia.org/wiki/Off-side_rule

If you look through the examples on that page, you'll see they all use indentation in the same way, so I didn't see much need to support indentation on the first line.

I went with the more conservative approach for this RFC because a) I was unable to come up with a use-case for indentation on the first line, b) relaxing the rules later is a backwards compatible change, and c) even with the relaxed rules, there are still strings you cannot represent (eg. a single indented line, or all lines indented) and the set of things that can't be represented is much harder to quantify under the relaxed rules.

ie. it seems much easier to say that if you need precise control over indentation, use raw strings.

Co-authored-by: Caleb Cartwright <calebcartwright@users.noreply.github.com>
@ehuss ehuss added the T-lang Relevant to the language team, which will review and decide on the RFC. label Jun 18, 2023
@Diggsey
Copy link
Contributor Author

Diggsey commented Jun 19, 2023

@petar-dambovaliev

How big of a problem is this? I am inclined to be against adding more stuff to the language for a problem we don't know we have.

sqlx has over 6 million downloads and is rapidly approaching diesel as the most popular crate for interacting with SQL databases. I've personally run into this problem with this specific crate at two different companies using Rust in production, and in numerous personal projects.

The asm! macro is part of the standard library, and could benefit greatly from this.

The json! macro from serde_json (a crate with ~150 million downloads) is horribly slow to compile, and could be replaced with a procedural macro that processes a code literal.

There are many different crates implementing some form of html! macro - these could all compile faster and have better UX if they used code literals.

DSLs are extremely powerful, and this kind of string literal is well suited to embedding DSLs within Rust programs.

I think Rust would benefit immensely from having some kind of "relative-indentation" string literal, regardless of whether it takes the exact form proposed here.

@VitWW
Copy link

VitWW commented Jun 19, 2023

For example, quite recently PHP added to v7.3 indentation to PHP Heredoc Strings.

So, it is useful feature

@Lokathor
Copy link
Contributor

I think the actual mechanism of three backticks instead of a double quote is perhaps strange. can't we do a hash prefix on the string literal to mark the string literal as being code-ish and then rustfmt can know to format such string literals?
code#"words here"#
or something like that?

@petar-dambovaliev
Copy link

For example, quite recently PHP added to v7.3 indentation to PHP Heredoc Strings.

So, it is useful feature

With all due respect, I don't think we should take notes on what the PHP people are doing.

@petar-dambovaliev
Copy link

@petar-dambovaliev

How big of a problem is this? I am inclined to be against adding more stuff to the language for a problem we don't know we have.

sqlx has over 6 million downloads and is rapidly approaching diesel as the most popular crate for interacting with SQL databases. I've personally run into this problem with this specific crate at two different companies using Rust in production, and in numerous personal projects.

The asm! macro is part of the standard library, and could benefit greatly from this.

The json! macro from serde_json (a crate with ~150 million downloads) is horribly slow to compile, and could be replaced with a procedural macro that processes a code literal.

There are many different crates implementing some form of html! macro - these could all compile faster and have better UX if they used code literals.

DSLs are extremely powerful, and this kind of string literal is well suited to embedding DSLs within Rust programs.

I think Rust would benefit immensely from having some kind of "relative-indentation" string literal, regardless of whether it takes the exact form proposed here.

Many people, Including myself are writing Rust in production and are using DSLs. You are fixing a problem that I(and anyone I worked with) didn't know I even had. That should tell you something. Is there anything else to this other than formatting?

@digama0
Copy link
Contributor

digama0 commented Jun 19, 2023

I'm sure I'm not the only one who has written things like this:

  writeln!(w, "  \
        <!-- <link rel=\"shortcut icon\" href=\"{rel}favicon.ico\"> -->\
    \n</head>\
    \n<body>\
    \n  <div class=\"body\">\
    \n    <h1 class=\"title\">\
    \n      {h1}\
    \n      <span class=\"nav\">{nav}</span>\
    \n    </h1>")

Multiline string literals where you want to be able to indent the body (because having them flush left looks atrocious) but also preserve leading indentation is a mess right now, so I definitely appreciate the motivation behind this RFC. I'm not sure it captures all the things I might want to do though, since for example it might not be the case that the first line is unindented (like in this example, where the first line has a two space indent) or that the minimum indentation is 0 (which is true in this case but might not be if interpolating an inner element instead of the <body> tag).

ie. it seems much easier to say that if you need precise control over indentation, use raw strings.

This seems like quite an unsatisfactory resolution, since the whole point of the syntax is to allow for precise control over indentation which is not otherwise preserved by multiline strings using \ line terminators.

As the example above should show, if you are outputting indented syntax for a language like python (or in this case, formatted HTML), just because the top level is zero indent doesn't mean that fragments of the string are also zero indent, or that the first line will be zero indent since it might be a fragment of the full output. I have seen the same pattern with formatted Rust code generation, the fragments will generally be inside some kind of scope and hence be nonzero indent, and the repeating block may or may not align with language constructs so the first line might not be the least indented.

@clarfonthey
Copy link
Contributor

With all due respect, I don't think we should take notes on what the PHP people are doing.

This level of disrespect for other communities is unproductive and doesn't help with the discussion. Whether your feelings are justified or not, it's better to explain why this particular feature doesn't fit for Rust, instead of just showing a general aversion to other language(s).

@digama0
Copy link
Contributor

digama0 commented Jun 20, 2023

Incidentally, the PHP heredoc syntax solves the indentation issue by using the indentation of the closing quote character, rather than the first content line or the minimum indentation. Thus:

let x = ```
      4 space indented
    2 space indented
  ```;

let x = ```
      2 space indented
     1 space indented
    ```;

let x = ```
    2 space indented
 error, less indented than end delimiter
  ```;

Also note that the newline before the end delimiter does not count as part of the string, you would have to add a \n at the end if you want a trailing newline.

@programmerjake
Copy link
Member

Incidentally, the PHP heredoc syntax solves the indentation issue by using the indentation of the closing quote character, rather than the first content line or the minimum indentation.

TLDR: I think the PHP heredoc syntax is the best so far.

The PHP heredoc syntax is basically what I was going to suggest (I didn't know PHP used it), though I didn't since it only works when the closing ``` is at the start of it's line (ignoring indentation), which wouldn't work for the RFC's proposed syntax for no trailing \n which is:

let s = ```
line with no ending line terminator```;

Also note that the newline before the end delimiter does not count as part of the string, you would have to add a \n at the end if you want a trailing newline.

This neatly solves the issue by making the syntax for no trailing \n be:

let s = ```
line with no ending line terminator
```;

and the corresponding syntax with a trailing \n is:

let s = ```
line with a ending line terminator

```;

functionality as required.

If it is necessary to include triple backticks within a code string
literal, more than three backticks may be used to enclose the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm torn here. Using the same thing as in doccomments makes sense, but at the same time when we already have "use more #s to escape more" I don't feel amazing about also having a "use more `s to escape more" construct.

@Diggsey
Copy link
Contributor Author

Diggsey commented Jun 20, 2023

Thanks for the feedback. I've updated the RFC to propose a variant of the "heredoc"-style indentation rules and updated the "prior art" section.

I've also attempted to enumerate every possible syntax variation that has been suggested in the alternatives section.

I've kept the triple backtick quote style for now, but I am torn between that and some of the other quote styles. However, I think the new choice for the indentation rules is the best option so far, especially when combined with the modification to optionally suppress the final newline.

@ksaadDE
Copy link

ksaadDE commented Jun 29, 2023

@Diggsey That is very off-topic. I keep it short. Most ORMs I know of are optimized for general-purpose (e.g. CRUD).

  1. DSL are nothing new. Almost two decades ago they have been used in several companies (that existed back then) and gov bodies. Only the amount of work to create them has been reduced (looks at nim).
  2. DSL and ORM are not opposing concepts, instead they work together see: Engineer/Software <-> DBO <-> ORM <-> DSL (<-> SQL <-> RDBMS) Sometimes you are directly using DSLs to obtain data e.g. for Data Science
  3. Professionals are using ORMs to work on the objects (maintainability) and not running into issues (and thus reinventing the wheel) like those that have been already solved many years ago (efficiency). The boss/customer does not wait or reward work that does that. So no, in most (general-purpose) use-cases using the ORM is better (and more efficient).

I'm fully aware that there is NoSQL out there. And other (semi-)old technology, new/old approaches etc. Intentionally left out. DTOs are excepted as well. Also, (qualitative) complexity depends on the problem you want to solve (e.g. OS complexity).

@ChrisDenton
Copy link

Yes, I think this conversation is drifting very offtopic. I believe the central point is that people can and do use DSLs in Rust. This RFC is proposing one way to improve support for people who do.

@ksaadDE
Copy link

ksaadDE commented Jun 29, 2023

@ChrisDenton

central point is that people can and do use DSLs in Rust. [...] proposing one way to improve support for people who do

Ack & agreed, no dispute about that.

Thinking further we should talk (at some point) about "how" people should (be allowed to) use that cool feature. The mentioned "fear" of bloated files with these "code blocks" in them (very lengthy too) could have a serious impact on maintainability (code-quality) and production use of software written in Rust.

@tmccombs
Copy link

Regarding length enforcement/linting, are there any existing lints around string length? I can't think of any reason why there should be lints for length of indented strings but not regular non-indented strings.

@VitWW
Copy link

VitWW commented Jun 29, 2023

What's about nested strings?
I think, multiple "#" could be fine (for h#"..."# syntax)

let s1 = h#"
    let s2 = h##"
        let s3 = h###"
            string
            "###;
        "##;
    "#;

@Nemo157
Copy link
Member

Nemo157 commented Jul 1, 2023

I think a clippy lint like "this literal is very long, consider moving it into a separate file and using include_str!(...)" would be a decent lint to have.

This is not (yet) possible if the literal is being passed to a proc-macro. Maybe once proc-macro-expand is stabilized such a lint would be useful (though, it'd need proc-macros to be updated to use expansion) but for now if the literal is going into a proc-macro (likely the common case) it should be suppressed.

1. It adds a four new types of string literals given all
the combinations.

# Rationale and alternatives
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another alternative I have mentioned on zulip:

Improve include! handling (when passed as literals to macros? in editors?) instead to make it more ergonomic to outline other-language code rather than inlining.

Pros:

  • works better with simple tools that don't handle nested languages well

  • establishes a new indent context, i.e. doesn't need to be adjusted with surrounding code which in my experience can be error-prone if the editor's indentation handling is imperfect. Examples of confusions:

    • inside comments
    • inside macros
    • inside doc comments
    • when current indentation is inconsistent with configured rules
    • when copy-pasting into a differently indented context
  • generally avoids stacking complexity

    some_proc_macro!{
       mod m {
          /// This is an example with nesting and several levels of indentation and whitespaces
          ///
          /// ```rust
          /// let p = h"python
          ///      def py():
          ///          a = '''Lorem ipsum dolor sit amet,
          ///          consectetur adipiscing elit,
          ///          sed do eiusmod tempor incididunt
          ///          ut labore et dolore magna aliqua.'''
          ///          print(a) 
          ///      ";
          /// ```
          ///
          fn nesting_fun() {}
       }
    }

Cons:

  • requires editor support if you want to view or even edit the included file in the context of its parent instead of opening a new view. But showing an overlay might be less complex than all the text nesting
  • context of substitutions may be harder to see

Since this is motivated by making things easier for rustfmt I recommend contacting the maintainers of other tools (syntax highlighters, editors, IDEs, ...) to see if this change helps or adds complexity for them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't consider this an alternative. Requiring powerful editor support to even use the feature makes it a no-go, and having to store things in separate files is a maintenance burden that's worse than the current situation, since it requires coming up with a naming scheme for those files that makes sense, makes it harder to resolve merge conflicts since tools like git will never understand this "magic include", and is way more complicated than what is proposed in this RFC.

The advantages you list I also consider to be problems with your approach. You say it works better with simple tools, but the opposite is true: you end up with something unworkable without powerful editor features. In contrast this RFC doesn't require any editor features at all to be an improvement over the status quo. Any support for nested language is an optional extra that doesn't affect the core functionality.

Your example of "stacking complexity" seems very straightforward tbh. Infinitely better than having to go to a spearate file.

Since this is motivated by making things easier for rustfmt I recommend contacting the maintainers of other tools (syntax highlighters, editors, IDEs, ...) to see if this change helps or adds complexity for them.

It by definition does not add any complexity for tools other than rustfmt, since the only required change as a result of this RFC is allowing a new prefix letter (h proposed here) and tools must already support that. Beside that, anything that is valid to do with a raw string literal is also valid to do with an h raw string literal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It by definition does not add any complexity for tools other than rustfmt, since the only required change as a result of this RFC is allowing a new prefix letter (h proposed here) and tools must already support that. Beside that, anything that is valid to do with a raw string literal is also valid to do with an h raw string literal.

Anything that adds syntax complicates syn and any other tools that use it or otherwise parse rust code. I can't imagine that it would ever be safe to just assume that any string prefix acts like a regular string literal, since raw strings already violate that, hence individual new letters have to be added to anything that parses rust code, including syntax highlighters (although the backup behavior is usually good enough for these).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You end up with something unworkable without powerful editor features.

How? Even many simple editors at least have tabs, panes or similar UI elements to view more than one file at a time.

At its most primitive you rely on your window manager and file browser to open multiple files at the same time in separate windows and show them side by side.

Any support for nested language is an optional extra that doesn't affect the core functionality.

A simple editor can have primitive syntax-highlighting that will work with separate files based on file extensions but won't work with inlined content. So this RFC makes things worse for simple editors

makes it harder to resolve merge conflicts since tools like git will never understand this "magic include"

I don't see how it would make things more difficult for git? If anything it makes diffs simple due to fewer whitespace adjustments.

Requiring powerful editor support to even use the feature makes it a no-go,

Where did I said that a powerful editor would be required? Rather I'm suggesting
a) improve powerful editors
b) keep things simple for simple editors

This covers both.

Your example of "stacking complexity" seems very straightforward tbh. Infinitely better than having to go to a separate file.

What is straight-forward about it? If you actually have to edit, indent, copy-paste, syntax-highlight or auto-complete that there are lots of pitfalls.
Note the outer macro which tends to make things more difficult for tools because at that point point they might not even know anymore whether they're dealing with rust or just things that happen to tokenize like rust.
And it's rust -> macro -> markdown -> codeblock (with language annotation) -> multiline string (with another language annotation).
These languages could be configured to have different indent rules!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think improving support for include! like things to be a negative (proc-macro-include RFC, proc-macro-expand feature would both be great to have), but it's a feature for different usecases than this. This RFC improves support for things that people are already doing. Even if we had better forms of include! I would not pull out 3 lines of SQL to a separate file just to get syntax highlighting, I would simply do what we do currently: use the existing literal strings and fight with rustfmt every time the surrounding code changes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separation of languages is the norm and should be encouraged. See the HTML/CSS/JS split that is encouraged instead of having inline script handlers and styles. See template files. See module trees.

You say my approach is a no-go because it makes things more difficult for simple editors. And yet you acknowledge that this RFC will primarily benefit complex editors. While I think my approach would benefit simple editors because they can then work with the outlined language.

At the moment, these strings are in the file and so can be reviewed and have conflicts resolved in-place. By moving them to a separate file you can no longer perform these actions with any context about the surrounding code.

I assume they'd conventionally still be placed in the same directory and show up in the diffs next to each other.

To make that at all workable you'd need a powerful editor to allow treating them as though they weren't in a separate file

Not necessarily. E.g. when you have an SQL query query!(include!("query.psql"), param1="val", param2="val", ...) then it has an API, like a function call. You edit functions separately and then fix their callsites.
So "jump to definition" + error messages from the query! macro about missing arguments would already cover that.

Copy link
Member

@Nemo157 Nemo157 Jul 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And yet you acknowledge that this RFC will primarily benefit complex editors.

My expectation is that this RFC will not effect complex editors (in cases where they are not acting as simple editors).

A complex editor that is using heuristics to determine when to apply other-language syntax highlighting to a literal could similarly use those heuristics to determine when to apply other-language auto-formatting to a literal.

This RFC simply provides support for auto-indentation (but not formatting) of literals for simple editors (and complex editors where their heuristics don't apply) that use rustfmt.

EDIT: actually, I forgot that this RFC also included language hints, which would allow a very strong hint to the complex editor heuristics of what other-language to treat a literal as, but it also likely allows editors in between simple and complex to use very simple heuristics and start multi-language highlighting where they couldn't previously.

EDIT2: To clarify some of my categorical assumptions to make sure there's no misunderstanding:

  • simple editor: notepad -> notepad++ -> unconfigured vim
    • no code understanding or only simple regex based highlighting
  • complex editor: neovim/vscode + LSP, jetbrains
    • semantic code understanding, so it actually knows which macro literals are being passed to
  • in between: minimally configured vim/neovim without an LSP
    • still just syntactic code understanding, but better than the simple regexes, so it doesn't know which macro is which to use for multi-language heuristics, but it can parse and use language hints

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separation of languages is the norm and should be encouraged.

That is not my experience. I've almost never seen sql queries pulled out into separate files. Most assembly I've seen is inline. Shader languages are a bit of a mix, and I don't have as much familiarity with it, but I don't think it is at all unusual to include shader code inline, especially if it is small. And this feature would be very useful for help text for cli programs. I can't imagine using a separate file for the help comment for every option in my cli that uses clap.

See the HTML/CSS/JS split that is encouraged instead of having inline script handlers and styles

But we also have frameworks like react, where html and css are embedded in Javascript. Or svelte where the JS is included in an html template.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shader languages are the one case I can think of where people actually care about "separation of languages", and then it has to do more with the fact that GPU code inherently has a modularity to it, because it is run in passes, and people tend to pull out modules into, well, modules. So you may as well have, e.g.

  • code.cpp
  • code.hpp
  • code.vert
  • code.frag

But ofc you may well just encounter something like

  • code.cs
  • code.hlsl

Depending.

Copy link
Member

@the8472 the8472 Jul 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

notepad++

Has syntax highlighting.

But we also have frameworks like react, where html and css are embedded in Javascript. Or svelte where the JS is included in an html template.

Yes, and I have encountered issues with that kind of multi-language, framework-specific file formats that makes me prefer separate files. Simple editors just didn't support it at all or mistook it as only one of the languages, complex editors had configuration issues because they picked up the wrong preprocessor version or something which led to lots of bogus squiggles in those files while vanilla JS files had no issues.

Most assembly I've seen is inline.

https://github.com/xiph/rav1e/tree/master/src/arm
https://github.com/memorysafety/rav1d/tree/main/src/x86
https://github.com/rust-lang/stacker/tree/master/psm/src/arch

Though none of that needs to be include!ed / act as a template in the first place, it's static code with a fixed interface and compiled separately. I can't think of a project that needs templated ASM.

@mattheww
Copy link

mattheww commented Jul 9, 2023

The reference-level explanation should say what happens in Rust 2018 and earlier (where supporting these literals would be an incompatible change; see reserved-prefixes).

Copy link
Contributor

@workingjubilee workingjubilee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not believe this proposal sufficiently engages with why programming languages other than Python and Markdown make the choices they do. In particular, the Swift programming language chooses to instead reject anything on the first line (before the multiline literal "proper"), and I think for good reasons. It is very easy to go from emitting a string literal something like this:

let text = "text\ntext\ntext\ntext";

To, wanting nicer formatting for generated code, emit this:

let text = "text
    text
    text
    text
";

This causes accidentally losing the first line. Even with a clarification of this RFC to add restrictions to what is allowed to go there so fewer inputs can be silently dropped, I don't think it is very "in character" for Rust to allow code that may be incorrect to pass compiling when it would be very easy to use a slightly different rule and catch a common mistake.

Comment on lines +145 to +146
Anything directly after the opening quote is not considered
part of the string literal. It may be used as a language hint or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Anything directly after the opening quote is not considered
part of the string literal. It may be used as a language hint or
Anything directly after the opening quote is not considered
part of the string literal. It may be used as a language hint or

There is no specified separator aside from the implied separator of the newline. Some people have mistaken this proposal as only allowing a constrained option here. It does not. It says "Anything", and specifies no compiler error if the symbols that immediately follow the " are, say... ". Perhaps you meant to constrain it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is what you're getting at, but the first line is still constrained by the delimeters of the string. ie. if the string begins h" then a single quote will still close the string even if it's on the first line. If a single quote would not close the string then it would still be allowed on the first line. The indentation and language hint rules apply "after" we've determined the bounds of the string literal.

Comment on lines +146 to +153
part of the string literal. It may be used as a language hint or
processed by macros (similar to the treatment of doc comments).

```rust
let sql = hr#"sql
SELECT * FROM table;
"#;
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not believe placing metadata regarding the string inside the visible string delimiter tokens should be accepted, as it has many negative impacts. In particular, there is an isomorphism between strings written using r#""# and strings written using "" (and without STRING_CONTINUE, i.e. 0x5C 0x0A), currently, that as far as I know is complete. This proposal would create a surjective function: there would be string literals written using

h"languagetag
"

which have no mirror image using the other syntactic forms for string literals. This causes great amounts of confusion for:

  • Lexing
  • Parsing
  • Code generation

And the very purpose of this language hint is for the service of syntax highlighters and the like, which are very likely going to be written in a language that may have no easy access to simply running syn or tree-sitter or whatever, and may instead be bashed together out of JavaScript and regexes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In particular, there is an isomorphism between strings written using r#""# and strings written using "" (and without STRING_CONTINUE, i.e. 0x5C 0x0A), currently, that as far as I know is complete.

Not sure what you're getting at here. There are already many ways to get the same "literal value" using different "encodings" of the same literal. For example, tabs could be encoded with \t or an actual tab character.

This proposal would create a surjective function: there would be string literals written using

h"languagetag
"

which have no mirror image using the other syntactic forms for string literals.

This causes great amounts of confusion for:

Lexing
Parsing
Code generation

This is going to need some more justification.

First of all, the language hint is purely a syntactic feature, it doesn't change the "value" of a string literal, so in terms of "values" (which if we're using set theoretic terms, is the most plausible thing to talk about but you haven't actually defined that...) there is the same amount of isomorphism between code string literals and raw string literals as there was between raw string literals and string literals (modulo indentation being relative, which is the entire point of the proposal).

Secondly, I flat out don't believe that this does introduce significant complexity in those areas. The compiler/tooling is already capable of dealing with string literals and raw string literals. This RFC doesn't change the basic rules for when a string begins/ends - the parsing rules are identical to the corresponding non-code-literal form. The only change is to how the content within the literal is converted into a value for use by the program.

- Byte string literals `hb"`
- Raw byte string literals `hbr#"`

The `h` modifier will appear before all characters in the prefix.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no persuasive and particular reason offered to have this precede all other characters in the prefix. It would be preferable to assume that we are going to explore accepting a non-canonical ordering.

Copy link
Contributor Author

@Diggsey Diggsey Jul 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Experimentally, br"<content>" compiles, but rb"<content>" does not compile. This implies that we are already particular about the order of string prefixes, and so I wrote this RFC with consistency in mind. I don't particularly care about what order is "canonical" but this rule was easy to define and seemed reasonably intuitive. If you have a strong reason to prefer a different order I'd love to heear it.

Comment on lines +178 to +184
An `h` modifier may be added to the prefix of the following string
literal types:

- String literals `h"`
- Raw string literals `hr#"`
- Byte string literals `hb"`
- Raw byte string literals `hbr#"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this not include c"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No particular reason - I was using stable Rust as a baseline, but I can update the RFC to include C string literals. The intent is that they combine in the natural way. That said, it looks like the implementation of the the C string literal RFC was reverted due to breakage, so... We'll see.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +269 to +272
The main drawback is increased complexity of the language:

1. It adds a four new types of string literals given all
the combinations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Teaching.

String literals are used in pattern matching. It will be very annoying to explain why a metadata tag that can be written as part of the literal and lives inside what appears to be the string's delimiter tokens does or does not participate in pattern matching. I would prefer the question simply not arise.

Specifically, this works:

let "SELECT" = &maybe_select_expr[0..6] else {
    return;
};

And I presume, with this proposal, that this would work:

let h"
    SELECT
" = &maybe_select_expr[0..6] else {
    return;
};

But I do not want to explain why either of these may or may not work:

let h"x86asm
" = &maybe_sql_expr[0..0] else {
    return;
};

let h"sql
" = &maybe_sql_expr[0..0] else {
    return;
};

All answers seem bad, to me. Introducing a form that allows the question to arise in the first place can simply be avoided.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we could add more complex rule for language_tag, for example first line with a tag must end with #.

let h#"sql#
    SELECT
    "# = &maybe_sql_expr[0..2] else { return; };

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have not actually clarified anything as long as the tag is inside the quotation marks.

Copy link
Contributor

@TimNN TimNN Jul 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the h#<lang> syntax was more natural when this RFC was still proposing the markdown-like triple backtick syntax (```<lang>).

Once feature(stmt_expr_attributes) is stabilized, I think that would nicely enabled something like (even if that is somewhat more verbose):

let sql = #[editor::inject_lang(sql)] h#"
    SELECT * FROM table;
    "#;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I do not want to explain why either of these may or may not work:

As proposed in the RFC, both of those would match as the language hint is not part of the value. I don't think this case is really any different from eg.

let "\t" = &maybe_tab_expr[0..0] else {
    return;
};

let "	" = &maybe_tab_expr[0..0] else {
    return;
};

Or:

enum Foo {
    Bar,
    Baz,
}

use Foo::Bar as Bat;


fn main() {
    match Foo::Bar {
        self::Bat => println!("Bat"),
        Foo::Baz => println!("Baz")
    }
}

Ultimately, you can't expect pattern matching to be syntactic - it's fundamentally about the value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes yes, and 2_5_5 also is matched by 255 and 0xFF, as those names alias, and if you introduce a specific alias for something, shockingly, it matches. And introducing redundant aliases without consideration for the potential harms to understanding is what I am objecting.

However, your comments have made apparent to me that you fundamentally do not actually believe this increases language complexity, as you don't think it makes it harder to parse or understand the source code, so I also object to the text written here. If mere quantitative increase in ways to express something does not count as an increase in complexity, then this entry is a lie and there is no drawback.

Suggested change
The main drawback is increased complexity of the language:
1. It adds a four new types of string literals given all
the combinations.
None.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, your comments have made apparent to me that you fundamentally do not actually believe this increases language complexity, as you don't think it makes it harder to parse or understand the source code

I do think it increases language complexity, but only in the sense that the language now has N+1 features rather than N. Where I disagree with you is the idea that there is a qualitative rather than quantitative difference in complexity in comparison to existing string literals.

My hope is that even this incremental increase in complexity could be later reduced: given that the feature is designed to allow represent every possible string, I think there's a world where a future edition simply makes all multiline literals behave like the literals proposed here. I think it would be appropriate to propose this more drastic change if we later find that the use of code string literals naturally replaces the use of multiline string / raw string literals due to people preferring an "indentation relative form", and if no unforeseen drawbacks are encountered.

@tmccombs
Copy link

I wonder if maybe the tag part should be deferred to a later PR (but kept in the future possibilities section). And for now just error if there is any text on the first line. Although, then there is a risk that macros or external tools rely on that behavior and break if and when tags are added later.

I also think that the RFC should better specify what it means to measure the whitespace. IMO, the cleanest way would be to require that the indentation must exactly match on each line. So for example you can't have tabs on one line, and spaces on another, or even the same number of spaces and tabs, but in different order. Or go even further and forbid mixed spaces and tabs altogether.

It also feels a little weird to me that the empty string takes multiple line with this:

let empty = h"
    -";

and I'm not overly fond of the "-" to suppress the final newline. I can't think of anything obviously better though.

I will suggest another alternative. The final newline could be suppressed with a backslash on the penultimate line, like so:

 let s = h"
    something \
    ";

That doesn't require adding any additional syntax, since it works the same as regular strings.
This has a couple of problems though . It doesn't work for raw strings, unless we add a special exception for this to raw indented strings. And you now need three lines to represent the empty string with the h prefix.

@Diggsey
Copy link
Contributor Author

Diggsey commented Jul 10, 2023

@workingjubilee

I think for good reasons. It is very easy to go from emitting a string literal something like this:

It's a fair criticism. There's certainly a risk there, but it's difficult to say how significant that risk actually is. Syntax highlighting the "language hint" differently would significantly mitigate that risk, and is trivial to do even if an IDE has no support for syntax highlighting the nested code itself.

My opinion is that you are overstating the risk here: in the example you provided the first line clearly stands out from the rest given the differing indentation, even without syntax highlighting. It's not clear why making a mistake here would be more significant, or harder to catch, than making a mistake anywhere else in the code.

If there was an alternative way to specify the language hint which wasn't worse and avoided the risk entirely, then I would be open to that, but I think the far bigger danger here is ending up with a syntax that is too heavy to use effectively. The current syntax:

let _ = hr#"foo
    <content>
    "#;

Is about at the limit of what I think is reasonable for such a QoL improvement to cost, so using eg. inline attribute syntax such as:

let _ = #[lang(sql)] hr#"
    <content>
    "#;

Would be too intrusive, especially the excessive use of #s.

The reason to support the language feature at all is to enable better tooling support. From basic syntax highlighting to more advanced features. It opens up many opportunities that didn't exist before, and is useful information for the programmer to be able to express in the code and for others reading it.

@workingjubilee
Copy link
Contributor

My opinion is that you are overstating the risk here:

And my opinion is that you are understating it.

in the example you provided the first line clearly stands out from the rest given the differing indentation, even without syntax highlighting. It's not clear why making a mistake here would be more significant, or harder to catch, than making a mistake anywhere else in the code.

My concern includes on-the-fly generated Rust code which is, in a strict, computational sense, impossible for me to eye-check and check-in for every example which I might want to generate, but which I may wish to have nicely formatted when I emit it, nonetheless, for various reasons. For example, it may later be inspected for debugging purposes. I would rather the compiler immediately err in those cases of emitting a malformed string, and that I can begin handling the compiler error that has been propagated into my tools via process::Command and logging things so that data can get back to me and I can later write a proper regression test so that I do not generate that code again. That is much preferable. I do lean heavily on the fact that the compiler errors in a lot of cases where I might fuck up in code generation. I would love for this to be a feature I can also lean heavily on because it adopted a highly regular syntax and was conservative about changes to the current lex and parse rule:

    STRING_LITERAL :
       " (
          ~[" \ IsolatedCR]
          | QUOTE_ESCAPE
          | ASCII_ESCAPE
          | UNICODE_ESCAPE
          | STRING_CONTINUE
       )* " SUFFIX?

    STRING_CONTINUE :
       \ followed by \n

and kept them to affecting whitespace which is comparatively easy to reason about, in a quasi-inverse of the rule regarding STRING_CONTINUE, rather than being constantly afraid that it might fucking bite me in the ass because it is sufficiently lossy as to gobble up actual text I might be counting on being present.

@workingjubilee
Copy link
Contributor

If there was an alternative way to specify the language hint which wasn't worse and avoided the risk entirely, then I would be open to that, but I think the far bigger danger here is ending up with a syntax that is too heavy to use effectively.

Truly, genuinely, I am content with a change as small as this:

let _ = h_foo_r#"
    <content>
    "#;

Or if you prefer:

let _ = hr#foo"
    <content>
    "#;

I believe feature(stmt_expr_attributes) is still worth reasoning about as it may be the case that it is desired that we have a more general solution for wanting to have sugar in for annotating a string with clearly external data, like what language it is about but potentially also other things, but I am not going to pretend that going all the way to our full attribute syntax is not a chore.

@Diggsey
Copy link
Contributor Author

Diggsey commented Jul 10, 2023

My concern includes on-the-fly generated Rust code

But why generate "code string literals" at all in that case? If the code is not intended to be edited by humans, then could you not generate a raw string literal?

Let's say for the sake of argument that you both want to generate nicely indented output, and you don't want the extraneous whitespace that would come with a raw string literal, and the generated code is not intended to be checked into source control / generally viewed by a human being. In that case, even with a code string literal you'd need to make sure that every line was properly indented right?

In that case, a simple validation rule that would catch mistakes in your code relating to the first line would be to disallow whitespace between the opening quote and the start of the language hint if present.

let _ = h_foo_r#"

"#;

Or if you prefer:

let _ = hr#foo"

"#;

I will add the former as an alternative in the RFC. The latter doesn't really work as the # is neither required in raw string literals, nor allowed at all in normal string literals, eg.

let _ = h"
    <content>
    ";

This is not a strongly held opinion, but I think it's suboptimal to use _ as a separator in this way. _ is typically used as a character that's not a separator, since it's treated as word-like in many respects. I think it would be better if the language hint was its own token if outside the string.

@workingjubilee
Copy link
Contributor

workingjubilee commented Jul 11, 2023

Let's say for the sake of argument that you both want to generate nicely indented output, and you don't want the extraneous whitespace that would come with a raw string literal, and the generated code is not intended to be checked into source control / generally viewed by a human being.

That is not quite my concern. My concern is specifically that I do want it to be potentially viewable by a human being, and that it is somewhere, logged for later review if necessary, but it's not like I am reviewing every single instance on git or whatever. This later review may happen whether the compilation succeeds or fails. Indeed, my concern is I would like to be able to make my codegen nice and legible for the benefit of places that I may never see it, without a concern that the result may be miscompiled. And some of the strings may, indeed, be SQL which my generated Rust code will later tell a database to execute, and I want to make examining the source easy and keep it easy to reason about why things are wrong even for people who may not write Rust programs very often, as they can still examine and easily read nicely formatted SQL that is also nicely formatted in the context of the Rust program.

And judging by the occasional error reports I get from these faraway databases, I am pretty sure they're not that familiar with Markdown and its quirks, either.

Part of what makes what I have made possible is that rustc is so very enthusiastic already about valid parses, so that I can simply defer a lot of work into the compiler instead of precompiling the code myself, because then it becomes a simple transaction with the compiler.
"simple"
"""simple"""
The production software I have helped midwife is quite terrifyingly complex and thus I am extremely interested in anything that I can leverage to improve its authorship and debugging experiences. Woe to me. It is not yours to take responsibility for my questionable life choices, but I would like to not model this as something I would have to be wary of in the codebases I work on, and instead simply look forward to its implementation so I can make use of it as soon as I possibly can.

@Diggsey
Copy link
Contributor Author

Diggsey commented Jul 12, 2023

@tmccombs

I also think that the RFC should better specify what it means to measure the whitespace. IMO, the cleanest way would be to require that the indentation must exactly match on each line.

This is already specified in the RFC:

Remove exactly the measured whitespace from each non-empty line. If this cannot be done, then issue a compiler error. The whitespace must match down to the exact character sequence.

It also feels a little weird to me that the empty string takes multiple line with this:

let empty = h"
    -";

That would be an error, since there is no final newline to suppress in that example. The empty string would be simply:

let empty = h"
    ";

With zero lines between the opening and closing quote, there is no newline to suppress.

Contrast this to:

let single_line = h"
    content
    ";

In this case there is a single line, and so there does exist a final newline that can be suppressed.

@Animeshz
Copy link

Just a suggestion, but one could also look at nix's syntax for an inspiration,

{
    environment.etc."auto-cpufreq.conf".text = ''
      [charger]
      governor = powersave
      turbo = never

      [battery]
      governor = powersave
      turbo = never
    '';
}

Normal strings are as is, with double quotes ", whereas multiline strings are made using double single-quotes '' and closes with the same. This doesn't conflict with the character literal, because any character literal must have a 1-length character before the closing quote.

The indentation is cleared by the compiler at compile-time, and if the ending quote ''; is at same level as of starting quote, it automatically removes the new-line.

@ksaadDE
Copy link

ksaadDE commented Sep 14, 2023

Why not combining the idea of the Back-Tick syntax and mix it with the brackets?

Example:

let mymultilinestr = {<language>
     <yourtext>
}

everything that replaces <yourtext> is going to be seen as string until the closing bracket.

@programmerjake
Copy link
Member

let mymultilinestr = {<language>
     <yourtext>
}

well, that's currently valid code, so changing it to be a string would conflict:

pub fn foo() {bar
    () // weird formatting for calling bar()
}

fn bar() {}

@Animeshz
Copy link

Since {} is used for scoping already, I don't think it could also be used to store strings.

@ksaadDE
Copy link

ksaadDE commented Sep 17, 2023

@Animeshz
@programmerjake

I almost forgot that unclean syntax is a thing in Rust.

I could provide another alternative:

let mymultilinestr  = S{<lang>
   <string>
};

The S infront of the bracket indicates it is a multi-line string. After that a language tag can be added, and in a new line starts the multi-line string until the last bracket.

Because of the prefixing S it does not conflict with scoping or any other use , to my knowledge. Simple but effective trick.

@digama0
Copy link
Contributor

digama0 commented Sep 17, 2023

S{} is already legal syntax for creating a structure named S:

struct S { lang: u32, b: u32 }
let lang = 1;
let b = 2;

let mymultilinestr  = S{lang
   , b
};

@ksaadDE
Copy link

ksaadDE commented Sep 17, 2023

S{} is already legal syntax for creating a structure named S:

Right this would conflict. Prefixing it with an - ?

let  mymultilinestr = -S{<lang>,
<mlstring>
};

The other alternative I would suggest are back ticks or a backslash to indicate that not a struct is meant.

I'm just playing around with ideas, how to make it usable.

@digama0
Copy link
Contributor

digama0 commented Sep 17, 2023

S could have a negation operator (impl std::ops::Neg for S)

@ksaadDE
Copy link

ksaadDE commented Sep 19, 2023

S could have a negation operator (impl std::ops::Neg for S)

@digama0

good point. What about the wave ~ ? I looked it up in the docs, seemingly it is not used (yet).

let  mymultilinestr = ~{<lang>,
<mlstring>
};

@kanashimia
Copy link

A comment can also be used for specifying a language:

let cool_codes = /*rust*/r#"
    fn main(){unsafe{*(0 as*mut _)=0}}
"#;

This is similar to how Helix editor highlights strings for Nix language.
Doesn't require any language changes, just a change in the guidelines and tooling.
But not as clear to parse as a dedicated language construct.
Dedenting is still a problem.

There seems to be two separate features proposed here:

  1. dedented string literals
  2. string literal language hint

Also for prior art: https://github.com/tc39/proposal-string-dedent

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T-lang Relevant to the language team, which will review and decide on the RFC.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet