Propose code string literals #3450

Diggsey · 2023-06-18T17:01:25Z

Add a new syntax for multi-line string literals designed to contain code and play nicely with rustfmt.

programmerjake · 2023-06-18T17:32:59Z

it would be nice if there was a way to have the first line have indentation, e.g.:

fn f() {
    // something like this -- not sure the best way to indicate indentation that should be included
    let s = ```
        abc
    ```;
    assert_eq!(s, "    abc\n");
}

petar-dambovaliev · 2023-06-18T17:41:21Z

How big of a problem is this? I am inclined to be against adding more stuff to the language for a problem we don't know we have.

programmerjake · 2023-06-18T17:55:57Z

How big of a problem is this? I am inclined to be against adding more stuff to the language for a problem we don't know we have.

I've encountered this issue of wanting auto-formatting multi-line strings with embedded indentation multiple times myself, I would likely use ``` code strings a lot, maybe more often then I'd use the dyn keyword.

digama0 · 2023-06-18T18:08:06Z

This design makes it impossible to indent the first line.

Diggsey · 2023-06-18T18:24:41Z

This design makes it impossible to indent the first line.

Yes, it's designed to handle code, and in languages where whitespace is signficant (eg. python) indentation is also relative to previous lines.

In cases where you need to indent the first line, then a normal or raw string literal may be more appropriate.

text/0000-code-literals.md

calebcartwright · 2023-06-18T18:44:13Z

cc @rust-lang/rustfmt and @rust-lang/style for awareness

digama0 · 2023-06-18T18:45:45Z

This design makes it impossible to indent the first line.

Yes, it's designed to handle code, and in languages where whitespace is signficant (eg. python) indentation is also relative to previous lines.

In cases where you need to indent the first line, then a normal or raw string literal may be more appropriate.

This seems rather over-indexed on python code. I'm all for preserving whitespace in this way, but why not do it like it is done in doc strings, and just trim as many characters as possible from the left margin (uniformly to all lines)?

programmerjake · 2023-06-18T18:49:20Z

i thought of a way the first line could be indented, just have another line be indented less:

{
    let return_and_close = ```
        return retval;
    }
    ```;
    assert_eq!(return_and_close, "    return retval;\n}\n");
}

digama0 · 2023-06-18T18:50:09Z

By my reading of the proposal the string literal would be an error because the second line is less indented than the first.

programmerjake · 2023-06-18T18:52:01Z

By my reading of the proposal the string literal would be an error because the second line is less indented than the first.

well, the proposal can fix that :)

Diggsey · 2023-06-18T18:55:07Z

This seems rather over-indexed on python code.

Python is simply the most prominent example from this list: https://en.wikipedia.org/wiki/Off-side_rule

If you look through the examples on that page, you'll see they all use indentation in the same way, so I didn't see much need to support indentation on the first line.

I went with the more conservative approach for this RFC because a) I was unable to come up with a use-case for indentation on the first line, b) relaxing the rules later is a backwards compatible change, and c) even with the relaxed rules, there are still strings you cannot represent (eg. a single indented line, or all lines indented) and the set of things that can't be represented is much harder to quantify under the relaxed rules.

ie. it seems much easier to say that if you need precise control over indentation, use raw strings.

Co-authored-by: Caleb Cartwright <calebcartwright@users.noreply.github.com>

Diggsey · 2023-06-19T17:47:06Z

@petar-dambovaliev

How big of a problem is this? I am inclined to be against adding more stuff to the language for a problem we don't know we have.

sqlx has over 6 million downloads and is rapidly approaching diesel as the most popular crate for interacting with SQL databases. I've personally run into this problem with this specific crate at two different companies using Rust in production, and in numerous personal projects.

The asm! macro is part of the standard library, and could benefit greatly from this.

The json! macro from serde_json (a crate with ~150 million downloads) is horribly slow to compile, and could be replaced with a procedural macro that processes a code literal.

There are many different crates implementing some form of html! macro - these could all compile faster and have better UX if they used code literals.

DSLs are extremely powerful, and this kind of string literal is well suited to embedding DSLs within Rust programs.

I think Rust would benefit immensely from having some kind of "relative-indentation" string literal, regardless of whether it takes the exact form proposed here.

VitWW · 2023-06-19T18:23:12Z

For example, quite recently PHP added to v7.3 indentation to PHP Heredoc Strings.

So, it is useful feature

Lokathor · 2023-06-19T19:29:03Z

I think the actual mechanism of three backticks instead of a double quote is perhaps strange. can't we do a hash prefix on the string literal to mark the string literal as being code-ish and then rustfmt can know to format such string literals?
code#"words here"#
or something like that?

petar-dambovaliev · 2023-06-19T21:50:21Z

For example, quite recently PHP added to v7.3 indentation to PHP Heredoc Strings.

So, it is useful feature

With all due respect, I don't think we should take notes on what the PHP people are doing.

petar-dambovaliev · 2023-06-19T21:59:55Z

@petar-dambovaliev

How big of a problem is this? I am inclined to be against adding more stuff to the language for a problem we don't know we have.

sqlx has over 6 million downloads and is rapidly approaching diesel as the most popular crate for interacting with SQL databases. I've personally run into this problem with this specific crate at two different companies using Rust in production, and in numerous personal projects.

The asm! macro is part of the standard library, and could benefit greatly from this.

The json! macro from serde_json (a crate with ~150 million downloads) is horribly slow to compile, and could be replaced with a procedural macro that processes a code literal.

There are many different crates implementing some form of html! macro - these could all compile faster and have better UX if they used code literals.

DSLs are extremely powerful, and this kind of string literal is well suited to embedding DSLs within Rust programs.

I think Rust would benefit immensely from having some kind of "relative-indentation" string literal, regardless of whether it takes the exact form proposed here.

Many people, Including myself are writing Rust in production and are using DSLs. You are fixing a problem that I(and anyone I worked with) didn't know I even had. That should tell you something. Is there anything else to this other than formatting?

digama0 · 2023-06-19T23:00:55Z

I'm sure I'm not the only one who has written things like this:

  writeln!(w, "  \
        <!-- <link rel=\"shortcut icon\" href=\"{rel}favicon.ico\"> -->\
    \n</head>\
    \n<body>\
    \n  <div class=\"body\">\
    \n    <h1 class=\"title\">\
    \n      {h1}\
    \n      <span class=\"nav\">{nav}</span>\
    \n    </h1>")

Multiline string literals where you want to be able to indent the body (because having them flush left looks atrocious) but also preserve leading indentation is a mess right now, so I definitely appreciate the motivation behind this RFC. I'm not sure it captures all the things I might want to do though, since for example it might not be the case that the first line is unindented (like in this example, where the first line has a two space indent) or that the minimum indentation is 0 (which is true in this case but might not be if interpolating an inner element instead of the <body> tag).

ie. it seems much easier to say that if you need precise control over indentation, use raw strings.

This seems like quite an unsatisfactory resolution, since the whole point of the syntax is to allow for precise control over indentation which is not otherwise preserved by multiline strings using \ line terminators.

As the example above should show, if you are outputting indented syntax for a language like python (or in this case, formatted HTML), just because the top level is zero indent doesn't mean that fragments of the string are also zero indent, or that the first line will be zero indent since it might be a fragment of the full output. I have seen the same pattern with formatted Rust code generation, the fragments will generally be inside some kind of scope and hence be nonzero indent, and the repeating block may or may not align with language constructs so the first line might not be the least indented.

clarfonthey · 2023-06-20T01:18:56Z

With all due respect, I don't think we should take notes on what the PHP people are doing.

This level of disrespect for other communities is unproductive and doesn't help with the discussion. Whether your feelings are justified or not, it's better to explain why this particular feature doesn't fit for Rust, instead of just showing a general aversion to other language(s).

digama0 · 2023-06-20T02:42:23Z

Incidentally, the PHP heredoc syntax solves the indentation issue by using the indentation of the closing quote character, rather than the first content line or the minimum indentation. Thus:

let x = ```
      4 space indented
    2 space indented
  ```;

let x = ```
      2 space indented
     1 space indented
    ```;

let x = ```
    2 space indented
 error, less indented than end delimiter
  ```;

Also note that the newline before the end delimiter does not count as part of the string, you would have to add a \n at the end if you want a trailing newline.

programmerjake · 2023-06-20T03:07:04Z

Incidentally, the PHP heredoc syntax solves the indentation issue by using the indentation of the closing quote character, rather than the first content line or the minimum indentation.

TLDR: I think the PHP heredoc syntax is the best so far.

The PHP heredoc syntax is basically what I was going to suggest (I didn't know PHP used it), though I didn't since it only works when the closing ``` is at the start of it's line (ignoring indentation), which wouldn't work for the RFC's proposed syntax for no trailing \n which is:

let s = ```
line with no ending line terminator```;

Also note that the newline before the end delimiter does not count as part of the string, you would have to add a \n at the end if you want a trailing newline.

This neatly solves the issue by making the syntax for no trailing \n be:

let s = ```
line with no ending line terminator
```;

and the corresponding syntax with a trailing \n is:

let s = ```
line with a ending line terminator

```;

scottmcm · 2023-06-20T19:52:17Z

text/0000-code-literals.md

+functionality as required.
+
+If it is necessary to include triple backticks within a code string
+literal, more than three backticks may be used to enclose the


Hmm, I'm torn here. Using the same thing as in doccomments makes sense, but at the same time when we already have "use more #s to escape more" I don't feel amazing about also having a "use more `s to escape more" construct.

text/0000-code-literals.md

…ode-string-literals

Diggsey · 2023-06-20T21:20:42Z

Thanks for the feedback. I've updated the RFC to propose a variant of the "heredoc"-style indentation rules and updated the "prior art" section.

I've also attempted to enumerate every possible syntax variation that has been suggested in the alternatives section.

I've kept the triple backtick quote style for now, but I am torn between that and some of the other quote styles. However, I think the new choice for the indentation rules is the best option so far, especially when combined with the modification to optionally suppress the final newline.

ksaadDE · 2023-06-29T11:39:33Z

@Diggsey That is very off-topic. I keep it short. Most ORMs I know of are optimized for general-purpose (e.g. CRUD).

DSL are nothing new. Almost two decades ago they have been used in several companies (that existed back then) and gov bodies. Only the amount of work to create them has been reduced (looks at nim).
DSL and ORM are not opposing concepts, instead they work together see: Engineer/Software <-> DBO <-> ORM <-> DSL (<-> SQL <-> RDBMS) Sometimes you are directly using DSLs to obtain data e.g. for Data Science
Professionals are using ORMs to work on the objects (maintainability) and not running into issues (and thus reinventing the wheel) like those that have been already solved many years ago (efficiency). The boss/customer does not wait or reward work that does that. So no, in most (general-purpose) use-cases using the ORM is better (and more efficient).

I'm fully aware that there is NoSQL out there. And other (semi-)old technology, new/old approaches etc. Intentionally left out. DTOs are excepted as well. Also, (qualitative) complexity depends on the problem you want to solve (e.g. OS complexity).

ChrisDenton · 2023-06-29T11:47:40Z

Yes, I think this conversation is drifting very offtopic. I believe the central point is that people can and do use DSLs in Rust. This RFC is proposing one way to improve support for people who do.

ksaadDE · 2023-06-29T12:08:13Z

@ChrisDenton

central point is that people can and do use DSLs in Rust. [...] proposing one way to improve support for people who do

Ack & agreed, no dispute about that.

Thinking further we should talk (at some point) about "how" people should (be allowed to) use that cool feature. The mentioned "fear" of bloated files with these "code blocks" in them (very lengthy too) could have a serious impact on maintainability (code-quality) and production use of software written in Rust.

tmccombs · 2023-06-29T16:49:37Z

Regarding length enforcement/linting, are there any existing lints around string length? I can't think of any reason why there should be lints for length of indented strings but not regular non-indented strings.

VitWW · 2023-06-29T20:42:19Z

What's about nested strings?
I think, multiple "#" could be fine (for h#"..."# syntax)

let s1 = h#"
    let s2 = h##"
        let s3 = h###"
            string
            "###;
        "##;
    "#;

Nemo157 · 2023-07-01T10:57:09Z

I think a clippy lint like "this literal is very long, consider moving it into a separate file and using include_str!(...)" would be a decent lint to have.

This is not (yet) possible if the literal is being passed to a proc-macro. Maybe once proc-macro-expand is stabilized such a lint would be useful (though, it'd need proc-macros to be updated to use expansion) but for now if the literal is going into a proc-macro (likely the common case) it should be suppressed.

the8472 · 2023-07-08T09:12:27Z

text/0000-code-literals.md

+1.  It adds a four new types of string literals given all
+    the combinations.
+
+# Rationale and alternatives


Another alternative I have mentioned on zulip:

Improve include! handling (when passed as literals to macros? in editors?) instead to make it more ergonomic to outline other-language code rather than inlining.

Pros:

works better with simple tools that don't handle nested languages well

establishes a new indent context, i.e. doesn't need to be adjusted with surrounding code which in my experience can be error-prone if the editor's indentation handling is imperfect. Examples of confusions:

inside comments

inside macros

inside doc comments

when current indentation is inconsistent with configured rules

when copy-pasting into a differently indented context

generally avoids stacking complexity

some_proc_macro!{ mod m { /// This is an example with nesting and several levels of indentation and whitespaces /// /// ```rust /// let p = h"python /// def py(): /// a = '''Lorem ipsum dolor sit amet, /// consectetur adipiscing elit, /// sed do eiusmod tempor incididunt /// ut labore et dolore magna aliqua.''' /// print(a) /// "; /// ``` /// fn nesting_fun() {} } }

Cons:

requires editor support if you want to view or even edit the included file in the context of its parent instead of opening a new view. But showing an overlay might be less complex than all the text nesting

context of substitutions may be harder to see

Since this is motivated by making things easier for rustfmt I recommend contacting the maintainers of other tools (syntax highlighters, editors, IDEs, ...) to see if this change helps or adds complexity for them.

I don't consider this an alternative. Requiring powerful editor support to even use the feature makes it a no-go, and having to store things in separate files is a maintenance burden that's worse than the current situation, since it requires coming up with a naming scheme for those files that makes sense, makes it harder to resolve merge conflicts since tools like git will never understand this "magic include", and is way more complicated than what is proposed in this RFC.

The advantages you list I also consider to be problems with your approach. You say it works better with simple tools, but the opposite is true: you end up with something unworkable without powerful editor features. In contrast this RFC doesn't require any editor features at all to be an improvement over the status quo. Any support for nested language is an optional extra that doesn't affect the core functionality.

Your example of "stacking complexity" seems very straightforward tbh. Infinitely better than having to go to a spearate file.

Since this is motivated by making things easier for rustfmt I recommend contacting the maintainers of other tools (syntax highlighters, editors, IDEs, ...) to see if this change helps or adds complexity for them.

It by definition does not add any complexity for tools other than rustfmt, since the only required change as a result of this RFC is allowing a new prefix letter (h proposed here) and tools must already support that. Beside that, anything that is valid to do with a raw string literal is also valid to do with an h raw string literal.

It by definition does not add any complexity for tools other than rustfmt, since the only required change as a result of this RFC is allowing a new prefix letter (h proposed here) and tools must already support that. Beside that, anything that is valid to do with a raw string literal is also valid to do with an h raw string literal.

Anything that adds syntax complicates syn and any other tools that use it or otherwise parse rust code. I can't imagine that it would ever be safe to just assume that any string prefix acts like a regular string literal, since raw strings already violate that, hence individual new letters have to be added to anything that parses rust code, including syntax highlighters (although the backup behavior is usually good enough for these).

You end up with something unworkable without powerful editor features.

How? Even many simple editors at least have tabs, panes or similar UI elements to view more than one file at a time.

At its most primitive you rely on your window manager and file browser to open multiple files at the same time in separate windows and show them side by side.

Any support for nested language is an optional extra that doesn't affect the core functionality.

A simple editor can have primitive syntax-highlighting that will work with separate files based on file extensions but won't work with inlined content. So this RFC makes things worse for simple editors

makes it harder to resolve merge conflicts since tools like git will never understand this "magic include"

I don't see how it would make things more difficult for git? If anything it makes diffs simple due to fewer whitespace adjustments.

Requiring powerful editor support to even use the feature makes it a no-go,

Where did I said that a powerful editor would be required? Rather I'm suggesting
a) improve powerful editors
b) keep things simple for simple editors

This covers both.

Your example of "stacking complexity" seems very straightforward tbh. Infinitely better than having to go to a separate file.

What is straight-forward about it? If you actually have to edit, indent, copy-paste, syntax-highlight or auto-complete that there are lots of pitfalls.
Note the outer macro which tends to make things more difficult for tools because at that point point they might not even know anymore whether they're dealing with rust or just things that happen to tokenize like rust.
And it's rust -> macro -> markdown -> codeblock (with language annotation) -> multiline string (with another language annotation).
These languages could be configured to have different indent rules!

I don't think improving support for include! like things to be a negative (proc-macro-include RFC, proc-macro-expand feature would both be great to have), but it's a feature for different usecases than this. This RFC improves support for things that people are already doing. Even if we had better forms of include! I would not pull out 3 lines of SQL to a separate file just to get syntax highlighting, I would simply do what we do currently: use the existing literal strings and fight with rustfmt every time the surrounding code changes.

Separation of languages is the norm and should be encouraged. See the HTML/CSS/JS split that is encouraged instead of having inline script handlers and styles. See template files. See module trees.

You say my approach is a no-go because it makes things more difficult for simple editors. And yet you acknowledge that this RFC will primarily benefit complex editors. While I think my approach would benefit simple editors because they can then work with the outlined language.

At the moment, these strings are in the file and so can be reviewed and have conflicts resolved in-place. By moving them to a separate file you can no longer perform these actions with any context about the surrounding code.

I assume they'd conventionally still be placed in the same directory and show up in the diffs next to each other.

To make that at all workable you'd need a powerful editor to allow treating them as though they weren't in a separate file

Not necessarily. E.g. when you have an SQL query query!(include!("query.psql"), param1="val", param2="val", ...) then it has an API, like a function call. You edit functions separately and then fix their callsites.
So "jump to definition" + error messages from the query! macro about missing arguments would already cover that.

And yet you acknowledge that this RFC will primarily benefit complex editors.

My expectation is that this RFC will not effect complex editors (in cases where they are not acting as simple editors).

A complex editor that is using heuristics to determine when to apply other-language syntax highlighting to a literal could similarly use those heuristics to determine when to apply other-language auto-formatting to a literal.

This RFC simply provides support for auto-indentation (but not formatting) of literals for simple editors (and complex editors where their heuristics don't apply) that use rustfmt.

EDIT: actually, I forgot that this RFC also included language hints, which would allow a very strong hint to the complex editor heuristics of what other-language to treat a literal as, but it also likely allows editors in between simple and complex to use very simple heuristics and start multi-language highlighting where they couldn't previously.

EDIT2: To clarify some of my categorical assumptions to make sure there's no misunderstanding:

simple editor: notepad -> notepad++ -> unconfigured vim

no code understanding or only simple regex based highlighting

complex editor: neovim/vscode + LSP, jetbrains

semantic code understanding, so it actually knows which macro literals are being passed to

in between: minimally configured vim/neovim without an LSP

still just syntactic code understanding, but better than the simple regexes, so it doesn't know which macro is which to use for multi-language heuristics, but it can parse and use language hints

Separation of languages is the norm and should be encouraged.

That is not my experience. I've almost never seen sql queries pulled out into separate files. Most assembly I've seen is inline. Shader languages are a bit of a mix, and I don't have as much familiarity with it, but I don't think it is at all unusual to include shader code inline, especially if it is small. And this feature would be very useful for help text for cli programs. I can't imagine using a separate file for the help comment for every option in my cli that uses clap.

See the HTML/CSS/JS split that is encouraged instead of having inline script handlers and styles

But we also have frameworks like react, where html and css are embedded in Javascript. Or svelte where the JS is included in an html template.

Shader languages are the one case I can think of where people actually care about "separation of languages", and then it has to do more with the fact that GPU code inherently has a modularity to it, because it is run in passes, and people tend to pull out modules into, well, modules. So you may as well have, e.g.

code.cpp

code.hpp

code.vert

code.frag

But ofc you may well just encounter something like

code.cs

code.hlsl

Depending.

notepad++

Has syntax highlighting.

But we also have frameworks like react, where html and css are embedded in Javascript. Or svelte where the JS is included in an html template.

Yes, and I have encountered issues with that kind of multi-language, framework-specific file formats that makes me prefer separate files. Simple editors just didn't support it at all or mistook it as only one of the languages, complex editors had configuration issues because they picked up the wrong preprocessor version or something which led to lots of bogus squiggles in those files while vanilla JS files had no issues.

Most assembly I've seen is inline.

https://github.com/xiph/rav1e/tree/master/src/arm
https://github.com/memorysafety/rav1d/tree/main/src/x86
https://github.com/rust-lang/stacker/tree/master/psm/src/arch

Though none of that needs to be include!ed / act as a template in the first place, it's static code with a fixed interface and compiled separately. I can't think of a project that needs templated ASM.

mattheww · 2023-07-09T09:02:06Z

The reference-level explanation should say what happens in Rust 2018 and earlier (where supporting these literals would be an incompatible change; see reserved-prefixes).

workingjubilee

I do not believe this proposal sufficiently engages with why programming languages other than Python and Markdown make the choices they do. In particular, the Swift programming language chooses to instead reject anything on the first line (before the multiline literal "proper"), and I think for good reasons. It is very easy to go from emitting a string literal something like this:

let text = "text\ntext\ntext\ntext";

To, wanting nicer formatting for generated code, emit this:

let text = "text
    text
    text
    text
";

This causes accidentally losing the first line. Even with a clarification of this RFC to add restrictions to what is allowed to go there so fewer inputs can be silently dropped, I don't think it is very "in character" for Rust to allow code that may be incorrect to pass compiling when it would be very easy to use a slightly different rule and catch a common mistake.

workingjubilee · 2023-07-09T22:28:06Z

text/0000-code-literals.md

+Anything directly after the opening quote is not considered
+part of the string literal. It may be used as a language hint or


Suggested change

Anything directly after the opening quote is not considered

part of the string literal. It may be used as a language hint or

Anything directly after the opening quote is not considered

part of the string literal. It may be used as a language hint or

There is no specified separator aside from the implied separator of the newline. Some people have mistaken this proposal as only allowing a constrained option here. It does not. It says "Anything", and specifies no compiler error if the symbols that immediately follow the " are, say... ". Perhaps you meant to constrain it.

I'm not sure if this is what you're getting at, but the first line is still constrained by the delimeters of the string. ie. if the string begins h" then a single quote will still close the string even if it's on the first line. If a single quote would not close the string then it would still be allowed on the first line. The indentation and language hint rules apply "after" we've determined the bounds of the string literal.

workingjubilee · 2023-07-09T22:35:38Z

text/0000-code-literals.md

+part of the string literal. It may be used as a language hint or
+processed by macros (similar to the treatment of doc comments).
+
+```rust
+let sql = hr#"sql
+    SELECT * FROM table;
+    "#;
+```


I do not believe placing metadata regarding the string inside the visible string delimiter tokens should be accepted, as it has many negative impacts. In particular, there is an isomorphism between strings written using r#""# and strings written using "" (and without STRING_CONTINUE, i.e. 0x5C 0x0A), currently, that as far as I know is complete. This proposal would create a surjective function: there would be string literals written using

h"languagetag "

which have no mirror image using the other syntactic forms for string literals. This causes great amounts of confusion for:

Lexing

Parsing

Code generation

And the very purpose of this language hint is for the service of syntax highlighters and the like, which are very likely going to be written in a language that may have no easy access to simply running syn or tree-sitter or whatever, and may instead be bashed together out of JavaScript and regexes.

In particular, there is an isomorphism between strings written using r#""# and strings written using "" (and without STRING_CONTINUE, i.e. 0x5C 0x0A), currently, that as far as I know is complete.

Not sure what you're getting at here. There are already many ways to get the same "literal value" using different "encodings" of the same literal. For example, tabs could be encoded with \t or an actual tab character.

This proposal would create a surjective function: there would be string literals written using

h"languagetag
"

which have no mirror image using the other syntactic forms for string literals.

This causes great amounts of confusion for:

Lexing Parsing Code generation

This is going to need some more justification.

First of all, the language hint is purely a syntactic feature, it doesn't change the "value" of a string literal, so in terms of "values" (which if we're using set theoretic terms, is the most plausible thing to talk about but you haven't actually defined that...) there is the same amount of isomorphism between code string literals and raw string literals as there was between raw string literals and string literals (modulo indentation being relative, which is the entire point of the proposal).

Secondly, I flat out don't believe that this does introduce significant complexity in those areas. The compiler/tooling is already capable of dealing with string literals and raw string literals. This RFC doesn't change the basic rules for when a string begins/ends - the parsing rules are identical to the corresponding non-code-literal form. The only change is to how the content within the literal is converted into a value for use by the program.

workingjubilee · 2023-07-09T22:37:37Z

text/0000-code-literals.md

+- Byte string literals `hb"`
+- Raw byte string literals `hbr#"`
+
+The `h` modifier will appear before all characters in the prefix.


There is no persuasive and particular reason offered to have this precede all other characters in the prefix. It would be preferable to assume that we are going to explore accepting a non-canonical ordering.

Experimentally, br"<content>" compiles, but rb"<content>" does not compile. This implies that we are already particular about the order of string prefixes, and so I wrote this RFC with consistency in mind. I don't particularly care about what order is "canonical" but this rule was easy to define and seemed reasonably intuitive. If you have a strong reason to prefer a different order I'd love to heear it.

workingjubilee · 2023-07-09T22:40:29Z

text/0000-code-literals.md

+An `h` modifier may be added to the prefix of the following string
+literal types:
+
+- String literals `h"`
+- Raw string literals `hr#"`
+- Byte string literals `hb"`
+- Raw byte string literals `hbr#"`


Why does this not include c"?

No particular reason - I was using stable Rust as a baseline, but I can update the RFC to include C string literals. The intent is that they combine in the natural way. That said, it looks like the implementation of the the C string literal RFC was reverted due to breakage, so... We'll see.

rust-lang/rust#113476

workingjubilee · 2023-07-09T22:52:47Z

text/0000-code-literals.md

+The main drawback is increased complexity of the language:
+
+1.  It adds a four new types of string literals given all
+    the combinations.


Teaching.

String literals are used in pattern matching. It will be very annoying to explain why a metadata tag that can be written as part of the literal and lives inside what appears to be the string's delimiter tokens does or does not participate in pattern matching. I would prefer the question simply not arise.

Specifically, this works:

let "SELECT" = &maybe_select_expr[0..6] else { return; };

And I presume, with this proposal, that this would work:

let h" SELECT " = &maybe_select_expr[0..6] else { return; };

But I do not want to explain why either of these may or may not work:

let h"x86asm " = &maybe_sql_expr[0..0] else { return; }; let h"sql " = &maybe_sql_expr[0..0] else { return; };

All answers seem bad, to me. Introducing a form that allows the question to arise in the first place can simply be avoided.

Sure, we could add more complex rule for language_tag, for example first line with a tag must end with #.

let h#"sql# SELECT "# = &maybe_sql_expr[0..2] else { return; };

You have not actually clarified anything as long as the tag is inside the quotation marks.

I guess the h#<lang> syntax was more natural when this RFC was still proposing the markdown-like triple backtick syntax (```<lang>).

Once feature(stmt_expr_attributes) is stabilized, I think that would nicely enabled something like (even if that is somewhat more verbose):

let sql = #[editor::inject_lang(sql)] h#" SELECT * FROM table; "#;

But I do not want to explain why either of these may or may not work:

As proposed in the RFC, both of those would match as the language hint is not part of the value. I don't think this case is really any different from eg.

let "\t" = &maybe_tab_expr[0..0] else { return; }; let " " = &maybe_tab_expr[0..0] else { return; };

Or:

enum Foo { Bar, Baz, } use Foo::Bar as Bat; fn main() { match Foo::Bar { self::Bat => println!("Bat"), Foo::Baz => println!("Baz") } }

Ultimately, you can't expect pattern matching to be syntactic - it's fundamentally about the value.

Yes yes, and 2_5_5 also is matched by 255 and 0xFF, as those names alias, and if you introduce a specific alias for something, shockingly, it matches. And introducing redundant aliases without consideration for the potential harms to understanding is what I am objecting.

However, your comments have made apparent to me that you fundamentally do not actually believe this increases language complexity, as you don't think it makes it harder to parse or understand the source code, so I also object to the text written here. If mere quantitative increase in ways to express something does not count as an increase in complexity, then this entry is a lie and there is no drawback.

Suggested change

The main drawback is increased complexity of the language:

1. It adds a four new types of string literals given all

the combinations.

None.

However, your comments have made apparent to me that you fundamentally do not actually believe this increases language complexity, as you don't think it makes it harder to parse or understand the source code

I do think it increases language complexity, but only in the sense that the language now has N+1 features rather than N. Where I disagree with you is the idea that there is a qualitative rather than quantitative difference in complexity in comparison to existing string literals.

My hope is that even this incremental increase in complexity could be later reduced: given that the feature is designed to allow represent every possible string, I think there's a world where a future edition simply makes all multiline literals behave like the literals proposed here. I think it would be appropriate to propose this more drastic change if we later find that the use of code string literals naturally replaces the use of multiline string / raw string literals due to people preferring an "indentation relative form", and if no unforeseen drawbacks are encountered.

tmccombs · 2023-07-10T18:30:36Z

I wonder if maybe the tag part should be deferred to a later PR (but kept in the future possibilities section). And for now just error if there is any text on the first line. Although, then there is a risk that macros or external tools rely on that behavior and break if and when tags are added later.

I also think that the RFC should better specify what it means to measure the whitespace. IMO, the cleanest way would be to require that the indentation must exactly match on each line. So for example you can't have tabs on one line, and spaces on another, or even the same number of spaces and tabs, but in different order. Or go even further and forbid mixed spaces and tabs altogether.

It also feels a little weird to me that the empty string takes multiple line with this:

let empty = h"
    -";

and I'm not overly fond of the "-" to suppress the final newline. I can't think of anything obviously better though.

I will suggest another alternative. The final newline could be suppressed with a backslash on the penultimate line, like so:

 let s = h"
    something \
    ";

That doesn't require adding any additional syntax, since it works the same as regular strings.
This has a couple of problems though . It doesn't work for raw strings, unless we add a special exception for this to raw indented strings. And you now need three lines to represent the empty string with the h prefix.

Diggsey · 2023-07-10T22:30:50Z

@workingjubilee

I think for good reasons. It is very easy to go from emitting a string literal something like this:

It's a fair criticism. There's certainly a risk there, but it's difficult to say how significant that risk actually is. Syntax highlighting the "language hint" differently would significantly mitigate that risk, and is trivial to do even if an IDE has no support for syntax highlighting the nested code itself.

My opinion is that you are overstating the risk here: in the example you provided the first line clearly stands out from the rest given the differing indentation, even without syntax highlighting. It's not clear why making a mistake here would be more significant, or harder to catch, than making a mistake anywhere else in the code.

If there was an alternative way to specify the language hint which wasn't worse and avoided the risk entirely, then I would be open to that, but I think the far bigger danger here is ending up with a syntax that is too heavy to use effectively. The current syntax:

let _ = hr#"foo
    <content>
    "#;

Is about at the limit of what I think is reasonable for such a QoL improvement to cost, so using eg. inline attribute syntax such as:

let _ = #[lang(sql)] hr#"
    <content>
    "#;

Would be too intrusive, especially the excessive use of #s.

The reason to support the language feature at all is to enable better tooling support. From basic syntax highlighting to more advanced features. It opens up many opportunities that didn't exist before, and is useful information for the programmer to be able to express in the code and for others reading it.

workingjubilee · 2023-07-10T23:05:15Z

My opinion is that you are overstating the risk here:

And my opinion is that you are understating it.

in the example you provided the first line clearly stands out from the rest given the differing indentation, even without syntax highlighting. It's not clear why making a mistake here would be more significant, or harder to catch, than making a mistake anywhere else in the code.

My concern includes on-the-fly generated Rust code which is, in a strict, computational sense, impossible for me to eye-check and check-in for every example which I might want to generate, but which I may wish to have nicely formatted when I emit it, nonetheless, for various reasons. For example, it may later be inspected for debugging purposes. I would rather the compiler immediately err in those cases of emitting a malformed string, and that I can begin handling the compiler error that has been propagated into my tools via process::Command and logging things so that data can get back to me and I can later write a proper regression test so that I do not generate that code again. That is much preferable. I do lean heavily on the fact that the compiler errors in a lot of cases where I might fuck up in code generation. I would love for this to be a feature I can also lean heavily on because it adopted a highly regular syntax and was conservative about changes to the current lex and parse rule:

    STRING_LITERAL :
       " (
          ~[" \ IsolatedCR]
          | QUOTE_ESCAPE
          | ASCII_ESCAPE
          | UNICODE_ESCAPE
          | STRING_CONTINUE
       )* " SUFFIX?

    STRING_CONTINUE :
       \ followed by \n

and kept them to affecting whitespace which is comparatively easy to reason about, in a quasi-inverse of the rule regarding STRING_CONTINUE, rather than being constantly afraid that it might fucking bite me in the ass because it is sufficiently lossy as to gobble up actual text I might be counting on being present.

workingjubilee · 2023-07-10T23:15:18Z

If there was an alternative way to specify the language hint which wasn't worse and avoided the risk entirely, then I would be open to that, but I think the far bigger danger here is ending up with a syntax that is too heavy to use effectively.

Truly, genuinely, I am content with a change as small as this:

let _ = h_foo_r#"
    <content>
    "#;

Or if you prefer:

let _ = hr#foo"
    <content>
    "#;

I believe feature(stmt_expr_attributes) is still worth reasoning about as it may be the case that it is desired that we have a more general solution for wanting to have sugar in for annotating a string with clearly external data, like what language it is about but potentially also other things, but I am not going to pretend that going all the way to our full attribute syntax is not a chore.

Diggsey · 2023-07-10T23:25:39Z

My concern includes on-the-fly generated Rust code

But why generate "code string literals" at all in that case? If the code is not intended to be edited by humans, then could you not generate a raw string literal?

Let's say for the sake of argument that you both want to generate nicely indented output, and you don't want the extraneous whitespace that would come with a raw string literal, and the generated code is not intended to be checked into source control / generally viewed by a human being. In that case, even with a code string literal you'd need to make sure that every line was properly indented right?

In that case, a simple validation rule that would catch mistakes in your code relating to the first line would be to disallow whitespace between the opening quote and the start of the language hint if present.

let _ = h_foo_r#"

"#;

Or if you prefer:

let _ = hr#foo"

"#;

I will add the former as an alternative in the RFC. The latter doesn't really work as the # is neither required in raw string literals, nor allowed at all in normal string literals, eg.

let _ = h"
    <content>
    ";

This is not a strongly held opinion, but I think it's suboptimal to use _ as a separator in this way. _ is typically used as a character that's not a separator, since it's treated as word-like in many respects. I think it would be better if the language hint was its own token if outside the string.

workingjubilee · 2023-07-11T00:10:15Z

Let's say for the sake of argument that you both want to generate nicely indented output, and you don't want the extraneous whitespace that would come with a raw string literal, and the generated code is not intended to be checked into source control / generally viewed by a human being.

That is not quite my concern. My concern is specifically that I do want it to be potentially viewable by a human being, and that it is somewhere, logged for later review if necessary, but it's not like I am reviewing every single instance on git or whatever. This later review may happen whether the compilation succeeds or fails. Indeed, my concern is I would like to be able to make my codegen nice and legible for the benefit of places that I may never see it, without a concern that the result may be miscompiled. And some of the strings may, indeed, be SQL which my generated Rust code will later tell a database to execute, and I want to make examining the source easy and keep it easy to reason about why things are wrong even for people who may not write Rust programs very often, as they can still examine and easily read nicely formatted SQL that is also nicely formatted in the context of the Rust program.

And judging by the occasional error reports I get from these faraway databases, I am pretty sure they're not that familiar with Markdown and its quirks, either.

Part of what makes what I have made possible is that rustc is so very enthusiastic already about valid parses, so that I can simply defer a lot of work into the compiler instead of precompiling the code myself, because then it becomes a simple transaction with the compiler.
"simple"
"""simple"""
The production software I have helped midwife is quite terrifyingly complex and thus I am extremely interested in anything that I can leverage to improve its authorship and debugging experiences. Woe to me. It is not yours to take responsibility for my questionable life choices, but I would like to not model this as something I would have to be wary of in the codebases I work on, and instead simply look forward to its implementation so I can make use of it as soon as I possibly can.

Diggsey · 2023-07-12T17:40:58Z

@tmccombs

I also think that the RFC should better specify what it means to measure the whitespace. IMO, the cleanest way would be to require that the indentation must exactly match on each line.

This is already specified in the RFC:

Remove exactly the measured whitespace from each non-empty line. If this cannot be done, then issue a compiler error. The whitespace must match down to the exact character sequence.

It also feels a little weird to me that the empty string takes multiple line with this:
let empty = h"
    -";

That would be an error, since there is no final newline to suppress in that example. The empty string would be simply:

let empty = h"
    ";

With zero lines between the opening and closing quote, there is no newline to suppress.

Contrast this to:

let single_line = h"
    content
    ";

In this case there is a single line, and so there does exist a final newline that can be suppressed.

Animeshz · 2023-07-29T07:20:28Z

Just a suggestion, but one could also look at nix's syntax for an inspiration,

{
    environment.etc."auto-cpufreq.conf".text = ''
      [charger]
      governor = powersave
      turbo = never

      [battery]
      governor = powersave
      turbo = never
    '';
}

Normal strings are as is, with double quotes ", whereas multiline strings are made using double single-quotes '' and closes with the same. This doesn't conflict with the character literal, because any character literal must have a 1-length character before the closing quote.

The indentation is cleared by the compiler at compile-time, and if the ending quote ''; is at same level as of starting quote, it automatically removes the new-line.

ksaadDE · 2023-09-14T01:43:30Z

Why not combining the idea of the Back-Tick syntax and mix it with the brackets?

Example:

let mymultilinestr = {<language>
     <yourtext>
}

everything that replaces <yourtext> is going to be seen as string until the closing bracket.

programmerjake · 2023-09-14T01:59:43Z

let mymultilinestr = {<language>
     <yourtext>
}

well, that's currently valid code, so changing it to be a string would conflict:

pub fn foo() {bar
    () // weird formatting for calling bar()
}

fn bar() {}

Animeshz · 2023-09-14T11:24:40Z

Since {} is used for scoping already, I don't think it could also be used to store strings.

ksaadDE · 2023-09-17T01:08:25Z

@Animeshz
@programmerjake

I almost forgot that unclean syntax is a thing in Rust.

I could provide another alternative:

let mymultilinestr  = S{<lang>
   <string>
};

The S infront of the bracket indicates it is a multi-line string. After that a language tag can be added, and in a new line starts the multi-line string until the last bracket.

Because of the prefixing S it does ~~not~~ conflict with scoping or any other use ~~, to my knowledge~~. ~~Simple but effective trick~~.

digama0 · 2023-09-17T01:22:13Z

S{} is already legal syntax for creating a structure named S:

struct S { lang: u32, b: u32 }
let lang = 1;
let b = 2;

let mymultilinestr  = S{lang
   , b
};

ksaadDE · 2023-09-17T17:38:57Z

S{} is already legal syntax for creating a structure named S:

Right this would conflict. Prefixing it with an - ?

let  mymultilinestr = -S{<lang>,
<mlstring>
};

The other alternative I would suggest are back ticks or a backslash to indicate that not a struct is meant.

I'm just playing around with ideas, how to make it usable.

digama0 · 2023-09-17T20:58:52Z

S could have a negation operator (impl std::ops::Neg for S)

ksaadDE · 2023-09-19T05:31:42Z

S could have a negation operator (impl std::ops::Neg for S)

@digama0

good point. What about the wave ~ ? I looked it up in the docs, seemingly it is not used (yet).

let  mymultilinestr = ~{<lang>,
<mlstring>
};

kanashimia · 2023-11-10T19:23:07Z

A comment can also be used for specifying a language:

let cool_codes = /*rust*/r#"
    fn main(){unsafe{*(0 as*mut _)=0}}
"#;

This is similar to how Helix editor highlights strings for Nix language.
Doesn't require any language changes, just a change in the guidelines and tooling.
But not as clear to parse as a dedicated language construct.
Dedenting is still a problem.

There seems to be two separate features proposed here:

dedented string literals
string literal language hint

Also for prior art: https://github.com/tc39/proposal-string-dedent

Propose code string literals

cc7e7c8

calebcartwright reviewed Jun 18, 2023

View reviewed changes

text/0000-code-literals.md Outdated Show resolved Hide resolved

Update text/0000-code-literals.md

a48ef56

Co-authored-by: Caleb Cartwright <calebcartwright@users.noreply.github.com>

ehuss added the T-lang Relevant to the language team, which will review and decide on the RFC. label Jun 18, 2023

scottmcm reviewed Jun 20, 2023

View reviewed changes

text/0000-code-literals.md Show resolved Hide resolved

Diggsey added 2 commits June 20, 2023 22:10

Incorporate feedback and suggestions

f889c3e

Merge branch 'code-string-literals' of github.com:Diggsey/rfcs into c…

47a4b6c

…ode-string-literals

Diggsey force-pushed the code-string-literals branch from 732e5f4 to 47a4b6c Compare June 20, 2023 21:16

Add another alternative

a00a4a9

the8472 reviewed Jul 8, 2023

View reviewed changes

Add note about editions

90ff817

workingjubilee reviewed Jul 9, 2023

View reviewed changes

Tweaks

ceca328

calebcartwright mentioned this pull request Jul 22, 2023

Suspicious formatting with string literals in macros rust-lang/rustfmt#5855

Closed

		Anything directly after the opening quote is not considered
		part of the string literal. It may be used as a language hint or

Propose code string literals #3450

Are you sure you want to change the base?

Propose code string literals #3450

Conversation

Diggsey commented Jun 18, 2023 • edited by rustbot

programmerjake commented Jun 18, 2023 • edited

petar-dambovaliev commented Jun 18, 2023

programmerjake commented Jun 18, 2023 • edited

digama0 commented Jun 18, 2023

Diggsey commented Jun 18, 2023

calebcartwright commented Jun 18, 2023

digama0 commented Jun 18, 2023

programmerjake commented Jun 18, 2023

digama0 commented Jun 18, 2023

programmerjake commented Jun 18, 2023

Diggsey commented Jun 18, 2023 • edited

Diggsey commented Jun 19, 2023

VitWW commented Jun 19, 2023

Lokathor commented Jun 19, 2023

petar-dambovaliev commented Jun 19, 2023

petar-dambovaliev commented Jun 19, 2023

digama0 commented Jun 19, 2023 • edited

clarfonthey commented Jun 20, 2023

digama0 commented Jun 20, 2023

programmerjake commented Jun 20, 2023

Choose a reason for hiding this comment

Diggsey commented Jun 20, 2023 • edited

ksaadDE commented Jun 29, 2023 • edited

ChrisDenton commented Jun 29, 2023

ksaadDE commented Jun 29, 2023

tmccombs commented Jun 29, 2023

VitWW commented Jun 29, 2023 • edited

Nemo157 commented Jul 1, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Nemo157 Jul 8, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

the8472 Jul 9, 2023 • edited

Choose a reason for hiding this comment

mattheww commented Jul 9, 2023

workingjubilee left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

workingjubilee Jul 9, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Diggsey Jul 10, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

workingjubilee Jul 9, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TimNN Jul 10, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tmccombs commented Jul 10, 2023

Diggsey commented Jul 10, 2023 • edited

workingjubilee commented Jul 10, 2023

workingjubilee commented Jul 10, 2023

Diggsey commented Jul 10, 2023

workingjubilee commented Jul 11, 2023 • edited

Diggsey commented Jul 12, 2023

Animeshz commented Jul 29, 2023

ksaadDE commented Sep 14, 2023

programmerjake commented Sep 14, 2023

Animeshz commented Sep 14, 2023

ksaadDE commented Sep 17, 2023 • edited

Diggsey commented Jun 18, 2023 •

edited by rustbot

programmerjake commented Jun 18, 2023 •

edited

programmerjake commented Jun 18, 2023 •

edited

Diggsey commented Jun 18, 2023 •

edited

digama0 commented Jun 19, 2023 •

edited

Diggsey commented Jun 20, 2023 •

edited

ksaadDE commented Jun 29, 2023 •

edited

VitWW commented Jun 29, 2023 •

edited

Nemo157 commented Jul 1, 2023 •

edited

Nemo157 Jul 8, 2023 •

edited

the8472 Jul 9, 2023 •

edited

workingjubilee Jul 9, 2023 •

edited

Diggsey Jul 10, 2023 •

edited

workingjubilee Jul 9, 2023 •

edited

TimNN Jul 10, 2023 •

edited

Diggsey commented Jul 10, 2023 •

edited

workingjubilee commented Jul 11, 2023 •

edited

ksaadDE commented Sep 17, 2023 •

edited

ksaadDE commented Sep 19, 2023 •

edited