Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support formatting jupyter notebooks #1218

Closed
4 of 6 tasks
Tracked by #5188
konstin opened this issue Dec 12, 2022 · 11 comments · Fixed by #4665
Closed
4 of 6 tasks
Tracked by #5188

Support formatting jupyter notebooks #1218

konstin opened this issue Dec 12, 2022 · 11 comments · Fixed by #4665
Assignees
Labels
core Related to core functionality help wanted Contributions especially welcome

Comments

@konstin
Copy link
Member

konstin commented Dec 12, 2022

black supports formatting jupyter notebook with the jupyter extra black[jupyter]. If installed like this, it formats jupyter notebooks just like python files. It would be nice if ruff could similarly lint and fix jupyter notebooks.

Steps to implement this:

  • Read Jupyter notebooks, build a mapping between cells and concatenated code and and print lint errors with cell ids. The code is in jupyter/notebooks.rs.
  • Apply fixes to concatenated code and map back to cells
  • Write back jupyter notebooks ensuring that if there are no changes the json is identical (roundtrip support). You might need to extend or replace the current schema or drop some of the schema for just serde_json::Value. See also the black implementation.
  • Test roundtrip support with a bunch of different notebooks found in the wild, ideally from different generators (e.g. jupyter notebook and pycharm) and different schema versions.
  • Remove jupyter_notebook feature and include .ipynb files in ruff by default
  • Optional: Check if there a any lints that should be off for jupyter notebook, e.g. "no print" or isort rules might not make sense
@charliermarsh
Copy link
Member

I assume Ruff would work with nbQA, but I too would like it to support Jupyter out-of-the-box. Looking into it...

@charliermarsh
Copy link
Member

Ruff was actually integrated into nbQA in the latest release, so you can now run (e.g.):

❯ nbqa ruff Untitled.ipynb
Untitled.ipynb:cell_1:2:5: F841 Local variable `x` is assigned to but never used
Untitled.ipynb:cell_2:1:1: E402 Module level import not at top of file
Untitled.ipynb:cell_2:1:8: F401 `os` imported but unused
Found 3 error(s).
1 potentially fixable with the --fix option.

(Would still like to have a first-party integration at some point.)

@charliermarsh charliermarsh added core Related to core functionality and removed enhancement labels Dec 31, 2022
@blakeNaccarato
Copy link

blakeNaccarato commented Mar 10, 2023

See a parallel effort over in the Ruff VSCode extension, where LSP support for analyzing entire Jupyter notebooks seems blocked by pygls still being on LSP v3.16. It could be a shorter road for the Ruff LSP implementation to handle entire notebook formatting, if effort can be concentrated over at pygls to get LSP 3.17 out. See my comment in the other thread for more detail and other Issue links.

It would be nice to have native support in Ruff as well! But in the short run, it looks like LSP support may have less resistance.

@konstin
Copy link
Member Author

konstin commented May 11, 2023

The current implementation can lint jupyter files, but it can't apply fixes yet or write them back to file. I've written instructions in the top comment and can mentor if someone wants to tackle this.

@konstin konstin added the help wanted Contributions especially welcome label May 12, 2023
@sladyn98
Copy link
Contributor

@konstin I could take this on

@charliermarsh
Copy link
Member

@sladyn98 - Sorry for the churn, I already chatted with @dhruvmanila about taking this one on!

@dhruvmanila
Copy link
Member

@charliermarsh you can assign this to me to avoid any future confusion :)

dhruvmanila added a commit that referenced this issue Jun 12, 2023
## Summary

Add support for applying auto-fixes in Jupyter Notebook.

### Solution

Cell offsets are the boundaries for each cell in the concatenated source
code. They are represented using `TextSize`. It includes the start and
end offset as well, thus creating a range for each cell. These offsets
are updated using the `SourceMap` markers.

### SourceMap

`SourceMap` contains markers constructed from each edits which tracks
the original source code position to the transformed positions. The
following drawing might make it clear:

![SourceMap visualization](https://github.com/astral-sh/ruff/assets/67177269/3c94e591-70a7-4b57-bd32-0baa91cc7858)

The center column where the dotted lines are present are the markers
included in the `SourceMap`. The `Notebook` looks at these markers and
updates the cell offsets after each linter loop. If you notice closely,
the destination takes into account all of the markers before it.

The index is constructed only when required as it's only used to render
the diagnostics. So, a `OnceCell` is used for this purpose. The cell
offsets, cell content and the index will be updated after each iteration
of linting in the mentioned order. The order is important here as the
content is updated as per the new offsets and index is updated as per
the new content.

## Limitations

### 1

Styling rules such as the ones in `pycodestyle` will not be applicable
everywhere in Jupyter notebook, especially at the cell boundaries. Let's
take an example where a rule suggests to have 2 blank lines before a
function and the cells contains the following code:

```python
import something
# ---
def first():
	pass

def second():
	pass
```

(Again, the comment is only to visualize cell boundaries.)

In the concatenated source code, the 2 blank lines will be added but it
shouldn't actually be added when we look in terms of Jupyter notebook.
It's as if the function `first` is at the start of a file.

`nbqa` solves this by recording newlines before and after running
`autopep8`, then running the tool and restoring the newlines at the end
(refer nbQA-dev/nbQA#807).

## Test Plan

Three commands were run in order with common flags (`--select=ALL
--no-cache --isolated`) to isolate which stage the problem is occurring:
1. Only diagnostics
2. Fix with diff (`--fix --diff`)
3. Fix (`--fix`)

### https://github.com/facebookresearch/segment-anything

```
-------------------------------------------------------------------------------
 Jupyter Notebooks       3            0            0            0            0
 |- Markdown             3           98            0           94            4
 |- Python               3          513          468            4           41
 (Total)                            611          468           98           45
-------------------------------------------------------------------------------
```

```console
$ cargo run --all-features --bin ruff -- check --no-cache --isolated --select=ALL /path/to/segment-anything/**/*.ipynb --fix
...
Found 180 errors (89 fixed, 91 remaining).
```

### https://github.com/openai/openai-cookbook

```
-------------------------------------------------------------------------------
 Jupyter Notebooks      65            0            0            0            0
 |- Markdown            64         3475           12         2507          956
 |- Python              65         9700         7362         1101         1237
 (Total)                          13175         7374         3608         2193
===============================================================================
```

```console
$ cargo run --all-features --bin ruff -- check --no-cache --isolated --select=ALL /path/to/openai-cookbook/**/*.ipynb --fix
error: Failed to parse /path/to/openai-cookbook/examples/vector_databases/Using_vector_databases_for_embeddings_search.ipynb:cell 4:29:18: unexpected token '-'
...
Found 4227 errors (2165 fixed, 2062 remaining).
```

### https://github.com/tensorflow/docs

```
-------------------------------------------------------------------------------
 Jupyter Notebooks     150            0            0            0            0
 |- Markdown             1           55            0           46            9
 |- Python               1          402          289           60           53
 (Total)                            457          289          106           62
-------------------------------------------------------------------------------
```

```console
$ cargo run --all-features --bin ruff -- check --no-cache --isolated --select=ALL /path/to/tensorflow-docs/**/*.ipynb --fix
error: Failed to parse /path/to/tensorflow-docs/site/en/guide/extension_type.ipynb:cell 80:1:1: unexpected token Indent
error: Failed to parse /path/to/tensorflow-docs/site/en/r1/tutorials/eager/custom_layers.ipynb:cell 20:1:1: unexpected token Indent
error: Failed to parse /path/to/tensorflow-docs/site/en/guide/data.ipynb:cell 175:5:14: unindent does not match any outer indentation level
error: Failed to parse /path/to/tensorflow-docs/site/en/r1/tutorials/representation/unicode.ipynb:cell 30:1:1: unexpected token Indent
...
Found 12726 errors (5140 fixed, 7586 remaining).
```

### https://github.com/tensorflow/models

```
-------------------------------------------------------------------------------
 Jupyter Notebooks      46            0            0            0            0
 |- Markdown             1           11            0            6            5
 |- Python               1          328          249           19           60
 (Total)                            339          249           25           65
-------------------------------------------------------------------------------
```

```console
$ cargo run --all-features --bin ruff -- check --no-cache --isolated --select=ALL /path/to/tensorflow-models/**/*.ipynb --fix
...
Found 4856 errors (2690 fixed, 2166 remaining).
```

resolves: #1218
fixes: #4556
@dhruvmanila
Copy link
Member

Sorry, not yet completed.

@dhruvmanila dhruvmanila reopened this Jun 12, 2023
@charliermarsh
Copy link
Member

What's the outstanding work here? Testing?

@dhruvmanila
Copy link
Member

  • Roundtrip support
  • End-to-end testing support

dhruvmanila added a commit that referenced this issue Jun 12, 2023
## Summary

Add roundtrip support for Jupyter notebook.

1. Read the notebook
2. Extract out the source code content
3. Use it to update the notebook itself (should be exactly the same [^1])
4. Serialize into JSON and print it to stdout

## Test Plan

`cargo run --all-features --bin ruff_dev --package ruff_dev --
round-trip <path/to/notebook.ipynb>`

<details><summary>Example output:</summary>
<p>

```
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "f3c286e9-fa52-4440-816f-4449232f199a",
   "metadata": {},
   "source": [
    "# Ruff Test"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a2b7bc6c-778a-4b07-86ae-dde5a2d9511e",
   "metadata": {},
   "source": [
    "Markdown block before the first import"
   ]
  },
  {
   "cell_type": "code",
   "id": "5e3ef98e-224c-450a-80e6-be442ad50907",
   "metadata": {
    "tags": []
   },
   "source": "",
   "execution_count": 1,
   "outputs": []
  },
  {
   "cell_type": "code",
   "id": "6bced3f8-e0a4-450c-ae7c-f60ad5671ee9",
   "metadata": {},
   "source": "import contextlib\n\nwith contextlib.suppress(ValueError):\n    print()\n",
   "outputs": []
  },
  {
   "cell_type": "code",
   "id": "d7102cfd-5bb5-4f5b-a3b8-07a7b8cca34c",
   "metadata": {},
   "source": "import random\n\nrandom.randint(10, 20)",
   "outputs": []
  },
  {
   "cell_type": "code",
   "id": "88471d1c-7429-4967-898f-b0088fcb4c53",
   "metadata": {},
   "source": "foo = 1\nif foo < 2:\n    msg = f\"Invalid foo: {foo}\"\n    raise ValueError(msg)",
   "outputs": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python (ruff-playground)",
   "name": "ruff-playground",
   "language": "python"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "pygments_lexer": "ipython3",
   "nbconvert_exporter": "python",
   "version": "3.11.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
```

</p>
</details> 

[^1]: The type in JSON might be different (#4665 (comment))

Part of #1218
konstin pushed a commit that referenced this issue Jun 13, 2023
## Summary

Add support for applying auto-fixes in Jupyter Notebook.

### Solution

Cell offsets are the boundaries for each cell in the concatenated source
code. They are represented using `TextSize`. It includes the start and
end offset as well, thus creating a range for each cell. These offsets
are updated using the `SourceMap` markers.

### SourceMap

`SourceMap` contains markers constructed from each edits which tracks
the original source code position to the transformed positions. The
following drawing might make it clear:

![SourceMap visualization](https://github.com/astral-sh/ruff/assets/67177269/3c94e591-70a7-4b57-bd32-0baa91cc7858)

The center column where the dotted lines are present are the markers
included in the `SourceMap`. The `Notebook` looks at these markers and
updates the cell offsets after each linter loop. If you notice closely,
the destination takes into account all of the markers before it.

The index is constructed only when required as it's only used to render
the diagnostics. So, a `OnceCell` is used for this purpose. The cell
offsets, cell content and the index will be updated after each iteration
of linting in the mentioned order. The order is important here as the
content is updated as per the new offsets and index is updated as per
the new content.

## Limitations

### 1

Styling rules such as the ones in `pycodestyle` will not be applicable
everywhere in Jupyter notebook, especially at the cell boundaries. Let's
take an example where a rule suggests to have 2 blank lines before a
function and the cells contains the following code:

```python
import something
# ---
def first():
	pass

def second():
	pass
```

(Again, the comment is only to visualize cell boundaries.)

In the concatenated source code, the 2 blank lines will be added but it
shouldn't actually be added when we look in terms of Jupyter notebook.
It's as if the function `first` is at the start of a file.

`nbqa` solves this by recording newlines before and after running
`autopep8`, then running the tool and restoring the newlines at the end
(refer nbQA-dev/nbQA#807).

## Test Plan

Three commands were run in order with common flags (`--select=ALL
--no-cache --isolated`) to isolate which stage the problem is occurring:
1. Only diagnostics
2. Fix with diff (`--fix --diff`)
3. Fix (`--fix`)

### https://github.com/facebookresearch/segment-anything

```
-------------------------------------------------------------------------------
 Jupyter Notebooks       3            0            0            0            0
 |- Markdown             3           98            0           94            4
 |- Python               3          513          468            4           41
 (Total)                            611          468           98           45
-------------------------------------------------------------------------------
```

```console
$ cargo run --all-features --bin ruff -- check --no-cache --isolated --select=ALL /path/to/segment-anything/**/*.ipynb --fix
...
Found 180 errors (89 fixed, 91 remaining).
```

### https://github.com/openai/openai-cookbook

```
-------------------------------------------------------------------------------
 Jupyter Notebooks      65            0            0            0            0
 |- Markdown            64         3475           12         2507          956
 |- Python              65         9700         7362         1101         1237
 (Total)                          13175         7374         3608         2193
===============================================================================
```

```console
$ cargo run --all-features --bin ruff -- check --no-cache --isolated --select=ALL /path/to/openai-cookbook/**/*.ipynb --fix
error: Failed to parse /path/to/openai-cookbook/examples/vector_databases/Using_vector_databases_for_embeddings_search.ipynb:cell 4:29:18: unexpected token '-'
...
Found 4227 errors (2165 fixed, 2062 remaining).
```

### https://github.com/tensorflow/docs

```
-------------------------------------------------------------------------------
 Jupyter Notebooks     150            0            0            0            0
 |- Markdown             1           55            0           46            9
 |- Python               1          402          289           60           53
 (Total)                            457          289          106           62
-------------------------------------------------------------------------------
```

```console
$ cargo run --all-features --bin ruff -- check --no-cache --isolated --select=ALL /path/to/tensorflow-docs/**/*.ipynb --fix
error: Failed to parse /path/to/tensorflow-docs/site/en/guide/extension_type.ipynb:cell 80:1:1: unexpected token Indent
error: Failed to parse /path/to/tensorflow-docs/site/en/r1/tutorials/eager/custom_layers.ipynb:cell 20:1:1: unexpected token Indent
error: Failed to parse /path/to/tensorflow-docs/site/en/guide/data.ipynb:cell 175:5:14: unindent does not match any outer indentation level
error: Failed to parse /path/to/tensorflow-docs/site/en/r1/tutorials/representation/unicode.ipynb:cell 30:1:1: unexpected token Indent
...
Found 12726 errors (5140 fixed, 7586 remaining).
```

### https://github.com/tensorflow/models

```
-------------------------------------------------------------------------------
 Jupyter Notebooks      46            0            0            0            0
 |- Markdown             1           11            0            6            5
 |- Python               1          328          249           19           60
 (Total)                            339          249           25           65
-------------------------------------------------------------------------------
```

```console
$ cargo run --all-features --bin ruff -- check --no-cache --isolated --select=ALL /path/to/tensorflow-models/**/*.ipynb --fix
...
Found 4856 errors (2690 fixed, 2166 remaining).
```

resolves: #1218
fixes: #4556
konstin pushed a commit that referenced this issue Jun 13, 2023
## Summary

Add roundtrip support for Jupyter notebook.

1. Read the notebook
2. Extract out the source code content
3. Use it to update the notebook itself (should be exactly the same [^1])
4. Serialize into JSON and print it to stdout

## Test Plan

`cargo run --all-features --bin ruff_dev --package ruff_dev --
round-trip <path/to/notebook.ipynb>`

<details><summary>Example output:</summary>
<p>

```
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "f3c286e9-fa52-4440-816f-4449232f199a",
   "metadata": {},
   "source": [
    "# Ruff Test"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a2b7bc6c-778a-4b07-86ae-dde5a2d9511e",
   "metadata": {},
   "source": [
    "Markdown block before the first import"
   ]
  },
  {
   "cell_type": "code",
   "id": "5e3ef98e-224c-450a-80e6-be442ad50907",
   "metadata": {
    "tags": []
   },
   "source": "",
   "execution_count": 1,
   "outputs": []
  },
  {
   "cell_type": "code",
   "id": "6bced3f8-e0a4-450c-ae7c-f60ad5671ee9",
   "metadata": {},
   "source": "import contextlib\n\nwith contextlib.suppress(ValueError):\n    print()\n",
   "outputs": []
  },
  {
   "cell_type": "code",
   "id": "d7102cfd-5bb5-4f5b-a3b8-07a7b8cca34c",
   "metadata": {},
   "source": "import random\n\nrandom.randint(10, 20)",
   "outputs": []
  },
  {
   "cell_type": "code",
   "id": "88471d1c-7429-4967-898f-b0088fcb4c53",
   "metadata": {},
   "source": "foo = 1\nif foo < 2:\n    msg = f\"Invalid foo: {foo}\"\n    raise ValueError(msg)",
   "outputs": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python (ruff-playground)",
   "name": "ruff-playground",
   "language": "python"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "pygments_lexer": "ipython3",
   "nbconvert_exporter": "python",
   "version": "3.11.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
```

</p>
</details> 

[^1]: The type in JSON might be different (#4665 (comment))

Part of #1218
dhruvmanila added a commit that referenced this issue Jun 15, 2023
## Summary

Ability to perform integration test on Jupyter notebooks

Part of #1218

## Test Plan

`cargo test`
@dhruvmanila
Copy link
Member

Meta issue: #5188

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Related to core functionality help wanted Contributions especially welcome
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants