Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEA Add examples recommender system #1125

Merged
merged 97 commits into from Nov 17, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
97 commits
Select commit Hold shift + click to select a range
28f4461
FEA Add examples recommender system
Apr 5, 2023
d27c9fe
Add recommend_examples to gallery_config
Apr 5, 2023
0e027a7
Use rubric directive instead of title
Apr 6, 2023
d74b825
Set number of recommended examples from gallery_config
Apr 6, 2023
7b9959c
Add parameter validation for n_examples
Apr 6, 2023
81f742e
Run black
Apr 6, 2023
c9c4589
Clean commented code
Apr 6, 2023
cbc55cd
Apply suggestions from https://github.com/ArturoAmorQ/sphinx-gallery/…
Apr 7, 2023
fc5b8a4
Apply suggestions from code review
ArturoAmorQ Apr 11, 2023
8624f68
Some fixes
Apr 11, 2023
d042591
Fix format
Apr 11, 2023
8a04e69
Rename data_dict into data
Apr 11, 2023
29fe98b
Simplify for loop
Apr 11, 2023
4a23db1
Simplify dtype
Apr 11, 2023
5b1b757
Change notation and divide inplace
Apr 11, 2023
fa2389a
Apply suggestions from code review
ArturoAmorQ Apr 17, 2023
3c8e957
Support dense matrices only
Apr 17, 2023
9b4f6f0
Rename tfidf_transformer into compute_tf_idf
Apr 17, 2023
2d217e3
Update documentation to dense matrix support only
Apr 17, 2023
1d611dc
Fix bugs
Apr 19, 2023
7d1f621
Change default value for backward compatibility
Apr 19, 2023
aeb5489
Fix format
Apr 20, 2023
108f523
Add entry to configuration.rst
Apr 20, 2023
ed7c6fc
Import numpy only if recommender is enabled
Apr 21, 2023
a52b5c1
Fix format
Apr 21, 2023
9b5ef86
Fix bugs
Apr 21, 2023
a2cc203
Apply suggestions from code review
ArturoAmorQ Apr 24, 2023
8dd83d0
Add missing import
Apr 24, 2023
8672914
Fix format
May 3, 2023
94e5a40
Merge branch 'master' into recommender_system
ArturoAmorQ Aug 3, 2023
bbe8a20
Fix format
Aug 3, 2023
e7be84c
Merge branch 'master' into recommender_system
larsoner Aug 9, 2023
9b2ec7c
Add test for recommended files
Aug 10, 2023
6231bca
FIX: importorskip
larsoner Aug 10, 2023
9c56f62
Merge branch 'master' into recommender_system
larsoner Aug 10, 2023
c6ea293
FIX: More complete test
larsoner Aug 10, 2023
12fd707
FIX: Path
larsoner Aug 10, 2023
581c7f4
FIX: No need
larsoner Aug 10, 2023
cd01604
Solve conflicts
Aug 21, 2023
904be25
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 21, 2023
72526be
Fix file header
Aug 21, 2023
ec7ed38
Merge branch 'recommender_system' of github.com:ArturoAmorQ/sphinx-ga…
Aug 21, 2023
40bb513
Merge branch 'master' of https://github.com/sphinx-gallery/sphinx-gal…
Aug 21, 2023
8706c81
Clean code
Aug 21, 2023
8d11a7a
Ensure that summary line fits on one line
Aug 21, 2023
b1083a7
Add test for html render of n_examples
Aug 21, 2023
083c541
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 21, 2023
30dc03d
Remove possible support for backrefs tokenizer
Aug 23, 2023
00e78c8
Fix wrong test comment
Aug 23, 2023
0714ef0
Prefer explicit nested numpy imports
Aug 23, 2023
1c8351d
Add support to min_df
Aug 23, 2023
c29c621
Merge branch 'master' into recommender_system
ArturoAmorQ Aug 23, 2023
69966a2
Add support for float min_df
Aug 25, 2023
c0dee6f
Modify doc on min_df accordingly
Aug 25, 2023
a8daadc
Add support for max_df
Aug 25, 2023
a06f026
Simplify use of default values
Aug 25, 2023
92aeff5
Improve documentation in the user guide
Aug 25, 2023
3feac7b
Fix ruff errors
Aug 25, 2023
8dea184
Iter
Aug 25, 2023
5289260
Update sphinx_gallery/recommender.py
ArturoAmorQ Aug 25, 2023
0dafc32
Add link to wikipedia for TF-IDF
Aug 25, 2023
a214c05
Change default values for ExampleRecommender
Aug 25, 2023
829ee56
Fix conflicts
Oct 23, 2023
76c0a81
Apply suggestions from code review
ArturoAmorQ Oct 23, 2023
4c7dafb
Iter on suggestions
Oct 23, 2023
f5d849f
Iter on conflict solving
Oct 23, 2023
2a21ce6
Change iterator variable name
Oct 23, 2023
a3e98eb
Prefer Path over os.listdir
Oct 23, 2023
d1ec903
Try import numpy on gen_gallery
Oct 23, 2023
6464c4f
Carry parameter validation before vectorizing
Oct 24, 2023
67dccf0
Remove nested numpy imports
Oct 24, 2023
8092e14
Fix conflicts
Oct 24, 2023
eae2a4a
Revert removal of nested imports
Oct 24, 2023
3ba4d26
Add importorskip to test
Oct 24, 2023
494d177
Use clearer variable names
Oct 25, 2023
1189948
Make docstring clearer
Oct 25, 2023
96c557a
Attempt to make test_rebuild more lenient
Oct 25, 2023
1859608
Iter on making test_rebuild more lenient
Oct 26, 2023
fbea244
Make rubric header customizable
Oct 26, 2023
a56e6ca
Merge branch 'master' into recommender_system
ArturoAmorQ Oct 27, 2023
38404d4
Merge branch 'master' into recommender_system
larsoner Oct 27, 2023
063df21
Improve rubric header description
Nov 6, 2023
44d2885
Add test for dict_vectorizer and compute_tf_idf
Nov 6, 2023
878772a
Merge branch 'recommender_system' of github.com:ArturoAmorQ/sphinx-ga…
Nov 6, 2023
dd762b1
Update sphinx_gallery/recommender.py
ArturoAmorQ Nov 6, 2023
767c6cb
Remove rubric_header from parameters of recommender if exists
Nov 6, 2023
a16867b
Merge branch 'recommender_system' of github.com:ArturoAmorQ/sphinx-ga…
Nov 6, 2023
d5398f3
Factorize tests into a single file
Nov 8, 2023
43b97ed
Keep rubric directive non-editable
Nov 8, 2023
c69f075
Add test for custom header
Nov 9, 2023
8e11051
Handle existing directory correctly
Nov 9, 2023
dc32406
Apply suggestions from code review
ArturoAmorQ Nov 13, 2023
e5e0045
Prefer Path over os.path
Nov 13, 2023
2d3cb5d
Move test_recommend_n_examples back to test_full
Nov 13, 2023
488f64d
improve doc, test same examples
lucyleeow Nov 16, 2023
f591624
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Nov 16, 2023
53e2740
Merge branch 'master' into recommender_system
lucyleeow Nov 17, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
2 changes: 1 addition & 1 deletion continuous_integration/install.sh
Expand Up @@ -12,7 +12,7 @@ if [ "$DISTRIB" == "mamba" ]; then
if [ "$PLATFORM" != "Linux" ]; then
conda remove -y memory_profiler
fi
PIP_DEPENDENCIES="jupyterlite-sphinx>=0.8.0,<0.9.0 jupyterlite-pyodide-kernel<0.1.0 libarchive-c"
PIP_DEPENDENCIES="jupyterlite-sphinx>=0.8.0,<0.9.0 jupyterlite-pyodide-kernel<0.1.0 libarchive-c numpy"
elif [ "$DISTRIB" == "minimal" ]; then
PIP_DEPENDENCIES=""
elif [ "$DISTRIB" == "pip" ]; then
Expand Down
52 changes: 52 additions & 0 deletions doc/configuration.rst
Expand Up @@ -45,6 +45,7 @@ file:
- ``reset_modules_order`` (:ref:`reset_modules_order`)
- ``abort_on_example_error`` (:ref:`abort_on_first`)
- ``only_warn_on_example_error`` (:ref:`warning_on_error`)
- ``recommender`` (:ref:`recommend_examples`)
- ``expected_failing_examples`` (:ref:`dont_fail_exit`)
- ``min_reported_time`` (:ref:`min_reported_time`)
- ``show_memory`` (:ref:`show_memory`)
Expand Down Expand Up @@ -1785,6 +1786,57 @@ flag is passed to ``sphinx-build``. This can be enabled by setting::
}


.. _recommend_examples:

Enabling the example recommender system
=======================================

Sphinx-Gallery can be configured to generate content-based recommendations for
an example gallery. A list of related examples is automatically generated by
computing the closest examples in the `TF-IDF
<https://en.wikipedia.org/wiki/Tf%E2%80%93idf>`_ space of their text contents.
Only examples within a single gallery (including it's sub-galleries) are used to
compute the closest examples. The most similar content is then displayed at the bottom
of each example as a set of thumbnails.

The recommender system can be enabled by setting ``enable`` to ``True``. To
configure it, pass a dictionary to the ``sphinx_gallery_conf``, e.g.::

sphinx_gallery_conf = {
...
"recommender": {"enable": True, "n_examples": 5, "min_df": 3, "max_df": 0.9},
}

The only necessary parameter is ``enable``. If any other parameters is not
specified, the default value is used. Below is a more complete explanation of
each field:

enable (type: bool, default: False)
Whether to generate recommendations inside the example gallery. Enabling this
feature requires adding `numpy` to the dependencies.
n_examples (type: int, default: 5)
Number of most relevant examples to display.
min_df (type: float in range [0.0, 1.0] | int, default: 3)
When building the vocabulary ignore terms that have a document frequency
strictly lower than the given threshold. If float, the parameter represents a
proportion of documents, integer represents absolute counts. This value is
also called cut-off in the literature.
max_df (type: float in range [0.0, 1.0] | int, default: 0.9)
When building the vocabulary ignore terms that have a document frequency
strictly higher than the given threshold. If float, the parameter represents a
proportion of documents, integer represents absolute counts.
rubric_header (type: str, default: "Related examples")
Customizable rubric header. It can be edited to more descriptive text or to
add external links, e.g. to the API doc of the recommender system on the
sphinx-gallery documentation.

The parameters ``min_df`` and ``max_df`` can be customized by the user to trim
the very rare/very common words. This may improve the recommendations quality,
but more importantly, it spares some computation resources that would be wasted
on non-informative tokens.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is already implied above but could we explicitly say that only the examples within a single gallery (and it's sub galleries) are used for computing closest examples.

Also this is probably obvious but can we add that only recommendations for .py files will be generated.

Currently example recommendations are only computed for ``.py`` files.

.. _setting_thumbnail_size:

Setting gallery thumbnail size
Expand Down
3 changes: 3 additions & 0 deletions setup.py
Expand Up @@ -30,6 +30,8 @@
with open("requirements.txt") as fid:
install_requires = [line.strip() for line in fid if line.strip()]

extras_require = {"recommender": ["numpy"]}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@larsoner , if we add this, I wonder if we should add other optional extras like graphviz and memory_profiler (not in this PR of course)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably at some point, yeah


setup(
name="sphinx-gallery",
description=description, # noqa: E501, analysis:ignore
Expand All @@ -55,6 +57,7 @@
author="Óscar Nájera",
author_email="najera.oscar@gmail.com",
install_requires=install_requires,
extras_require=extras_require,
python_requires=">=3.8",
license="3-clause BSD",
classifiers=[
Expand Down
39 changes: 39 additions & 0 deletions sphinx_gallery/gen_gallery.py
Expand Up @@ -18,6 +18,7 @@
import os
import pathlib
from xml.sax.saxutils import quoteattr, escape
from itertools import chain

from sphinx.errors import ConfigError, ExtensionError
import sphinx.util
Expand All @@ -39,6 +40,7 @@
from .interactive_example import post_configure_jupyterlite_sphinx
from .interactive_example import create_jupyterlite_contents
from .directives import MiniGallery, ImageSg, imagesg_addnode
from .recommender import ExampleRecommender, _write_recommendations

_KNOWN_CSS = (
"sg_gallery",
Expand Down Expand Up @@ -85,6 +87,7 @@ def __call__(self, gallery_conf, script_vars):
"download_all_examples": True,
"abort_on_example_error": False,
"only_warn_on_example_error": False,
"recommender": {"enable": False},
"failing_examples": {},
"passing_examples": [],
"stale_examples": [], # ones that did not need to be run due to md5sum
Expand Down Expand Up @@ -663,6 +666,42 @@ def generate_gallery_rst(app):
costs += subsection_costs
write_computation_times(gallery_conf, target_dir, subsection_costs)

# Build recommendation system
if gallery_conf["recommender"]["enable"]:
try:
import numpy as np # noqa: F401
except ImportError:
raise ConfigError("gallery_conf['recommender'] requires numpy")

recommender_params = copy.deepcopy(gallery_conf["recommender"])
recommender_params.pop("enable")
recommender_params.pop("rubric_header", None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Would we be able to get a test to check config processing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is something similar as done in c69f075 what you had in mind?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes thanks

recommender = ExampleRecommender(**recommender_params)

gallery_py_files = []
# root and subsection directories containing python examples
gallery_directories = [gallery_dir_abs_path] + subsecs
for current_dir in gallery_directories:
src_dir = os.path.join(gallery_dir_abs_path, current_dir)
# sort python files to have a deterministic input across call
py_files = sorted(
[
fname
for fname in Path(src_dir).iterdir()
if fname.suffix == ".py"
],
key=gallery_conf["within_subsection_order"](src_dir),
)
gallery_py_files.append(
[os.path.join(src_dir, fname) for fname in py_files]
)
# flatten the list of list
gallery_py_files = list(chain.from_iterable(gallery_py_files))

recommender.fit(gallery_py_files)
for fname in gallery_py_files:
_write_recommendations(recommender, fname, gallery_conf)

# generate toctree with subsections
if gallery_conf["nested_sections"] is True:
subsections_toctree = _format_toctree(
Expand Down
9 changes: 9 additions & 0 deletions sphinx_gallery/gen_rst.py
Expand Up @@ -226,6 +226,10 @@ def __exit__(self, type_, value, tb):
:download:`Download Jupyter notebook: {0} <{0}>`
"""

RECOMMENDATIONS_INCLUDE = """\n
.. include:: {0}.recommendations
"""


def codestr2rst(codestr, lang="python", lineno=None):
"""Return reStructuredText code block from code string."""
Expand Down Expand Up @@ -1467,6 +1471,11 @@ def save_rst_example(

example_rst += CODE_DOWNLOAD.format(example_file.name, language)

if gallery_conf["recommender"]["enable"]:
# extract the filename without the extension
recommend_fname = Path(example_fname).stem
example_rst += RECOMMENDATIONS_INCLUDE.format(recommend_fname)

if gallery_conf["show_signature"]:
example_rst += SPHX_GLR_SIG

Expand Down