Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEA Add examples recommender system #1125

Merged
merged 97 commits into from
Nov 17, 2023
Merged
Show file tree
Hide file tree
Changes from 15 commits
Commits
Show all changes
97 commits
Select commit Hold shift + click to select a range
28f4461
FEA Add examples recommender system
Apr 5, 2023
d27c9fe
Add recommend_examples to gallery_config
Apr 5, 2023
0e027a7
Use rubric directive instead of title
Apr 6, 2023
d74b825
Set number of recommended examples from gallery_config
Apr 6, 2023
7b9959c
Add parameter validation for n_examples
Apr 6, 2023
81f742e
Run black
Apr 6, 2023
c9c4589
Clean commented code
Apr 6, 2023
cbc55cd
Apply suggestions from https://github.com/ArturoAmorQ/sphinx-gallery/…
Apr 7, 2023
fc5b8a4
Apply suggestions from code review
ArturoAmorQ Apr 11, 2023
8624f68
Some fixes
Apr 11, 2023
d042591
Fix format
Apr 11, 2023
8a04e69
Rename data_dict into data
Apr 11, 2023
29fe98b
Simplify for loop
Apr 11, 2023
4a23db1
Simplify dtype
Apr 11, 2023
5b1b757
Change notation and divide inplace
Apr 11, 2023
fa2389a
Apply suggestions from code review
ArturoAmorQ Apr 17, 2023
3c8e957
Support dense matrices only
Apr 17, 2023
9b4f6f0
Rename tfidf_transformer into compute_tf_idf
Apr 17, 2023
2d217e3
Update documentation to dense matrix support only
Apr 17, 2023
1d611dc
Fix bugs
Apr 19, 2023
7d1f621
Change default value for backward compatibility
Apr 19, 2023
aeb5489
Fix format
Apr 20, 2023
108f523
Add entry to configuration.rst
Apr 20, 2023
ed7c6fc
Import numpy only if recommender is enabled
Apr 21, 2023
a52b5c1
Fix format
Apr 21, 2023
9b5ef86
Fix bugs
Apr 21, 2023
a2cc203
Apply suggestions from code review
ArturoAmorQ Apr 24, 2023
8dd83d0
Add missing import
Apr 24, 2023
8672914
Fix format
May 3, 2023
94e5a40
Merge branch 'master' into recommender_system
ArturoAmorQ Aug 3, 2023
bbe8a20
Fix format
Aug 3, 2023
e7be84c
Merge branch 'master' into recommender_system
larsoner Aug 9, 2023
9b2ec7c
Add test for recommended files
Aug 10, 2023
6231bca
FIX: importorskip
larsoner Aug 10, 2023
9c56f62
Merge branch 'master' into recommender_system
larsoner Aug 10, 2023
c6ea293
FIX: More complete test
larsoner Aug 10, 2023
12fd707
FIX: Path
larsoner Aug 10, 2023
581c7f4
FIX: No need
larsoner Aug 10, 2023
cd01604
Solve conflicts
Aug 21, 2023
904be25
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 21, 2023
72526be
Fix file header
Aug 21, 2023
ec7ed38
Merge branch 'recommender_system' of github.com:ArturoAmorQ/sphinx-ga…
Aug 21, 2023
40bb513
Merge branch 'master' of https://github.com/sphinx-gallery/sphinx-gal…
Aug 21, 2023
8706c81
Clean code
Aug 21, 2023
8d11a7a
Ensure that summary line fits on one line
Aug 21, 2023
b1083a7
Add test for html render of n_examples
Aug 21, 2023
083c541
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 21, 2023
30dc03d
Remove possible support for backrefs tokenizer
Aug 23, 2023
00e78c8
Fix wrong test comment
Aug 23, 2023
0714ef0
Prefer explicit nested numpy imports
Aug 23, 2023
1c8351d
Add support to min_df
Aug 23, 2023
c29c621
Merge branch 'master' into recommender_system
ArturoAmorQ Aug 23, 2023
69966a2
Add support for float min_df
Aug 25, 2023
c0dee6f
Modify doc on min_df accordingly
Aug 25, 2023
a8daadc
Add support for max_df
Aug 25, 2023
a06f026
Simplify use of default values
Aug 25, 2023
92aeff5
Improve documentation in the user guide
Aug 25, 2023
3feac7b
Fix ruff errors
Aug 25, 2023
8dea184
Iter
Aug 25, 2023
5289260
Update sphinx_gallery/recommender.py
ArturoAmorQ Aug 25, 2023
0dafc32
Add link to wikipedia for TF-IDF
Aug 25, 2023
a214c05
Change default values for ExampleRecommender
Aug 25, 2023
829ee56
Fix conflicts
Oct 23, 2023
76c0a81
Apply suggestions from code review
ArturoAmorQ Oct 23, 2023
4c7dafb
Iter on suggestions
Oct 23, 2023
f5d849f
Iter on conflict solving
Oct 23, 2023
2a21ce6
Change iterator variable name
Oct 23, 2023
a3e98eb
Prefer Path over os.listdir
Oct 23, 2023
d1ec903
Try import numpy on gen_gallery
Oct 23, 2023
6464c4f
Carry parameter validation before vectorizing
Oct 24, 2023
67dccf0
Remove nested numpy imports
Oct 24, 2023
8092e14
Fix conflicts
Oct 24, 2023
eae2a4a
Revert removal of nested imports
Oct 24, 2023
3ba4d26
Add importorskip to test
Oct 24, 2023
494d177
Use clearer variable names
Oct 25, 2023
1189948
Make docstring clearer
Oct 25, 2023
96c557a
Attempt to make test_rebuild more lenient
Oct 25, 2023
1859608
Iter on making test_rebuild more lenient
Oct 26, 2023
fbea244
Make rubric header customizable
Oct 26, 2023
a56e6ca
Merge branch 'master' into recommender_system
ArturoAmorQ Oct 27, 2023
38404d4
Merge branch 'master' into recommender_system
larsoner Oct 27, 2023
063df21
Improve rubric header description
Nov 6, 2023
44d2885
Add test for dict_vectorizer and compute_tf_idf
Nov 6, 2023
878772a
Merge branch 'recommender_system' of github.com:ArturoAmorQ/sphinx-ga…
Nov 6, 2023
dd762b1
Update sphinx_gallery/recommender.py
ArturoAmorQ Nov 6, 2023
767c6cb
Remove rubric_header from parameters of recommender if exists
Nov 6, 2023
a16867b
Merge branch 'recommender_system' of github.com:ArturoAmorQ/sphinx-ga…
Nov 6, 2023
d5398f3
Factorize tests into a single file
Nov 8, 2023
43b97ed
Keep rubric directive non-editable
Nov 8, 2023
c69f075
Add test for custom header
Nov 9, 2023
8e11051
Handle existing directory correctly
Nov 9, 2023
dc32406
Apply suggestions from code review
ArturoAmorQ Nov 13, 2023
e5e0045
Prefer Path over os.path
Nov 13, 2023
2d3cb5d
Move test_recommend_n_examples back to test_full
Nov 13, 2023
488f64d
improve doc, test same examples
lucyleeow Nov 16, 2023
f591624
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Nov 16, 2023
53e2740
Merge branch 'master' into recommender_system
lucyleeow Nov 17, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
32 changes: 32 additions & 0 deletions sphinx_gallery/gen_gallery.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
import os
import pathlib
from xml.sax.saxutils import quoteattr, escape
from itertools import chain

from sphinx.errors import ConfigError, ExtensionError
import sphinx.util
Expand All @@ -40,6 +41,7 @@
from .interactive_example import post_configure_jupyterlite_sphinx
from .interactive_example import create_jupyterlite_contents
from .directives import MiniGallery, ImageSg, imagesg_addnode
from .recommender import ExampleRecommender, _write_recommendations

_KNOWN_CSS = ('sg_gallery', 'sg_gallery-binder', 'sg_gallery-dataframe',
'sg_gallery-rendered-html')
Expand Down Expand Up @@ -76,6 +78,7 @@ def __call__(self, gallery_conf, script_vars):
'download_all_examples': True,
'abort_on_example_error': False,
'only_warn_on_example_error': False,
'recommender': {'enable': True, 'n_examples': 5},
'failing_examples': {},
'passing_examples': [],
'stale_examples': [], # ones that did not need to be run due to md5sum
Expand Down Expand Up @@ -616,6 +619,35 @@ def generate_gallery_rst(app):
gallery_conf, target_dir, subsection_costs
)

# Build recommendation system
if gallery_conf["recommender"]["enable"]:
n_examples = gallery_conf["recommender"]["n_examples"]
recommender = ExampleRecommender(n_examples=n_examples)

gallery_py_examples = []
# root and subsection directories containing python examples
gallery_directories = [gallery_dir_abs_path] + subsecs
for current_dir in gallery_directories:
src_dir = os.path.join(gallery_dir_abs_path, current_dir)
# sort python files to have a deterministic input across call
py_files = sorted(
[
fname
for fname in os.listdir(src_dir)
if os.path.splitext(fname)[1] == "py"
],
key=gallery_conf["within_subsection_order"](src_dir),
)
gallery_py_examples.append(
[os.path.join(src_dir, fname) for fname in py_files]
)
# flatten the list of list
gallery_py_examples = list(chain.from_iterable(gallery_py_examples))

recommender.fit(gallery_py_examples)
for fname in gallery_py_examples:
_write_recommendations(recommender, fname, gallery_conf)

# generate toctree with subsections
if gallery_conf["nested_sections"] is True:
subsections_toctree = _format_toctree(
Expand Down
11 changes: 11 additions & 0 deletions sphinx_gallery/gen_rst.py
Original file line number Diff line number Diff line change
Expand Up @@ -177,6 +177,11 @@ def __exit__(self, type_, value, tb):
<br />"""


RECOMMENDATIONS_INCLUDE = """\n
.. include:: {0}.recommendations
"""


def codestr2rst(codestr, lang='python', lineno=None):
"""Return reStructuredText code block from code string."""
if lineno is not None:
Expand Down Expand Up @@ -1321,6 +1326,12 @@ def save_rst_example(example_rst, example_file, time_elapsed,
binder_badge_rst,
ref_fname,
jupyterlite_rst)

if gallery_conf["recommender"]["enable"]:
# extract the filename without the extension
recommendation_fname = os.path.splitext(os.path.split(example_fname)[1])[0]
example_rst += RECOMMENDATIONS_INCLUDE.format(recommendation_fname)

if gallery_conf['show_signature']:
example_rst += SPHX_GLR_SIG

Expand Down
308 changes: 308 additions & 0 deletions sphinx_gallery/recommender.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,308 @@
# -*- coding: utf-8 -*-
# Author: Arturo Amor
# License: 3-clause BSD
"""
Recommendation system generator
===============================

Generate recommendations based on TF-IDF representation and a KNN model.
"""
import numbers
# import pickle
import re
from collections import defaultdict
from pathlib import Path

import numpy as np
from scipy import sparse
from scipy.sparse.linalg import norm

from .backreferences import (
_thumbnail_div,
THUMBNAIL_PARENT_DIV,
THUMBNAIL_PARENT_DIV_CLOSE,
)
from .py_source_parser import split_code_and_text_blocks
from .gen_rst import extract_intro_and_title


class ExampleRecommender:
"""Compute content-based KNN-TF-IFD recommendation system.

Parameters
----------
n_examples : int, default=5
Number of most relevant examples to display.

tokenizer : {"raw", "backrefs"}, default="raw"
The type of tokenizer to use. If "raw", the raw text will be used as
tokens. If "backrefs", the list of sphinx-gallery backreferences will
be used as tokens.

Attributes
----------
file_names_ : list of str
The list of file names used for computing the similarity matrix.
The recommended examples will be chosen among this list.

similarity_matrix_ : sparse matrix
Fitted matrix of pairwise cosine similarities.
"""

def __init__(self, *, n_examples=5, tokenizer="raw"):
self.n_examples = n_examples
self.tokenizer = tokenizer

@staticmethod
def token_freqs(doc):
"""Extract a dict mapping raw tokens from doc to their occurrences."""
token_generator = (tok.lower() for tok in re.findall(r"\w+", doc))
return dict_freqs(token_generator)

@staticmethod
def dict_freqs(doc):
"""Extract a dict mapping list of tokens to their occurrences."""
freq = defaultdict(int)
for tok in doc:
freq[tok] += 1
return freq

@staticmethod
def dict_vectorizer(data):
"""Convert a dictionary of feature arrays into a sparse matrix.

Parameters
----------
data : list of dict
An iterable of dictionaries of feature arrays, where each key
corresponds to a feature name, and each value is an array of feature
values.

Returns
-------
X : sparse matrix
A sparse matrix in CSR format of shape (n_samples, n_features) where
n_samples is the number of samples in the dataset and n_features is the
total number of features across all samples.
"""
feature_names = []
all_values = defaultdict(list)
for row in data:
ArturoAmorQ marked this conversation as resolved.
Show resolved Hide resolved
for feature_name, feature_value in row.items():
feature_names.append(feature_name)
all_values[feature_name].append(feature_value)

feature_names = sorted(set(feature_names))
data, indices, indptr = [], [], [0]
for row in data:
for j, feature_name in enumerate(feature_names):
if feature_name in row:
feature_value = row[feature_name]
data.append(feature_value)
indices.append(j)
indptr.append(len(indices))
X = sparse.csr_matrix(
(data, indices, indptr), shape=(len(indptr) - 1, len(feature_names))
)
return X

@staticmethod
def tfidf_transformer(X):
ArturoAmorQ marked this conversation as resolved.
Show resolved Hide resolved
"""Transform a term frequency matrix into a term frequency-inverse document
frequency (TF-IDF) matrix.

Parameters
----------
X : {ndarray, sparse matrix} of shape (n_samples, n_features)
ArturoAmorQ marked this conversation as resolved.
Show resolved Hide resolved
A term frequency matrix.

Returns
-------
X_tfidf : {ndarray, sparse matrix} of shape (n_samples, n_features)
A tf-idf matrix of the same shape as X.
"""
if not sparse.issparse(X):
X = sparse.csr_matrix(X, dtype=X.dtype)

n_samples, n_features = X.shape
ArturoAmorQ marked this conversation as resolved.
Show resolved Hide resolved

# Count the number of non-zero values for each feature in sparse X
if sparse.isspmatrix_csr(X):
df = np.bincount(X.indices, minlength=n_features)
else:
df = np.diff(X.indptr)
df = df.astype(X.dtype, copy=False)
# perform idf smoothing
df += 1
n_samples += 1
idf = np.log(n_samples / df) + 1

idf_diag = sparse.diags(
idf,
offsets=0,
shape=(n_features, n_features),
format="csr",
dtype=X.dtype,
)
X_tfidf = X * idf_diag
X_tfidf = (X_tfidf.T / norm(X_tfidf, axis=1)).T
X_tfidf = sparse.csr_matrix(X_tfidf, dtype=X.dtype)

return X_tfidf

@staticmethod
def cosine_similarity(X, Y=None, dense_output=True):
"""
Compute the cosine similarity between two vectors X and Y.
ArturoAmorQ marked this conversation as resolved.
Show resolved Hide resolved

Parameters
----------
X : {ndarray, sparse matrix} of shape (n_samples_X, n_features)
Input data.

Y : {ndarray, sparse matrix} of shape (n_samples_Y, n_features), default=None
Input data. If `None`, the output will be the pairwise
similarities between all samples in `X`.
ArturoAmorQ marked this conversation as resolved.
Show resolved Hide resolved

dense_output : bool, default=True
Whether to return dense output even when the input is sparse. If
`False`, the output is sparse if both input arrays are sparse.

Returns
-------
cosine_similarity : ndarray of shape (n_samples_X, n_samples_Y)
Cosine similarity matrix.
"""

if Y is X or Y is None:
Y = X

X_normalized = X / norm(X)
if X is Y:
Y_normalized = X_normalized
else:
Y_normalized = Y / norm(Y)

X_normalized = sparse.csr_matrix(X_normalized, dtype=X.dtype)
ArturoAmorQ marked this conversation as resolved.
Show resolved Hide resolved
similarity = X_normalized @ Y_normalized.T

if dense_output and hasattr(similarity, "toarray"):
ArturoAmorQ marked this conversation as resolved.
Show resolved Hide resolved
return similarity.toarray()
return similarity

def fit(self, file_names):
"""
Compute the similarity matrix of a group of documents.
ArturoAmorQ marked this conversation as resolved.
Show resolved Hide resolved

Parameters
----------
file_names : list or generator of file names.

Returns
-------
self : object
Fitted recommender.
"""
n_examples = self.n_examples
known_tokenizers = {"raw", "backrefs"}
if self.tokenizer not in known_tokenizers:
raise ValueError(
f"Unknown tokenizer {self.tokenizer}. "
f"Expected one of {known_tokenizers}."
)
if not isinstance(n_examples, numbers.Integral):
raise ValueError("n_examples must be an integer")
ArturoAmorQ marked this conversation as resolved.
Show resolved Hide resolved
elif n_examples < 1:
raise ValueError("n_examples must be strictly positive")

if self.tokenizer == "raw":
frequency_func = self.token_freqs
counts_matrix = self.dict_vectorizer(
[frequency_func(Path(fname).read_text()) for fname in file_names]
)
else: # self.tokenizer == "backrefs"
frequency_func = self.dict_freqs
# backrefs_list = []
# for fname in file_names:
# pickle_file = fname[:-3] + "_codeobj.pickle"
# try:
# with open(pickle_file, "rb") as f:
# names = pickle.load(f)
# back_references = [
# name.split("_codeobj")[0] for name in names.keys()
# ]
# except:
# back_references = []
# continue
# backrefs_list.append(back_references)
# counts_matrix = dict_vectorizer(
# [frequency_func(backref) for backref in backrefs_list]
# )

tfidf_matrix = self.tfidf_transformer(counts_matrix)
self.similarity_matrix_ = self.cosine_similarity(tfidf_matrix)
ArturoAmorQ marked this conversation as resolved.
Show resolved Hide resolved
self.file_names_ = file_names
return self


def predict(self, file_name):
"""Compute the most `n_examples` similar documents to the query.
ArturoAmorQ marked this conversation as resolved.
Show resolved Hide resolved

Parameters
----------
file_name : str
Name of the file corresponding to the query index `item_id`.

Returns
-------
recommendations : list of str
Name of the files most similar to the query.
"""
item_id = self.file_names_.index(file_name)
similar_items = list(enumerate(self.similarity_matrix_[item_id]))
sorted_items = sorted(similar_items, key=lambda x: x[1], reverse=True)

# Get the top k items similar to item_id
ArturoAmorQ marked this conversation as resolved.
Show resolved Hide resolved
top_k_items = [index for index, _ in sorted_items[1 : self.n_examples + 1]]
recommendations = [self.file_names_[index] for index in top_k_items]
return recommendations


def _write_recommendations(recommender, fname, gallery_conf):
"""Generate `.recommendations` RST file for a given example.
ArturoAmorQ marked this conversation as resolved.
Show resolved Hide resolved

Parameters
----------
recommender : ExampleRecommender
Instance of a fitted ExampleRecommender.

fname : str
Path to the example file.

gallery_conf : dict
Configuration dictionary for the sphinx-gallery extension.
"""
path_fname = Path(fname)
recommendation_fname = f"{path_fname.parent / path_fname.stem}.recommendations"
recommended_examples = recommender.predict(fname)

with open(recommendation_fname, "w", encoding="utf-8") as ex_file:
ex_file.write("\n\n.. rubric:: Related examples\n")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on @betatim's suggestion in the linked scikit-learn PR.

Suggested change
ex_file.write("\n\n.. rubric:: Related examples\n")
ex_file.write(
"\n\n.. rubric:: Automatically generated list of related examples\n"
)

Maybe you should make this rubric header user-settable via a configuration parameter instead of hard-coding it in the Python code.

Suggested change
ex_file.write("\n\n.. rubric:: Related examples\n")
default_rubric_header = (
".. rubric:: Automatically generated list of related examples"
)
rubric_header = gallery_conf['recommender'].get(
'rubric_header', default_rubric_header
)
ex_file.write(f"\n\n{rubric_header}\n")

This way it gives full freedom to customize this later (e.g. add a tooltip or a small paragraph to explain how the examples are generated).

ex_file.write(THUMBNAIL_PARENT_DIV)
for example_fname in recommended_examples:
example_path = Path(example_fname)
_, script_blocks = split_code_and_text_blocks(
example_fname, return_node=False
)
intro, title = extract_intro_and_title(fname, script_blocks[0][1])
ex_file.write(
_thumbnail_div(
example_path.parent,
gallery_conf["src_dir"],
example_path.name,
intro,
title,
is_backref=True,
)
)
ex_file.write(THUMBNAIL_PARENT_DIV_CLOSE)
ArturoAmorQ marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also insert of user customizable rubric footer here. E.g. to link to API doc of the recommender system on the sphinx-gallery documentation website.