sphinx index only published cms pages #3708

akostadinov · 2024-02-29T10:41:53Z

We filter out unpublished pages in SearchPresenter anyway. So better not index them to begin with.

Also don't index anything out of sphinx/manticore index scope globally.

akostadinov · 2024-02-29T10:42:40Z

test/factories/cms.rb

-    # association is copied to child factory because we reference provider
-    # and it is not yet created, but copying association to this factory fixes it
-    association :provider, :factory => :provider_account


this seems to be outdated because factories work alright without it

That will automatically create a :provider for the :cms_page, using the :provider_account factory, unless you manually provide a :provider. Since a provider is mandatory to create a cms_page, I'd say you shouldn't remove this.

:cms_template already defines this association. Probably in the past there was some inheritance issue and it had to be duplicated here as well.

partial improvement for THREESCALE-10793

mayorova · 2024-03-01T13:17:41Z

app/indices/cms_page_index.rb

@@ -8,5 +8,5 @@
  has tenant_id, type: :bigint

  indexes :published
-  scope { CMS::Page.where(searchable: true) }
+  scope { CMS::Page.where(searchable: true).where.not(published: nil) }


Maybe just add return false if published.nil? here

porta/app/models/cms/page.rb

Lines 68 to 83 in e6cf20e

def is_searchable?

case

when ! mime_type.html?

false

when liquid_enabled?

template = Liquid::Template.parse(published)

nodelist = template.instance_variable_get("@root").instance_variable_get("@nodelist")

nodelist.none?{ |i| i.is_a?(Liquid::Tag) and not i.is_a?(Liquid::Include) }

else

true

end

end

This seems to be intended to set the searchable attribute. And it shouldn't depend on whether page is published or not.

Yeah, but I mean that searchabledefines some criteria of when a page should be "searchable". We just add a new criteria here "the page must be searchable if it's published, so I just suggested to put it all to the same bucket. But that's fine as it is.

It makes logical sense. But then we will need to ensure to change this value reliably when page is published/unpublished and I'm not really keen on making sure all details are correct. It also follows previous (bad?) example of filtering out unpublished pages separately in the presenter.

What @mayorova says makes sense, and apparently it wouldn't require any additional change, as the model includes a callback to update the searchable value on every change:

porta/app/models/cms/page.rb

Line 29 in e6cf20e

before_save :mark_for_searchability

If they would move in sync, then what is the point to have both of these?

This searchable attribute doesn't make much sense to me to begin with. To make it right IMO a bigger refactoring would be needed.

And I don't find it a priority to make these attributes make sense. It requires more attention that now is due elsewhere.

For now I think this is an improvement and we can have a JIRA to sanitize these attributes.

If the would move in sync, then what is the point to have both of these?

I don't understand this.

"they"

I meant, if both attributes are changed together, then why keep them. I think this searchable is probably redundant and we may just define the proper index scope to skip the elements currently defining it. Why keep a separate field in the database when index can have a scope?

But again, not something I find very useful to spend time on now.

Understood, thanks.

app/presenters/search_presenters.rb

app/workers/sphinx_indexation_worker.rb

mayorova · 2024-03-01T13:42:25Z

app/presenters/search_presenters.rb

    end

    def search
+      return @search if @search


Hmm... I don't understand this change... Would the rest of the method ever execute? 🤔 (i.e. in which cases @search would not be present?)

just normal memoization, before method is called for the first time, the variable would be nil

Co-authored-by: Daria Mayorova <mayorova@users.noreply.github.com>

jlledom

I think we lack a test to actually verify the behavior implemented by this PR: that existing but not published cms pages are't indexed.

jlledom · 2024-03-13T13:46:10Z

test/factories/cms.rb

-    # association is copied to child factory because we reference provider
-    # and it is not yet created, but copying association to this factory fixes it
-    association :provider, :factory => :provider_account


That will automatically create a :provider for the :cms_page, using the :provider_account factory, unless you manually provide a :provider. Since a provider is mandatory to create a cms_page, I'd say you shouldn't remove this.

jlledom · 2024-03-13T13:57:11Z

app/indices/cms_page_index.rb

@@ -8,5 +8,5 @@
  has tenant_id, type: :bigint

  indexes :published
-  scope { CMS::Page.where(searchable: true) }
+  scope { CMS::Page.where(searchable: true).where.not(published: nil) }


What @mayorova says makes sense, and apparently it wouldn't require any additional change, as the model includes a callback to update the searchable value on every change:

porta/app/models/cms/page.rb

Line 29 in e6cf20e

before_save :mark_for_searchability

jlledom · 2024-03-13T14:05:13Z

app/workers/sphinx_account_indexation_worker.rb

+  def reindex(instance)
+    ThinkingSphinx::Processor.new(instance: instance).upsert
+  end
+
+  def delete_from_index(model, *ids)
+    ids.each do |id|
+      ThinkingSphinx::Processor.new(model: model, id: id).delete
+    end
+  end


I would need some help to understand this changes and how they relate to cms pages

This code was removed from parent class, so it had to move here. No need to keep these methods in parent if they are not used there.

And the changes you made in the parent and in initializers/sphinx.rb don't make this unnecessary?

No, because here we also remove associated buyers when provider is marked to be removed (but not actually removed to allow their own callbacks to kick in)

jlledom · 2024-03-13T14:05:31Z

app/workers/sphinx_indexation_worker.rb

-    ids.each do |id|
-      ThinkingSphinx::Processor.new(model: model, id: id).delete
-    end
+    ThinkingSphinx::Processor.new(model: model, id: id).stage


What does this do and how it replaces the old code?

See the stage method implementation. It implements this upsert/delete in a better way, based on index scope, not only on the presence of the object.

jlledom · 2024-03-13T14:11:06Z

config/initializers/sphinx.rb

+# implement conditionally inserting or deleting from the index
+# see https://github.com/pat/thinking-sphinx/pull/1258
+ThinkingSphinx::Processor.include(Module.new do
+  def stage
+    real_time_indices.each do |index|
+      found = index.scope.find_by(model.primary_key => id)
+
+      if found
+        ThinkingSphinx::RealTime::Transcriber.new(index).copy found
+      else
+        ThinkingSphinx::Deletion.perform(index, id)
+      end
+    end
+  end
+end)


I read the linked PR and now I'm even more confused. Could you please provide a global picture on what is this for?

When object is found in AR scope, it is indexed in manticore, otherwise it is removed from index.

akostadinov · 2024-04-13T10:17:24Z

I think we lack a test to actually verify the behavior implemented by this PR: that existing but not published cms pages are't indexed.

We have tests that verify something matching scope is indexed and when not found in scope is deindexed. Would be redundant to write a dedicated CMS one.

https://github.com/3scale/porta/pull/3708/files?diff=unified&w=0#diff-b457d353bc438cc33ec35d44c791aa37721418b3588608b31e6725c51cae2038R92

jlledom · 2024-04-15T15:56:38Z

app/workers/sphinx_account_indexation_worker.rb

+  def reindex(instance)
+    ThinkingSphinx::Processor.new(instance: instance).upsert
+  end
+
+  def delete_from_index(model, *ids)
+    ids.each do |id|
+      ThinkingSphinx::Processor.new(model: model, id: id).delete
+    end
+  end


And the changes you made in the parent and in initializers/sphinx.rb don't make this unnecessary?

jlledom · 2024-04-15T16:09:05Z

app/indices/cms_page_index.rb

@@ -8,5 +8,5 @@
  has tenant_id, type: :bigint

  indexes :published
-  scope { CMS::Page.where(searchable: true) }
+  scope { CMS::Page.where(searchable: true).where.not(published: nil) }


If the would move in sync, then what is the point to have both of these?

I don't understand this.

akostadinov self-assigned this Feb 29, 2024

akostadinov commented Feb 29, 2024

View reviewed changes

akostadinov force-pushed the manticore branch from b27b0ef to 8ac5284 Compare February 29, 2024 10:43

optimize CMS::Page search scope and memoization

6ce4d60

akostadinov force-pushed the manticore branch 2 times, most recently from 1ca4d89 to a562646 Compare February 29, 2024 18:50

index should upsert/delete based on scope

2cd01cc

partial improvement for THREESCALE-10793

akostadinov force-pushed the manticore branch from a562646 to 2cd01cc Compare February 29, 2024 19:18

mayorova reviewed Mar 1, 2024

View reviewed changes

app/presenters/search_presenters.rb Outdated Show resolved Hide resolved

mayorova reviewed Mar 1, 2024

View reviewed changes

app/workers/sphinx_indexation_worker.rb Show resolved Hide resolved

mayorova reviewed Mar 1, 2024

View reviewed changes

style

b0a6354

Co-authored-by: Daria Mayorova <mayorova@users.noreply.github.com>

jlledom reviewed Mar 13, 2024

View reviewed changes

github-actions bot added the Stale label Apr 13, 2024

3scale deleted a comment from github-actions bot Apr 13, 2024

github-actions bot removed the Stale label Apr 14, 2024

jlledom approved these changes Apr 15, 2024

View reviewed changes

github-actions bot added the Stale label May 17, 2024

akostadinov removed the Stale label May 17, 2024

3scale deleted a comment from github-actions bot May 17, 2024

akostadinov requested a review from mayorova May 29, 2024 21:56

akostadinov merged commit 42c6152 into 3scale:master May 29, 2024
17 of 21 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sphinx index only published cms pages #3708

sphinx index only published cms pages #3708

akostadinov commented Feb 29, 2024

akostadinov Feb 29, 2024

jlledom Mar 13, 2024

akostadinov May 29, 2024

mayorova Mar 1, 2024

akostadinov Mar 1, 2024

mayorova Mar 2, 2024

akostadinov Mar 2, 2024

jlledom Mar 13, 2024

akostadinov Apr 13, 2024 •

edited

jlledom Apr 15, 2024

akostadinov Apr 15, 2024

akostadinov Apr 15, 2024

jlledom Apr 16, 2024

mayorova Mar 1, 2024

akostadinov Mar 1, 2024

jlledom left a comment

jlledom Mar 13, 2024

jlledom Mar 13, 2024

jlledom Mar 13, 2024

akostadinov Apr 13, 2024

jlledom Apr 15, 2024

akostadinov Apr 15, 2024

jlledom Mar 13, 2024

akostadinov Apr 13, 2024

jlledom Mar 13, 2024

akostadinov Apr 13, 2024

akostadinov commented Apr 13, 2024

jlledom Apr 15, 2024

jlledom Apr 15, 2024

	def is_searchable?
	case
	when ! mime_type.html?
	false

	when liquid_enabled?
	template = Liquid::Template.parse(published)
	nodelist = template.instance_variable_get("@root").instance_variable_get("@nodelist")

	nodelist.none?{ \|i\| i.is_a?(Liquid::Tag) and not i.is_a?(Liquid::Include) }

	else
	true

	end
	end

sphinx index only published cms pages #3708

sphinx index only published cms pages #3708

Conversation

akostadinov commented Feb 29, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

akostadinov Apr 13, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jlledom left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

akostadinov commented Apr 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

akostadinov Apr 13, 2024 •

edited