Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PR/historical branches are getting indexed by Google #3645

Open
chalin opened this issue Dec 5, 2023 · 3 comments
Open

PR/historical branches are getting indexed by Google #3645

chalin opened this issue Dec 5, 2023 · 3 comments

Comments

@chalin
Copy link
Contributor

chalin commented Dec 5, 2023

Originally posted by @thesuperzapper in #3628 (comment):

@chalin Also, all our PR/historical branches are getting indexed by Google, we should fix that at the same time as this PR.

The goals would be:

  1. The main www.kubeflow.org site should be indexed
  2. All PR deploy-preview-XXXX--competent-brattain-de2d6d.netlify.app should NOT be indexed
  3. All other v1-7-branch.kubeflow.org sites should be NOT be indexed:
    • (these are just CNAME records pointing to the branch domains like v1-7-branch--competent-brattain-de2d6d.netlify.app)

I believe your changes here achieve 2, because you are setting -e dev in the hugo command, and because this is not "production", docsy adds <meta> no index tags.

We need to be careful about 1. Are you 100% confident that not setting -e production or HUGO_ENV=production is safe?

To achieve 3, we could set the HUGO_ENV from [context.branch-deploy.environment] to dev, but it will probably propagate faster if we use a robots.txt disallow on those domains (otherwise, the <meta> tags will take until Google next indexes each page).

@chalin
Copy link
Contributor Author

chalin commented Dec 6, 2023

To achieve 3, we could set the HUGO_ENV from [context.branch-deploy.environment] to dev, but it will probably propagate faster if we use a robots.txt disallow on those domains (otherwise, the <meta> tags will take until Google next indexes each page).

AFAIK, what you propose won't work. I've had to work through a similar issue for another CNCF project with multiple versions of the docs being indexed. Based on my experiences, you'll need to change each old-version branch individually (to somehow set / config it to emit noindex, nofollow as appropriate for the branch) and have it rebuilt and redeployed.

Btw, you can't use robots.txt to prevent domains from being indexed -- see https://developers.google.com/search/docs/crawling-indexing/robots/intro:

image

/cc @nate-double-u

@chalin
Copy link
Contributor Author

chalin commented Dec 6, 2023

As I mentioned elsewhere, I'm OOO, but I'll be glad to help with this in the new year.

@thesuperzapper
Copy link
Member

@chalin It's possible if the Netelify configs are defined for all branches in master (rather than the branches themselves) as discussed here #3628 (comment), then we might only need to update master, and then trigger a re-deploy of the older Netelify branches.

(However, I think the super new version of Hugo running in master will probably break our really old Docsy versions and the deploy might fail).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

2 participants