Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

initialCanonicalUrl in scripts confuse search crawlers #53274

Closed
1 task done
frontimin opened this issue Jul 27, 2023 · 25 comments · Fixed by #64594 or #69370
Closed
1 task done

initialCanonicalUrl in scripts confuse search crawlers #53274

frontimin opened this issue Jul 27, 2023 · 25 comments · Fixed by #64594 or #69370
Labels
bug Issue was opened via the bug report template. linear: next Confirmed issue that is tracked by the Next.js team.

Comments

@frontimin
Copy link

frontimin commented Jul 27, 2023

Verify canary release

  • I verified that the issue exists in the latest Next.js canary release

Provide environment information

Operating System:
      Platform: darwin
      Arch: x64
      Version: Darwin Kernel Version 22.5.0: Mon Apr 24 20:53:44 PDT 2023; root:xnu-8796.121.2~5/RELEASE_ARM64_T8103
    Binaries:
      Node: 18.13.0
      npm: 9.8.1
      Yarn: 1.22.10
      pnpm: N/A
    Relevant packages:
      next: 13.4.7
      eslint-config-next: 13.4.1
      react: 18.2.0
      react-dom: 18.2.0
      typescript: 5.0.4

Which area(s) of Next.js are affected? (leave empty if unsure)

Metadata (metadata, generateMetadata, next/head)

To Reproduce

Open Vercel website on /about page, open devtools 'Elements' tab, open search input with cmd + F and type initialCanonicalUrl, you will see this in one of the <script> tags:
... \"initialCanonicalUrl\":\"/app-future/en/about\", ...

Open /app-future/en/about and see that you get the same page!

Describe the Bug

Actual page path and initialCanonicalUrl are different.
Our SEO department reported this problem to us, as they said it confuses the search engine robots (don't ask me why) and reduce SEO optimization points.
For our case it looks like:
/blog/example-article – actual page path at English locale
/en/blog/example-article – initialCanonicalUrl from <script>

Expected Behavior

/blog/example-article – actual page path at English locale
/blog/example-article – initialCanonicalUrl from <script>

Which browser are you using? (if relevant)

Google Chrome

How are you deploying your application? (if relevant)

No response

@frontimin frontimin added the bug Issue was opened via the bug report template. label Jul 27, 2023
@HaykMkrtich
Copy link

HaykMkrtich commented Jul 29, 2023

I also faced with this issue, I was looking for the solution for weeks but found nothing. I Also asked chat GPT to recommend any solution, but none of them solved the problem. It's nice to find this issue, hope I can get the solution for this issue. Thanks.

@zlwaterfield
Copy link
Contributor

Any update on this? Still seems to be an outstanding bug.

@tombennet
Copy link

Any updates / advice on workarounds? Thanks!

@adomaskizogian

This comment has been minimized.

@omerp-explorium
Copy link

omerp-explorium commented Jun 17, 2024

Did anyone find a workaround for this?

@adomaskizogian
Copy link

adomaskizogian commented Jun 17, 2024

Did anyone find a workaround for this?

We have partially mitigated this by additing more patterns on robots.txt but that has nothing to do with addressing the root cause

@omerp-explorium
Copy link

omerp-explorium commented Jun 17, 2024

@adomaskizogian I see.. thanks, I think I will try to manually replace the wrong texts next generates on our static pages.. hopefully it will work

@omerman
Copy link

omerman commented Jun 17, 2024

I have so many 404 in my google search console.. and I just realized it's because of this issue 😓

@huozhi
Copy link
Member

huozhi commented Jun 26, 2024

Google won't pick the initialCanonicalUrl in tht html respnose for SEO, that value is only for internal state. The canonical url should be configured through Metadata API through alternates.canonical. https://nextjs.org/docs/app/api-reference/functions/generate-metadata then google can pick it up properly.

Open Vercel website on /about page, open devtools 'Elements' tab, open search input with cmd + F and type initialCanonicalUrl, you will see this in one of the <script> tags:
... "initialCanonicalUrl":"/app-future/en/about", ...
Open /app-future/en/about and see that you get the same page!

Vercel about page is not affected since google is picking up the canonical url from meta tag, that value is not related to SEO.

@huozhi huozhi closed this as completed Jun 26, 2024
@omerman
Copy link

omerman commented Jun 26, 2024

Google won't pick the initialCanonicalUrl in tht html respnose for SEO, that value is only for internal state. The canonical url should be configured through Metadata API through alternates.canonical. https://nextjs.org/docs/app/api-reference/functions/generate-metadata then google can pick it up properly.

Im telling you for sure that I have a bunch(thousands) of urls google “discovered” due to this variable.
and im sure its this variable, because it claims to be referenced by the page itself.

I agree that google shouldnt pick it up as a url.. but it does.

@omerman
Copy link

omerman commented Jun 26, 2024

@huozhi

@omerman
Copy link

omerman commented Jun 27, 2024

Google won't pick the initialCanonicalUrl in tht html respnose for SEO, that value is only for internal state. The canonical url should be configured through Metadata API through alternates.canonical. https://nextjs.org/docs/app/api-reference/functions/generate-metadata then google can pick it up properly.

Open Vercel website on /about page, open devtools 'Elements' tab, open search input with cmd + F and type initialCanonicalUrl, you will see this in one of the <script> tags:
... "initialCanonicalUrl":"/app-future/en/about", ...
Open /app-future/en/about and see that you get the same page!

Vercel about page is not affected since google is picking up the canonical url from meta tag, that value is not related to SEO.

@huozhi
P.S the vercel about page which is: https://vercel.com/about, has initialCanonicalUrl pointing to /app-future/en-US/about

I bet you that if you go to google search console and look for the section of pages that causes redirect or 404 you will see the page:
https://vercel.com/app-future/en-US/about

@frontimin

This comment has been minimized.

@studentIvan
Copy link

studentIvan commented Jul 8, 2024

Google won't pick the initialCanonicalUrl in tht html respnose for SEO, that value is only for internal state. The canonical url should be configured through Metadata API through alternates.canonical. https://nextjs.org/docs/app/api-reference/functions/generate-metadata then google can pick it up properly.

Open Vercel website on /about page, open devtools 'Elements' tab, open search input with cmd + F and type initialCanonicalUrl, you will see this in one of the <script> tags:
... "initialCanonicalUrl":"/app-future/en/about", ...
Open /app-future/en/about and see that you get the same page!

Vercel about page is not affected since google is picking up the canonical url from meta tag, that value is not related to SEO.

Is not a correct information. Google picks it and wasting the crawling budget.

image

For people who faced with this problem and have CDN - there can be a solution to filter the text content coming from server with the regexp like this

  if (responseHTML.includes("initialCanonicalUrl")) {
    responseHTML = responseHTML.replace(
      /("initialCanonicalUrl\\?":\\?"[^?"]+)(\??)([^"\\]+)(\\?")/,
      (match, p1, p2, p3, p4) => {
        if (p3.includes("_rsc=")) {
          return p1 + p4;
        }
        return p1 + p2 + p3 + p4;
      }
    );
  }

@huozhi

@huozhi huozhi reopened this Jul 8, 2024
@StepanKhvatov
Copy link

I also see a problem with initialCanonical

Page

initialCanonicalUrl\":\"/en/imin-cyprus/guide/real-estate\

● Next js Team Overview - redirect 301 из префиксов из initialcanonical (решение 404 ошибки) - Asana 2024-07-05 15-25-09

@gnoff
Copy link
Contributor

gnoff commented Jul 11, 2024

This should be resolved with this refactor which will no longer serialize a prop called "initialCanonicalUrl" in the RSC payload

#64594

@omerman
Copy link

omerman commented Jul 11, 2024

This should be resolved with this refactor which will no longer serialize a prop called "initialCanonicalUrl" in the RSC payload

#64594

@gnoff I looked at the PR description and code, Im not sure it will resolve it.
It suggests that it will look something like ...pagesource...c="x/y/z".
and c is the initialCanonicalUrl..
I dont think google captures the initialCanonicalUrl as url because of it's name.. but because the value matches a pattern of a link.
I say that because I've seen google try to also map the variable paths in my code when i used this folder structure:

app
  route
     [...paths]

so my source had ...pagesource...{\"children\":[[\"paths\", "x/y/z"]]...} and it tried to also use x/y/z as a page coming referenced by my page..

Anyways I believe that as long as next will insert this data inline as part of the page, it will still be picked up by google
hope you understand 😅

@ztanner
Copy link
Member

ztanner commented Jul 12, 2024

Reopening this until we can verify the above mentioned PR fixes it.

@ztanner ztanner reopened this Jul 12, 2024
@gnoff
Copy link
Contributor

gnoff commented Jul 12, 2024

hmm @omerman it seems like this would be a problem then for that string appearing anywhere in the document. Seems quite aggressive for google to make any kind of assumption about url-matching string sequences though I don't doubt you are seeing what you are seeing. Unfortunately the benefits of encoding the initial flight data in the document are significant and so we don't have an easy way to completely omit it. We could do some alternate encoding like an array of path sequences that get reified on the client but that really only solves url-like sequences for this one specific use case and not any url-apparent string.

Will think about this some more and see if we can come up with something that makes sense

@gnoff
Copy link
Contributor

gnoff commented Jul 12, 2024

Reviewing further it seems like google is crawling anything url-like it finds in the byte sequence of the document (meaning it doesn't have to parse the contents as a string per-se). This is presumably to discover links that might be worth visiting and would certainly find false positives amongst a large variety of data that can be present on any given page.

What I'd like to better understand is why this matters. What is the practical downside to google attempting to visit pages that it thinks might exist? While we might have a workaround for this one specific url-patterned string I'm not particularly motivated to special case this in Next.js when arbitrary data can also trigger the same exact behavior from google. But I admit I might be ignorant of some consequence to this that warrants handling this very specific special case

@rdadoune
Copy link

As @studentIvan mentioned, it's a waste of crawl budget. Maybe base64 encoding the URL could help?

@omerman
Copy link

omerman commented Jul 12, 2024

@gnoff My preferred approach with next is to have it write a js file with the preflight and reference it using a script tag in the main html document(That way its not inline and to my knowledge google doesnt analyze script tags references as well). but Thats something Im saying without knowing the implications it has on the performance. In anycase tbh Im not sure myself if Google will rank me worse because of these resulting 404s… its not like they share it 😂
So to sum up, im not really sure either, I just get a massive 404 report by google, and not sure if its that bad as well, im in the dark.

@gnoff
Copy link
Contributor

gnoff commented Jul 12, 2024

I understand that theoretically it's a wasted crawl. But my read of crawl budget is that 404's effectively neutralize this and that the budget only impacts sites operating at a scale where you are likely to have sitemaps or other alternate crawling mechanisms in place to index the correct pages explicitly.

Has anyone on this thread experienced a material adverse outcome because of this string? I'm genuinely asking b/c I want to better understand the concern. In my experience SEO/indexing behavior has a perception of being arcane knowledge that is conveyed in "best practices" because getting clear answers from google can be hard and so there is a demand for certain things to work in accordance with perceived problems even if they are not experienced problems

@gnoff
Copy link
Contributor

gnoff commented Jul 12, 2024

@omerman I wrote up a bit about why we don't do this here: #42170 (comment)

We don't base64 encode the entire inlined flight stream b/c it increases it's size and then you need to decode it anway which is a bit slower than just parsing. Though when we support binary data we will need to do this for the bits that contain binary data.

I think we can look into changing the format of this URL. It's internal only so it's not like it being a string is part of any public API. It will help this specific case

@c0b41

This comment has been minimized.

@ztanner ztanner added the linear: next Confirmed issue that is tracked by the Next.js team. label Aug 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue was opened via the bug report template. linear: next Confirmed issue that is tracked by the Next.js team.
Projects
None yet