`initialCanonicalUrl` in scripts confuse search crawlers #53274

frontimin · 2023-07-27T23:00:06Z

Verify canary release

I verified that the issue exists in the latest Next.js canary release

Provide environment information

Operating System:
      Platform: darwin
      Arch: x64
      Version: Darwin Kernel Version 22.5.0: Mon Apr 24 20:53:44 PDT 2023; root:xnu-8796.121.2~5/RELEASE_ARM64_T8103
    Binaries:
      Node: 18.13.0
      npm: 9.8.1
      Yarn: 1.22.10
      pnpm: N/A
    Relevant packages:
      next: 13.4.7
      eslint-config-next: 13.4.1
      react: 18.2.0
      react-dom: 18.2.0
      typescript: 5.0.4

Which area(s) of Next.js are affected? (leave empty if unsure)

Metadata (metadata, generateMetadata, next/head)

To Reproduce

Open Vercel website on /about page, open devtools 'Elements' tab, open search input with cmd + F and type initialCanonicalUrl, you will see this in one of the <script> tags:
... \"initialCanonicalUrl\":\"/app-future/en/about\", ...

Open /app-future/en/about and see that you get the same page!

Describe the Bug

Actual page path and initialCanonicalUrl are different.
Our SEO department reported this problem to us, as they said it confuses the search engine robots (don't ask me why) and reduce SEO optimization points.
For our case it looks like:
/blog/example-article – actual page path at English locale
/en/blog/example-article – initialCanonicalUrl from <script>

Expected Behavior

/blog/example-article – actual page path at English locale
/blog/example-article – initialCanonicalUrl from <script>

Which browser are you using? (if relevant)

Google Chrome

How are you deploying your application? (if relevant)

No response

The text was updated successfully, but these errors were encountered:

HaykMkrtich · 2023-07-29T20:02:49Z

I also faced with this issue, I was looking for the solution for weeks but found nothing. I Also asked chat GPT to recommend any solution, but none of them solved the problem. It's nice to find this issue, hope I can get the solution for this issue. Thanks.

zlwaterfield · 2023-12-27T16:22:24Z

Any update on this? Still seems to be an outstanding bug.

tombennet · 2024-01-17T08:58:52Z

Any updates / advice on workarounds? Thanks!

omerp-explorium · 2024-06-17T08:27:30Z

Did anyone find a workaround for this?

adomaskizogian · 2024-06-17T08:33:03Z

Did anyone find a workaround for this?

We have partially mitigated this by additing more patterns on robots.txt but that has nothing to do with addressing the root cause

omerp-explorium · 2024-06-17T08:46:04Z

@adomaskizogian I see.. thanks, I think I will try to manually replace the wrong texts next generates on our static pages.. hopefully it will work

omerman · 2024-06-17T13:58:16Z

I have so many 404 in my google search console.. and I just realized it's because of this issue 😓

huozhi · 2024-06-26T12:14:29Z

Google won't pick the initialCanonicalUrl in tht html respnose for SEO, that value is only for internal state. The canonical url should be configured through Metadata API through alternates.canonical. https://nextjs.org/docs/app/api-reference/functions/generate-metadata then google can pick it up properly.

Open Vercel website on /about page, open devtools 'Elements' tab, open search input with cmd + F and type initialCanonicalUrl, you will see this in one of the <script> tags:
... "initialCanonicalUrl":"/app-future/en/about", ...
Open /app-future/en/about and see that you get the same page!

Vercel about page is not affected since google is picking up the canonical url from meta tag, that value is not related to SEO.

omerman · 2024-06-26T12:17:43Z

Google won't pick the initialCanonicalUrl in tht html respnose for SEO, that value is only for internal state. The canonical url should be configured through Metadata API through alternates.canonical. https://nextjs.org/docs/app/api-reference/functions/generate-metadata then google can pick it up properly.

Im telling you for sure that I have a bunch(thousands) of urls google “discovered” due to this variable.
and im sure its this variable, because it claims to be referenced by the page itself.

I agree that google shouldnt pick it up as a url.. but it does.

omerman · 2024-06-26T12:18:01Z

@huozhi

omerman · 2024-06-27T10:10:52Z

Google won't pick the initialCanonicalUrl in tht html respnose for SEO, that value is only for internal state. The canonical url should be configured through Metadata API through alternates.canonical. https://nextjs.org/docs/app/api-reference/functions/generate-metadata then google can pick it up properly.

Open Vercel website on /about page, open devtools 'Elements' tab, open search input with cmd + F and type initialCanonicalUrl, you will see this in one of the <script> tags:
... "initialCanonicalUrl":"/app-future/en/about", ...
Open /app-future/en/about and see that you get the same page!

Vercel about page is not affected since google is picking up the canonical url from meta tag, that value is not related to SEO.

@huozhi
P.S the vercel about page which is: https://vercel.com/about, has initialCanonicalUrl pointing to /app-future/en-US/about

I bet you that if you go to google search console and look for the section of pages that causes redirect or 404 you will see the page:
https://vercel.com/app-future/en-US/about

studentIvan · 2024-07-08T12:58:47Z

Google won't pick the initialCanonicalUrl in tht html respnose for SEO, that value is only for internal state. The canonical url should be configured through Metadata API through alternates.canonical. https://nextjs.org/docs/app/api-reference/functions/generate-metadata then google can pick it up properly.

Open Vercel website on /about page, open devtools 'Elements' tab, open search input with cmd + F and type initialCanonicalUrl, you will see this in one of the <script> tags:
... "initialCanonicalUrl":"/app-future/en/about", ...
Open /app-future/en/about and see that you get the same page!

Vercel about page is not affected since google is picking up the canonical url from meta tag, that value is not related to SEO.

Is not a correct information. Google picks it and wasting the crawling budget.

For people who faced with this problem and have CDN - there can be a solution to filter the text content coming from server with the regexp like this

  if (responseHTML.includes("initialCanonicalUrl")) {
    responseHTML = responseHTML.replace(
      /("initialCanonicalUrl\\?":\\?"[^?"]+)(\??)([^"\\]+)(\\?")/,
      (match, p1, p2, p3, p4) => {
        if (p3.includes("_rsc=")) {
          return p1 + p4;
        }
        return p1 + p2 + p3 + p4;
      }
    );
  }

@huozhi

StepanKhvatov · 2024-07-11T04:50:05Z

I also see a problem with initialCanonical

Page

initialCanonicalUrl\":\"/en/imin-cyprus/guide/real-estate\

● Next js Team Overview - redirect 301 из префиксов из initialcanonical (решение 404 ошибки) - Asana 2024-07-05 15-25-09

gnoff · 2024-07-11T17:05:50Z

This should be resolved with this refactor which will no longer serialize a prop called "initialCanonicalUrl" in the RSC payload

#64594

omerman · 2024-07-11T18:27:44Z

This should be resolved with this refactor which will no longer serialize a prop called "initialCanonicalUrl" in the RSC payload

#64594

@gnoff I looked at the PR description and code, Im not sure it will resolve it.
It suggests that it will look something like ...pagesource...c="x/y/z".
and c is the initialCanonicalUrl..
I dont think google captures the initialCanonicalUrl as url because of it's name.. but because the value matches a pattern of a link.
I say that because I've seen google try to also map the variable paths in my code when i used this folder structure:

app
  route
     [...paths]

so my source had ...pagesource...{\"children\":[[\"paths\", "x/y/z"]]...} and it tried to also use x/y/z as a page coming referenced by my page..

Anyways I believe that as long as next will insert this data inline as part of the page, it will still be picked up by google
hope you understand 😅

ztanner · 2024-07-12T19:03:07Z

Reopening this until we can verify the above mentioned PR fixes it.

gnoff · 2024-07-12T21:12:20Z

hmm @omerman it seems like this would be a problem then for that string appearing anywhere in the document. Seems quite aggressive for google to make any kind of assumption about url-matching string sequences though I don't doubt you are seeing what you are seeing. Unfortunately the benefits of encoding the initial flight data in the document are significant and so we don't have an easy way to completely omit it. We could do some alternate encoding like an array of path sequences that get reified on the client but that really only solves url-like sequences for this one specific use case and not any url-apparent string.

Will think about this some more and see if we can come up with something that makes sense

gnoff · 2024-07-12T21:56:47Z

Reviewing further it seems like google is crawling anything url-like it finds in the byte sequence of the document (meaning it doesn't have to parse the contents as a string per-se). This is presumably to discover links that might be worth visiting and would certainly find false positives amongst a large variety of data that can be present on any given page.

What I'd like to better understand is why this matters. What is the practical downside to google attempting to visit pages that it thinks might exist? While we might have a workaround for this one specific url-patterned string I'm not particularly motivated to special case this in Next.js when arbitrary data can also trigger the same exact behavior from google. But I admit I might be ignorant of some consequence to this that warrants handling this very specific special case

rdadoune · 2024-07-12T22:06:14Z

As @studentIvan mentioned, it's a waste of crawl budget. Maybe base64 encoding the URL could help?

omerman · 2024-07-12T22:32:42Z

@gnoff My preferred approach with next is to have it write a js file with the preflight and reference it using a script tag in the main html document(That way its not inline and to my knowledge google doesnt analyze script tags references as well). but Thats something Im saying without knowing the implications it has on the performance. In anycase tbh Im not sure myself if Google will rank me worse because of these resulting 404s… its not like they share it 😂
So to sum up, im not really sure either, I just get a massive 404 report by google, and not sure if its that bad as well, im in the dark.

gnoff · 2024-07-12T22:34:26Z

I understand that theoretically it's a wasted crawl. But my read of crawl budget is that 404's effectively neutralize this and that the budget only impacts sites operating at a scale where you are likely to have sitemaps or other alternate crawling mechanisms in place to index the correct pages explicitly.

Has anyone on this thread experienced a material adverse outcome because of this string? I'm genuinely asking b/c I want to better understand the concern. In my experience SEO/indexing behavior has a perception of being arcane knowledge that is conveyed in "best practices" because getting clear answers from google can be hard and so there is a demand for certain things to work in accordance with perceived problems even if they are not experienced problems

gnoff · 2024-07-12T22:39:39Z

@omerman I wrote up a bit about why we don't do this here: #42170 (comment)

We don't base64 encode the entire inlined flight stream b/c it increases it's size and then you need to decode it anway which is a bit slower than just parsing. Though when we support binary data we will need to do this for the bits that contain binary data.

I think we can look into changing the format of this URL. It's internal only so it's not like it being a string is part of any public API. It will help this specific case

frontimin added the bug Issue was opened via the bug report template. label Jul 27, 2023

This was referenced Dec 27, 2023

initialCanonicalUrl is not taking into account basePath from config #59970

Closed

initialCanonicalUrl is not taking into account basePath from config #59971

Closed

This comment has been minimized.

Sign in to view

huozhi closed this as completed Jun 26, 2024

This comment has been minimized.

Sign in to view

huozhi reopened this Jul 8, 2024

ztanner mentioned this issue Jul 11, 2024

refactor <AppRouter /> structure #64594

Merged

ztanner closed this as completed in #64594 Jul 12, 2024

ztanner closed this as completed in df1a427 Jul 12, 2024

ztanner reopened this Jul 12, 2024

This comment has been minimized.

Sign in to view

ztanner added the linear: next Confirmed issue that is tracked by the Next.js team. label Aug 27, 2024

ztanner mentioned this issue Aug 27, 2024

send initialCanonicalUrl in array format to prevent crawler confusion #69370

Merged

ztanner closed this as completed in 7f57d4b Aug 27, 2024

ztanner closed this as completed in #69370 Aug 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`initialCanonicalUrl` in scripts confuse search crawlers #53274

`initialCanonicalUrl` in scripts confuse search crawlers #53274

frontimin commented Jul 27, 2023 •

edited

Loading

HaykMkrtich commented Jul 29, 2023 •

edited

Loading

zlwaterfield commented Dec 27, 2023

tombennet commented Jan 17, 2024

This comment has been minimized.

omerp-explorium commented Jun 17, 2024 •

edited

Loading

adomaskizogian commented Jun 17, 2024 •

edited

Loading

omerp-explorium commented Jun 17, 2024 •

edited

Loading

omerman commented Jun 17, 2024

huozhi commented Jun 26, 2024 •

edited

Loading

omerman commented Jun 26, 2024 •

edited

Loading

omerman commented Jun 26, 2024

omerman commented Jun 27, 2024 •

edited

Loading

This comment has been minimized.

studentIvan commented Jul 8, 2024 •

edited

Loading

StepanKhvatov commented Jul 11, 2024

gnoff commented Jul 11, 2024 •

edited

Loading

omerman commented Jul 11, 2024

ztanner commented Jul 12, 2024

gnoff commented Jul 12, 2024

gnoff commented Jul 12, 2024

rdadoune commented Jul 12, 2024

omerman commented Jul 12, 2024 •

edited

Loading

gnoff commented Jul 12, 2024

gnoff commented Jul 12, 2024

This comment has been minimized.

initialCanonicalUrl in scripts confuse search crawlers #53274

initialCanonicalUrl in scripts confuse search crawlers #53274

Comments

frontimin commented Jul 27, 2023 • edited Loading

Verify canary release

Provide environment information

Which area(s) of Next.js are affected? (leave empty if unsure)

To Reproduce

Describe the Bug

Expected Behavior

Which browser are you using? (if relevant)

How are you deploying your application? (if relevant)

HaykMkrtich commented Jul 29, 2023 • edited Loading

zlwaterfield commented Dec 27, 2023

tombennet commented Jan 17, 2024

This comment has been minimized.

omerp-explorium commented Jun 17, 2024 • edited Loading

adomaskizogian commented Jun 17, 2024 • edited Loading

omerp-explorium commented Jun 17, 2024 • edited Loading

omerman commented Jun 17, 2024

huozhi commented Jun 26, 2024 • edited Loading

omerman commented Jun 26, 2024 • edited Loading

omerman commented Jun 26, 2024

omerman commented Jun 27, 2024 • edited Loading

This comment has been minimized.

studentIvan commented Jul 8, 2024 • edited Loading

StepanKhvatov commented Jul 11, 2024

gnoff commented Jul 11, 2024 • edited Loading

omerman commented Jul 11, 2024

ztanner commented Jul 12, 2024

gnoff commented Jul 12, 2024

gnoff commented Jul 12, 2024

rdadoune commented Jul 12, 2024

omerman commented Jul 12, 2024 • edited Loading

gnoff commented Jul 12, 2024

gnoff commented Jul 12, 2024

This comment has been minimized.

`initialCanonicalUrl` in scripts confuse search crawlers #53274

`initialCanonicalUrl` in scripts confuse search crawlers #53274

frontimin commented Jul 27, 2023 •

edited

Loading

HaykMkrtich commented Jul 29, 2023 •

edited

Loading

omerp-explorium commented Jun 17, 2024 •

edited

Loading

adomaskizogian commented Jun 17, 2024 •

edited

Loading

omerp-explorium commented Jun 17, 2024 •

edited

Loading

huozhi commented Jun 26, 2024 •

edited

Loading

omerman commented Jun 26, 2024 •

edited

Loading

omerman commented Jun 27, 2024 •

edited

Loading

studentIvan commented Jul 8, 2024 •

edited

Loading

gnoff commented Jul 11, 2024 •

edited

Loading

omerman commented Jul 12, 2024 •

edited

Loading