Post by radact » Mon Jul 08, 2024 3:00 pm

Google is throwing up thousands of errors while trying to crawl my site.

Code: Select all

https://www.anotherworld.com.au/shop/msi-stealth-NBM-STEALT14A13VF059?tag=MSI&sort=rating&order=DESC&limit=20
Jul 6, 2024
https://www.anotherworld.com.au/shop/simplecom-cm412-hdmi-splitter?sort=p.model&order=ASC&limit=50
Jul 6, 2024
https://www.anotherworld.com.au/shop/astrotek-gpu-adapter-cable?tag=Astrotek&sort=p.model&order=DESC&limit=100
Jul 6, 2024
https://www.anotherworld.com.au/shop/simplecom-se203-tool-free-hdd-sdd-usb3-enclosure?sort=p.model&order=ASC&limit=20
Jul 6, 2024
https://www.anotherworld.com.au/shop/mame-arcade-collection-roms-retro-games-coinop-emulator?tag=MAME&limit=75
Jul 6, 2024
https://www.anotherworld.com.au/shop/smartoo?sort=p.price&order=ASC&limit=25
Jul 6, 2024
https://www.anotherworld.com.au/shop/j5-create?limit=25
Jul 6, 2024
https://www.anotherworld.com.au/shop/deepcool-z3-thermal-paste?tag=Heatsink Compound&sort=rating&order=ASC&limit=75
Jul 6, 2024
https://www.anotherworld.com.au/shop/j5-create?sort=rating&order=DESC
Jul 6, 2024
https://www.anotherworld.com.au/shop/j5-create?sort=p.price&order=DESC
Jul 6, 2024
In the robots.txt file, I've tried blocking the crawler from indexing pages like this:

Code: Select all

User-Agent: *
#Disabled 8/7/24
#Allow: /

User-agent: Googlebot
#Disabled 8/7/24
#Allow: /


# Allow: /sitemap.htm
Sitemap: https://www.anotherworld.com.au/shop/index.php?route=extension/feed/simple_google_sitemap

#Added 13/6/2024
Allow: https://www.anotherworld.com.au/shop/index.php?route=information/contact
Disallow: /*?sort=
Disallow: /*?limit=
Disallow: /*?tag=
Disallow: /*?route=
Am I doing this the right way to stop the errors regarding alternate pages with canonical tags? Why can't it just index the page itself without ?xxxx info that seems to be causing these errors.

New member

Posts

Joined
Fri Nov 25, 2016 11:36 am

Post by ADD Creative » Mon Jul 08, 2024 4:37 pm

That not actually an error, just a notice. Google is just telling you that it's seen the canonical tag. https://support.google.com/webmasters/a ... onical_tag

You can block the pages in robots.txt. This will just move them into the "Blocked by robots.txt" reason. It will also save crawl budget. However if any of the pages have already been indexed you will end up with "Indexed, though blocked by robots.txt" warnings. and blank pages in the search results.

Another option is to make pages with the URL parameters noindex.

www.add-creative.co.uk


Expert Member

Posts

Joined
Sat Jan 14, 2012 1:02 am
Location - United Kingdom

Post by radact » Tue Jul 09, 2024 6:56 am

So just ignore those errors then for the non-canonical alerts? I was under the impression that lots of alerts could negatively affect my rankings.

Or is it possible to set noindex just for the search results that start with ?tag, ?route, etc.

New member

Posts

Joined
Fri Nov 25, 2016 11:36 am

Post by ADD Creative » Tue Jul 09, 2024 7:20 pm

I don't know if there is a downside, other than wasting crawl budget. This is what Google say about it in the link I posted.
Alternate page with proper canonical tag
This page is marked as an alternate of another page (that is, an AMP page with a desktop canonical, or a mobile version of a desktop canonical, or the desktop version of a mobile canonical). This page correctly points to the canonical page, which is indexed, so there is nothing you need to do. Alternate language pages are not detected by Search Console.
You can make the pages noindex by adding something like the following to htaccess. However this will just change the reason for not being indexed from "Alternate page with proper canonical tag" to "Excluded by ‘noindex’ tag". It's "Duplicate without user-selected canonical" or "Duplicate, Google chose different canonical than user" that are probably the ones you want to avoid.

Code: Select all

<If "%{QUERY_STRING} =~ m#(sort|order|limit|tag|search|sub_category|description|filter)=#i">
Header set X-Robots-Tag "noindex, nofollow"
</If>
<If "%{QUERY_STRING} =~ m#(route=(product/(search&|compare)|checkout/|affiliate/|account/(?!login)))#i">
Header set X-Robots-Tag "noindex, nofollow"
</If>
There is third option of preventing you site from generating the links in the first place. You would have to consider how to do that and the implications.

www.add-creative.co.uk


Expert Member

Posts

Joined
Sat Jan 14, 2012 1:02 am
Location - United Kingdom

Post by radact » Wed Jul 10, 2024 7:06 am

I'll give that a go and see what happens to the pages report. I just don't know why it has to be so aggressive in it's crawling, there would be hundreds of combinations that would give canonical errors for every product if it uses the tag, route, sort, etc. I already went through the trouble of unique and friendly seo urls for everything. I have around 1300 products listed, but over 45000 various errors/alerts on the google console! I'm pretty sure that would negatively impact the crawl budget and possibly the rankings.

New member

Posts

Joined
Fri Nov 25, 2016 11:36 am

Post by ADD Creative » Wed Jul 10, 2024 4:59 pm

The bigger your site the more blocking with robots.txt looks the best option, as that will save crawl budget. It's just than fixing the "Indexed, though blocked by robots.txt" becomes tricky.

Google used to have a URL parameter tool when you could set parameters to ignore. They removed it and now the Googlebot seems even is better at finding the links in the OpenCart JavaScript, making the issue worse.

www.add-creative.co.uk


Expert Member

Posts

Joined
Sat Jan 14, 2012 1:02 am
Location - United Kingdom

Post by radact » Wed Jul 10, 2024 7:54 pm

By the sounds of it, it's probably not worth trying to fix it, as long as the URLs themselves are friendly.
Just ignore the alerts until Google change the way their crawlbot works?

New member

Posts

Joined
Fri Nov 25, 2016 11:36 am

Post by ADD Creative » Thu Jul 11, 2024 1:22 am

As long as the canonical tag is pointing to the correct page, I can't see it would be an issue other than a waste of craw budget.

www.add-creative.co.uk


Expert Member

Posts

Joined
Sat Jan 14, 2012 1:02 am
Location - United Kingdom

Post by radact » Thu Jul 11, 2024 7:24 am

Yes, the pages are in the following format:
https://www.anotherworld.com.au/shop/re ... scriptions
https://www.anotherworld.com.au/shop/al ... -computers
https://www.anotherworld.com.au/shop/le ... b-lighting
etc

Rather than the old generic method where the arbitrary productid and code was used.

New member

Posts

Joined
Fri Nov 25, 2016 11:36 am
Who is online

Users browsing this forum: No registered users and 15 guests