Post by dimko » Mon Apr 02, 2012 8:00 pm

newbie007 wrote:
uksitebuilder wrote:Could you share your working version for all who come across this
I Use OpenCart 1.5.0.


thank you
He ment to share the final code of your robots.txt file, not the version of your OpenCart :)

Would you, please?

Thanks.

Using OpenCart v1.5.1.3


Active Member

Posts

Joined
Sun Sep 25, 2011 2:10 am

Post by kadal » Thu May 15, 2014 4:49 pm

Hello,
anyone can explain to me, what's the function of "Disallow: /*&limit" mean?

jual botol - rental bus wisata lombok & sewa mobil lombok by jasa pembuatan web : pju tenaga surya


Newbie

Posts

Joined
Mon Jul 15, 2013 4:02 am
Location - Lombok

Post by samaraki » Tue Jun 10, 2014 7:12 am

Should I put this as I have SEO mod enabled...

user-agent: *
Disallow: /*&limit
Disallow: /*&sort
Disallow: /*?route=checkout/
Disallow: /*?route=account/
Disallow: /*?route=product/search
Disallow: /*?route=affiliate/
Disallow: /*checkout/
Disallow: /*account/
Disallow: /*product/search
Disallow: /*affiliate/

??

Cause on the search engines, it shows tons of my links with with the /product/search etc...
And Google and Bing don't index most of my pages instead they index search results.

ATM I have it like:

user-agent: *
Disallow: /*&limit
Disallow: /*&sort
Disallow: /*?route=checkout/
Disallow: /*?route=account/
Disallow: /*?route=product/search
Disallow: /*?route=affiliate/

But I think I need to put the SEO disallow too?

Active Member

Posts

Joined
Fri Jul 26, 2013 2:36 pm

Post by Dhaupin » Thu Jul 03, 2014 11:37 pm

@jimaras - if your site puts some kinda tag in the url you can do this. I dunno how it works in OC, but lets say when you change to spanish the site urls look like: http://www.MYSITE.com/es/path-to-my-content. To block it, add a deny rule like Disallow /*es/ and it will effectively lock out es style urls from being indexed.


@kadal - the disallow &limit and &sort are used in places like the category or search results. You will see filters to change the "items per page" which is the &limit. Also you will see a filter for "Sort By" which is the &sort. Multiply the count of available limit and sort and you see how many duplicate crawls there can be.


@samaraki - They are def indexing your pages if you submitted your site maps. They look for stuff beyond the maps though so if you find tons of search results in serps, then yes disable search.



Guide to Setting Up Your Realm - Robotic Thoughts:
You dont need to define routes or extra lines just use an astrisk...mind the slash(s) locations they are like triggers... The robots disallows in first post should be updated for this as its more flexible. If you wanna allow indexing by bots, then put a # before those lines. Keep in mind, limit is available in OC as ? or &, same with sort and order. Also this accounts for SEO urls that use the utility _route_= method, or the seo urls themselves. Also keep in mind that using a slash at the end of a directive makes it specific to routes.



Step 1) Before Blocking Bots, Allow Good Parameters:
Goto https://www.google.com/webmasters/tools/ click your site. Nav to "Crawl > URL parameters". You will notice some important ones that arent in this robots.txt like filter, product_id, and page. You will see "edit" buttons at the end of their row. Click edit and tell google about the parameter. For example "page" you tell it:

Does this parameter change page content seen by the user?
- Yes Changes, reorders, or narrows page content

How does this parameter affect page content?
- Paginates

Which URLs with this parameter should Googlebot crawl? (Choose 1 of the 2 below)
- Let googlebot decide (the safe way)
- Every URL (may lead to duplicates if set wrong)

You have to set these ruled parameters in every multistore/account for every domain on OC in Google webmaster tools, but they all share the same robots.txt.



Step 2) Set Robots.txt Correctly:
This will disallow access to these areas in a specific safe way. Something like http://www.MYSITE.com/account or http://www.MYSITE.com/view-my-account WILL work since it doesnt have a slash at the end, whereas http://www.MYSITE.com/account/anything route is blocked since there is something after a slash.

Code: Select all

# safe method - specific denys
user-agent: *
Disallow: /*&limit
Disallow: /*?limit
Disallow: /*&sort
Disallow: /*?sort
Disallow: /*&order
Disallow: /*?order
Disallow: /*checkout/
Disallow: /*account/
# Disallow: /*product/search/
Disallow: /*affiliate/
Disallow: /*download/
Disallow: /*admin/

Optional) You Can Also Set Wildcard Ways (Dangerously)
This will disallow access to these areas in a general blanket way. Something like http://www.MYSITE.com/account or http://www.MYSITE.com/view-my-account WONT work since it matches the word "account" in general. Do not use this method if you have urls that contain the words "account" or any other on this list unless you know what you are doing.

Code: Select all

# use at own risk - wildcard denies
user-agent: *
Disallow: /*&limit
Disallow: /*?limit
Disallow: /*&sort
Disallow: /*?sort
Disallow: /*&order
Disallow: /*?order
Disallow: /*checkout*
Disallow: /*account*
# Disallow: /*product/search/
Disallow: /*affiliate*
Disallow: /*download*
Disallow: /*admin*

Step 3) To Test Robots.txt Treaty:
Goto https://www.google.com/webmasters/tools/ click your site, then find "Crawl > Blocked URLS". Once youre there, paste the sitemap in the top box, then paste a link you wanna test in the bottom. For example if you wanna test

Code: Select all

http://www.YOURSITE.com/product/search&search=test&sort=p.price&order=DESC&limit=100
you will see its blocked by about 4 of those rules.



And thats a good starting point. Sitemaps werent included because the sitemaps multistore idea for php inject into in robots.txt is awesome. You can read about it somewhere above this post. Anyways, thanks all, I hope this helps clarify year 2014 robots<->GWT for everyone!
Last edited by Dhaupin on Tue Jul 08, 2014 8:23 am, edited 7 times in total.

https://creadev.org | support@creadev.org - Opencart Extensions, Integrations, & Development. Made in the USA.


User avatar
Active Member

Posts

Joined
Tue May 13, 2014 3:45 am
Location - PA

Post by villagedefrance » Fri Jul 04, 2014 1:47 am

Here is a typical "robots.txt" file :

Code: Select all

User-agent: *

# Directories
Disallow: /admin/
Disallow: /download/
Disallow: /image/
Disallow: /system/

# Files
Disallow: /php.ini
Disallow: /config.php
Disallow: /address.php
Disallow: /account.php
Disallow: /cart.php
Disallow: /checkout.php
Disallow: /history.php
Disallow: /manual.php
Disallow: /payment_address.php
Disallow: /shipping_address.php
Disallow: /order.php
Disallow: /transaction.php
Disallow: /wishlist.php
Disallow: /reward.php
Disallow: /voucher.php
Disallow: /success.php
Disallow: /pagination.php
Disallow: /password.php
Disallow: /search.php
Disallow: /edit.php

# Sitemap

# This file is to prevent the crawling and indexing of certain parts
# of your site by web crawlers and spiders run by sites like Yahoo!
# and Google. By telling these "robots" where not to go on your site,
# you save bandwidth and server resources.

# For more information about the robots.txt standard, see:
# http://www.robotstxt.org/wc/robots.html
Hope that helps

OpenCart custom solutions @ https://villagedefrance.net


User avatar
Active Member

Posts

Joined
Wed Oct 13, 2010 10:35 pm
Location - UK

Post by Dhaupin » Mon Jul 07, 2014 10:37 pm

villagedefrance wrote:Here is a typical "robots.txt" file :

Code: Select all

User-agent: *

# Directories
Disallow: /admin/
Disallow: /download/
Disallow: /image/
Disallow: /system/

# Files
Disallow: /php.ini
Disallow: /config.php
Disallow: /address.php
Disallow: /account.php
Disallow: /cart.php
Disallow: /checkout.php
Disallow: /history.php
Disallow: /manual.php
Disallow: /payment_address.php
Disallow: /shipping_address.php
Disallow: /order.php
Disallow: /transaction.php
Disallow: /wishlist.php
Disallow: /reward.php
Disallow: /voucher.php
Disallow: /success.php
Disallow: /pagination.php
Disallow: /password.php
Disallow: /search.php
Disallow: /edit.php

# Sitemap

# This file is to prevent the crawling and indexing of certain parts
# of your site by web crawlers and spiders run by sites like Yahoo!
# and Google. By telling these "robots" where not to go on your site,
# you save bandwidth and server resources.

# For more information about the robots.txt standard, see:
# http://www.robotstxt.org/wc/robots.html
Hope that helps
There is a difference between file access blocking and url de-indexing. So this is file access blocking, you are using the wrong tool to block some of these -- you would wanna use HTaccess instead since robots.txt is simply a "treaty". Also some of these cant even be accessed from .com/file.php so the paths are also wrong. The main HT already denies indexing so folders dont need blocked really. Actually, since robots.txt is public, what you have basically done here is tell harvesters, bad bots, hackers, and exploiters where all your sensitive file locations are.

Also if you visit your page for those urls like config.php, you will notice there is nothing to parse visually anways so its just a blank page, or an error. Bots cant "index" invisible ya know? If there is no link to them they wont even see them anyways. Its just best to keep the utility layer blocks hardcore at HTaccess instead

https://creadev.org | support@creadev.org - Opencart Extensions, Integrations, & Development. Made in the USA.


User avatar
Active Member

Posts

Joined
Tue May 13, 2014 3:45 am
Location - PA

Post by sunsys » Fri Oct 02, 2015 4:28 am

Dhaupin wrote:
Step 2) Set Robots.txt Correctly:
This will disallow access to these areas in a specific safe way. Something like http://www.MYSITE.com/account or http://www.MYSITE.com/view-my-account WILL work since it doesnt have a slash at the end, whereas http://www.MYSITE.com/account/anything route is blocked since there is something after a slash.

Code: Select all

# safe method - specific denys
user-agent: *
Disallow: /*&limit
Disallow: /*?limit
Disallow: /*&sort
Disallow: /*?sort
Disallow: /*&order
Disallow: /*?order
Disallow: /*checkout/
Disallow: /*account/
# Disallow: /*product/search/
Disallow: /*affiliate/
Disallow: /*download/
Disallow: /*admin/ 
@Dhaupin : But the problem with the above Disallow rule is that it throws up a lots and lots of crawl errors as the urls are been blocked by robots.txt file so please advice a way out as I have some 5000+ crawl errors due to such blocked urls.

Regards,
Sun Systems
Industrial Electronics and Instrumentation


User avatar
Active Member

Posts

Joined
Tue Jan 27, 2015 5:19 am

Post by doduae » Sun Jan 24, 2016 5:31 pm

Hi, I want to ask how can I get index my all products into google I have 1000+ products live on my website but google don't index all my url. I think my sitemap is not working good please suggest.

Sitemap Url:

https://doduae.com / sitemap.xml

https://doduae.com/bags-and-luggage/handbags


Newbie

Posts

Joined
Sun Jan 24, 2016 5:22 pm

Who is online

Users browsing this forum: No registered users and 65 guests