@jimaras - if your site puts some kinda tag in the url you can do this. I dunno how it works in OC, but lets say when you change to spanish the site urls look like:
http://www.MYSITE.com/es/path-to-my-content. To block it, add a deny rule like Disallow /*es/ and it will effectively lock out es style urls from being indexed.
@kadal - the disallow &limit and &sort are used in places like the category or search results. You will see filters to change the "items per page" which is the &limit. Also you will see a filter for "Sort By" which is the &sort. Multiply the count of available limit and sort and you see how many duplicate crawls there can be.
@samaraki - They are def indexing your pages if you submitted your site maps. They look for stuff beyond the maps though so if you find tons of search results in serps, then yes disable search.
Guide to Setting Up Your Realm - Robotic Thoughts:
You dont need to define routes or extra lines just use an astrisk...mind the slash(s) locations they are like triggers... The robots disallows in first post should be updated for this as its more flexible. If you wanna allow indexing by bots, then put a # before those lines. Keep in mind, limit is available in OC as ? or &, same with sort and order. Also this accounts for SEO urls that use the utility _route_= method, or the seo urls themselves. Also keep in mind that using a slash at the end of a directive makes it specific to routes.
Step 1) Before Blocking Bots, Allow Good Parameters:
Goto
https://www.google.com/webmasters/tools/ click your site. Nav to "Crawl > URL parameters". You will notice some important ones that arent in this robots.txt like filter, product_id, and page. You will see "edit" buttons at the end of their row. Click edit and tell google about the parameter. For example "page" you tell it:
Does this parameter change page content seen by the user?
- Yes Changes, reorders, or narrows page content
How does this parameter affect page content?
- Paginates
Which URLs with this parameter should Googlebot crawl? (Choose 1 of the 2 below)
- Let googlebot decide (the safe way)
- Every URL (may lead to duplicates if set wrong)
You have to set these ruled parameters in every multistore/account for every domain on OC in Google webmaster tools, but they all share the same robots.txt.
Step 2) Set Robots.txt Correctly:
This will disallow access to these areas in a specific safe way. Something like
http://www.MYSITE.com/account or
http://www.MYSITE.com/view-my-account WILL work since it doesnt have a slash at the end, whereas
http://www.MYSITE.com/account/anything route is blocked since there is something after a slash.
Code: Select all
# safe method - specific denys
user-agent: *
Disallow: /*&limit
Disallow: /*?limit
Disallow: /*&sort
Disallow: /*?sort
Disallow: /*&order
Disallow: /*?order
Disallow: /*checkout/
Disallow: /*account/
# Disallow: /*product/search/
Disallow: /*affiliate/
Disallow: /*download/
Disallow: /*admin/
Optional) You Can Also Set Wildcard Ways (Dangerously)
This will disallow access to these areas in a general blanket way. Something like
http://www.MYSITE.com/account or
http://www.MYSITE.com/view-my-account WONT work since it matches the word "account" in general. Do not use this method if you have urls that contain the words "account" or any other on this list unless you know what you are doing.
Code: Select all
# use at own risk - wildcard denies
user-agent: *
Disallow: /*&limit
Disallow: /*?limit
Disallow: /*&sort
Disallow: /*?sort
Disallow: /*&order
Disallow: /*?order
Disallow: /*checkout*
Disallow: /*account*
# Disallow: /*product/search/
Disallow: /*affiliate*
Disallow: /*download*
Disallow: /*admin*
Step 3) To Test Robots.txt Treaty:
Goto
https://www.google.com/webmasters/tools/ click your site, then find "Crawl > Blocked URLS". Once youre there, paste the sitemap in the top box, then paste a link you wanna test in the bottom. For example if you wanna test
Code: Select all
http://www.YOURSITE.com/product/search&search=test&sort=p.price&order=DESC&limit=100
you will see its blocked by about 4 of those rules.
And thats a good starting point. Sitemaps werent included because the sitemaps multistore idea for php inject into in robots.txt is awesome. You can read about it somewhere above this post. Anyways, thanks all, I hope this helps clarify year 2014 robots<->GWT for everyone!