How to

How to set up a Robots.txt file

5 years ago
26 February 2019
4 replies
898 views

Userlevel 1

Frank
Product Guru
23 replies

Do you want to hide specific areas of your platform from search engines, or do you want to tell them you have a sitemap available? Say hello to the Robots.txt file. It’s a simple text-file with huge responsibilities. This file will specifically tell search engines how to crawl and index the content of your platform.

Robots.txt is not designed to boss search engines around, they are free to do whatever they please. But it’s become a universal standard and the majority of the bigger search engines (most importantly: Google) will respect and follow the rules you provide. Want to learn how to setup your robots.txt file? We’re explaining as much as we can below!

Default Robots.txt file on inSided platforms

Tip: Use the default rules listed below to stop crawling members, sort and search pages (and save crawling budget)

User-agent: *
Disallow: /members/
Disallow: ?userid=
Disallow: ?sort=
Disallow: search_type=tag
Disallow: search?q=

Robots.txt elements

User-agent: determines for which search engines the rules apply, * indicates that the rules apply for all user-agents
Allow: determines which content is accessible for the user-agent (only works for Google and Bing)
Disallow: determines which content is not accessible for the user-agent
Sitemap: tells the search engine where it can find the sitemap.xml file. This should be an absolute URL and in .xml format. (The inSided platform does not come with a built-in sitemap functionality.)

How To Setup a Robots.txt file for your community

Go to Control → Settings → SEO → Robots.txt (you have to be an administrator to access this page)
Provide Robots.txt elements in the details section. The following elements are supported: user-agent, disallow, allow, sitemap
Hit Save changes, and you’re done! You successfully configured your Robots.txt file.

Example: How to crawl all of my content

User-agent: * 
Allow: /

Example: How to crawl none of my content

User-agent: * 
Disallow: /

Which search engines support Robots.txt

Google (documentation)
Bing (documentation)
Yahoo (documentation)
DuckDuckGo (documentation)
Yandex (documentation)
Baidu (documentation)

Note: It is not required to have a Robots.txt file. Search engines will crawl all pages of your platform if you don’t provide any.

Beware: The inSided platform does not come with a built-in sitemap functionality. If you want to make use of a Sitemap you have to create and host your own sitemap (in XML)
A Robots.txt file is custom made for your platform only – inSided support won’t be able to assist you with issues related to your robots.txt.

4 replies

wieger
Gainsight Employee: Rookie
10 replies
5 years ago
28 February 2019

@Frank Hidden/private categories, only accessible by certain user roles, are never crawled right? Only the public parts are normally indexed by search engines, is that right?

A robot.txt file will not override that or can it? Thanks!

Userlevel 1

Frank
Author
Product Guru
23 replies
5 years ago
4 March 2019

Hey Wieger, those categories cannot be crawled as the search engines can't get access to them to crawl them, so a robot.txt file won't override this

Perhaps I have not yet discovered enough about the robot.txt, then forgive me my question!

Does the robot.txt also have the ability to prevent search engines from crawling content before a certain date? With the instructions in this topic it seems that it's only possible for "areas" on the platform, is that correct?

If so, are there any plans for the nearby future to make this possible? And if I have overlooked it, how can I make sure that outdated content gets crawlproofed?

Thanks in advance!

Userlevel 1

Frank
Author
Product Guru
23 replies
5 years ago
26 March 2019

No worries David, all questions are welcome here. This website can tell you more about Robots.txt: http://www.robotstxt.org/robotstxt.html

It is not possible to prevent search from crawling content before a certain date. You can only instruct search engines to either index or not index your content. Unfortunately we have to work with the limitations that robots.txt gives us, therefore we can not make your request possible.

If you want to make sure outdate content will not be crawled and indexed you can use the Disallow element.

Default Robots.txt file on inSided platforms

Robots.txt elements

How To Setup a Robots.txt file for your community

Example: How to crawl all of my content

Example: How to crawl none of my content

Which search engines support Robots.txt

Reply

Didn't find what you were looking for?

Need a navigation rescue?

Sign up

Welcome to the Gainsight Community

Scanning file for viruses.

This file cannot be downloaded