Gatsby robots.txt and sitemap.xml Files
This is a 4 part series about building SEO-optimized Gatsby blog.
gatsby-config.js
andgatsby-node.js
files- GraphQL Fragments
- SEO component
sitemap.xml
androbots.txt
In Part 4, we will learn about the sitemap.xml
and robots.txt
files.
At any point of time feel free to checkout the source code in GitHub or the live blog.
sitemap.xml
sitemap.xml
is a way to provide more information to search engines about pages on your website. A sitemap is a file that
lists a website's URLs along with additional metadata about each URL.
To generate sitemap.xml
we have to first install the gatsby-plugin-sitemap
plugin.
yarn add gatsby-plugin-sitemap
Add gatsby-plugin-sitemap
in gatsby-config.js
.
// ✂️module.exports = {plugins: [`gatsby-plugin-sitemap`,// ✂️],}
By default, gatsby-plugin-sitemap
expects the site URL from siteMetadata.siteUrl
, so we don't need to
do anything else. It's important to note that this plugin generates sitemap files only in
production
mode. To validate the generated sitemap.xml
file, we will need to build and serve the
site.
gatsby build && gatsby serve
Once we navigate to http://localhost:9000/sitemap/sitemap-index.xml
, we should see the following:
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"><sitemap><loc>https://gatsby-seo.netlify.app/sitemap/sitemap-0.xml</loc></sitemap></sitemapindex>
This is not the actual sitemap, but the sitemap index page. According to Google,
a single sitemap.xml
is limited to 50MB (uncompressed) and 50,000 URLs. Initially, these limits were created to
ensure that our web server wouldn't become overloaded by serving large files to search engines.
The sitemap index page solves this issue by letting us break down our sitemap into smaller
pieces. If you want to know the maximum amount of URLs that can be
submitted with a single sitemap index, check out this StackOverflow answer.
If we want to open the actual sitemap.xml
, we have to copy the URL from the loc
node and open it in the browser http://localhost:9000/sitemap/sitemap-0.xml
(where localhost
stands in for the production URL).
<urlsetxmlns="http://www.sitemaps.org/schemas/sitemap/0.9"xmlns:news="http://www.google.com/schemas/sitemap-news/0.9"xmlns:xhtml="http://www.w3.org/1999/xhtml"xmlns:image="http://www.google.com/schemas/sitemap-image/1.1"xmlns:video="http://www.google.com/schemas/sitemap-video/1.1"><url><loc>https://gatsby-seo.netlify.app/</loc><changefreq>daily</changefreq><priority>0.7</priority></url><url><loc>https://gatsby-seo.netlify.app/blog/siberian-husky/</loc><changefreq>daily</changefreq><priority>0.7</priority></url>// ✂️</urlset>
As it happens, changefreq
and priority
are no longer used by Google. We can safely remove this bloat from
sitemap.xml
and add what's important: the last modified date (lastmod
).
Now we update the gatsby-config.js
file:
const path = require("path")const siteMetadata = require("./site-metadata")const slashify = require("./src/helpers/slashify")// ✂️const {NODE_ENV,SITE_URL,URL: NETLIFY_SITE_URL = SITE_URL,DEPLOY_PRIME_URL: NETLIFY_DEPLOY_URL = NETLIFY_SITE_URL,CONTEXT: NETLIFY_ENV = NODE_ENV,} = process.envconst isNetlifyProduction = NETLIFY_ENV === `production`const siteUrl = isNetlifyProduction ? NETLIFY_SITE_URL : NETLIFY_DEPLOY_URLmodule.exports = {siteMetadata: {...siteMetadata,siteUrl,},plugins: [// ✂️{resolve: `gatsby-plugin-sitemap`,options: {query: `{allMdx {nodes {frontmatter {publishedmodified}fields {slug}}}}`,resolveSiteUrl: () => siteUrl,resolvePages: ({ allMdx: { nodes: mdxNodes } }) => {const { pages } = siteMetadataconst blogPathName = pages.blog.pathNameconst allPages = Object.values(pages).reduce((acc, { pathName }) => {if (pathName) {acc.push({ path: slashify(pathName) })}return acc}, [])const allArticles = mdxNodes.map(({ frontmatter: { published, modified }, fields: { slug } }) => ({path: slashify(blogPathName, slug),lastmod: modified ? modified : published,}))return [...allPages, ...allArticles]},serialize: ({ path: url, lastmod }) => ({url,lastmod,}),},},}
Next we fetch and format (thanks to the slashify
function) all the page
and article URLs. Then we fetch the last modified date for each article and create a sitemap entry for all the articles. If
the article doesn't have a last modified date, we revert to the published date.
After the above changes, our sitemap.xml
should look like this:
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"xmlns:news="http://www.google.com/schemas/sitemap-news/0.9"xmlns:xhtml="http://www.w3.org/1999/xhtml"xmlns:image="http://www.google.com/schemas/sitemap-image/1.1"xmlns:video="http://www.google.com/schemas/sitemap-video/1.1"><url><loc>https://gatsby-seo-draft.netlify.app/</loc></url><url><loc>https://gatsby-seo-draft.netlify.app/blog/siberian-husky/</loc><lastmod>2017-02-02T00:00:00.000Z</lastmod></url>// ✂️</urlset>
robots.txt
The robots.txt
file instructs search engines how to crawl the pages of your website. It indicates whether a particular user
agent can or cannot crawl parts of the website. These crawling instructions are specified by disallow
and allow
directives
for all or specific user agents.
To generate robots.txt
, we have to install gatsby-plugin-robots-txt.
yarn add gatsby-plugin-robots-txt
Update gatsby-config.js
:
// ✂️const {NODE_ENV,SITE_URL,URL: NETLIFY_SITE_URL = SITE_URL,DEPLOY_PRIME_URL: NETLIFY_DEPLOY_URL = NETLIFY_SITE_URL,CONTEXT: NETLIFY_ENV = NODE_ENV,} = process.env// ✂️module.exports = {// ✂️plugins: [// ✂️{resolve: `gatsby-plugin-robots-txt`,options: {resolveEnv: () => NETLIFY_ENV,env: {production: {policy: [{ userAgent: `*` }],},"branch-deploy": {policy: [{ userAgent: `*`, disallow: [`/`] }],sitemap: null,host: null,},"deploy-preview": {policy: [{ userAgent: `*`, disallow: [`/`] }],sitemap: null,host: null,},},},},// ✂️],}
We can use Netlify to make sure that search engines do not have access to the development version of our website.
We do so with the NETLIFY_ENV
environment variable. When the NETLIFY_ENV
value is set to production
, we allow all user
agents to access all URLs. When the NETLIFY_ENV
value is set to branch-deploy
or deploy-preview
we deny access
to all URLs and user agents.
robots.txt
is generated in the root of our website http://<SITE_URL>/robots.txt
. With the above configuration, our
production robots.txt
will look like this:
User-agent: *Sitemap: https://gatsby-seo.netlify.app/sitemap/sitemap-index.xmlHost: https://gatsby-seo.netlify.app
And in non-production environments (branch-deploy
or deploy-preview
), it will look like this:
User-agent: *Disallow: /