Experience Sitecore ! | All posts tagged 'SEO'

Experience Sitecore !

More than 200 articles about the best DXP by Martin Miles

Sitemaps in Sitecore XM Cloud: Automation, Customization, and SEO Best Practices

In Sitecore XM Cloud, sitemaps are generated and served via Experience Edge to inform search engines about all discoverable URLs. XM Cloud uses SXA’s built‑in sitemap features by default, storing the generated XML as media items in the CMS so they can be published to Experience Edge. Sitemap behavior is controlled by the Sitemap configuration item under /sitecore/content/<SiteCollection>/<Site>/Settings/Sitemap. There are few important fields - Refresh threshold which defines minimum time between regenerations, Cache expiration, Maximum number of pages per sitemap for splitting into a sitemap index, and Generate sitemap media items which must be enabled to publish via Edge. The Sitemap media items field of the Site item will list the generated sitemap(s) under /sitecore/media library/Project/<Site>/<Site>/Sitemaps/<Site>​, and the default link provider is used unless overridden. Tip: you can configure a custom provider via <linkManager> and choose its name in the Sitemap settings.

Automated Sitemap Generation Workflow

When content authors publish pages, XM Cloud schedules sitemap regeneration automatically based on the refresh threshold. Behind the scenes, an OnPublishEnd pipeline (often the SitemapCacheClearer.OnPublishEnd handler in SXA) checks each site’s sitemap settings. If enough time has elapsed since the last build, a Sitemap Refresh job runs. In this job, the old sitemap media item is deleted and a new one is generated and saved in the Media Library​. Once created, the new sitemap item is linked in the Sitemap media items field of the site and then published. This typically triggers two publish actions: one to publish the new media item (/sitecore/media library/Project/.../Sitemaps/<Site>/sitemap) and one to re-publish the Site item so Experience Edge sees the updated link.

For high-volume publishing, it’s best to set a reasonable refresh threshold to batch sitemap generation. For example, if you publish many pages daily, you might set the refresh threshold to 0 forcing a rebuild every time, or schedule a daily publish so the sitemap is updated once per day. Generating sitemaps can be resource-intensive especially for large sites, so avoid rebuilding on every small change unless necessary.

Sitemap Filtering: SXA provides pipeline processors to include or exclude pages. By default, items inheriting SXA’s base page templates have a Change frequency field. Setting it to "do not include" will exclude that page from the sitemap​. The SXA sitemap pipelines (sitemap.filterItem) include built‑in processors for base template filtering and change-frequency logic. To exclude a page, simply open it in Content Editor (or Experience Editor SEO dialog) and set Change frequency to "do not include"​.

GraphQL Sitemap Query: Once published, the XM Cloud GraphQL API provides access to the sitemap media URL. For example, the following query returns the sitemap XML URL for a given site name:

query SitemapQuery($site: String!) {
      site {
        siteInfo(site: $site) {
          sitemap
        }
      }
    }

This returns the Experience Edge URL of the generated sitemap media item. You can use this in headless code or debugging to verify the sitemap’s existence and freshness.

Sitemaps in Local Docker Containers

In a local XM Cloud Docker setup, the /sitemap.xml route often returns an empty file by default because the Experience Edge publish never occurs. There is no web database or Edge target, so the OnPublishEnd process never actually runs, leaving the empty sitemap item. Attempting to publish locally throws an exception (Invalid Authority connection string for Edge). To debug or test sitemap issues locally, you can manually trigger the SXA sitemap pipeline.

I really like the Sitemap Developer Utility approach suggested by Jeff L'Heureux: in your XM Cloud solution’s Docker files, create a page (e.g. generateSitemap.aspx) inside docker\deploy\platform with code that simulates a publish event. For example, one can invoke the SitemapCacheClearer.OnPublishEnd() method manually in C#.

// Simulate a publish event for the "Edge" target
    Database master = Factory.GetDatabase("master");
    List<string> targets = new List<string> {"Edge"};
    PublishOptions options = new PublishOptions(master, master, PublishMode.SingleItem, 
        Language.English, DateTime.Now, targets);
    Publisher publisher = new Publisher(options);
    SitecoreEventArgs args = new SitecoreEventArgs("OnPublishEnd", new object[] { publisher }, new EventResult());
    new SitemapCacheClearer().OnPublishEnd(null, args);
    

This code triggers the same sitemap build logic as a real publish​. Jeff's utility page provides buttons to run various steps (OnPublishEnd, the sitemap.generateSitemapJob pipeline, etc.) and shows output.

Once you run the utility and the cache job completes, the media item is regenerated. Then restart or refresh your Next.js site locally to see the updated sitemap at http://front-end-site.localhost/sitemap.xml. The browser will display the raw XML with <loc>, <lastmod>, <changefreq>, and <priority> entries as it normally should.

Sitemap Customization for Multi-Domain Sites

A common scenario is one XM Cloud instance serving multiple language or regional domains (say, www.siteA.com and www.siteA.fr) with one shared content tree. In SXA this is often handled by a Site Grouping with multiple hostnames. By default, SXA will generate a single sitemap based on the primary hostname. This leads to two issues: the same XML file is returned on both domains, and each page appears several times (once per language) under the same <loc>. For example, a bilingual site without customization might show both English and French URLs under the English domain, duplicating <url> entries.

To fix this, customize the Next.js API route (e.g. pages/api/sitemap.ts) that serves /sitemap.xml. The approach is: detect which host/domain the request is for, fetch the raw sitemap XML via GraphQL, and then filter and rewrite the entries accordingly. For instance, if the host header contains the French domain, only include the French URLs and update the <loc> and hreflang="fr" links to use the French hostname. Pseudocode for the filtering might look like:

if (lang === 'en') {
      // Filter out French URLs and fix alternate links
      urls = urls.filter(u => !u.loc[0].includes(FRENCH_PREFIX))
                 .map(updateFrenchAlternateLinks);
    } else if (lang === 'fr') {
      // Filter out English URLs and swap French loc to French domain
      urls = urls.filter(u => u.loc[0].includes(FRENCH_PREFIX))
                 .map(updateLocToFrenchDomain)
                 .map(updateFrenchAlternateLinks);
    }
    

Here, FRENCH_PREFIX is something like en.mysite.com/fr, and we replace it with the French hostname. In practice, the XML is parsed (e.g. via xml2js), then the result.urlset.url array is filtered and modified, and rebuilt to XML. There is a great solution suggested by Mike Payne which uses two helper functions filterUrlsEN and filterUrlsFR to drop unwanted entries and updateLoc/updateFrenchXhtmlURLs to replace URL prefixes​. Finally, the modified XML is sent in the HTTP response. This ensures that when a sitemap is requested from www.site.ca, all <loc> URLs and alternate links point to site.ca, and when requested from www.othersite.com, they point to www.othersite.com.

SEO Considerations and Best Practices

  • Include Alternate Languages (hreflang): XM Cloud (via SXA) automatically adds <xhtml:link rel="alternate" hreflang="..."> entries in the sitemap for multi-lingual pages. Ensure these are correct for your domains. After customizing for multiple hostnames, the <xhtml:link> URLs should also be updated to the appropriate domain​. This helps Google index the right language version for each region.

  • Set Change Frequency and Priority: Use SXA’s SEO dialog or Content Editor on the page item to set Change frequency and Priority for each page. For example, if a page is static, set a low change frequency. These values are written into <changefreq> and <priority> in the sitemap. Note: Pages can be excluded by setting frequency to "do not include".

  • Maximize Crawling via Sitemap Index: If your site has many pages, configure Maximum number of pages per sitemap so XM Cloud generates a sitemap index with multiple files. This avoids any single sitemap exceeding search engine limits and keeps crawlers from giving up on a very large file.

  • Robots.txt: SXA will append the sitemap link /sitemap.xml to the site’s robots.txt automatically​. Verify that your robots.txt in production references the correct sitemap and hostname.

  • Media Items and Edge: Always keep Generate sitemap media items enabled: without having this, XM Cloud cannot deliver the XML to the front-end. After a successful build, the sitemap XML is stored in a media item and served by Experience Edge. You can confirm the published sitemap exists by checking /sitecore/media library/Project/<Site>/<Site>/Sitemaps/<Site> or by running the GraphQL query mentioned above.

  • Link Provider Configuration: If your site uses custom URL routing (e.g. language segments or rewritten paths), you can override the link provider used for sitemap URLs. In a patch config, add something like:

    <linkManager defaultProvider="switchableLinkProvider">
          <providers>
            <add name="customSitemapLinkProvider" 
                 type="Sitecore.XA.Foundation.Multisite.LinkManagers.LocalizableLinkProvider, Sitecore.XA.Foundation.Multisite"
                 lowercaseUrls="true" .../>
          </providers>
        </linkManager>

    Don't forget to set the "Link provider name" field in the Sitemap settings to customeSitemapLinkProvider​ afterwards. This ensures the sitemap uses the correct domain and culture prefixes as needed.

Diagnostics and Troubleshooting

If the sitemap isn’t updating or the XML is wrong, check these:

  • Site Item Settings: On the site’s Settings/Sitemap item, confirm the refresh threshold and expiration are as expected. During debugging you can set threshold to 0 to force immediate rebuilds.

  • Was it published to Edge? Ensure the sitemap media item was published to Edge. You might need to publish the Site item or Media Library manually if it wasn’t picked up.

  • Cache Type: In the SXA Sitemap settings, the Cache Type can be set to "Inactive," "Stored in cache", or "Stored in file". For XM Cloud, the default "Stored in file" is typically used so the XML is persisted. If set to "Inactive", the sitemap generator will not run.

  • Inspect Job History: In the CM admin (/sitecore/admin/Jobs.aspx), look for the "Sitemap refresh" jobs to see if these succeeded or threw errors.

  • Next.js Route Errors: If your Next.js site’s /sitemap.xml endpoint returns an error, inspect its handler. The custom API route uses GraphQLSitemapXmlService.getSitemap(). Ensure the hostnames in your logic match your ENV variables, namely PUBLIC_EN_HOSTNAME. Add logging around the xml2js parsing if the output seems empty or malformed.

By following the above patterns - configuring SXA sitemap settings, automating generation on publish, and customizing for your site topology -you can ensure that XM Cloud serves up accurate, SEO‑friendly sitemaps. This helps search engines index your content fully and respects multi-lingual domain structures and refresh logic specific to a headless architecture.

References: one, two, three and four.

Evolutional approach to Next.js and its modes

As you've might hear, Sitecore has chosen Next.js to be used along with its JSS SDK. But what makes Next that great tool for most of us switching to a new paradigm of development for Sitecore? In this blog post, I'll go through Sitecore development evolution, starting with a review of Sitecore development progressing with time.

Old school development

A decade ago, we used classical ASP.NET WebForms to render a page on a server and pass it to the client. The whole idea of WebForms was faulty as it tried to mimic the event-based model of desktop development to make web development feel familiar to them. That was at cost of ignoring the stateless nature of HTML, creating weird ugly abstractions (ie. ViewState and EventValidation).

It was later made obsolete with an MVC approach which turned ASP.NET web development to what it should be in a better world: no state and events abstractions, server controls, and Master pages. Proper separation of code and markup (which itself went better and readable with the introduction of Razor views). It all benefited from an MVC architecture, proven with other web technologies, such as Ruby on Rails. Moreover, the implementation allowed extensibility at always every lifecycle of web request while ASP.NET MVC going open-source allowed writing your code aligned with the exact implementation of the framework.

MVC made a great step ahead and stayed a default way of making sites with Sitecore for as long as 5-8 years. Being so close to a raw request was a great strength but at a cost of having a lot of repetitive activities.

The introduction of SXA fixed most of these issues by strongly relying on Sitecore PowerShell for addressing most things that should and were in fact automated. The overall developers and editors' experience has improved with SXA due to the introduction of Page and Partial Designs, powerful components adjustable with rendering variants, most popular grid systems support, flexible search and SEO tools.

SXA was great in most aspects, except the one but most important - it was still based on top of MVC. That means web pages were generated at the server by rendering content into HTML views. Or in other words, it was not headless...

Headless

Meanwhile, the world of front-end development has experienced massive growth and after half-a-decade craziness of JS frameworks appearing one after another, a triad of winners stood out: React, Angular, and Vue. Those good old days of using jQuery came to an end giving way to industry-proven frameworks with the bigger feature sets and revised architecture that suits modern web development.

With time It became even harder and harder to split work between back-end and front-end teams (as for full-stack guys most of them tend to choose either side). Even bigger efforts have been spent on unwanted work of merging FE and BE teams in sync, which could not last long as both sides were struggling from that situation.

The headless approach was the right answer resolving all those issues with JSS being Sitecore response for that.

With the release of JSS, it became possible to separate BE and FE in a way that page each side becomes responsible for only its own duties. Front-end becomes free of previous limitations and could use React / Vue / Angular as much as they wanted. They did not need to use a heavily loaded web server with Sitecore for generating HTML pages - a new component called Rendering Host did that job exclusively for them. The only interaction with the back-end left was receiving just the necessary data asynchronously thanks to Layout Service and GraphQL.

NOTE: Actually headless means anything can consume the data from back-end services, not just FE frameworks. That is well done in ASP.NET Core renderings as an alternative option for headless implementation for Sitecore.

Client-Side Rendering

With a typical non-Sitecore single-page application the webserver firstly sends the browser an HTML page being in some initial state. Once that page gets loaded, the browser executes its JavaScript code which raises an asynchronous request(s) to an API endpoint in order to get actual data. As the user progresses with this app, more requests are sent by the browser, which will partially update content on a page without the whole page reloading from a web server. This approach is known as Client-Side Rendering, CSR and it brings lots of advantages such as apps responding faster and reducing traffic between client and server.

What's wrong with single-page applications?

Since single-page apps only load an initial HTML page once, this is the same as what search engine bots get. They struggle to obtain follow-up data from APIs and cannot index the page. Also without page reload the URL reaming the same and it can vary by only appending a #-anchor to a page URL. Often these URLs cannot be correctly processed when called directly.

Next.js

To address the above we have Next.js - a framework for statically generated and server-rendered React applications that opens up a lot of possibilities for developers: creating ready-to-use, zero-configuration applications, code separation, static HTML exporting, better UX, faster performance, and more. You can see many of its features below:

Next.js will ensure SEO without any extra actions from users beyond creating an application. Just to make clear, that results not from Next.js specifically, but from server-side rendering.

Once can do some SEO reports with Lighthouse even at earlier stages as you begin building your application.

But that still wasn't that....

SSG challenge

The idea behind Jamstack is truly attractive: instead of serving webpages in real-time (even when taking those from a cache), the webpages are already pre-rendered and deployed to CDN being globally accessible immediately upon publishing. In a simple scenario, one does not even have to keep a running server up as the traffic never reaches it going to CDN. Static content is fast, resilient to downtime, and gets indexed immediately by crawlers.

This approach however has some issues.

Let's think about a huge site with millions of pages. Deploying such a site may last hours rather than minutes due to static pages generation and the number of files to process. An increasing amount of content means increasing generation time. It seems to be reasonable to re-generating only those pages been updated, but it is only a small part of the solution (deployment becomes complicated and even one character change in a common part like a header will still make you process all the pages).

ISR

That is where Incremental Static Regeneration (ISR) comes into play. ISR is a new evolution step for Jamstack. Next.js allows you to create or update static pages beyond you’ve built a site. Incremental Static Regeneration enables developers and content editors to use static-generation on a per-page basis, without needing to rebuild the entire site. With ISR, you can benefit best from both worlds while scaling to millions of pages.

The principle difference is that now Static pages could be generated on-demand at runtime. The developers' job is now deciding which portion of pages you pre-generate, i.e. well known 80/20 Pareto's Law where 80% of traffic is served by only 20% of pages, while the other 80% of pages get the remaining 20% of traffic.

So it makes good sense to pre-generate that heavily used 20 % of pages. How to know which pages or sections to go through? You've got an arsenal of tools like analytics, A/B testing, alternative metrics - in any case, you got the flexibility to make your own tradeoff on build times, as the image below compares:

With being given a choice now, developers can define options A or B and choose between them: Selecting option A build time gets faster, while option B generates more pages.

This becomes crucial when working on large eCommerce implementations or headless CMSs such as Sitecore.

How that works

ISR relies on that same API being used for static sites generation getStaticProps. The difference is that by setting revalidate parameter to 60 we make Next.js using ISR for a page. Here's how the request goes with ISR:

  1. With Next.js one can define a revalidation time per page (ie. 60 seconds)
  2. The initial request to a product page will return the cached page with the original price
  3. At this stage, someone makes changes into a product data, affected in the database changes
  4. All requests to the page after the initial request but before 60 seconds are returned immediately as are cached.
  5. After a given 60-second window, the following request will still show the cached (old) page. But Next.js triggers background regeneration of that page. Once completed, it will update a cache for that single page or keep an old cached page upon a background regeneration failure.

Finding a compromise

Since all the sites vary by volume, audience, purpose, and internal architecture - there's no a silver bullet to cover them all with a universal solution. That is why Next.js is end-user-centric, offering developers shifting between solutions without leaving the bounds of the framework. It's for you to choose the right tool for a project.

Edge caching

In certain cases, ISR is not the best option, like some apps where live data display is crucial. Those would be better handled with server rendering, with some option of own Cache-Control headers with surrogate keys to invalidate content. Server rendered pages could get cached at some edge servers. With a hybrid framework, one can make own tradeoff and still stay within the framework.

SSR with edge server caching may look similar to ISR (especially with stale-while-revalidate headers for cache control).

The major difference comes from the way of handling the first request. With ISR it returns a statically rendered page that ensures the user will see a page even in case of API connectivity loss or database failure. SSR allows setting the pages depending on the specific features of requests.

One thing to care about in that case is using SSR whiteout caching may affect the performance as every millisecond of wait is important. In addition SSR with no cache badly impacts the TTFB metric (Time to First Byte) being used by Lighthouse.

In addition to that, ISR is not beneficial for small websites. That is reasonable if build time for the whole site is times lower than the revalidation parameter - just use classic SSR instead.

ISP fallback options

This is an important parameter with two potential options. When working with data that is fast to retrieve it makes sense using fallback: blocking. In that case, you do not need to display using a temporal "in progress" page while the data retrieval. That will guarantee users see the right page regardless of it is cached or not.

For uncertain or slow loading data the above approach will affect UX badly, therefore setting fallback: true makes an immediate display of the "please wait" page while data is processed.

SEO is the cause

SEO (search engine optimization) is a set of techniques (and even unobvious tricks) for changing your site in order to attract higher traffic from search engines. In order to increase the site's search rate, one needs to keep in mind many of them, such as:

Visitors won’t wait an eternity until your page loads. Performance is actually a crucial factor for SEO and therefore should be the main concern when building an app. In addition to FTFB (mentioned previously), there is another important parameter abbreviated as FCP (First Contentful Paint). Google uses FCP as a key metric for performance - FCP directly affects SEO rating. You can read more about improving FCP.

With Next.js you can analyze FCP and LCP (Largest Contentful Paint - time used for major content shown) by creating App component with a reportWebVitals function:

// pages/_app.js
export function reportWebVitals(metric)
{
  console.log(metric)
}

Once these parameters get calculated reportWebVitals function is called with all the metrics for you to log and analyze. Follow this link for more details about measuring performance with Next.js

I hope this post gives an overall highlight on the rendering evolution from the old days till ISR and nuances choosing them with Next.js.

Yet another SXA rendering variant - Script Reference Tag coming to improve your SEO

Note! The code used in this post can be cloned from GitHib repository: SXA.Foundation.Variants

I previously wrote a post about having a rendering variant holding an inline JavaScript one might need along adding some basic JS functionality into your components. 

This is useful when you're early developing your pages and have no possibility or capacity of recompiling entire frontend and updating Creative Exchange package into your solution because of adding/changing few lines; however, given approach is not SEO-friendly as search engines penalize sites for excessive inline scripts and styles. So use it considering to be technical debt, that should be addressed prior to going to production.

The very minimal change one can do is to replace the inline script with a reference to that same script stored in Media Library - same that SXA does itself with themes. This blog post below reveals an approach:

Firstly, create a template:

Then reference given template IDs within Constants.cs file:

using Sitecore.Data;

namespace Platform.Foundation.Variants.Pipelines.VariantFields.ScriptReferenceTag
{
    public static partial class Constants
    {
        public static partial class RenderingVariants
        {
            public static partial class Templates
            {
                public static ID ScriptReferenceTag { get; } = new ID("{0EC036D7-384D-4CF6-AD1F-FE949E96126A}");
            }

            public static partial class Fields
            {
                public static class ScriptReferenceTag
                {
                    public static ID ScriptMedia { get; } = new ID("{F1497AF9-7DD3-4B38-BE22-5F092007F929}");
                }
            }
        }
    }
}
Model class, having just one property that stores a GUID of a referenced script from Media Library 
using Sitecore.Data.Items;
using Sitecore.XA.Foundation.RenderingVariants.Fields;

namespace Platform.Foundation.Variants.Pipelines.VariantFields.ScriptReferenceTag
{
    public class VariantScriptReferenceTag : RenderingVariantFieldBase
    {
        public string ScriptMedia { get; set; }

        public VariantScriptReferenceTag(Item variantItem) : base(variantItem)
        {
        }
    }
}
Parser:
using Sitecore.Data;
using Sitecore.XA.Foundation.Variants.Abstractions.Pipelines.ParseVariantFields;

namespace Platform.Foundation.Variants.Pipelines.VariantFields.ScriptReferenceTag
{
    public class ParseScriptReferenceTag : ParseVariantFieldProcessor
    {
        public override ID SupportedTemplateId =>  Constants.RenderingVariants.Templates.ScriptReferenceTag;
        
        public override void TranslateField(ParseVariantFieldArgs args)
        {
            ParseVariantFieldArgs variantFieldArgs = args;

            var variantHtmlTag = new VariantScriptReferenceTag(args.VariantItem) { Tag = "script" };
            variantHtmlTag.ScriptMedia = args.VariantItem[Constants.RenderingVariants.Fields.ScriptReferenceTag.ScriptMedia];
            variantFieldArgs.TranslatedField = variantHtmlTag;
        }
    }
}
Renderer:
using System;
using Sitecore.Data;
using System.Web.UI.HtmlControls;
using Sitecore.XA.Foundation.RenderingVariants.Pipelines.RenderVariantField;
using Sitecore.XA.Foundation.Variants.Abstractions.Pipelines.RenderVariantField;
using Sitecore.Resources.Media;

namespace Platform.Foundation.Variants.Pipelines.VariantFields.ScriptReferenceTag
{
    public class RenderScriptReferenceTag : RenderVariantField
    {
        public override Type SupportedType => typeof(VariantScriptReferenceTag);

        public override void RenderField(RenderVariantFieldArgs args)
        {
            var variantField = args.VariantField as VariantScriptReferenceTag;
            if (variantField != null)
            {
                var id = variantField?.ScriptMedia;
                if (string.IsNullOrWhiteSpace(id))
                {
                    return;
                }

                var scriptItem = Context.Database.GetItem(new ID(id));
                if(scriptItem == null)
                {
                    return;
                }

                var url = MediaManager.GetMediaUrl(scriptItem);

                var tag = new HtmlGenericControl(variantField.Tag);
                tag.Attributes.Add("type", "text/javascript");
                tag.Attributes.Add("defer", String.Empty);
                tag.Attributes.Add("src", url);

                args.ResultControl = tag;
                args.Result = RenderControl(args.ResultControl);
            }
        }
    }
}

Example of usage:

This rendering variant field generates the following output:

<script src="/-/media/Project/Platform/Other/Scripts/Header-script.js" type="text/javascript" defer="" ></script>

This approach works perfectly well. But once again for a second, have you ever considered moving such scripts into a Theme along with related component (if any) instead of leaving it like that? Hope this helps!

Creating XML Sitemap for the Helix solution

I am working on a solution that already has HTML sitemap as a part of Navigation feature. Now I got a request to add also a basic XML sitemap with common set requirements. Habitat ships with an interface template _Navigable, so let's extend this template by adding a checkbox field called

ShowInSitemap, stating whether a particular page will be shown in that sitemap:


In order to start, we need to create a handler. Having handlers in web.config is not the desired way of doing things, it will require also doing configuration transform for the deployments, so let's do things in a Sitecore way (Feature.Navigation.config file):

<configuration xmlns:patch="http://www.sitecore.net/xmlconfig/">
    <sitecore>
        <pipelines>
            <httpRequestBegin>
                <processor type="Platform.Feature.Navigation.Pipelines.SitemapHandler, Platform.Feature.Navigation"
                           patch:before="processor[@type='Sitecore.Pipelines.HttpRequest.CustomHandlers, Sitecore.Kernel']">
                </processor>
            </httpRequestBegin>
            <preprocessRequest>
                <processor type="Sitecore.Pipelines.PreprocessRequest.FilterUrlExtensions, Sitecore.Kernel">
                    <param desc="Allowed extensions">aspx, ashx, asmx, xml</param>
                </processor>
            </preprocessRequest>
        </pipelines>
    </sitecore>
</configuration>

We rely on httpRequestBegin pipeline and incline our new SitemapHandler from Navigation feature right before CustomHandlers processor.

SitemapHandler is an ordinary pipeline processor for httpRequestBegin pipeline, so is inherited from HttpRequestProcessor:

    public class SitemapHandler : HttpRequestProcessor
    {
        const string sitemapHandler = "sitemap.xml";

        private readonly INavigationRepository _navigationRepository;

        public SitemapHandler()
        {
            _navigationRepository = new NavigationRepository(RootItem);
        }

        public override void Process(HttpRequestArgs args)
        {
            if (Context.Site == null 
                || args == null
                || string.IsNullOrEmpty(Context.Site.RootPath.Trim()) 
                || Context.Page.FilePath.Length > 0 
                || !args.Url.FilePath.Contains(sitemapHandler))
            {
                return;
            }

            Response.ClearHeaders();
            Response.ClearContent();
            Response.ContentType = "text/xml";

            try
            {
                var navigationItems = _navigationRepository.GetSitemapItems(RootItem);
                string xml = new XmlSitemapService().BuildSitemapXML(flatItems);

                Response.Write(xml);
            }
            finally
            {
                Response.Flush();
                Response.End();
            }
        }

        private Item RootItem => Context.Site.GetRootItem();

        private HttpResponse Response => HttpContext.Current.Response;
    }

And XmlSitemapService code below:

    public class XmlSitemapService
    {
        public string CreateSitemapXml(IEnumerable<NavigationItem> items)
        {
            var doc = new XmlDocument();

            var declarationNode = doc.CreateXmlDeclaration("1.0", "UTF-8", null);
            doc.AppendChild(declarationNode);

            var urlsetNode = doc.CreateElement("urlset");

            var xmlnsAttr = doc.CreateAttribute("xmlns");
            xmlnsAttr.Value = "http://www.sitemaps.org/schemas/sitemap/0.9";
            urlsetNode.Attributes.Append(xmlnsAttr);
            doc.AppendChild(urlsetNode);

            foreach (NavigationItem itm in items)
            {
                doc = CreateSitemapRecord(doc, itm);
            }
            return doc.OuterXml;
        }

        private XmlDocument CreateSitemapRecord(XmlDocument doc, NavigationItem item)
        {
            string link = item.Url;

            string lastModified = HttpUtility
             .HtmlEncode(item.Item.Statistics.Updated.ToString("yyyy-MM-ddTHH:mm:sszzz"));

            XmlNode urlsetNode = doc.LastChild;

            XmlNode url = doc.CreateElement("url");
            urlsetNode.AppendChild(url);

            XmlNode loc = doc.CreateElement("loc");
            url.AppendChild(loc);
            loc.AppendChild(doc.CreateTextNode(link));

            XmlNode lastmod = doc.CreateElement("lastmod");
            url.AppendChild(lastmod);
            lastmod.AppendChild(doc.CreateTextNode(lastModified));

            return doc;
        }
    }
Also, NavigationItem is a custom POCO:
 
public class NavigationItem
{
    public Item Item { get; set; }
    public string Title { get; set; }
    public string Url { get; set; }
    public bool IsActive { get; set; }
    public int Level { get; set; }
    public NavigationItems Children { get; set; }
    public string Target { get; set; }
    public bool ShowChildren { get; set; }
}


Few things to mention.
1. Since you are using LinkManager in order to generate the links, you need to make sure you have full URL path as required by protocol, not the site-root-relative path. So you'll need to pass custom options in that case:

2. Once deployed to production, you may face an unpleasant behavior of HTTPS links generated along with 443 port number (such as . That is thanks to LinkManager not being wise enough to predict such a case. However there is a setting that make LinkManager works as expected. Not obvious
var options = LinkManager.GetDefaultUrlOptions();
options.AlwaysIncludeServerUrl = true;
options.SiteResolving = true;
LinkManager.GetItemUrl(item, options);

or better option in Heliix to rely on Sitecore.Foundation.SitecoreExtensions:

item.Url(options) from


//TODO: Update the code with the recent



That's it!

Sitecore with SEO: overview and compare ways for managing duplicate content

In this blog post I decided to cover all ways of managing duplicate content in Sitecore and overview possible ways of dealing with that with emphasis on SEO. So, we have the following options to consider:

  1. Duplicates
  2. Clones
  3. Proxies
  4. Aliases
  5. IIS URL Rewrite module
  6. Sitecore Redirect Module
  7. External Reverse Proxy

1. Duplicates are commonly known and most straightforward way of creating duplicates (clones) of the items. The easiest way to perform that operation is to right click the item you'd want to copy, select Duplicate from context menu and specify new item's name.


This ends up with an entirely independent new item (and all its ancestors) located at the same level, including all field values, presentation details, permissions etc. Beware the locks and workflows - those also would be exact match of those original items have. After that, new item lives it own life and is no way synchronized to its original prototype (except Standard Values, for sure, as both new and duplicate items share the same template).

Also it's worth of mentioning Copy To - this brings similar behavior, but allows to create duplicates keeping same name but at other paths rather than original item. Copy To is available from the same context menu.


2. Clones are sort of similar to duplicates with the difference that no new item is created when using clones. To create a clone for a highlighted item from a Sitecore tree, select Configure tab, hit Clone and specify where you'd want to locate your clone.


Notice, that clones are displayed in content tree in a slightly light font color, I personally think that may create some future issues when business users may perform some actions on item without realizing that item is a clone. Why is that important to know? Let's view the way clones function on a lower level.

When you create a clone the item and the values are not physically copied. Instead, the inheritance similar to the one between Standard Values (that is sort of prototype item for a template) and real template item, is created (clone inherits not from s.v. but from original item). When you modify a filed value of original item, that would affect same field of cloned item. However the reverse process, when you modify a field value on cloned item, overrides that individual field value and it is no more tied to original item's field. Other fields of the same item will still keep the reference to their originals. Clones use the __Source field of the Advanced section from standard template to specify the cloned item:


Unlike duplicates, clones do not clone most of standard fields (those coming from standard template) like locks / workflows and statistics (created, updated, revision). But they do clone security settings, which, again, can be overridden for a clone item.

If you want to get rid of clone item - there are 2 ways to do that: just delete the clone (obvious) and unclone it. Uncloning turns cloned item into a normal item and copies field values from originals. Clones exist only in master database, when you perform publishing to CD servers - uncloning takes place there.

You also can do some crazy things like creating clones of clones - inheritances chain take place in this case; each field at each level can be overridden, for sure.

To get even more understanding on how clones work in Sitecore I recommend reading Cloning What Ifs article.


3. Proxies is another mechanism of creating and managing duplicate content in Sitecore. The are frequently used in cases similar when you have an item that you may want to be a child of multiple parent items. In order to use them you must ensure a config file setting called proxiesEnabled is true; then you create proxy items at /Sitecore/System/Proxies based on /System/Proxy template. However, proxies considered to be outdated in favor of Clones. Please do not use Proxies!


4. Aliases are the different beast. They are perfectly good for promos and campaigns as the normally specify a quick URL for campaign landing page. Aliases have out-of-box limitation that they are set only per root level and not multisite-friendly (however there is a link that explains how to implement that feature on your own).

Aliases are defined under /sitecore/system/Aliases based on the System/Alias template.

There is just one field in alias template that allows to select target item.

There are two more overheads when working with aliases - sometimes you may need to identify if an item is alias:

bool isAlias = Context.Database.Aliases.Exists(path);

Also you may need to set canonical URLs on them to improve SEO. Good way of doing that is:

public class AliasResolver : Sitecore.Pipelines.HttpRequest.AliasResolver
{
    public override void Process(HttpRequestArgs args)
    {
        base.Process(args);

        if (Context.Item != null)
        {
            args.Context.Items["CanonicalUrl"] = Context.Item.GetFullUrl(args.Context.Request.Url);
        }
    }
}

Also, do not forget to publish your aliases to content delivery databases, as they won't work until published.



5. IIS URL Rewrite module is probably most functional option, it is external to Sitecore, that means it happens before routing and before pipelines.

For the drawbacks of using IIS URL Rewrite I would mention that you'd need to have access to IIS Manager or web.config write permission on each of content delivery servers. I previously wrote a blog post IIS URL Rewrite module - few SEO tricks that can demonstrate how powerful it is.

Also I would beware you of some specific Sitecore URLs and create appropriate extensions (ex. for WebResource.axd - take from real code).



6. Sitecore Redirect Module is another good choice as it does perfect server side 301 redirect for both URLs and items. It is almost as powerful as IIS URL Rewrite, but because it is configured in Sitecore - you do not need to have CD environments access at all - just create and publish redirect rules (as you normally do with generic content) - they will take effect immediately! Module is transparent to multi-site configuration, it can do redirects from one site's URL or item to another.

One more advantage of the module - availability of source code, so functionality can be extended to any bespoke requirement, also it becomes compatible with new Sitecore versions by just rebuilding it with appropriate Sitecore.Kernel.dll and replacing updated module DLL in webroot bin folder.

The only drawback, probably, is that in default state it performs only 301 redirects (however you may implement whatever you require). Please remember, that 301 requests are cached by browser -so you you are testing it intensively - you may need to purge browser cache from time to time.


7. External Reverse Proxy can be another option. It can do not only rewrites to external websites, but also rewrite some requests to alternative internal URL and pass that to IIS as "given" and further down to Sitecore. I met such scenarios several times on projects I took part. By the way, did you know that IIS can also serve as reverse proxy?

Performing rewrites and URL resolving logic outside of Sitecore can be both advantageous and disadvantageous. What traps does it bring?

Well, imagine you are new developer who start working on a new working copy of source code. When you run locally you may have different URL patterns compared to those on production environment. Business users usually deal with external production URLs and do not know internal structure, so that is how they form tasks and change requests. If you are not enough lucky to have comprehensive documentation or senior colleagues who can explain how is that configured - you may end up in multiple puzzling hours of attempting to find and match URLs from different environments.

Also, SEO much relies on sitemaps, so if you are using dynamic sitemaps - you need to implement that custom URL resolving logic that you have on reverse proxy. Also Sitemap Module from the Market would not work for you in that case.


I hope this article helped you to understand you options are with their pros and cons and to pick up a proper implementation depending on exact scenario.

IIS URL Rewrite module - as reverse proxy with links rewrite

Not many people know that IIS itself can serve as Reverse Proxy, with rewriting URLs on-the-fly. We are going to take a look on how to configure that feature. Let's assume we have 2 websites - primary website that has URL http://test2/ and is a hosted by IIS, moreover there is an instance of Sitecore installed; and another external static website that has URL http://external/ and it has few static pages and resources. For this experiment I got external website hosted at the same IIS instance, while in reality it can be literary anything and anywhere.


Apart from having IIS, you will need the following prerequisite:

- URL Rewrite Module installed, version 2.0

- Application Request Routing version 2.0


The easiest way to get all the prerequisites is to install them through Web Platform Installer. It will install all of them so you'll just need to have IIS refreshed and get ready to start.



External website contains static.html file with the following code

<div>
    img/sitecore.png<br>
    <img src="img/sitecore.png" alt="sitecore" width="230" height="106">
</div>
<div>
    /img/sitecore.png<br>
    <img src="/img/sitecore.png" alt="sitecore" width="230" height="106">
</div>
<div>
    http://external/img/sitecore.png<br>
    <img src="http://external/img/sitecore.png" alt="sitecore" width="230" height="106">
</div>
<p>
    <a href="sitecore.zip">sitecore.zip</a><br>
    <a href="/sitecore.zip">/sitecore.zip</a><br>
    <a href="http://external/sitecore.zip">http://external/sitecore.zip</a><br>
</p>

This code has 3 images and 3 links to an archive file, each of them is either relative link (from the doc level, for sure) or absolute link (from web root) or fully qualified link including domain name and protocol. This HTML renders renders into the following screenshot:


Our objective is to have a "virtual" "folder" called ext on the test2 website so that it "mapped" to external website and also correctly "maps" and rewrites all the resources of external website on resulting page.

Example:

When we hit http://test2/ in browser - we get default Sitecore page as it is provided by Test2 website, as normally.

When we hit http://test/ext/static.html - we get the page at that URL but with the content of external/static.html page with all links and references rewritten to be test2/ext/*.* instead of external/*.*

So, to make IIS Rewrite work as reverse Proxy, let's do the following steps:


Make sure "Enable proxy"is checked, otherwise nothing will work.


In URL Rewrite section, click "Add Rule(s)" link, then from popup screen select "Reverse Proxy" and specify the rule. Also check outbound rules as the are rules that factually rewrite internal links. Please note that this function may add some overhead to your website performance.


After you specify the rules - one inbound and 2 outbound (they are shown below) - reverse proxy now functions and you may verify that by requesting the following ULR (as on the screenshot below):


Notice, that all links and images look correct, as the were before. To ensure they were rewritten correctly, let's view the source file of resulting page. Here is it:

<div>
    img/sitecore.png<br>
    <img src="img/sitecore.png" alt="sitecore" width="230" height="106">
</div>
<div>
    /img/sitecore.png<br>
    <img src="http://test2/ext/img/sitecore.png" alt="sitecore" width="230" height="106">
</div>
<div>
    http://external/img/sitecore.png<br>
    <img src="http://test2/ext/img/sitecore.png" alt="sitecore" width="230" height="106">
</div>
<p>
    <a href="sitecore.zip">sitecore.zip</a><br>
    <a href="http://test2/ext/sitecore.zip">/sitecore.zip</a><br>
    <a href="http://test2/ext/sitecore.zip">http://external/sitecore.zip</a><br>
</p>

As there were no need to rewrite relative URLs - they remain untouched. However root-folder URL and full URL were rewritten to satisfy new domain name and desired folder-path.

And finally, here is resulting configuration that makes it all work. Whatever we have previously configured is stored in the configuration file within system.webserver node in rewrite section:

<rewrite>
      <rules>
        <clear></clear>
          <rule name="ReverseProxyInboundRule2" stopprocessing="true">
            <match url="(ext)/(.*)?"></match>  
              <conditions>
                  <add input="{CACHE_URL}" pattern="^(https?)://"></add>
              </conditions>
              <action type="Rewrite" url="{C:1}://external/{R:2}"></action>
          </rule>
      </rules>
      <outboundrules>
        <rule name="ReverseProxyOutboundRule2" precondition="ResponseIsHtml1">
          <match filterbytags="A, Form, Img" pattern="^/(.*)" negate="false"></match>
          <action type="Rewrite" value="http://test2/ext/{R:1}"></action>
        </rule>
        <rule name="ReverseProxyOutboundRule1" precondition="ResponseIsHtml1">
          <match filterbytags="A, Form, Img" pattern="^http://external/(.*)?" negate="false"></match>
          <action type="Rewrite" value="http://test2/ext/{R:1}"></action>
        </rule>
        <preconditions>
              <precondition name="ResponseIsHtml1">
                  <add input="{RESPONSE_CONTENT_TYPE}" pattern="^text/html"></add>
              </precondition>
          </preconditions>
      </outboundrules>
    </rewrite>

There is no need to use visual configurer at all, you may just drop this snippet on web.config into appropriate section and it will start working straight away!


IIS URL Rewrite module - few SEO tricks



1. Canonicals - do 301 permanent redirect (this also works with HTTPS).






    

  


2. Rewrite URL lowercase - one more rewrite rule aiming SEO improvement.






3. Append trailing slash - this is another SEO (sometimes arguable) improvement. It is believed that having trailing slash on your URLs (except when it deals with file names for sure) will improve search engines ranking.










4. Query string rewrite - the example of extracting query string parameters and rewriting it in your own manner






    



5. Redirect to HTTPS - forcely redirect all non-secure request to HTTPS









6. Prevent image hot-linking - disallow strangers of reusing images hosted on your website in order to protect them and traffic.