Configure tenant crawling - Fluid Topics - 3.9

Fluid Topics Configuration and Administration Guide

Category
Reference Guides
Audience
public
Version
3.9

How crawling works

Search engines use crawlers, also called bots, to visit websites on a regular basis and index their content. As a result, users can enter search terms in an external search engine and directly filter on only those websites containing the most relevant information.

For example, Google regularly crawls one of Antidot's documentation portals – doc.fluidtopics.com. Since the Google crawling tool indexes content from doc.fluidtopics.com, users can search for and access Antidot's documentation via Google's list of search results. When searching for "enrich and clean" on Google, the "Enrich and Clean" Technical Note on Antidot's portal appears at the top of the list of search results:

fluid-topics-doc-search-in-google

Use case

When Google crawls doc.fluidtopics.com, it retrieves information from the following URL:

doc.fluidtopics.com/sitemap.xml

This URL grants Google access to all content which is public (not protected by content access rights). The following examples show excerpts of the Antidot portal's sitemap.

  • Three structured documents (maps) which open in the Reader page:

    <url>
    <loc>
    https://doc.fluidtopics.com/r/map_1
    </loc>
    </url>
    <url>
    <loc>
    https://doc.fluidtopics.com/r/map_2
    </loc>
    </url>
    <url>
    <loc>
    https://doc.fluidtopics.com/r/map_3
    </loc>
    </url>

  • An unstructured document which opens in the Viewer page:

    <url>
    <loc>
    https://doc.fluidtopics.com/v/u/PDF_1
    </loc>
    </url>

From these URLs, crawlers can access and index the following content:

  • The maps map_01, map_02 and map_03 and their topics.
  • The unstructured document PDF_1.

Crawling and content access rights

Crawlers can only access and index public content. By default, each document is public until an ADMIN or KHUB_ADMIN user configures configure content access rights to keep it from being exposed to all unauthorized users, including crawlers. Indeed, configuring content access rights is the only way to keep crawlers from accessing and indexing content.