Version: current

Record Extractor

Introduction

info

This documentation will only contain information regarding the helpers.docsearch method, see Algolia Crawler Documentation for more information on the Algolia Crawler.

Pages are extracted by a recordExtractor. These extractors are assigned to actions via the recordExtractor parameter. This parameter links to a function that returns the data you want to index, organized in an array of JSON objects.

The helpers are a collection of functions to help you extract content and generate Algolia records.

Useful links

Usage

The most common way to use the DocSearch helper, is to return its result to the recordExtractor function.

recordExtractor: ({ helpers }) => {
  return helpers.docsearch({
    recordProps: {
      lvl0: {
        selectors: "header h1",
      },
      lvl1: "article h2",
      lvl2: "article h3",
      lvl3: "article h4",
      lvl4: "article h5",
      lvl5: "article h6",
      content: "main p, main li",
    },
  });
},

Manipulate the DOM with Cheerio

The Cheerio instance ($) allows you to manipulate the DOM:

recordExtractor: ({ $, helpers }) => {
  // Removing DOM elements we don't want to crawl
  $(".my-warning-message").remove();

  return helpers.docsearch({
    recordProps: {
      lvl0: {
        selectors: "header h1",
      }
      lvl1: "article h2",
      lvl2: "article h3",
      lvl3: "article h4",
      lvl4: "article h5",
      lvl5: "article h6",
      content: "main p, main li",
    },
  });
},

Provide fallback selectors

Fallback selectors can be useful when retrieving content that might not exist in some pages:

recordExtractor: ({ $, helpers }) => {
  return helpers.docsearch({
    recordProps: {
      // `.exists h1` will be selected if `.exists-probably h1` does not exists.
      lvl0: {
        selectors: [".exists-probably h1", ".exists h1"],
      }
      lvl1: "article h2",
      lvl2: "article h3",
      lvl3: "article h4",
      lvl4: "article h5",
      lvl5: "article h6",
      // `.exists p, .exists li` will be selected.
      content: [
        ".does-not-exists p, .does-not-exists li",
        ".exists p, .exists li",
      ],
    },
  });
},

Provide raw text (`defaultValue`)

Only the lvl0 and custom variables selectors support this option

You might want to structure your search results differently than your website, or provide a defaultValue to a potentially non-existent selector:

recordExtractor: ({ $, helpers }) => {
  return helpers.docsearch({
    recordProps: {
      lvl0: {
        // It also supports the fallback DOM selectors syntax!
        selectors: ".exists-probably h1",
        defaultValue: "myRawTextIfDoesNotExists",
      },
      lvl1: "article h2",
      lvl2: "article h3",
      lvl3: "article h4",
      lvl4: "article h5",
      lvl5: "article h6",
      content: "main p, main li",
      // The variables below can be used to filter your search
      language: {
        // It also supports the fallback DOM selectors syntax!
        selectors: ".exists-probably .language",
        // Since custom variables are used for filtering, we allow sending
        // multiple raw values
        defaultValue: ["en", "en-US"],
      },
    },
  });
},

Indexing content for faceting

These selectors also support defaultValue and fallback selectors

You might want to index content that will be used as filters in your frontend (e.g. version or lang), you can defined any custom variable to the recordProps object to add them to your Algolia records:

recordExtractor: ({ helpers }) => {
  return helpers.docsearch({
    recordProps: {
      lvl0: {
        selectors: "header h1",
      },
      lvl1: "article h2",
      lvl2: "article h3",
      lvl3: "article h4",
      lvl4: "article h5",
      lvl5: "article h6",
      content: "main p, main li",
      // The variables below can be used to filter your search
      foo: ".bar",
      language: {
        // It also supports the fallback DOM selectors syntax!
        selectors: ".does-not-exists",
        // Since custom variables are used for filtering, we allow sending
        // multiple raw values
        defaultValue: ["en", "en-US"],
      },
      version: {
        // You can send raw values without `selectors`
        defaultValue: ["latest", "stable"],
      },
    },
  });
},

The following version, lang and foo attributes will be available in your records:

foo: "valueFromBarSelector",
language: ["en", "en-US"],
version: ["latest", "stable"]

You can now use them to filter your search in the frontend

Boost search results with `pageRank`

pageRank used to be an integer, it is now a string

This parameter allow you to boost records built from the current pathsToMatch. Pages with highest pageRank will be returned before pages with a lower pageRank. Note that you can pass any numeric value as a string, including negative values:

{
  indexName: "YOUR_INDEX_NAME",
  pathsToMatch: ["https://YOUR_WEBSITE_URL/api/**"],
  recordExtractor: ({ $, helpers }) => {
    return helpers.docsearch({
      recordProps: {
        lvl0: "header h1",
        lvl1: "article h2",
        lvl2: "article h3",
        lvl3: "article h4",
        lvl4: "article h5",
        lvl5: "article h6",
        content: "article p, article li",
        pageRank: "30",
      },
    });
  },
},

Reduce the number records

If you encounter the Extractors returned too many records error when your page outputs more than 750 records. The aggregateContent option helps you reducing the number of records at the content level of the extractor.

{
  indexName: "YOUR_INDEX_NAME",
  pathsToMatch: ["https://YOUR_WEBSITE_URL/api/**"],
  recordExtractor: ({ $, helpers }) => {
    return helpers.docsearch({
      recordProps: {
        lvl0: "header h1",
        lvl1: "article h2",
        lvl2: "article h3",
        lvl3: "article h4",
        lvl4: "article h5",
        lvl5: "article h6",
        content: "article p, article li",
      },
      aggregateContent: true,
    });
  },
},

Reduce the record size

If you encounter the Records extracted are too big error when crawling your website, it's mostly because there was too many informations in your records, or that your page is too big. The recordVersion option helps you reducing the records size by removing informations that are only used with DocSearch v2.

{
  indexName: "YOUR_INDEX_NAME",
  pathsToMatch: ["https://YOUR_WEBSITE_URL/api/**"],
  recordExtractor: ({ $, helpers }) => {
    return helpers.docsearch({
      recordProps: {
        lvl0: "header h1",
        lvl1: "article h2",
        lvl2: "article h3",
        lvl3: "article h4",
        lvl4: "article h5",
        lvl5: "article h6",
        content: "article p, article li",
      },
      recordVersion: "v3",
    });
  },
},

`recordProps` API Reference

`lvl0`

type: Lvl0 | required

type Lvl0 = {
  selectors: string | string[];
  defaultValue?: string;
};

`lvl1`, `content`

type: string | string[] | required

`lvl2`, `lvl3`, `lvl4`, `lvl5`, `lvl6`

type: string | string[] | optional

`pageRank`

type: string | optional

See the live example

Custom variables

type: string | string[] | CustomVariable | optional

type CustomVariable =
  | {
      defaultValue: string | string[];
    }
  | {
      selectors: string | string[];
      defaultValue?: string | string[];
    };

Custom variables are used to filter your search, you can define them in the recordProps

`helpers.docsearch` API Reference

`aggregateContent`

type: boolean | default: true | optional

This option groups the Algolia records created at the content level of the selector into a single record for its matching heading.

`recordVersion`

type: 'v3' | 'v2' | default: v2 | optional

This option remove content from the Algolia records that are only used for DocSearch v2. If you are using the latest version of DocSearch, you can set it to v3.

`indexHeadings`

type: boolean | { from: number, to: number } | default: true | optional

This option tells the crawler if the headings (lvlX) should be indexed.

When false, only records for the content level will be created.
When from, to is provided, only records for the lvlX to lvlY will be created.

Introduction​

info

Useful links​

Usage​

Manipulate the DOM with Cheerio​

Provide fallback selectors​

Provide raw text (defaultValue)​

Indexing content for faceting​

Boost search results with pageRank​

Reduce the number records​

Reduce the record size​

recordProps API Reference​

lvl0​

lvl1, content​

lvl2, lvl3, lvl4, lvl5, lvl6​

pageRank​

Custom variables​

helpers.docsearch API Reference​

aggregateContent​

recordVersion​

indexHeadings​

Introduction

Useful links

Usage

Manipulate the DOM with Cheerio

Provide fallback selectors

Provide raw text (`defaultValue`)

Indexing content for faceting

Boost search results with `pageRank`

Reduce the number records

Reduce the record size

`recordProps` API Reference

`lvl0`

`lvl1`, `content`

`lvl2`, `lvl3`, `lvl4`, `lvl5`, `lvl6`

`pageRank`

Custom variables

`helpers.docsearch` API Reference

`aggregateContent`

`recordVersion`

`indexHeadings`