(譯) Laravel 8自動產生Sitemap

13 min readSep 10, 2021

Sitemap 是否真的必須？

理論上 Sitemap 協助搜尋引擎的網頁爬蟲發現您網站所有的頁面。Google 文件有詳細說明

如果您網站的頁面都有正確的連結，爬蟲通常可以發現全部的頁面。即便如此，Sitemap 也可以改善網頁抓取尤其是您的網站滿足下拉條件之一時：- 您的網站非常大。因此 Google 爬蟲有可能會忽略抓取一些最近更新的頁面
- 您的網站有大量內容頁面存檔。這些頁面屬於獨立頁面或沒有被妥善連結。如果您的網站本身每個頁面沒有做好參考關聯，您可以將他們列在 Sitemap 上來確保 Google 不會忽略它們。
- 您的網站是新網站，當下幾乎沒有外部連結。Google Bot 和其他爬蟲是根據某個頁面的連結來找到另一個頁面的。如果外部沒有任何連結，結果就是 Google 可能根本沒有發現您的頁面。
- 您的網站使用大量媒體內容，顯示在 Google News 或其他網站相容的 Sitemap 註記

針對大型網站，並非所有頁面都會有參考連結（例如電商網站並不是所有的產品在頁面上都有連結），因此定義 Sitemap 就是需要的。但對於小型到中型網站可能全部頁面都有適當的關聯，從 Google 文件得出的結論並不是必須。

不過經常被提到的是如果有 Sitemap 並且將其提供給搜尋引擎那麼抓取的速度會快一點。也很常聽到在 Google Search Console 提供 Sitemap 是有好處的，您可以比較頁面數和 Google 取得的是否一致。通過這種方式，您可以檢測 Google 是否無法抓取您希望抓取的網站部分。

安裝

$ composer require spatie/laravel-sitemap

設定

要覆寫預設設定可以先匯出設定檔：

$ php artisan vendor:publish --provider="Spatie\Sitemap\SitemapServiceProvider"

設定檔為。config/sitemap.php

<?phpuse GuzzleHttp\RequestOptions;
use Spatie\Sitemap\Crawler\Profile;return [/*
     * These options will be passed to GuzzleHttp\Client when it is created.
     * For in-depth information on all options see the Guzzle docs:
     *
     * http://docs.guzzlephp.org/en/stable/request-options.html
     */
    'guzzle_options' => [/*
         * Whether or not cookies are used in a request.
         */
        RequestOptions::COOKIES => true,/*
         * The number of seconds to wait while trying to connect to a server.
         * Use 0 to wait indefinitely.
         */
        RequestOptions::CONNECT_TIMEOUT => 10,/*
         * The timeout of the request in seconds. Use 0 to wait indefinitely.
         */
        RequestOptions::TIMEOUT => 10,/*
         * Describes the redirect behavior of a request.
         */
        RequestOptions::ALLOW_REDIRECTS => false,
    ],/*
     * The sitemap generator can execute JavaScript on each page so it will
     * discover links that are generated by your JS scripts. This feature
     * is powered by headless Chrome.
     */
    'execute_javascript' => false,/*
     * The package will make an educated guess as to where Google Chrome is installed.
     * You can also manually pass it's location here.
     */
    'chrome_binary_path' => null,/*
     * The sitemap generator uses a CrawlProfile implementation to determine
     * which urls should be crawled for the sitemap.
     */
    'crawl_profile' => Profile::class,];

建立 Sitemap

想像您有一個 Laravel 應用程式，使用 example.com 網域，每個頁面都有正確關聯。包含首頁，聯絡頁面，專案頁面。使用這個套件您可以如下產生 Sitemap:

use Spatie\Sitemap\Sitemap;
use Spatie\Tags\Url;$sitemap = Sitemap::create()
  ->add(Url::create('/home'))
  ->add(Url::create('/contact'));Projects::all()->each(function (Project $project) use ($sitemap) {
    $sitemap->add(Url::create("/project/{$project->slug}"));
});$sitemap->writeToFile(public_path('sitemap.xml'));

上面可以完成我們的需求，但有點麻煩。如果您加入其他頁面，請記得回來補上。

產生 Sitemap

為了避免手動加入，套件支援了 SitemapGenerator 。這個類別可以自動爬您的網站並產生 Sitemap。

使用 SitemapGenerator 上面範例可以使用下面取代：

use Spatie\Sitemap\SitemapGenerator;SitemapGenerator::create('https://example.com')->writeToFile(public_path('sitemap.xml'));

您可以輕易的建立一個 Artisan 指令來建立這個 Sitemap 以及排程。如此可以確保新的頁面和內容會自動被加入

$ php artisan make:command GenerateSitemapnamespace App\Console\Commands;use Illuminate\Console\Command;
use Spatie\Sitemap\SitemapGenerator;class GenerateSitemap extends Command
{
    /**
     * The console command name.
     *
     * @var string
     */
    protected $signature = 'sitemap:generate';/**
     * The console command description.
     *
     * @var string
     */
    protected $description = 'Generate the sitemap.';/**
     * Execute the console command.
     *
     * @return mixed
     */
    public function handle()
    {
        // modify this to your own needs
        SitemapGenerator::create(config('app.url'))
            ->writeToFile(public_path('sitemap.xml'));
    }
}

它也可以在 Console Kernel 中設定每日執行

// app/Console/Kernel.php
protected function schedule(Schedule $schedule)
{
    ...
    $schedule->command('sitemap:generate')->daily();
    ...
}

兩全其美的方式

您可以結合兩種方式。您可以自動搭配手動加入：

SitemapGenerator::create('https://example.com')
 ->getSitemap()
 ->add(Url::create('/extra-page'))
 ->add(...);
 ->writeToFile($path);

限制

此套件主要目標是小型到中型的網站，基於規範，一個 Sitemap 可以儲存 50000 筆資料，如果超過您會需要 Sitemap Index。另外也關於指定連結的類型如影片，圖片等等，目前此套件不支援。

自訂爬蟲配置

您可以使用 Spatie\Crawler\CrawlProfiles\CrawlProfile 介面和通過客製 shouldCrawl() 方法控制那些 URL，網址，Sub-Domain 要爬。

use Spatie\Crawler\CrawlProfiles\CrawlProfile;
use Psr\Http\Message\UriInterface;class CustomCrawlProfile extends CrawlProfile
{
    public function shouldCrawl(UriInterface $url): bool
    {
        if ($url->getHost() !== 'localhost') {
            return false;
        }
        
        return $url->getPath() === '/';
    }
}

然後在設定檔 config/sitemap.php 註冊

return [
    ...
    /*
     * The sitemap generator uses a CrawlProfile implementation to determine
     * which urls should be crawled for the sitemap.
     */
    'crawl_profile' => CustomCrawlProfile::class,
    
];

變更屬性

舉例要變更聯絡頁面 /contact 的 lastmod，changefreq，和 priority

use Carbon\Carbon;
use Spatie\Sitemap\SitemapGenerator;
use Spatie\Sitemap\Tags\Url;SitemapGenerator::create('https://example.com')
   ->hasCrawled(function (Url $url) {
       if ($url->segment(1) === 'contact') {
           $url->setPriority(0.9)
               ->setLastModificationDate(Carbon::create('2016', '1', '1'));
       }return $url;
   })
   ->writeToFile($sitemapPath);

忽略連結

如果您不希望某些被爬到的連結出現在 Sitemap 可以在 hasCrawled 處理

use Spatie\Sitemap\SitemapGenerator;
use Spatie\Sitemap\Tags\Url;SitemapGenerator::create('https://example.com')
   ->hasCrawled(function (Url $url) {
       if ($url->segment(1) === 'contact') {
           return;
       }return $url;
   })
   ->writeToFile($sitemapPath);

防止爬蟲讀取某些頁面

您也可以在爬蟲這邊設定忽略某些頁面。注意 shouldCrawl 只有在預設爬蟲或自訂爬蟲有實作 shouldrawlCallback 時才有作用：

use Spatie\Sitemap\SitemapGenerator;
use Psr\Http\Message\UriInterface;SitemapGenerator::create('https://example.com')
   ->shouldCrawl(function (UriInterface $url) {
       // All pages will be crawled, except the contact page.
       // Links present on the contact page won't be added to the
       // sitemap unless they are present on a crawlable page.
       
       return strpos($url->getPath(), '/contact') === false;
   })
   ->writeToFile($sitemapPath);

設定爬蟲

爬蟲本身可以設定執行一些不同的事情。您可以利用 Sitemap Generator 來設定，例如忽略 robot 檢查

SitemapGenerator::create('http://localhost:4020')
    ->configureCrawler(function (Crawler $crawler) {
        $crawler->ignoreRobots();
    })
    ->writeToFile($file);

限制存取頁數

use Spatie\Sitemap\SitemapGenerator;SitemapGenerator::create('https://example.com')
    ->setMaximumCrawlCount(500) // only the 500 first pages will be crawled

執行 JavaScript

備註：目前在 Laravel 8 啟用該設定並無法正確讀取 Inertia 產生的連結。

Sitemap Generator 會在每個讀取頁面執行 JavaScript ，因此可以讀取使用 JS 產生的連結。要提供此功能只須將 excute_javascript 設成 true。

底層 headless Chrome 會執行 JavaScript 。如何安裝

加入替代連結

多語系的網站對於一個頁面可能有多個替代的版本。基於這個情境您可以加入替代的連結

use Spatie\Sitemap\SitemapGenerator;
use Spatie\Sitemap\Tags\Url;SitemapGenerator::create('https://example.com')
    ->getSitemap()
    // here we add one extra link, but you can add as many as you'd like
    ->add(Url::create('/extra-page')->setPriority(0.5)->addAlternate('/extra-pagina', 'nl'))
    ->writeToFile($sitemapPath);