Siva R Vaka: GSA (Google Search Appliance) Crawling Content

If there are no crawlers defined, then there won't any data in index.
Term "Crawling" is used when getting the web content , "Traversal" is the term used to while reading info from the filesystem.
Name of crawler is under "console-->crawl and index--> http header"
Configuring crawlers

From admin console :

Login to GSA (http://gsa:8000/EnterpriseController)
Crawl and index --> crawl URL's
Need to fill Start pattern, follow patterns and do not follow pattern text areas

Can point to root or index page

Sends get request to start page (or root page), contains all links to pages on webserver, then it crawl recursively.

Can specify follow patterns or white list (can specify file extensions)

Like http://sivavaka.com/, /sales/, *.pdf, *.html

Can specify don't follow patterns or block list

Like http://sivavaka.com/sensitive, *.exe, *.bin
Sample do not crawl patterns are like links on calendar (previous and next years …create inifinity loop)

Rules for the configuring the patterns

Add trailing "/" for directories : causes to follows the all links and also looks for the index.html file by default.
"#" for the comments in patterns
Regular expressions::: regexp:\\.mdb$ (here \\ is escaping char)

NOTE: GSA doesn't store the binary version of binary documents like word documents or xls …etc but it converts the doc to html and stores the text. (text version link will give html version)

Document Dates:

several locations document dates can be located .

By default last-modified-date return by webserver
In url
Metatag
Body
title
From admin console --> Document dates

*/ --> last modified (*/ mean all documents)

Note: If date not found, document indexed without date and appeared at last.

Recrawling and load

Recrawling duration:

GSA by default set the recrawl to 3 days , if document has changes then it reduces the recrawl period to 50 % shorter that means day and half. If GSA see the change again then it shorter the 50% again until it doesn't see the change.
If document doesn't change then recrawl time will go up 50%.

Removing the documents

If document returns 404, you can set to remove document from index.
Also from "Crawl and index --> Freshness tuning" options, you can specify the URL pattern

Host load schedule

Parallel connections, max doc size , no of documents crawl…etc settings

You can each document crawl history from GSA console--> crawl diagnosis.

Reasons for URL's not in Index

a. Files blocked by Robots.txt

Sample ::

User-agent: *

Disallow: /Something/sensitive

User-agent: googlebot

Disallow: /

Unsupported file types
Files beyond license/coverage limit
Do not crawl pattern
Orphaned files
Password protected files
AJAX content
<frame> and <a> outside of <frameset>

Troubleshooting the URL not in the results

Browser access : If you don't setup the crawler access then you may not be able to crawl the secure documents
Crawl Diagnostics
Real-time Diagnostics
Cached version
Search results

You can directly search using following pattern to find

Info:<url> verfies the page is indexed

link:<url> pages that links this url

Resources

1. https://developers.google.com/search-appliance/documentation/614

Siva R Vaka

GSA (Google Search Appliance) Crawling Content

No comments:

Post a Comment