GSA (Google Search Appliance) Crawling Content


  1. If there are no crawlers defined, then there won't any data in index.
  2. Term "Crawling" is used when getting the web content , "Traversal" is the term used to while reading info from the filesystem.
  3. Name of crawler is under "console-->crawl and index--> http header"
  4. Configuring crawlers
    1. From admin console :
      1. Login to GSA (http://gsa:8000/EnterpriseController)
      2. Crawl and index --> crawl URL's
      3. Need to fill Start pattern, follow patterns  and do not follow pattern text areas
    2. Can point to root or index page
      1. Sends get request to start page (or root page), contains all links to pages on webserver, then it crawl recursively.
    3. Can specify follow patterns or white list (can specify file extensions)
      1. Like http://sivavaka.com/, /sales/, *.pdf, *.html
    4. Can specify don't follow patterns or block list
      1. Like http://sivavaka.com/sensitive, *.exe, *.bin
      2. Sample do not crawl patterns are like links on calendar (previous and next years …create inifinity loop)
    5. Rules for the configuring the patterns
      1. Add trailing "/" for directories : causes to follows the all links and also looks for the index.html file by default.
      2. "#" for the comments in patterns
      3. Regular expressions::: regexp:\\.mdb$  (here \\ is escaping char)
NOTE: GSA doesn't store the binary version of binary documents like word documents or xls …etc but it converts the doc to html and stores the text. (text version link will give html version)


  1. Document Dates:
    1. several locations document dates can be located .
      1. By default last-modified-date return by webserver
      2. In url
      3. Metatag
      4. Body
      5. title
      6. From admin console --> Document dates
        1. */ --> last modified  (*/ mean all documents)
Note: If date not found, document indexed without date and appeared at last.


  1. Recrawling and load
    1. Recrawling duration:
      1. GSA by default set the recrawl to 3 days , if document has changes then it reduces the recrawl period to 50 % shorter that means day and half. If GSA see the change again then it shorter the 50% again until it doesn't see the change.
      2. If document doesn't change then recrawl time will go up  50%.
    2. Removing the documents
      1. If document returns 404, you can set to remove document from index.
      2. Also from "Crawl and index --> Freshness tuning" options, you can specify the URL pattern
    3. Host load schedule
      1. Parallel connections, max doc size , no of documents crawl…etc settings
    4. You can each document crawl history from GSA console--> crawl diagnosis.

  1. Reasons for URL's not in Index
                              a.  Files blocked by Robots.txt
Sample ::
User-agent: *
Disallow: /Something/sensitive
User-agent: googlebot
Disallow: /
  1. Unsupported file types
  2. Files beyond license/coverage limit
  3. Do not crawl pattern
  4. Orphaned files
  5. Password protected files
  6. AJAX content
  7. <frame> and <a> outside of <frameset>
  1. Troubleshooting the URL not in the results
    1. Browser access : If you don't setup the crawler access then you may not be able to crawl the secure documents
    2. Crawl Diagnostics
    3. Real-time Diagnostics
    4. Cached version
    5. Search results
You can directly search using following pattern to find
Info:<url> verfies the page is indexed
link:<url> pages that links this url

Resources

No comments:

Post a Comment