Google Search Appliance (GSA) Feeds


There are two types of feeds we can input to GSA

  1. Metadata-and-URL
    1. Client sends the metadata and URL to actual document (binary documents  or web like HTML) to GSA
    2. GSA indexs the metadata and makes the HTTP GET call to crawl content of the URL. (It will crawls links on the HTML also).
    3. Adds to the existing feed.
  2. Content Feeds
    1. It includes the both metadata and content (whole HTML content or text…)
    2. If it is binary do the based 64 encoded and include in the XML
    3. And this feed may be big and can be till 1 GB for xml feed
    4. You need to specify feed type as Full/Incremental  (indicates whether to replace / add to the existing feed content).


XML Feed Elements
  1. Datasource : indicates the name of the feed
  2. Feedtype   :  incremental, full or metadata-and-url 
    1. For the content feed it is going to be incremental/full
    2. For the metadta-and-URL , it is just metadata and URL's
  3. Reocrd : Each record is specific document,  Can have different properties like URL, action , MIMEType
    1. Action:  is add/delete
    2. URL: if it is metadata-and-url feed , GSA makes the get call with this URL but if it is content feed then GSA uses this as unique identifier to this record. (you may need to configure the userid/password to fetch the content using the URL)
    3. MimeType : will denote how the GSA treats that document eg: application/pdf or text/html
    4. Last-modified-date :
    5. Authmethod: none, httpbasic, ntlm, httpsso
    6. displayURL : for alternate URL ( Used for search results)
  4. Metadata: You can send the ACL's using the metadata as blow
<meta name="google:aclusers" content="siva,vaka"/>
<meta name="google:aclgroups" content="admins,hr"/>

  1. Content : actual content (like html content, binary content …etc)
    1. encoding : specifies the encoding eg: base64binary or base64compressed.


<gsafeed>
<header>
<datasource>myfeed</datasource>
<feedtype>incremental</feedtype>
</header>
<group>
<record url="" action="add" mimetype="text/html">
<metadata>
<meta name="state" content="new york">
<meta name="city" content="buffalo">
<meta name="google:aclusers" content="siva,vaka"/>
<meta name="google:aclgroups" content="admins,hr"/
</metadata>
<content>
<![CDATA[
<html>
<head><title>my doc</title></head>
<body>this is html document body</body>
</html>
]]>
</content>
</record>

<record url="" action="add" mimetype="application/pdf">
<content encoding="base64binary">
SDfadfxdafeadfdafJLKadfad
Adfadsfadf KDLDJSPofaifadf…
</content>
</record>

<record url="http://test.com/page1.html" displayURL="http://test.com/pagesIndex.html" action="add" mimetype="text/html">
<content>

</content>

</record>

</group>
</gsafeed>


NOTE: Metadata-and-URL gives better relevance because it maintains the links structure(it follows the links) , but content feeds may not because it doesn't followed the links.

Sample HTML page to post the feed to GSA is as follows
<html>
         <body>
<form enctype="multipart/form-data" action="http://<gsa>:19900/xmlfeed" method="post">
<input type="text" name="datasource">
<input type="radio" name="feedtype" value="full">full
<input type="radio" name="feedtype" value="incremental">incremental
<input type="radio" name="feedtype" value="metadata-and-url">metadata-and-url
<input type="file" name="data">
<input type="submit" value="submit">
</form>
<body>
</html>

NOTE:
  1. Default port for the HTTPS feeds is 19902
  2. Max feed file size is : 1GB, if it is more than 1 GB you can break them into several feeds with same feed name.
  3. Timeouts may happen if too many feeds are pushed
  4. To delete the Feed ,
    1. Add do not crawl pattern
    2. For content feeds, you can delete the feed from feeds page
    3. Delete the individual records using action="delete"


GSA Feed DTD

<?xml version="1.0" encoding="UTF-8"?>
<!ELEMENT gsafeed (header, group+)>
<!ELEMENT header (datasource, feedtype)>
<!-- datasource name should match the regex [a-zA-Z_][a-zA-Z0-9_-]*,
     the first character must be a letter or underscore,
     the rest of the characters can be alphanumeric, dash, or underscore. -->
<!ELEMENT datasource (#PCDATA)>
<!-- feedtype must be either 'full', 'incremental', or 'metadata-and-url' -->
<!ELEMENT feedtype (#PCDATA)>
<!-- group element lets you group records together and
     specify a common action for them -->
<!ELEMENT group (record*)>
<!-- record element can have attribute that overrides group's element-->
<!ELEMENT record (metadata*,content*)>
<!ELEMENT metadata (meta*)>
<!ELEMENT meta EMPTY>
<!ELEMENT content (#PCDATA)>
<!-- default is 'add' -->
<!-- last-modified date as per RFC822 -->
<!ATTLIST group
   action (add|delete) "add"
   pagerank CDATA #IMPLIED>
<!ATTLIST record
   url CDATA #REQUIRED
   displayurl CDATA #IMPLIED
   action (add|delete) #IMPLIED
   mimetype CDATA #REQUIRED
   last-modified CDATA #IMPLIED
   lock (true|false) "false"
   authmethod (none|httpbasic|ntlm|httpsso) "none"
   pagerank CDATA #IMPLIED>
<!ATTLIST meta
   encoding (base64binary) #IMPLIED
   name CDATA #REQUIRED
   content CDATA #REQUIRED>
<!-- for content, if encoding is specified, it should be either base64binary
     (base64 encoded) or base64compressed (zlib compressed and then base64
     encoded). -->
<!ATTLIST content encoding (base64binary|base64compressed) #IMPLIED>

Resources
  1. https://developers.google.com/search-appliance/documentation/612/feedsguide#system
  2. https://developers.google.com/search-appliance/documentation/612/feedsguide#system

No comments:

Post a Comment