There are two types
of feeds we can input to GSA
- Metadata-and-URL
- Client sends the metadata and URL to actual document (binary documents or web like HTML) to GSA
- GSA indexs the metadata and makes the HTTP GET call to crawl content of the URL. (It will crawls links on the HTML also).
- Adds to the existing feed.
- Content Feeds
- It includes the both metadata and content (whole HTML content or text…)
- If it is binary do the based 64 encoded and include in the XML
- And this feed may be big and can be till 1 GB for xml feed
- You need to specify feed type as Full/Incremental (indicates whether to replace / add to the existing feed content).
XML Feed Elements
- Datasource : indicates the name of the feed
- Feedtype : incremental, full or metadata-and-url
- For the content feed it is going to be incremental/full
- For the metadta-and-URL , it is just metadata and URL's
- Reocrd : Each record is specific document, Can have different properties like URL, action , MIMEType
- Action: is add/delete
- URL: if it is metadata-and-url feed , GSA makes the get call with this URL but if it is content feed then GSA uses this as unique identifier to this record. (you may need to configure the userid/password to fetch the content using the URL)
- MimeType : will denote how the GSA treats that document eg: application/pdf or text/html
- Last-modified-date :
- Authmethod: none, httpbasic, ntlm, httpsso
- displayURL : for alternate URL ( Used for search results)
- Metadata: You can send the ACL's using the metadata as blow
<meta
name="google:aclusers" content="siva,vaka"/>
<meta
name="google:aclgroups" content="admins,hr"/>
- Content : actual content (like html content, binary content …etc)
- encoding : specifies the encoding eg: base64binary or base64compressed.
<gsafeed>
<header>
<datasource>myfeed</datasource>
<feedtype>incremental</feedtype>
</header>
<group>
<record
url="" action="add" mimetype="text/html">
<metadata>
<meta
name="state" content="new york">
<meta
name="city" content="buffalo">
<meta
name="google:aclusers" content="siva,vaka"/>
<meta
name="google:aclgroups" content="admins,hr"/
</metadata>
<content>
<![CDATA[
<html>
<head><title>my
doc</title></head>
<body>this
is html document body</body>
</html>
]]>
</content>
</record>
<record
url="" action="add"
mimetype="application/pdf">
<content
encoding="base64binary">
SDfadfxdafeadfdafJLKadfad
Adfadsfadf
KDLDJSPofaifadf…
</content>
</record>
<record
url="http://test.com/page1.html"
displayURL="http://test.com/pagesIndex.html" action="add"
mimetype="text/html">
<content>
</content>
</record>
</group>
</gsafeed>
NOTE:
Metadata-and-URL gives better relevance because it maintains the links
structure(it follows the links) , but content feeds may not because it doesn't
followed the links.
Sample HTML page to
post the feed to GSA is as follows
<html>
<body>
<body>
<form
enctype="multipart/form-data"
action="http://<gsa>:19900/xmlfeed" method="post">
<input
type="text" name="datasource">
<input
type="radio" name="feedtype" value="full">full
<input
type="radio" name="feedtype"
value="incremental">incremental
<input
type="radio" name="feedtype"
value="metadata-and-url">metadata-and-url
<input
type="file" name="data">
<input
type="submit" value="submit">
</form>
<body>
</html>
NOTE:
- Default port for the HTTPS feeds is 19902
- Max feed file size is : 1GB, if it is more than 1 GB you can break them into several feeds with same feed name.
- Timeouts may happen if too many feeds are pushed
- To delete the Feed ,
- Add do not crawl pattern
- For content feeds, you can delete the feed from feeds page
- Delete the individual records using action="delete"
GSA
Feed DTD
<?xml version="1.0"
encoding="UTF-8"?>
<!ELEMENT gsafeed (header, group+)>
<!ELEMENT header (datasource, feedtype)>
<!-- datasource name should match the regex [a-zA-Z_][a-zA-Z0-9_-]*,
the first character must be a letter or underscore,
the rest of the characters can be alphanumeric, dash, or underscore. -->
<!ELEMENT datasource (#PCDATA)>
<!-- feedtype must be either 'full', 'incremental', or 'metadata-and-url' -->
<!ELEMENT feedtype (#PCDATA)>
<!ELEMENT gsafeed (header, group+)>
<!ELEMENT header (datasource, feedtype)>
<!-- datasource name should match the regex [a-zA-Z_][a-zA-Z0-9_-]*,
the first character must be a letter or underscore,
the rest of the characters can be alphanumeric, dash, or underscore. -->
<!ELEMENT datasource (#PCDATA)>
<!-- feedtype must be either 'full', 'incremental', or 'metadata-and-url' -->
<!ELEMENT feedtype (#PCDATA)>
<!-- group element lets you group records
together and
specify a common action for them -->
<!ELEMENT group (record*)>
specify a common action for them -->
<!ELEMENT group (record*)>
<!-- record element can have attribute that
overrides group's element-->
<!ELEMENT record (metadata*,content*)>
<!ELEMENT metadata (meta*)>
<!ELEMENT meta EMPTY>
<!ELEMENT content (#PCDATA)>
<!ELEMENT record (metadata*,content*)>
<!ELEMENT metadata (meta*)>
<!ELEMENT meta EMPTY>
<!ELEMENT content (#PCDATA)>
<!-- default is 'add' -->
<!-- last-modified date as per RFC822 -->
<!ATTLIST group
action (add|delete) "add"
pagerank CDATA #IMPLIED>
<!ATTLIST record
url CDATA #REQUIRED
displayurl CDATA #IMPLIED
action (add|delete) #IMPLIED
mimetype CDATA #REQUIRED
last-modified CDATA #IMPLIED
lock (true|false) "false"
authmethod (none|httpbasic|ntlm|httpsso) "none"
pagerank CDATA #IMPLIED>
<!-- last-modified date as per RFC822 -->
<!ATTLIST group
action (add|delete) "add"
pagerank CDATA #IMPLIED>
<!ATTLIST record
url CDATA #REQUIRED
displayurl CDATA #IMPLIED
action (add|delete) #IMPLIED
mimetype CDATA #REQUIRED
last-modified CDATA #IMPLIED
lock (true|false) "false"
authmethod (none|httpbasic|ntlm|httpsso) "none"
pagerank CDATA #IMPLIED>
<!ATTLIST meta
encoding (base64binary) #IMPLIED
name CDATA #REQUIRED
content CDATA #REQUIRED>
encoding (base64binary) #IMPLIED
name CDATA #REQUIRED
content CDATA #REQUIRED>
<!-- for content, if encoding is specified, it
should be either base64binary
(base64 encoded) or base64compressed (zlib compressed and then base64
encoded). -->
<!ATTLIST content encoding (base64binary|base64compressed) #IMPLIED>
(base64 encoded) or base64compressed (zlib compressed and then base64
encoded). -->
<!ATTLIST content encoding (base64binary|base64compressed) #IMPLIED>
Resources
No comments:
Post a Comment