This library is released free of charge with source code included under the terms of the GPL. See the LICENSE file for details.
Includes Lucene from the Apache Group, and HTML Parser written by David McNicol.
Questions/Comments: Email James Cooper <pixel@bitmechanic.com>
SSL support: If you want to crawl encrypted URLs (i.e.
https://) you need to install
JSSE. This is included
by default in JDK 1.4, but earlier versions require an add-on. Spindle
attempts to initialize the JSSE package, and if it fails, will not
try to request https URLs.
Command Line Arguments:
argument description example required -u [start url] Starting URL to index. The spider will recursively request all links from this URL that are on the same host/port. This argument may be specified multiple times. -u http://www.bitmechanic.com/yes -d [index dir] Directory to write the Lucene index files out to. -d /www/search/indexyes -i [include substring] String of URLs to include. If set, only URLs that include this String will be added to the index and traversed. Can be specified multiple times. -i /~pixel/(would index files in my home dir only)no -e [exclude substring] String of URLs to exclude. If set, URLs that include this String will be excluded from the index. Can be specified multiple times. -e .cgino -v Turns on verbose mode. -vno -a Indicates that an existing index should be appended to, rather than overwritten. -ano -m [mime type] Specifies MIME types to index. By default "text/plain" and "text/html" are indexed. May be specified multiple times. -m text/plainno -t [num threads] Specifies the number of Java threads to use when spidering. The threads run in parallel. Increasing this number will increase the load on the server while indexing, but up to a point will increase the overall throughput. This parameter does not affect the size or quality of the resulting index. -t 5no (default is 2) -s [description bytes] Specifies the number characters to store as the description for each page. -s 1024no (default is 256)
search on command lineI suggest running your search on the command line first just to test that your index is working properly. To do this, run the following command:
This will return a set of hits to STDOUT. Once you have this test working, you're ready to add the search to your web site.java com.bitmechanic.spindle.Search [index dir] [keyword] for example: java com.bitmechanic.spindle.Search c:/spindle/bitmechanic-index spiderUsing the Search class
You can use the Search class programmatically to run searches against the index from your Java code:
import com.bitmechanic.spindle.Search; import java.util.*; public void runSearch() { ArrayList hits = Search.search("c:/spindle/bitmechanic-index", "mykeyword"); // These are the keys in the HashMap objects returned from the // search. String keys[] = { "url", "title", "score", "desc" }; for (int i = 0; i < hits.size(); i++) { HashMap map = (HashMap)hits.get(i); for (int x = 0; x < keys.length; x++) { System.out.println(keys[x] + "=" + map.get(keys[x])); } System.out.println(); } }Searching with listlib
In most cases, you want to run a search from a JSP. In this case, try using listlib. Put listlib.jar in your server's CLASSPATH, and add listlib.tld to your document root. Here's a sample JSP (which is included in the distribution as
search.jsp):<% String query = request.getParameter("query"); if (query == null) query = ""; %> <%@ taglib uri="/listlib.tld" prefix="list" %> <html> <head> <title>Search Page</title> </head> <body> <h1>Search</h1> <form method="post" action="search.jsp"> Search term: <input type="text" name="query" value="<%=query%>"> <input type="submit"> </form> <% if (!query.equals("")) { %> <!-- Create a spindle Search object -- it implements ListCreator --> <jsp:useBean id="search" class="com.bitmechanic.spindle.Search"/> <!-- Specify the directory that stores our Lucene index created by the spindle spider --> <jsp:setProperty name="search" property="dir" value="c:/bitmechanic/spindle/bitmechanic"/> <!-- Copy the query from the web form into the Search object --> <jsp:setProperty name="search" property="query"/> <!-- Run the search, and create our ListContainer with the results --> <list:init name="customers" listCreator="search" max="20"> <jsp:useBean id="customers" class="com.bitmechanic.listlib.ListContainer"/> <p><hr><p> <list:hasResults> <!-- Using a scriptlet on the ListContainer to test size --> <% if (customers.getSize() == 1) { %>1 page <% } else { %> <list:prop property="size"/> pages <% } %> matched your query. <br> Now displaying <list:prop property="start"/>-<list:prop property="end"/> <br> <list:hasPrev> <a href="<list:prevLink/>">Previous</a> <list:hasNext> | </list:hasNext> </list:hasPrev> <list:hasNext> <a href="<list:nextLink/>">Next</a> </list:hasNext> <p> <list:iterate> <dl> <dt><a href="<list:iterateProp property="url"/>"><list:iterateProp property="title"/></a> (score: <list:iterateProp property="score"/>) <dd><list:iterateProp property="desc"/><br> <list:iterateProp property="url"/></dd> </dl> </list:iterate> </list:hasResults> <list:hasNoResults> No pages matched your query. </list:hasNoResults> </list:init> <% } %> </body> </html>