bitmechanic: spindle

Current version: 0.90

This library is released free of charge with source code included under the terms of the GPL. See the LICENSE file for details.

Includes Lucene from the Apache Group, and HTML Parser written by David McNicol.

Questions/Comments: Email James Cooper <pixel@bitmechanic.com>


Overview

spindle is a web indexing/search tool built on top of the Lucene toolkit. It includes a HTTP spider that is used to build the index, and a search class that is used to search the index. In addition, support is provided for the Bitmechanic listlib JSP TagLib, so that a search can be added to a JSP based site without writing any Java classes.

Installation

Indexing a Site

Run the bin/spindle or bin\spindle.bat script to index your site. You may need to edit this script so that the CLASSPATH matches your installation.

SSL support: If you want to crawl encrypted URLs (i.e. https://) you need to install JSSE. This is included by default in JDK 1.4, but earlier versions require an add-on. Spindle attempts to initialize the JSSE package, and if it fails, will not try to request https URLs.

Command Line Arguments:

argument description example required
-u [start url] Starting URL to index. The spider will recursively request all links from this URL that are on the same host/port. This argument may be specified multiple times. -u http://www.bitmechanic.com/ yes
-d [index dir] Directory to write the Lucene index files out to. -d /www/search/index yes
-i [include substring] String of URLs to include. If set, only URLs that include this String will be added to the index and traversed. Can be specified multiple times. -i /~pixel/ (would index files in my home dir only) no
-e [exclude substring] String of URLs to exclude. If set, URLs that include this String will be excluded from the index. Can be specified multiple times. -e .cgi no
-v Turns on verbose mode. -v no
-a Indicates that an existing index should be appended to, rather than overwritten. -a no
-m [mime type] Specifies MIME types to index. By default "text/plain" and "text/html" are indexed. May be specified multiple times. -m text/plain no
-t [num threads] Specifies the number of Java threads to use when spidering. The threads run in parallel. Increasing this number will increase the load on the server while indexing, but up to a point will increase the overall throughput. This parameter does not affect the size or quality of the resulting index. -t 5 no (default is 2)
-s [description bytes] Specifies the number characters to store as the description for each page. -s 1024 no (default is 256)

Searching

Use the Search class to search indexes created by the spider.

search on command line

I suggest running your search on the command line first just to test that your index is working properly. To do this, run the following command:

java com.bitmechanic.spindle.Search [index dir] [keyword]

for example:

java com.bitmechanic.spindle.Search c:/spindle/bitmechanic-index spider
This will return a set of hits to STDOUT. Once you have this test working, you're ready to add the search to your web site.

Using the Search class

You can use the Search class programmatically to run searches against the index from your Java code:

    import com.bitmechanic.spindle.Search;
    import java.util.*;

    public void runSearch() {
        ArrayList hits = Search.search("c:/spindle/bitmechanic-index",
                                       "mykeyword");

        // These are the keys in the HashMap objects returned from the
        // search.
        String keys[] = { "url", "title", "score", "desc" };

        for (int i = 0; i < hits.size(); i++) {
            HashMap map = (HashMap)hits.get(i);
            for (int x = 0; x < keys.length; x++) {
                System.out.println(keys[x] + "=" + map.get(keys[x]));
            }
            System.out.println();
        }
    }

Searching with listlib

In most cases, you want to run a search from a JSP. In this case, try using listlib. Put listlib.jar in your server's CLASSPATH, and add listlib.tld to your document root. Here's a sample JSP (which is included in the distribution as search.jsp):


             
<%
  String query = request.getParameter("query");
  if (query == null) query = "";
%>
<%@ taglib uri="/listlib.tld" prefix="list" %>
<html>
  <head>
    <title>Search Page</title>
  </head>

  <body>

  <h1>Search</h1>
  <form method="post" action="search.jsp">
    Search term: 
    <input type="text" name="query" value="<%=query%>">

    <input type="submit">
  </form>

<% if (!query.equals("")) { %>

  <!-- Create a spindle Search object -- it implements ListCreator -->
  <jsp:useBean id="search" class="com.bitmechanic.spindle.Search"/>

  <!-- Specify the directory that stores our Lucene index created by the
       spindle spider -->
  <jsp:setProperty name="search" property="dir" 
                   value="c:/bitmechanic/spindle/bitmechanic"/>

  <!-- Copy the query from the web form into the Search object -->
  <jsp:setProperty name="search" property="query"/>

  <!-- Run the search, and create our ListContainer with the results -->
  <list:init name="customers" listCreator="search" max="20">
  <jsp:useBean id="customers" class="com.bitmechanic.listlib.ListContainer"/>

  <p><hr><p>

  <list:hasResults>
     <!-- Using a scriptlet on the ListContainer to test size -->
     <% if (customers.getSize() == 1) { %>1 page <% } else { %>
           <list:prop property="size"/> pages <% } %> matched your query.
     <br>

     Now displaying <list:prop property="start"/>-<list:prop property="end"/>
     <br>

     <list:hasPrev>
        <a href="<list:prevLink/>">Previous</a>

        <list:hasNext> | </list:hasNext>
     </list:hasPrev>

     <list:hasNext>
        <a href="<list:nextLink/>">Next</a>

     </list:hasNext>
     <p>

     <list:iterate>
        <dl>
          <dt><a href="<list:iterateProp property="url"/>"><list:iterateProp property="title"/></a>  (score: <list:iterateProp property="score"/>)
          <dd><list:iterateProp property="desc"/><br>

              <list:iterateProp property="url"/></dd>
        </dl>

     </list:iterate>

  </list:hasResults>

  <list:hasNoResults>

     No pages matched your query.
  </list:hasNoResults>

  </list:init>

<% } %>
     
  </body>
</html>