SearchAssist - a simple, yet practical application of a data structure called Ternary Search Tree to build a portable Search Engine in Java and XML that weighs just 30 KB. The application shows how the dynamic loading feature in Java can be leveraged to make the Search Engine highly customizable. It also demonstrates how some of the rarely used features of Swing, XML/XPath and LiveConnect can be effectively utilized to make Applets very interesting.

Initial requirements: The application must take the Mozilla bookmarks as input. Mozilla bookmarks are saved as HTML files with a standard structure. An index of the bookmarks must be created, if they are to be searchable. A Swing UI to allow the user to enter the search words and view the results.

Input data format: When the initial requirements are analyzed and refined further, the potential applications of this project become obvious. For example, using XML instead of the HTML form of Mozilla bookmarks open up new scenarios in which it can be used. A light weight Help system for another application, or a photo album that can be searched using the person’s name. The possibilities are endless. To be able to support multiple nested levels of bookmarks/topics like the ones Mozilla allows, a suitable XML format must be designed.

The XML will have a Category tag as the root. A Category can have nested Categories or Items as leaves. Both Category and Item have Title tags. Each Item has an additional Link tag that, as the name suggests, contains the link to the actual text/website/photo as the case may be. Each Category/Item is uniquely addressable. Both Category and Item tags have an attribute called uniqueID, which contains a string that does not repeat in the document. Such a format is also ideal for storing nested topics for a Help system.

Indexing: To be able to search the XML file, an index of search words and pointers to the location where detailed information is located, must be created. For a seasoned Java programmer, the first thing that would come to mind would be a HashMap as it can be persisted to a file using Serialization. Although this is the right way of thinking, it must be remembered that the index must be compact and easily searchable. By easily, it means the user must be able to retrieve the information in as few steps, as possible. The HashMap, although one of the most widely used (sometimes, overused) data structures, does not completely satisfy the requirements at hand. It does not differentiate between an Integer, String and Array as long as the key is an Object. If there are 3 keywords words in the XML input file such as, “book, bookmark and booklet”, then a HashMap will treat all three of them as just Objects and hence they will be hashed to completely different locations. Although all three words had a common prefix namely “book”, the HashMap fails to utilize this pattern effectively, as it is not aware of the nature of the keys being used.

A Ternary Search Tree, on the other hand, is designed such that the nodes with the same prefix share the common nodes. Instead of treating the keyword as a single, unbreakable Object, it is split into an array of characters. Keys that share common prefixes reuse the same characters. The remaining part, which does not match, branches off. This way of storing saves a lot of space as the Tree is aware of the keyword’s content. Such a data structure automatically lends itself to providing another very useful feature for the UI – Auto-completion.

Auto-completion: Since the index is aware of the keywords it stores, the program navigates the Tree; such that the string of characters that have been visited match the text that the user has typed. If the user has typed “boo”, then the tree has seen – “b, o and o”. The Tree guesses that the user is typing one of – “book or bookmark or booklet”, if there are only these 3 words starting with “boo” in it. The Tree will supply the closest match that appears first in alphabetical order. If there is no match, then there will not be any auto-completion, so the user can stop without having to type the entire search word to find out.

Displaying search results: If the text typed by the user matches the keyword in the index, then the search results have to be displayed. Because the information can be as varied as simple text/websites/photos, the display of search results must be customizable. To achieve this, the actual implementation must be kept separate from the interface – simple Java programming idiom.

UI components and layout: Now that most of the important requirements have been addressed, the UI layout must be designed. The layout is very minimal. The most crucial Swing components are - a custom JTextField which does auto-completion, a JTextArea to display the List of nearest matches and a JPanel, to which a custom JComponent is added to display the search results.

Generating the Ternary Search Tree index file: The Mozilla bookmarks file has to be converted to XML format. A careful observation of the HTML bookmarks file reveals a simple structure which can be easily transformed to the XML structure using SAX. The HTML file contains a lot of attributes that are not required for the transformation and so can be safely ignored.
@See MozillaBookmark2XMLConverter.java

The resulting XML file that contains the Category, Title, Item and Link tags are fed to a small Java program that just echoes the XML tags and text content into another file. It is a simple SAX Handler. In addition to that, it generates a uniqueID for each Category and Item and adds it as an attribute and value pair.
@See XMLIDGenerator.java

For the index to be generated, a list of keywords and their corresponding uniqueIDs is required. Since the uniqueIDs have been generated in the previous step, the next step is to generate the keywords that will point to the uniqueIDs. Again, a simple SAX Handler retrieves the text of the Title tags and tokenizes them. This list of tokens from each Title tag will contain non-keywords or stop words such as articles, pronouns and prepositions that can be weeded out. The stop-words are stored in a file, and will be referred to by the SAX Handler. The filtered list of keywords is written back as a comma-separated list under a Keywords tag of the parent Category or Title tag.
@See KeywordTagAdder.java

With the XML file containing both the Keywords and the uniqueIDs, the Ternary Search Tree can be generated. As a keyword can occur in more than one Category or Item, the Ternary Search Tree stores a List of uniqueIDs instead of just one. This List can provide another useful feature for the UI. The size() method on the List can be used to display the number of hits for the corresponding keyword, denoted in square brackets in the JTextArea. The Ternary Search Tree thus populated is Serialized into a file and stored on the disk.
@See TSTGenerator.java

The steps preceding the actual index generation are required only if the Mozilla bookmarks file is used as the source of information. The index can be generated from any source. It is just a mapping of keywords to Lists of uniqueIDs.

The Mozilla bookmarks file can contain some characters which cause the XML parser to fail. Because HTML allows unclosed tags such as <P>, <BR>, <HR> and <DT>, they have to be replaced by their XML compliant forms - <P />, <BR />, <HR /> and <DT />. And as URLs contain ampersands (&) as part of HTTP Get requests, they too cause the parser to fail. These stand-alone ampersands have to be converted to their XML entity equivalent, namely &#038;. A small Java class finds and replaces such trouble making strings in the XML file before it is fed to the XML parser.
@See SpecialCharactersFilter.java

The steps described above constitute a simple information transformation pipeline.

Search Engine Applet implementation:

Under the hood: At the heart of the Applet is the Ternary Search Tree. The index is De-Serialized from the file and loaded into memory in the Applet’s init() method. To navigate the Tree, i.e. to query the index, the text content of the JTextField is used. It is wired to listen to keyPressed and keyReleased events using the addKeyListener(KeyListener) method. For every character pressed, the current position in the index moves deeper into the tree, provided the prefix exists. To provide the auto-completion functionality, the JTextField is customized by inheritance. The AutoCompleteTextField has an anonymous inner class that extends the KeyAdapter to listen to keyPressed and keyReleased events. The AutoCompleteTextField uses an AutoCompleteDictionary to do a lookup(textFieldContents) of the typed text. The AutoCompleteDictionary is an interface and so different implementations can be plugged in. The string that lookup(textFieldContents) returns is the auto-completion text suggested by the AutoCompleteDictionary. If the lookup(…) returns an empty string, then the textField is unchanged, indicating that the search word does not exist.
@See AutoCompleteTextField.java

Because the Ternary Search Tree is the dictionary here, it can be used directly to provide the auto-completion text. The TSTDictionary implements the AutoCompleteDictionary’s lookup(textFieldContents) method by invoking the matchPrefixString(textFieldContents) on the De-Serialized Ternary Search Tree it contains internally. If the Ternary Search Tree has a keyword whose prefix matches or completely matches the textFieldContents, then the keyword is returned. Otherwise, if there are no matches it returns null. Apart from returning the closest match, the TSTDictionary populates the JTextArea with the remaining near matches. The size() of the Lists the keywords contains is displayed alongside the keywords in square brackets indicating the number of hits.
@See TSTDictionary.java

More about the custom JTextField: When the user presses the Enter key in the AutoCompleteTextField, the ActionListener’s actionPerformed(…) method will be invoked. This method retrieves the List of uniqueIDs from the Ternary Search Tree using the textField contents. At any point, the textField will either contain a keyword that exists in the Tree, which will be auto-completed. The List of uniqueIDs thus obtained is rendered on the right-hand side.
@See TSTApplet.java

Customizable search result renderer: From the beginning, one of the main requirements has been that the rendering of uniqueIDs or search results must be customizable. To facilitate that, the TSTApplet uses an interface on which it invokes the renderNodes(List ids) method. This IDRenderer interface has 2 more methods that serve as lifecycle methods. The init(Attributes params) to initialize the component with the Attributes instance and the getRenderingComponent() that returns the actual Swing component that renders the search results. This component is added to the JPanel on the right-hand side. To TSTApplet is made aware of the class that implements IDRenderer by specifying its fully qualified class name as the IDRendererImpl Applet parameter. Using this, the Applet, in the init() method, dynamically loads the IDRenderer implementation class.
@See IDRenderer.java
@See TST.html
@See TSTApplet.java

HTML renderer: The index in this case is a map of keywords present in the Title tags of Categories and Items to their uniqueIDs. So, the IDRenderer implementation must display the XML fragment marked by the uniqueID. In order to do that, the XML file containing only the uniqueIDs, without the keywords has to be available to the Applet. Although the XML file with the uniqueIDs and keywords will serve the purpose, the additional keyword tags increases the file size and the memory required to load it. The IDRenderer which does this is XPathNodeAccessHTMLRenderer. On instantiation, the XML file is loaded as a DOM Document. To allow the XML file to be switched or replaced with an updated file, the name and location is not hard-coded. The XMLWithIdsFile Applet parameter points to the actual location. Even the index file’s location is not hard-coded. It is fetched using the SerializedTSTFile Applet parameter. These 2 input files are compressed and placed in a separate zip file called DataFiles.zip that should be present in the Applet’s classpath.
@See XPathNodeAccessHTMLRenderer
@See TST.html

XPath to the rescue: The XPathNodeAccessHTMLRenderer uses a JEditorPane to render the XML fragment as HTML with links. So, getRenderingComponent() returns the JEditorPane. When the user presses the Enter key in the textField to request the search results, the AutoCompleteTextField’s ActionListener fetches the List of uniqueIDs for the search word, if it exists and sends the List to the IDRenderer’s renderNodes(List ids). For each uniqueID in the List of ids, the renderNodes(...) method, retrieves the XML fragment pointed to by the uniqueID using XPath. Since the elements can only be one of Category or Item tags, the XPath query is very specific to reduce search time. With the Element thus obtained simple HTML Lists are generated by iterating through it recursively. The Title tags become the List items in bold and the Links become HTML Anchor tags displayed as nested HTML Lists.
@See XPathNodeAccessHTMLRenderer.java

Java-JavaScript communication: When a Link is clicked in the JEditorPane, a new browser window opens to display the URL. This requires Netscape LiveConnect to do the Java-JavaScript communication.
@See LinkClickListener.java

The JAR file must be signed because the XML parser reads some System properties which are not available to unsigned Applets.

Resources:

Requirements:

Limitations: Memory consuming as the whole index is loaded. The XML file also takes up memory. More mature index systems like JavaHelp should be used for larger indices.