Document Search Tutorial

Have a batch of files you need to search through? Want to search through existing meta-data of the documents plus custom defined extracted keywords? This tutorial will get a search application of local documents up and running within a few minutes.

Requirements

If you want to try Fuse Document-Search on your local machine you may want to run through the Install Prerequisites for Fuse tutorial to setup all the prerequisites.

In addition, you also need git installed and of course a bunch of documents, which you want to search through.

Setup

After you installed all prerequisites, you need to clone a git-repository which contains prepackaged configuration to give you a quick and easy start with Fuse Document-Search.

git clone https://github.com/qsensei/fuse-docsearch.git
cd fuse-docsearch

Inside this repository, there are a couple of files:

  • a docker-compose.yml file which defines all needed Docker containers and their configuration
  • a pkg directory with a set of schema and resource files tailored for the document-search application
  • an env.sh file which contains the definition of environment variables which configure the basic functionality
  • and a update-pkg.sh helper script

First thing should be to edit the env.sh file. This contains two settings DOCSEARCH_DIR and DOCSEARCH_PORT. Adjust DOCSEARCH_DIR to point at the directory containing the documents which you want to index with Fuse. You may also adjust DOCSEARCH_PORT to change the port which Fuse will listen on.

Note

If you are using Docker Machine with Virtualbox to run Fuse locally on Windows or OS X make sure to place your docuements in a directory below C:Users on Windows or /Users on OS X. This is due to a restriction of Docker Machine. Also use slashes instead of backslashes for Windows paths, eg.

DOCSEARCH_DIR=/c/Users/my_user_name/My_Documents

When you are done editing env.sh you can source it. This will put the variables defined in that file into effect for the current shell-session. You have to do that whenever you open a new terminal window to interact with Fuse Document-Search (for example after a restart).

source ./env.sh

You can now use Docker Compose to start Fuse Document-Search:

docker-compose up -d

Hint

If this leaves you with an error message like this

ERROR: Validation failed in file './docker-compose.yml', reason(s):
router.ports is invalid: Invalid port ":80", should be [[remote_ip:]remote_port[-remote_port]:]port[/protocol]

you most likely forgot to source the env.sh file with source ./env.sh.

This will start all containers defined in the docker-compose.yml file. You will now be able access the Fuse search interface at the IP of your Docker machine and the above defined DOCSEARCH_PORT: http://<docker-machine-IP>:DOCSEARCH_PORT (for example http://192.168.99.100:8000). After startup, Fuse is already configured with the configuration package in the pkg directory and the documents from inside the directory given with DOCSEARCH_DIR are already being processed. You need to have a little patience now - but after a couple of minutes, you can already see first results.

Hint

You can use the usual set of docker-compose commands. For example, to restart Fuse Document-Search do docker-compose restart. If you need to inspect the logs you can do that with docker-compose logs. If - for whatever reason - you want to start all over again, you can do docker-compose rm and start up a new pristine instance of Fuse Document-Search with docker-compose up again. For more information have a look at the Docker Compose documentation.

Keyword Extractions

Documents without meta data don’t really leave us with much to query our data with. Fortunately, we provide an extraction service centered around Fuse core to extract keywords from raw text using a list of words or regular expressions.

For example, let’s say a word document contains the following text (taken from a NAFTA document):

For documents like this, it would be useful to be able to search and analyze with a list of companies so that besides the given text, we also have a set of keywords (like “GM”, “Ford”, “Chrysler” and “Volvo” in our example) which can be used to index, query by and analyze companies mentioned in each document.

This can be achieved with keyword extraction, which can be defined inside the Content Schema. Out-of-the box, the schema contains keyword extraction for the names of the Fortune 500 companies.

Here, an attribute with the name “ext:companies” is defined, which is populated with all found keywords taken from the Fortune500.txt file.

That file is located in the pkg directory and is fairly simple - just a new-line delimited list of keywords. You may edit that document and restart Fuse to see changes. (Note that it takes a while for changes to become visible, since all documents need to be re-processed with the new keyword definitions.)

Add Another Keyword Extraction

If you want to add another facet which indexes another attribute with extracted keywords, these are the steps involved:

Edit the pkg/contentschema.json and add another attribute to the “document” type:

"attributes": [
              ...
    {"name": "ext:animals", "extractions": {"keywords": ["cat", "mouse", "dog"]}}
]

Note

Alternatively, the keywords can be supplied by other ways as well. See Configuring Extractions for the ability to:

  • read keywords from a file within the pkg directory
  • Supply keywords from within the content schema.
  • Supply a list of regexes and string formats.
  • Configure case insensitivity.

Then add another facet to the pkg/indexschema.json:

"indexes": [
             ...
     {"name": "animal"}
]

and let this facet be populated with the newly defined attribute:

"sources": [
             ...
     {"index": "animal", "fuse:type": "document", "attribute": "ext:animals"}
]

When you are done making changes to the schema, you need to update Fuse with the changed schema. There is a helper script update-pkg.sh within the fuse-docsearch repository which serves this task. Just call it:

./update-pkg.sh

When the new schema were successfully installed into Fuse, you can restart all containers by

docker-compose restart

After a while, your changes will become visible in the search interface.