Tuesday, May 1, 2007

Re: [BLUG] open source search engines?

I favor Namazu much more than ht://Dig.

http://www.namazu.org/

It can search any type of document provided you create a (Perl) filter
for it. It also has support for boolean expressions.

It has support(+) for (from /usr/share/namazu/filter):

apachecache.pl gzip.pl man.pl postscript.pl taro56.pl
bzip2.pl hdml.pl mhonarc.pl powerpoint.pl taro7_10.pl
compress.pl hnf.pl mp3.pl rfc.pl tex.pl
deb.pl html.pl msword.pl rpm.pl
dvi.pl macbinary.pl ooo.pl rtf.pl
excel.pl mailnews.pl pdf.pl taro.pl

(+) Some of the filters require third-party tools to extract the data in
to a more easily parsable form.

Creating filters is a straight-forward process. I've created several.
This can be done, for example:

* You have internal documentation in HTML with a strict style
guildeline. (For example, a title page with the author and date modified
which is more reliable than the HEAD tags.) You can leverage the benefit
of the known style to pull meta-data from the document.

* You use a mail to HTML gateway like Pipermail and you want the correct
meta-data displayed.

What is neat with Namazu is that in addition to a CGI-based interface,
it also provides a command-line interface. I get a kick out of searching
my documents from the command-line.

Not all PDF documents will be easily parsable. Some PDF documents are
actually stored as images internally. (For instance Faxes which arrive
as PDF documents.) In such cases you would want to use OCR software on
the PDF. (You would likely need to convert it to another format first.)

Cheers,
Steven Black

On Tue, 2007-05-01 at 14:45 -0400, Joe Auty wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hello,
>
> Does anybody have any experience working with open source search
> engines? I've looked at ht://dig, but had some problems getting it to
> do what I wanted. Has anybody used any other?
>
> Requirements:
>
> - - must be able to search .doc, .pdf, and a wide variety of other formats
> - - must be able to pass on a username and password to a site using
> Apache basic authentication
> - - must work over SSL sites
> - - must return useful results :)
>
>
> - --
> Joe Auty
> NetMusician: web publishing software for musicians
> http://www.netmusician.org
> joe@netmusician.org
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.6 (Darwin)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iD8DBQFGN4q9CgdfeCwsL5ERArvUAJkBklb7yKsMZoQWz6dDzuI4ONl0qACcD8al
> 9nveG4STsH9pbYDlB1YPHgU=
> =5FAT
> -----END PGP SIGNATURE-----
>
> _______________________________________________
> BLUG mailing list
> BLUG@linuxfan.com
> http://mailman.cs.indiana.edu/mailman/listinfo/blug

_______________________________________________
BLUG mailing list
BLUG@linuxfan.com
http://mailman.cs.indiana.edu/mailman/listinfo/blug

No comments: