RFC for internal site search

Some ideas for internal site search.

  • each module should return a result-object. we need a static function per module that knows how to search and where (which table, which cells) to search.
  • return and universal result set, with the keys: title, shortdescription, link
  • maybe sort by rights (if user guest isn't able to watch the article, he should not get the search result). this will prevent guests from seeing too much "you are not allowed... pleaes login" screens.
  • caching
  • indexing
  • indexing of uploaded documents

show resultset like: 1 result in faq: -> link to restult 13 results in "articles" -> link to article-results...

from phpdig docs:

PhpDig indexes HTML and text files by itself. PhpDig could index PDF, MS-Word, MS-Excel, and MS-PowerPoint files if you install external binaries on the server for this purpose. PhpDig is configured to use catdoc, xls2csv, pstotext or pdftotext, and ppt2text programs.

The author of PhpDig does not offer support for the binary programs. Contact the authors of those programs if you have trouble with compiling and/or installing them.

Of course, you can use other binary programs to extract text from PDF, MS-Word, MS-Excel, and MS-PowerPoint files.

To demonstrate the external binaries feature, you can search Hamlet (tragedy, Shakespeare, from MS-Word format) or ~L'Avare (comedy, Molière, from PDF format). phpdig docs

don't know if those programs would be available for windows, too... maybe use pear classes for indexing those docs?

good links for search related things:

external site search:

sbaturzio: an interface from SGL to htdig IMHO is a nice solution