Re: [Wikitech-l] lucene search 2.0 test webinterface - Wikitech-l

30 May 2007

Hi Robert,

   Yes, there are multiple queries. In my scenario, "precision first" 
usually implied
the amount of return results is limited. Users may not have patiences on 
both waiting
for responses and reading for pages of results. That's why I prefer 
sequential process
rather than parallel; I can guess a small and maybe precise result set 
first, then query
for more if the result set seems to be too small, i.e. the recall is not 
high enough.

   For example, a query in Chinese applies word-based analyzer first, 
with a limit,
say 1000:

   static int m_limit = 1000;
   Query query = _a_word_based_Chinese_query_here_;
   ArrayList<MyResult> resultList = new ArrayList<MyResult>();
   TopDocs topDocs = m_standardSearcher.search(query, (Filter)null, 
m_limit);
   for(ScoreDoc scoreDoc: topDocs.scoreDocs) {
       Document doc = m_standardSearcher.doc(scoreDoc.doc);
       float score = scoreDoc.score;
       MyResult aResult = new MyResult(doc, score);
       resultList.add(aResult);
  }

   Once the size of resultList did not reach 1000, another 
character-based query
will be fired to get more results up to (1000 - current size).

   It's a very simple heuristic and proved to be fast enough on single 
P4 2GHz
machine with 2GB RAM, which served for a 3GB Lucene index file. Results
returned within 1 sec, in average.

   The problem of all multiple, parallel, or distributed Lucene queries 
is, score
merging may not be reasonable, especially when indexes are in different 
strategy
of tokenization.

   You may be also interested in

http://issues.apache.org/jira/browse/NUTCH-92 ,
http://hellonline.com/blog/?p=55 , and
http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg12709.html

   Thank you!

   Cheers,
/Mike/

Robert Stojnic wrote:
...

 Hm, wouldn't that require running multiple queries for a single user 
 query? If I'm understanding it correctly it refines search by trying 
 different queries, and merges the results?
 For the wikipedia system, speed is of out most importance, since it's 
 a high traffic site, and has very few resources (compared to other 
 sites of same traffic).

 r.

 On 5/23/07, *Tian-Jian Barabbas Jiang@Gmail* &lt;barabbas(a)gmail.com 
 <mailto:barabbas@gmail.com>> wrote:

     Although I bet you have already done it, here's my
     2 cents:
     I usually adapt a concept to my IR system:
     Precision first, Recall next.
     For example, my system may do exact match first, get
     the results from

         searcher.doc(topDocs.scoreDocs[i].doc)

     and save them externally.
     It allows me to merge some more partial matched
     results later.
     Apparently these can be done by something like parallel
     queries, but I like to merge them sequentially by myself.