Hi Robert,
Yes, there are multiple queries. In my scenario, "precision first"
usually implied
the amount of return results is limited. Users may not have patiences on
both waiting
for responses and reading for pages of results. That's why I prefer
sequential process
rather than parallel; I can guess a small and maybe precise result set
first, then query
for more if the result set seems to be too small, i.e. the recall is not
high enough.
For example, a query in Chinese applies word-based analyzer first,
with a limit,
say 1000:
static int m_limit = 1000;
Query query = _a_word_based_Chinese_query_here_;
ArrayList<MyResult> resultList = new ArrayList<MyResult>();
TopDocs topDocs = m_standardSearcher.search(query, (Filter)null,
m_limit);
for(ScoreDoc scoreDoc: topDocs.scoreDocs) {
Document doc = m_standardSearcher.doc(scoreDoc.doc);
float score = scoreDoc.score;
MyResult aResult = new MyResult(doc, score);
resultList.add(aResult);
}
Once the size of resultList did not reach 1000, another
character-based query
will be fired to get more results up to (1000 - current size).
It's a very simple heuristic and proved to be fast enough on single
P4 2GHz
machine with 2GB RAM, which served for a 3GB Lucene index file. Results
returned within 1 sec, in average.
The problem of all multiple, parallel, or distributed Lucene queries
is, score
merging may not be reasonable, especially when indexes are in different
strategy
of tokenization.
You may be also interested in
http://issues.apache.org/jira/browse/NUTCH-92 ,
http://hellonline.com/blog/?p=55 , and
http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg12709.html
Thank you!
Cheers,
/Mike/
Robert Stojnic wrote:
Hm, wouldn't that require running multiple queries for a single user
query? If I'm understanding it correctly it refines search by trying
different queries, and merges the results?
For the wikipedia system, speed is of out most importance, since it's
a high traffic site, and has very few resources (compared to other
sites of same traffic).
r.
On 5/23/07, *Tian-Jian Barabbas Jiang@Gmail* <barabbas(a)gmail.com
<mailto:barabbas@gmail.com>> wrote:
Although I bet you have already done it, here's my
2 cents:
I usually adapt a concept to my IR system:
Precision first, Recall next.
For example, my system may do exact match first, get
the results from
searcher.doc(topDocs.scoreDocs[i].doc)
and save them externally.
It allows me to merge some more partial matched
results later.
Apparently these can be done by something like parallel
queries, but I like to merge them sequentially by myself.