No problems - I checked a couple of pages (including one or
two w/images), and there doesn't seem to be any issues.
--
------------------------------------------------------------
Robert Merkel rgmerk(a)mira.net
Go You Big Red Fire Engine
-- Unknown Audience Member at Adam Hills standup gig
------------------------------------------------------------
> The point of the parser was to detect the cases where the query
> was not well-formed, unbalanced brackets, "A and or B", et cetera,
> and then give some kind of syntax error to indicate what is wrong.
> What do you do with those cases now?
Now I let MySQL choke on it, and report back its error, which is
actually much more useful than it sounds. The part of the program
that reports MySQL errors is very good; I originally made it that
way for debugging, but it's not bad for this case either. I just
don't see that much benefit from making nicer error messages on
badly formed searches.
> Btw. is it correct that you only highlight one search word per line
> in the result of the search?
If fixed that, and also the word-boundary problem, but it does
still limit the context display to 60 characters before and after
the first hit of each line. Personally, I think that's plenty to
get a sense of context, but I might be convinced otherwise.
> Ah, I see, sorry for not checking your code first. So that is why
> the scoring doesn't work anymore. The simplest way to get scoring
> back is probably to not eliminate duplicates when processing.
Hmm. That might be a good idea. I might even be able to add extra
duplicates for words in headings or something. I initially assumed
that eliminating duplicates would speed the search, but if it hurts
the scoring, that's not a good tradeoff.
> Not necessarily. The usual way to do this is define your own index
> table like
> Text_index(word, article, #occurrences)
> and then you let MySQL compute some sort of scoring and sort on
> that. This is tricky if you have OR and NOT but with only AND
> this is easy.
I think I'll try removing the dup-stripping first.
0
> Having said that I now favour removing my search code and moving
> to MySQL's binary search, because if you don't like it's default
> scoring you can now use the +'s. It was fun writing a parser for
> boolean expressions but if we can get rid of that complicated
> piece of code and defer some of the work to the database I'm all
> for it. Simplify, simplify.
The first simplification I did was to get rid of the parser because
it wasn't necessary--SQL is already doing it, so I just pass on the
ANDs, ORs, and NOTs as they are. Yes, I put an implicit AND between
terms, because that makes fast, small result sets.
The boolean searching in MySQL 4.0 would be great--but that's a
BIG leap--MySQL 4.0 is not a stable product. It's alpha software,
and I'm not so sure that giving up the reliability of 3.23 is worth
the extra features. Does anyone on the list have experience with
MySQL 4.0 in a production environment?
MySQL 3.23 is very stable and reliable. Even recompiling it from
source was simple (I did that to get rid of the 4-letter miniumum--
you can check that out at the new site--search for "PVC" for example).
The second change I made to the search was to parse the article
text into a separate field the way we were already doing for titles.
This field contains all the unique words of the article just once,
case folded and stripped from punctuation (so it fixes the ''
problem, for example). I even do some processing for things
like [[game]]s, which will put both "game" and "games" in the index.
We could expand this preprocessing to do some things.
We could also do our own scoring after MySQL returns the raw
results, but that would require making a pass through the entire
result set before displaying anything. Another thing about the
search in the new codebase is that it is blindingly fast--it can
return results in within 2 seconds many times. When it's that
fast, you don't need as many features because the user can do
multiple searches.
> However, on the long run we should probably implement our own
> indexing. That would allow us to tackle several problems:
> - the ' problem
> - searching UTF-8 with proper collation without hacking the
> character set
> - recognizing entities such as ö
> - languages with inflections
All of these can be solved with the pre-processing already in
the new codebase--in fact the ' problem is already solved. I
haven't done anything with new character sets, but that should
be pretty easy--take a look at SearchUpdate.php.
> - partial matches or ... we could wait for the MySQL team to
> implement the Generic user-suppliable UDF preparser as us
> mentioned in their to-do list. Perhaps we should give them a
> call. :-)
I'm really big on stable, reliable software. Even if MySQL
chose to implement something like that, I wouldn't recommend
using it until it had been in production for a few months, and
we can't even say that of 4.0 yet.
0
Wikipedia now has more than 50,000 pages (28,000 articles), and the
1,000 most recent changes were made in the last 5 days.
But is there any way to judge/measure/monitor the quality of the
contribution as volume grows? Are new articles still written on new
topics of general interest, or do more and more cover obscure topics?
Do more duplicates appear? Is there any way to tell from statistics?
I guess the number of different authors that a new article attracts in
its first three months could be an interesting statistical measure.
--
Lars Aronsson <lars(a)aronsson.se>
tel +46-70-7891609
http://aronsson.se/http://elektrosmog.nu/http://susning.nu/
The search code in the new codebase behaves similar to our current
code in that it assumes an implicit AND for several search terms, and
doesn't return any results if no articles match all terms.
I wonder if this is the intuitive behavior for most users. I think
Google has conditioned people to type in as much relevant information
as possible to get better hits, and most search engines work that way.
In fact, the built-in mysql search code works that way too. Maybe we
should use it directly? That way, we could also present the results
according to relevancy (which mysql reports), rather than
alphabetically.
We would lose the boolean AND OR NOT operators, but newer versions of
mysql have substitutes: you use "+term" if you definitely want the term
in your results, and you use "-term" if you definitely don't want it.
This is almost as powerful as boolean searching.
Alternatively, we could have an "advanced search" page where you could
construct a boolean search, include/exclude specific namespaces etc.
Now that I think about it, a way to optionally search talk: and
wikipedia: would probably be desirable.
Axel
> Mistaken ? In 1999 Unisys stated that its policy is to require
> a $5000 fee from websites that carry GIF images made by unlicensed
> software -- even nonprofit websites created and displayed with free
> software. Can Wikipedia prove that every GIF image uploaded to it
> has been created by a properly licensed GIF encoder ? I think not.
Unisys can claim any damn thing it wants. But it's what the law
says that matters, and the law says that Unisys is just blowing
smoke up our ass on that claim. Only the claim on encoding software
has any legal merit.
Again, I'm not averse to excluding GIFs from Wikipedia for many
reasons, but fear of a legitimate patent infringement claim is
not one of them.0
Oops. Sorry about the empty message.
I recompiled MySQL with the 3-letter minimum instead of four, and
implemented a search function on the new codebase. As with some
other decisions, I went with speed over functionality I didn't see
much use for, but I'm willing to be convinced. For example, Title
matches and text matches are separated, and there is no total count
of matches at the top (that's a whole query in itself, and I didn't
think it was useful enough to be worth the time).
I trimmed down the amount of context shown with each hit, and added
line numbers, just so that one gets an idea of the use of the term
in context (and I put it in red).
When this search is satisfactory (I want to make at least one more
tweak to solve the MySQL '' problem), that will be all of the major
functionality of the Wiki, and only filling in a few special pages
remains, so now is a good time for some major testing.
Also, I turned on the PHP option for really pedantic error checking,
so if you get errors now that you didn't before, that's probably the
reason (and please report them to
http://sourceforge.net/tracker/?group_id=34373&atid=411192 ).
The test site is still
http://www.piclab.com/newwiki/wiki.phtml
0
> Still, as long as Wikipedia neither codes nor decodes GIFs, how can
> it be in violation?
It can't. Derek is completely mistaken on that score. Only software
that encodes or decodes GIF has any problem, and even the case for
decoders is pretty thin. So the patent itself is no reason to forgo
use of GIF in web sites. But it does make a political statement to
avoid their use, in that in the long run, avoiding GIFs on web pages
may in time reduce their use to such low levels that free software
developers might be able to produce more non-patent-encumbered
software for producing images.
And PNG is a superior format anyway (and I'm not just saying that
because I'm one of its developers--I was on the committee that
created GIF too).
0