Development for 2.x : Improved Search

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Development for 2.x : Improved Search

Scott johnson-5
Hello,

[Thanks to Andy for previewing this before I sent it to the list as a whole]

Now I'm a search geek and I find that WordPress' search functionality just
doesn't cut it.  And I think this is fixable in one of 2 ways -- actually
there are likely Big N ways to fix it but 2 ways I'd be interested in taking
on.

Note: I know that at least one high profile WP User (Om Malik who is
Business 2.0 / Gigaom.com <http://gigaom.com/>) really wants search
"fixed".  And search my blog, fuzzyblog.com for the term money versus google
or blogdigger and its clearly not doing enough.

a) Simple : Add MySQL Full Text Indexing to Wordpress and modify the search
hooks to use it. Moving to FT indices on MyISAM tables gives actually quite
good serch out of the gate.  its not perfect but it scales to like
1.3million posts w/ relatively linear response times.  Its public
knowledge
that the 1st version of Feedster did MySQL full text until we hit this
point.  its not perfect and there are some character set issues but its a
lot better than what seems to be in place now.

Difficulty: not huge.  Willing to do in full myself.

b) if the desire for N database support means that WP doesn't want to do
this then the next approach is to duplicate the SQL based search approach of
MnogoSearch which uses SQL tables for the core word list and indices.  Its
an interesting approach and would take a bunch of work but I could certainly
do help w/ that but its a long term not short term project and would take
more than just me.

Difficulty: tedious and a fair bit of code.

c) I don't know what hosted WP does for tables but the big limitation on
MyISAM is scalabilty when you generally move to Innodb.  Now innodb, of
course, doesn't have full text indices which raises another set of issues.
Also there are implications on all this if you're using 1 table for 1 users
posts versus 1 table for ALL users posts.  Any insights would be appreciated
(I'm a search guy who's learning to hack WP but knows damn well he doesn't
have all the answers).

d) UI and posts versus pages.  My suggestion is to generate a composite
results page showing something like this imho:

Matching blog posts for FOO:

Sorted by date | [Sort by Relevance]  <== links

1. Blah Bar
2. Blah Foo
3. Blah Gah
4.  blah Etc

More...

Matching pages for FOO:

(same structure)

Thoughts?  *ducks*

Scott
_______________________________________________
wp-hackers mailing list
[hidden email]
http://lists.automattic.com/mailman/listinfo/wp-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Development for 2.x : Improved Search

Denis de Bernardy
> a) Simple : Add MySQL Full Text Indexing to Wordpress and
> modify the search hooks to use it. Moving to FT indices on
> MyISAM tables gives actually quite good serch out of the
> gate. (...)
>
> Difficulty: not huge.  Willing to do in full myself.

This is mostly done already, and MyISAM full text indexing is about as bad
as bad can get.

http://www.semiologic.com/software/search-reloaded/

D.
 

_______________________________________________
wp-hackers mailing list
[hidden email]
http://lists.automattic.com/mailman/listinfo/wp-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Development for 2.x : Improved Search

Denis de Bernardy
> > a) Simple : Add MySQL Full Text Indexing to Wordpress and
> > modify the search hooks to use it. Moving to FT indices on
> > MyISAM tables gives actually quite good serch out of the
> > gate. (...)
> >
> > Difficulty: not huge.  Willing to do in full myself.
>
> This is mostly done already, and MyISAM full text indexing is
> about as bad as bad can get.
>
> http://www.semiologic.com/software/search-reloaded/

As additional information, a past version of the plugin did a slightly
better job than the above at the cost of a huge compute power. To spare
yourself some time:

1. Using a FT index on the text-only version of the formatted post excerpt
and content does not improve the results in any significant manner.

2. MySQL has a number of issues that are related to charsets.

These tend to worsen after MySQL 4.1 (at which point they introduced a
collection of new bugs, for good measure). The underlying mess is a
nightmare to sort out.

3. Trying to tweak the results by reworking the raw mysql score can produce
meaningful enhancements but involved a significant overhead.

Things I tried include the keyword order, their presence in the post title,
presence or absence of double quotes to create keyword groups, and later on
the use of a soundex.

I eventually dropped all of these ideas because working around MySQL's lack
of features by using php was simply ridiculous. If you give this a shot
yourself, store your indexes and search procedures in a real database, such
as pgsql.

4. Last but not least, several users sent me messages along the lines of the
following:

"Search reloaded returns results in a random order. Why doesn't it sort
results by date?"


Denis

_______________________________________________
wp-hackers mailing list
[hidden email]
http://lists.automattic.com/mailman/listinfo/wp-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Development for 2.x : Improved Search

Scott johnson-5
Hi Denis,

I'll download your code and take a look at it.  There's a fair bit of
tweaking that you do to setup mysql full text search and I don't know what
your assumptions were.

I'd also -- for any search task -- recommend offering search by date, newest
to oldest, as a default if not option.  The world is increasingly about
currency and, particularly in a blog context, this is key.

Finally, if memory serves me correctly, the default connector in MySQL full
text search is OR not AND which means that users get what appear to be
random results.  When we switched this in Feedster it essentially fixed the
problem from the users perspective.

Reworking the relevance score is always tricky but it certainly can be
done.  http://www.queryserver.com/ is a product of mine from '97 (still
around) that normalizes relevance scores across all the major search engines
and produces a merged metasearch result.

*Scurries to download code and try it before I have to get offline*.

Thanks!

Scott

On 2/5/06, Denis de Bernardy <[hidden email]> wrote:

>
> > > a) Simple : Add MySQL Full Text Indexing to Wordpress and
> > > modify the search hooks to use it. Moving to FT indices on
> > > MyISAM tables gives actually quite good serch out of the
> > > gate. (...)
> > >
> > > Difficulty: not huge.  Willing to do in full myself.
> >
> > This is mostly done already, and MyISAM full text indexing is
> > about as bad as bad can get.
> >
> > http://www.semiologic.com/software/search-reloaded/
>
> As additional information, a past version of the plugin did a slightly
> better job than the above at the cost of a huge compute power. To spare
> yourself some time:
>
> 1. Using a FT index on the text-only version of the formatted post excerpt
> and content does not improve the results in any significant manner.
>
> 2. MySQL has a number of issues that are related to charsets.
>
> These tend to worsen after MySQL 4.1 (at which point they introduced a
> collection of new bugs, for good measure). The underlying mess is a
> nightmare to sort out.
>
> 3. Trying to tweak the results by reworking the raw mysql score can
> produce
> meaningful enhancements but involved a significant overhead.
>
> Things I tried include the keyword order, their presence in the post
> title,
> presence or absence of double quotes to create keyword groups, and later
> on
> the use of a soundex.
>
> I eventually dropped all of these ideas because working around MySQL's
> lack
> of features by using php was simply ridiculous. If you give this a shot
> yourself, store your indexes and search procedures in a real database,
> such
> as pgsql.
>
> 4. Last but not least, several users sent me messages along the lines of
> the
> following:
>
> "Search reloaded returns results in a random order. Why doesn't it sort
> results by date?"
>
>
> Denis
>
> _______________________________________________
> wp-hackers mailing list
> [hidden email]
> http://lists.automattic.com/mailman/listinfo/wp-hackers
>
_______________________________________________
wp-hackers mailing list
[hidden email]
http://lists.automattic.com/mailman/listinfo/wp-hackers
Loading...