Обсуждение: Suggestion for improving Archives

Поиск
Список
Период
Сортировка

Suggestion for improving Archives

От
Josh Berkus
Дата:
Folks,

In addition to the pending migration of the Archives (or half the archives, or
whatever), I had another suggestion to make the archives less
resource-intensive yet more user-friendly:

Drop the search interface and replace it with links to pgsql.ru and Google
Groups.

Both of those resources are faster and better search engines than the Mhonarc
search could ever be, and neither eats CPU time on hub.org.   Yes?

--
Josh Berkus
Aglio Database Solutions
San Francisco

Re: Suggestion for improving Archives

От
"Dave Page"
Дата:


-----Original Message-----
From: pgsql-www-owner@postgresql.org on behalf of Josh Berkus
Sent: Fri 9/3/2004 5:19 PM
To: PostgreSQL WWW Mailing List
Subject: [pgsql-www] Suggestion for improving Archives

> Both of those resources are faster and better search engines than the Mhonarc
> search could ever be, and neither eats CPU time on hub.org.   Yes?

Um, Mhonarc is the mail to html program which will still be required, and none of the searches run on hub.org currently
anyway.

Regards, Dave

Re: Suggestion for improving Archives

От
"Marc G. Fournier"
Дата:
On Fri, 3 Sep 2004, Josh Berkus wrote:

> Folks,
>
> In addition to the pending migration of the Archives (or half the archives, or
> whatever), I had another suggestion to make the archives less
> resource-intensive yet more user-friendly:
>
> Drop the search interface and replace it with links to pgsql.ru and Google
> Groups.
>
> Both of those resources are faster and better search engines than the
> Mhonarc search could ever be, and neither eats CPU time on hub.org.
> Yes?

Note that the search functions haven't chewed up CPU on our servers in
over a month now ... John Hansen has been running search.postgresql.org
off of his server(s) for about that long now ...

----
Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
Email: scrappy@hub.org           Yahoo!: yscrappy              ICQ: 7615664

Re: Suggestion for improving Archives

От
Oleg Bartunov
Дата:
I could configure search daemon at pgsql.ru to allow
search requests from archives.postgresl.org via perl interface,
so results could be wrapped into any design.

    Oleg
On Fri, 3 Sep 2004, Josh Berkus wrote:

> Folks,
>
> In addition to the pending migration of the Archives (or half the archives, or
> whatever), I had another suggestion to make the archives less
> resource-intensive yet more user-friendly:
>
> Drop the search interface and replace it with links to pgsql.ru and Google
> Groups.
>
> Both of those resources are faster and better search engines than the Mhonarc
> search could ever be, and neither eats CPU time on hub.org.   Yes?
>
>

    Regards,
        Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

Re: Suggestion for improving Archives

От
"John Hansen"
Дата:
> Note that the search functions haven't chewed up CPU on our
> servers in over a month now ... John Hansen has been running
> search.postgresql.org off of his server(s) for about that long now ...
>

... It's been more than 2 months, but who's counting? :)

... John

Re: Suggestion for improving Archives

От
"John Hansen"
Дата:
Hi,

>
> I could configure search daemon at pgsql.ru to allow search
> requests from archives.postgresl.org via perl interface, so
> results could be wrapped into any design.
>

I currently run search.postgresql.org, and it can be integrated into any
design as well, either using the (ASPseek) method of a html-like
template, or using XSL (the daemon can be configured to output XML).

> >
> > Drop the search interface and replace it with links to pgsql.ru and
> > Google Groups.
> >
> > Both of those resources are faster and better search
> engines than the Mhonarc
> > search could ever be, and neither eats CPU time on hub.org.   Yes?

However, only one of those resources, pgsql.ru, could be made to make
available new threads on an hourly basis, or maybe even realtime.

Either way, I'm not fuzzed, whichever works the best.

... John


Re: Suggestion for improving Archives

От
Josh Berkus
Дата:
Guys,

> However, only one of those resources, pgsql.ru, could be made to make
> available new threads on an hourly basis, or maybe even realtime.
>
> Either way, I'm not fuzzed, whichever works the best.

Hmmmm .....

===========================
Click to Search the Archives:

-- PGSQL.ru's Full Text Search of Archives using OpenFTS (fast, all PostgreSQL
sites)

-- Google Groups (fast, general search of Usenet and PostgreSQL mailing lists)

-- Monharc Archive Search (slow but includes up-to-the-last-hour posts)

===========================

Good?

--
Josh Berkus
Aglio Database Solutions
San Francisco

Re: Suggestion for improving Archives

От
"John Hansen"
Дата:
>
> -- Monharc Archive Search (slow but includes up-to-the-last-hour
posts)
>

Maybe I'm missing something...

Could you define 'slow' for me....
Any search I do on the archives comes back in less than a second.


... John

Re: Suggestion for improving Archives

От
"Marc G. Fournier"
Дата:
On Sat, 4 Sep 2004, Josh Berkus wrote:

> Guys,
>
>> However, only one of those resources, pgsql.ru, could be made to make
>> available new threads on an hourly basis, or maybe even realtime.
>>
>> Either way, I'm not fuzzed, whichever works the best.
>
> Hmmmm .....
>
> ===========================
> Click to Search the Archives:
>
> -- PGSQL.ru's Full Text Search of Archives using OpenFTS (fast, all PostgreSQL
> sites)
>
> -- Google Groups (fast, general search of Usenet and PostgreSQL mailing lists)
>
> -- Monharc Archive Search (slow but includes up-to-the-last-hour posts)

When is the last time you used the search on archives.postgresql.org?  The
following was searching mvcc:

"Documents 1-10 of total 1576 found.   Searching in 390035 documents took
0.037 seconds."

the following was searching 'wal vadim':

"Documents 1-10 of total 880 found.   Searching in 390035 documents took
0.441 seconds."

the following was searching "postgresql releases 8.0":

"Documents 1-10 of total 2190 found.  Searching in 390035 documents took
1.882 seconds."

the followign was searching "nested transaction support":

"Documents 1-10 of total 383 found.   Searching in 390035 documents took
4.714 seconds."

Not what I'd consider "slow" ... granted, that last one on Google too .2
seconds, but when we can build a server farm like them, then I'll be
worried about 4secs vs .2 :)

----
Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
Email: scrappy@hub.org           Yahoo!: yscrappy              ICQ: 7615664

Re: Suggestion for improving Archives

От
"John Hansen"
Дата:
> the followign was searching "nested transaction support":
>
> "Documents 1-10 of total 383 found.   Searching in 390035
> documents took
> 4.714 seconds."
>
> Not what I'd consider "slow" ... granted, that last one on
> Google too .2 seconds, but when we can build a server farm
> like them, then I'll be worried about 4secs vs .2 :)
>

And the archives is currently being crawled, and there was a vacuum
running (heavy db load).

nested transaction support:
Documents 1-10 of total 383 found.   Searching in 390035 documents took
0.104 seconds.

... John

Re: Suggestion for improving Archives

От
"Greg Sabino Mullane"
Дата:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


> the followign was searching "nested transaction support":
>
> "Documents 1-10 of total 383 found. Searching in 390035 documents took
> 4.714 seconds."

The self-reported time should really not be used. I just ran a query, for
example, that took 8 seconds as measured by my local clock, but reported
searching in under 2 seconds: so obviously there are some other factors
here. (I'll give maybe half a second for network times on my end).

I prefer pgsql.ru or google because it searches the docs and the mailing
lists, and the quality of the results tend to be higher. While we are here,
the "for files modified" bit of the search.postgresql.org box does not seem
to work: searching for "nested transactions vadim" brings back 62 hits,
regardless of whether I set it to within one day or within 2 years. The
top hit is from June 2000. There is also no way to sort it by date,
which can be extremely important. The ads on every page are annoying as
well.

My own personal summary of advantages:

pgsql.ru: very fast, searches all sites at once, no advertisements,
nice "group by site" feature, cool Mozilla plugin, BSD-licensed tech,
written by PG developers

google: extremely fast, searches many other sources, minimal ads,
order by date, powerful "advanced search" available

search.postgresql.org: linked from main site?

- --
Greg Sabino Mullane greg@turnstep.com
PGP Key: 0x14964AC8 200409041541

-----BEGIN PGP SIGNATURE-----

iD8DBQFBOh6NvJuQZxSWSsgRArohAJ9qoVBbhrc/vPntojFTXDocX5EZegCfeHC4
T5VIlIxwEklT6EGquje6w3Y=
=HclZ
-----END PGP SIGNATURE-----



Re: Suggestion for improving Archives

От
"Dave Page"
Дата:


-----Original Message-----
From: pgsql-www-owner@postgresql.org on behalf of Greg Sabino Mullane
Sent: Sat 9/4/2004 8:57 PM
To: pgsql-www@postgresql.org
Subject: Re: [pgsql-www] Suggestion for improving Archives
 

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
 
 
> the followign was searching "nested transaction support":
>
> "Documents 1-10 of total 383 found. Searching in 390035 documents took
> 4.714 seconds."
 
> The self-reported time should really not be used. I just ran a query, for
> example, that took 8 seconds as measured by my local clock, but reported
> searching in under 2 seconds: so obviously there are some other factors
> here. (I'll give maybe half a second for network times on my end).

Factors that would apply equally if we used pgsql.ru - what you are most likely seeing it the ads and other graphics
loading.If we used pgsql.ru, you would still have those coming from hub.org.
 
 
> pgsql.ru: very fast, searches all sites at once, no advertisements,
> nice "group by site" feature, cool Mozilla plugin, BSD-licensed tech,
> written by PG developers

Would be linked from main site with ads if we started using it 'officially'.

> search.postgresql.org: linked from main site?

Yes, should we leave it hidden? :-) Note that in comparison to pgsql.ru, search.pg.org also groups by site, searches
allsites at once and is seriously hacked by PG developers (not written I grant you). It is also open source (GPL, not
BSD- not that I think that is particularly important to any of the users).
 
 
Regards, Dave.

Re: Suggestion for improving Archives

От
"John Hansen"
Дата:
> While we are here, the "for files modified" bit of
> the search.postgresql.org box does not seem to work:
> searching for "nested transactions vadim" brings back 62
> hits, regardless of whether I set it to within one day or
> within 2 years. The top hit is from June 2000. There is also
> no way to sort it by date, which can be extremely important.
> The ads on every page are annoying as well.
>

Seems to fork fine for me, no results in the last 3 months,5 in the last
6 and 12, 26 in the last 2 years. Sorting by date, rather than
relevance, could be added.



Re: Suggestion for improving Archives

От
Oleg Bartunov
Дата:
On Sun, 5 Sep 2004, John Hansen wrote:

> > While we are here, the "for files modified" bit of
> > the search.postgresql.org box does not seem to work:
> > searching for "nested transactions vadim" brings back 62
> > hits, regardless of whether I set it to within one day or
> > within 2 years. The top hit is from June 2000. There is also
> > no way to sort it by date, which can be extremely important.
> > The ads on every page are annoying as well.
> >
>
> Seems to fork fine for me, no results in the last 3 months,5 in the last
> 6 and 12, 26 in the last 2 years. Sorting by date, rather than
> relevance, could be added.
>

Marc again dropped last time modification header, so it's impossible
to sort results by date (in general case ) without specific parser.
Also, he changed template for message. These changes cause recrawling
the whole archive each time and overloading archives.postgresql.org
More specific search engine could use another source of information which
messages to crawl, but one we use at pgsql.ru is a general search engine
and it can't get modification date without proper header.

I suggest:

1. Use 3-server architecture (image server, frontend, backend) which
   could be reduced to 2 servers (image+frontend, backend) -
   frontend could be plain apache+mod_accel and serve/cache all backends
   outputs, backend is a modperl or/and php enabled apache.

2. return last modification header - be friendly to crawlers and browsers
3. stop changing message template


    Oleg


>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
>                http://archives.postgresql.org
>

    Regards,
        Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

Re: Suggestion for improving Archives

От
"John Hansen"
Дата:
> Marc again dropped last time modification header, so it's
> impossible to sort results by date (in general case ) without
> specific parser.

Yes, that is unfortunate, but the code required to make this happen puts
stress on the archives to some degree.

> Also, he changed template for message. These changes cause
> recrawling the whole archive each time and overloading
> archives.postgresql.org More specific search engine could use
> another source of information which messages to crawl, but
> one we use at pgsql.ru is a general search engine and it
> can't get modification date without proper header.

There should be no need to reindex the entire archive because of a
template change, since if you honor the embedded
<!--noindex-->..<!--/noindex--> tags, the body text never changes.
Unless of course, you want to keep an up-to-date cached copy.

>
> I suggest:
>
> 1. Use 3-server architecture (image server, frontend, backend) which
>    could be reduced to 2 servers (image+frontend, backend) -
>    frontend could be plain apache+mod_accel and serve/cache
> all backends
>    outputs, backend is a modperl or/and php enabled apache.
> 2. return last modification header - be friendly to crawlers
> and browsers

Tho an accellerator would only work if last-modified header is returned
by the backend, this might be worth looking into.

> 3. stop changing message template
>

Template changes are inevitable, they're part of progress :)

... John

Re: Suggestion for improving Archives

От
"John Hansen"
Дата:
>
> Oleg, is there anything that I can put into <HEAD></HEAD> for
> this?  To avoid having to use PHP to do it?
>

<meta http-equiv="Last-Modified" content="Tue, 01 Jun 2004 09:44:10">

... John

Re: Suggestion for improving Archives

От
"Marc G. Fournier"
Дата:
On Sun, 5 Sep 2004, John Hansen wrote:

>> Marc again dropped last time modification header, so it's
>> impossible to sort results by date (in general case ) without
>> specific parser.
>
> Yes, that is unfortunate, but the code required to make this happen puts
> stress on the archives to some degree.
>
>> Also, he changed template for message. These changes cause
>> recrawling the whole archive each time and overloading
>> archives.postgresql.org More specific search engine could use
>> another source of information which messages to crawl, but
>> one we use at pgsql.ru is a general search engine and it
>> can't get modification date without proper header.
>
> There should be no need to reindex the entire archive because of a
> template change, since if you honor the embedded
> <!--noindex-->..<!--/noindex--> tags, the body text never changes.
> Unless of course, you want to keep an up-to-date cached copy.

I think what Oleg is referring to is that search engines generally compare
the Last-Modified header before pulling in the whole file, to see if they
are the same or not ... php, unfortunately, sets that to now(), so as far
as SE's are concerned, every time they index is a new file :(

I'm going to play with mhonarc this week to see if I can get it to
properly set Last-Modified to Date based on the message itself ... that
will clean up that mess ...

Oleg, is there anything that I can put into <HEAD></HEAD> for this?  To
avoid having to use PHP to do it?

----
Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
Email: scrappy@hub.org           Yahoo!: yscrappy              ICQ: 7615664

Re: Suggestion for improving Archives

От
"John Hansen"
Дата:
>
> What code ? I've seen that last modified header and now it's gone.
> No stress on the archives, it's pure question of several lines of code
>

Yes, which is exactly what we wanted to avoid, more php code.


> it's not a portal page, it's just a message, why should it
> changed so often. I think I should teach our crawler to
> recognize if changes were cosmetic using fuzzy checksum.
>

No, but even something as simple as adding a new mailing list would then
cause you to recrawl the entire site.
I agree that the last-modified header is the best solution. (the value
of it being equal to the message date, that is)


... John

Re: Suggestion for improving Archives

От
Oleg Bartunov
Дата:
On Sun, 5 Sep 2004, John Hansen wrote:

> > Marc again dropped last time modification header, so it's
> > impossible to sort results by date (in general case ) without
> > specific parser.
>
> Yes, that is unfortunate, but the code required to make this happen puts
> stress on the archives to some degree.

What code ? I've seen that last modified header and now it's gone.
No stress on the archives, it's pure question of several lines of code


>
> > Also, he changed template for message. These changes cause
> > recrawling the whole archive each time and overloading
> > archives.postgresql.org More specific search engine could use
> > another source of information which messages to crawl, but
> > one we use at pgsql.ru is a general search engine and it
> > can't get modification date without proper header.
>
> There should be no need to reindex the entire archive because of a
> template change, since if you honor the embedded
> <!--noindex-->..<!--/noindex--> tags, the body text never changes.
> Unless of course, you want to keep an up-to-date cached copy.
>

Hmm, this is rather non-standard feature of archives.postgresql.org.
The problem is not with index/reindex ! The problem with crawler which
doesn't have enough information to make a right decision.
I don't like non-standard solution/hack when there are standard and
reliable solutions.


> >
> > I suggest:
> >
> > 1. Use 3-server architecture (image server, frontend, backend) which
> >    could be reduced to 2 servers (image+frontend, backend) -
> >    frontend could be plain apache+mod_accel and serve/cache
> > all backends
> >    outputs, backend is a modperl or/and php enabled apache.
> > 2. return last modification header - be friendly to crawlers
> > and browsers
>
> Tho an accellerator would only work if last-modified header is returned
> by the backend, this might be worth looking into.
>

I don't see a problem to return that header. But we'll have standard
solution for database driven site with dynamic content. Note,
one frontend could serve/hide many backends.


> > 3. stop changing message template
> >
>
> Template changes are inevitable, they're part of progress :)
>

it's not a portal page, it's just a message, why should it changed so
often. I think I should teach our crawler to recognize if changes were
cosmetic using fuzzy checksum.


> ... John
>

    Regards,
        Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

Re: Suggestion for improving Archives

От
"Marc G. Fournier"
Дата:
k, check out:

http://archives.postgresql.org/sfpug/2004-09/msg00003.php

I have the meta tag in place ... please confirm that the format is okay,
as that is what mhonarc is getting from the message itself ... I can
reformat it if I have to using PHP, but would like to avoid it if at all
possible ... basically, if the default will work, I'd like to leave it as
is ...

On Mon, 6 Sep 2004, John Hansen wrote:

>>
>> Oleg, is there anything that I can put into <HEAD></HEAD> for
>> this?  To avoid having to use PHP to do it?
>>
>
> <meta http-equiv="Last-Modified" content="Tue, 01 Jun 2004 09:44:10">
>
> ... John
>

----
Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
Email: scrappy@hub.org           Yahoo!: yscrappy              ICQ: 7615664

Re: Suggestion for improving Archives

От
"John Hansen"
Дата:
>
> k, check out:
>
> http://archives.postgresql.org/sfpug/2004-09/msg00003.php
>
> I have the meta tag in place ... please confirm that the
> format is okay, as that is what mhonarc is getting from the
> message itself ... I can reformat it if I have to using PHP,
> but would like to avoid it if at all possible ... basically,
> if the default will work, I'd like to leave it as is ...
>

Last-Modified: Sun,  5 Sep 2004 04:38:32 +0100 (BST)

The date format for last-modified is not defined afaik, at least various
web servers seem to have different formats, so I'm guessing this would
be acceptable.

... John

Re: Suggestion for improving Archives

От
Josh Berkus
Дата:
John,

> Maybe I'm missing something...
>
> Could you define 'slow' for me....
> Any search I do on the archives comes back in less than a second.

My apologies; using the Archive search was so painfully slow that I'd stopped
using it months ago.    I didn't know that you'd speeded it up.

--
Josh Berkus
Aglio Database Solutions
San Francisco

Re: Suggestion for improving Archives

От
"John Hansen"
Дата:
>
> My apologies; using the Archive search was so painfully slow
> that I'd stopped
> using it months ago.    I didn't know that you'd speeded it up.
>

:)

... John