Обсуждение: Postgresql performace question

Поиск
Список
Период
Сортировка

Postgresql performace question

От
"Mark Jones"
Дата:
Hello

I am working on a project that acquires real-time data from an external
device that I need to store and be able to search through and retrieve
quickly. My application receives packets of data ranging in size from 300 to
5000 bytes every 50 milliseconds for the minimum duration of 24 hours before
the data is purged or archived off disk. There are several fields in the
data that I like to be able to search on to retrieve the data at later time.
By using a SQL database such as Postgresql or Mysql it seams that it would
make this task much easier. My questions are, is a SQL database such as
Postgresql able to handle this kind of activity saving a record of 5000
bytes at rate of 20 times a second, also how well will it perform at
searching through a database which contains nearly two million records at a
size of about 8 - 9 gigabytes of data, assuming that I have adequate
computing hardware. I am trying to determine if a SQL database would work
well for this or if I need to write my own custom database for this project.
If anyone has any experience in doing anything similar with Postgresql  I
would love to know about your findings.

Thanks
Mark



Re: Postgresql performace question

От
Rod Taylor
Дата:
On Sun, 2003-03-02 at 18:52, Mark Jones wrote:
> Hello
>
> I am working on a project that acquires real-time data from an external
> device that I need to store and be able to search through and retrieve
> quickly. My application receives packets of data ranging in size from 300 to
> 5000 bytes every 50 milliseconds for the minimum duration of 24 hours before
> the data is purged or archived off disk. There are several fields in the
> data that I like to be able to search on to retrieve the data at later time.
> By using a SQL database such as Postgresql or Mysql it seams that it would
> make this task much easier. My questions are, is a SQL database such as
> Postgresql able to handle this kind of activity saving a record of 5000
> bytes at rate of 20 times a second, also how well will it perform at
> searching through a database which contains nearly two million records at a
> size of about 8 - 9 gigabytes of data, assuming that I have adequate
> computing hardware. I am trying to determine if a SQL database would work
> well for this or if I need to write my own custom database for this project.
> If anyone has any experience in doing anything similar with Postgresql  I
> would love to know about your findings.

Not such a big load depending on the complexity of the queries being
performed.  From the sounds of it, they're pretty simple.

The real question is, the data collection is in real-time, but can you
have a maintenance window (6 hours a week?) to do things like REINDEX?

Can you use a parallel loader or does it need to be loaded sequentially?

Will the data purge be spread over time or come in large lumps?


Simple queries on 2 million records happen in low double digit
milliseconds on relatively low end hardware with a decent index -- but
complex queries can take minutes on 1/10000th the dataset.

20 inserts / second shouldn't be difficult to achieve even on the lowest
end hardware -- my laptop can do that -- giving each packet it's own
transaction.

But if it's 20 packets with parts being inserted into 50 tables (now
1000 inserts / second) things could become a little trickier -- but
still easily doable.

The most important question is the maintenance window.

--
Rod Taylor <rbt@rbt.ca>

PGP Key: http://www.rbt.ca/rbtpub.asc

Вложения

Re: Postgresql performace question

От
Rod Taylor
Дата:
> device. As mentioned above I may want to be able to sequentially read records
> from the database to be able to recreate the data stream that I received and
> recored. I would need to be able to read the records back fast enough to
> replay the at the same rate at which I received them.

It's only 20 /second, and it doesn't appear as if you'll be getting
bursts -- so it should be easy to handle.  The trick is getting around
the lack of maintenance time.

7.4 will be much better suited for this application due to the index
corrections, however, in the mean time you'll need to do a little
trickery.

That is, create a new table for each ~month or so, then copy the last 24
hours of data into the new table, and change a marker so that queries go
against it.  A month puts 54M records into the table -- not such a big
deal if you have good indexes.

Create two 'control' tables.  One points to the table to do queries
from, the second points to the table(s) to inject new data into.  Tech
your interface and data loader to use these to determine where to run
queries out of.  An alternative approach is to use views and rules to
make it transparent -- alter the view to point at the new data source.

During the transition stage, you'll be querying out of the old table,
while doing the data load into both old and new with the data loader,
and copying the last 24 hours worth of data into the new table.

Once new table is ready (has data copied) update the 'query' control
table to have the interface query the new table.  You can then drop the
old one.


It's a kludgy way of doing it, but gets around PostgreSQLs growing index
issue. Anything else and you'll see hick-ups during maintenance.  MySQL
will require a similar technique, but for different reasons.

As stated above, PostgreSQL 7.4 is a few steps closer to being able to
deal with large table churn in a 24/7 environment. If you do the table
swap on a monthly basis, you could do a bulk archive of that tables data
very easily -- in which case you'll almost never need to vacuum ;)



A different (and perhaps easier) approach is to create a partial index
for each day doing everything else as you normally would on a single
table.  Since you will have a segmented index, you can easily remove the
old indexes once the data has been archived. Once again, the goal is to
allow the index to shrink.

This will require you to vacuum frequently and have a large FSM.  It's
also preferred you do a gradual archival of data and not in bulk.
Archive 10 minutes worth every 10 minutes after 3 days have past rather
than an entire days at once.

> On Sunday 02 March 2003 06:35 pm, Rod Taylor wrote:
> > On Sun, 2003-03-02 at 18:52, Mark Jones wrote:
> > > Hello
> > >
> > > I am working on a project that acquires real-time data from an external
> > > device that I need to store and be able to search through and retrieve
> > > quickly. My application receives packets of data ranging in size from 300
> > > to 5000 bytes every 50 milliseconds for the minimum duration of 24 hours
> > > before the data is purged or archived off disk. There are several fields
> > > in the data that I like to be able to search on to retrieve the data at
> > > later time. By using a SQL database such as Postgresql or Mysql it seams
> > > that it would make this task much easier. My questions are, is a SQL
> > > database such as Postgresql able to handle this kind of activity saving a
> > > record of 5000 bytes at rate of 20 times a second, also how well will it
> > > perform at searching through a database which contains nearly two million
> > > records at a size of about 8 - 9 gigabytes of data, assuming that I have
> > > adequate computing hardware. I am trying to determine if a SQL database
> > > would work well for this or if I need to write my own custom database for
> > > this project. If anyone has any experience in doing anything similar with
> > > Postgresql  I would love to know about your findings.
> >
> > Not such a big load depending on the complexity of the queries being
> > performed.  From the sounds of it, they're pretty simple.
> >
> > The real question is, the data collection is in real-time, but can you
> > have a maintenance window (6 hours a week?) to do things like REINDEX?
> >
> > Can you use a parallel loader or does it need to be loaded sequentially?
> >
> > Will the data purge be spread over time or come in large lumps?
> >
> >
> > Simple queries on 2 million records happen in low double digit
> > milliseconds on relatively low end hardware with a decent index -- but
> > complex queries can take minutes on 1/10000th the dataset.
> >
> > 20 inserts / second shouldn't be difficult to achieve even on the lowest
> > end hardware -- my laptop can do that -- giving each packet it's own
> > transaction.
> >
> > But if it's 20 packets with parts being inserted into 50 tables (now
> > 1000 inserts / second) things could become a little trickier -- but
> > still easily doable.
> >
> > The most important question is the maintenance window.
--
Rod Taylor <rbt@rbt.ca>

PGP Key: http://www.rbt.ca/rbtpub.asc

Вложения

Re: [HACKERS] Postgresql performace question

От
Greg Copeland
Дата:
Something I didn't see mentioned of, does your data need to be made
available at real time?  Just because you're sampling 20/s doesn't have
to mean that the data is made available at 20 samples per second or even
1 sample per 1/20th of a second.  I mention this because you might find
that it's a little more resource friendly to stuff many samples into the
same transaction and commit them together.  As an example, take
60-seconds (1200 samples) worth of data and send it to the database
while continuing to sample your data.  I mention this because it may
allow for more flexibility in your implementation.

At any rate, I agree with the other comments.  Maintenance issues are
probably going to be the gotcha if you're not forward looking.


Regards,

    Greg Copeland


On Sun, 2003-03-02 at 19:33, Mark Jones wrote:
> > The real question is, the data collection is in real-time, but can you
> > have a maintenance window (6 hours a week?) to do things like REINDEX?
>
> The database has to have the last 24 hours of data online and be acessable 24
> hours a day 7 days a week 365 days a year basicly no downtime. My application
> is weather related system, which must be avaliable at any moment.
>
> > Will the data purge be spread over time or come in large lumps?
>
>  The database that is online would only contain the last 24 hours of data.
> Once 24 hours of data has been collected the data would begin to be purged say
>  purging of 1 hours worth of data every hour.
>
> > But if it's 20 packets with parts being inserted into 50 tables (now
> > 1000 inserts / second) things could become a little trickier -- but
> > still easily doable.
>
> No there will be only one table with each packet having its own record.
> Each packet contains a few header fields that I would like to be able to
> search on. The rest of the data is purely binary data that goes along with
> the header. I no intertest in searching through it only being able to retrive
> it based on its header. One important way of retriving data would be based on
> time saying that every packet has a timestamp, I would like to select a range
> of records out of the database. There maybe cases were I want to stream the
> data back out just the way as it was recorded.
>
>
> > Can you use a parallel loader or does it need to be loaded sequentially?
>
> I am not exactly sure what you mean by parallel loader but the data would be
> loaded sequentially into the database just as I receive it from my external
> device. As mentioned above I may want to be able to sequentially read records
> from the database to be able to recreate the data stream that I received and
> recored. I would need to be able to read the records back fast enough to
> replay the at the same rate at which I received them.
>
>
> Thanks Again
> Mark
>
>
> On Sunday 02 March 2003 06:35 pm, Rod Taylor wrote:
> > On Sun, 2003-03-02 at 18:52, Mark Jones wrote:
> > > Hello
> > >
> > > I am working on a project that acquires real-time data from an external
> > > device that I need to store and be able to search through and retrieve
> > > quickly. My application receives packets of data ranging in size from 300
> > > to 5000 bytes every 50 milliseconds for the minimum duration of 24 hours
> > > before the data is purged or archived off disk. There are several fields
> > > in the data that I like to be able to search on to retrieve the data at
> > > later time. By using a SQL database such as Postgresql or Mysql it seams
> > > that it would make this task much easier. My questions are, is a SQL
> > > database such as Postgresql able to handle this kind of activity saving a
> > > record of 5000 bytes at rate of 20 times a second, also how well will it
> > > perform at searching through a database which contains nearly two million
> > > records at a size of about 8 - 9 gigabytes of data, assuming that I have
> > > adequate computing hardware. I am trying to determine if a SQL database
> > > would work well for this or if I need to write my own custom database for
> > > this project. If anyone has any experience in doing anything similar with
> > > Postgresql  I would love to know about your findings.
> >
> > Not such a big load depending on the complexity of the queries being
> > performed.  From the sounds of it, they're pretty simple.
> >
> > The real question is, the data collection is in real-time, but can you
> > have a maintenance window (6 hours a week?) to do things like REINDEX?
> >
> > Can you use a parallel loader or does it need to be loaded sequentially?
> >
> > Will the data purge be spread over time or come in large lumps?
> >
> >
> > Simple queries on 2 million records happen in low double digit
> > milliseconds on relatively low end hardware with a decent index -- but
> > complex queries can take minutes on 1/10000th the dataset.
> >
> > 20 inserts / second shouldn't be difficult to achieve even on the lowest
> > end hardware -- my laptop can do that -- giving each packet it's own
> > transaction.
> >
> > But if it's 20 packets with parts being inserted into 50 tables (now
> > 1000 inserts / second) things could become a little trickier -- but
> > still easily doable.
> >
> > The most important question is the maintenance window.
--
Greg Copeland <greg@copelandconsulting.net>
Copeland Computer Consulting


Re: Postgresql performace question

От
Mark Jones
Дата:
> The real question is, the data collection is in real-time, but can you
> have a maintenance window (6 hours a week?) to do things like REINDEX?

The database has to have the last 24 hours of data online and be acessable 24
hours a day 7 days a week 365 days a year basicly no downtime. My application
is weather related system, which must be avaliable at any moment.

> Will the data purge be spread over time or come in large lumps?

 The database that is online would only contain the last 24 hours of data.
Once 24 hours of data has been collected the data would begin to be purged say
 purging of 1 hours worth of data every hour.

> But if it's 20 packets with parts being inserted into 50 tables (now
> 1000 inserts / second) things could become a little trickier -- but
> still easily doable.

No there will be only one table with each packet having its own record.
Each packet contains a few header fields that I would like to be able to
search on. The rest of the data is purely binary data that goes along with
the header. I no intertest in searching through it only being able to retrive
it based on its header. One important way of retriving data would be based on
time saying that every packet has a timestamp, I would like to select a range
of records out of the database. There maybe cases were I want to stream the
data back out just the way as it was recorded.


> Can you use a parallel loader or does it need to be loaded sequentially?

I am not exactly sure what you mean by parallel loader but the data would be
loaded sequentially into the database just as I receive it from my external
device. As mentioned above I may want to be able to sequentially read records
from the database to be able to recreate the data stream that I received and
recored. I would need to be able to read the records back fast enough to
replay the at the same rate at which I received them.


Thanks Again
Mark


On Sunday 02 March 2003 06:35 pm, Rod Taylor wrote:
> On Sun, 2003-03-02 at 18:52, Mark Jones wrote:
> > Hello
> >
> > I am working on a project that acquires real-time data from an external
> > device that I need to store and be able to search through and retrieve
> > quickly. My application receives packets of data ranging in size from 300
> > to 5000 bytes every 50 milliseconds for the minimum duration of 24 hours
> > before the data is purged or archived off disk. There are several fields
> > in the data that I like to be able to search on to retrieve the data at
> > later time. By using a SQL database such as Postgresql or Mysql it seams
> > that it would make this task much easier. My questions are, is a SQL
> > database such as Postgresql able to handle this kind of activity saving a
> > record of 5000 bytes at rate of 20 times a second, also how well will it
> > perform at searching through a database which contains nearly two million
> > records at a size of about 8 - 9 gigabytes of data, assuming that I have
> > adequate computing hardware. I am trying to determine if a SQL database
> > would work well for this or if I need to write my own custom database for
> > this project. If anyone has any experience in doing anything similar with
> > Postgresql  I would love to know about your findings.
>
> Not such a big load depending on the complexity of the queries being
> performed.  From the sounds of it, they're pretty simple.
>
> The real question is, the data collection is in real-time, but can you
> have a maintenance window (6 hours a week?) to do things like REINDEX?
>
> Can you use a parallel loader or does it need to be loaded sequentially?
>
> Will the data purge be spread over time or come in large lumps?
>
>
> Simple queries on 2 million records happen in low double digit
> milliseconds on relatively low end hardware with a decent index -- but
> complex queries can take minutes on 1/10000th the dataset.
>
> 20 inserts / second shouldn't be difficult to achieve even on the lowest
> end hardware -- my laptop can do that -- giving each packet it's own
> transaction.
>
> But if it's 20 packets with parts being inserted into 50 tables (now
> 1000 inserts / second) things could become a little trickier -- but
> still easily doable.
>
> The most important question is the maintenance window.

--
Mark Jones
http://www.hackerjones.org