Обсуждение: Online DW

Поиск
Список
Период
Сортировка

Online DW

От
Sridhar N Bamandlapally
Дата:
Hi

Is there any feature in PostgreSQL where online DW (Dataware housing) is possible ?

am looking for scenario like

1. Production DB will have CURRENT + LAST 7 DAYS data only

2. Archive/DW DB will have CURRENT + COMPLETE HISTORY

expecting something like streaming, but not ETL

Thanks
Sridhar

Re: Online DW

От
John R Pierce
Дата:
On 6/10/2016 1:11 AM, Sridhar N Bamandlapally wrote:
> Is there any feature in PostgreSQL where online DW (Dataware housing)
> is possible ?
>
> am looking for scenario like
>
> 1. Production DB will have CURRENT + LAST 7 DAYS data only
>
> 2. Archive/DW DB will have CURRENT + COMPLETE HISTORY
>
> expecting something like streaming, but not ETL

you'd need to manage that yourself.

I'd probably use a custom ETL job that pulls data from the production
database and sends it to the EDW using a message queue (EMS, JMS, AMQ,
etc).   the production database would be partitioned by day, and once a
day drop the 7+ day old partition, while the EDW wouldn't ever drop, and
if its partitioned, maybe by week or month.

It would probably be easier if the the EDW didn't need the latest
transactions, only data an hour to a day old.   If the EDW needed
near-live data, then I'd want to be using the same sort of messaging
queue platform (EMS, JMS, AMQ, etc) to send the data to the production
database, so that the EDW feed process could subscribe to the same events.

this sort of pattern works easier if your data is inserted once and not
updated, but it can be made to handle updates, too.   the message queue
paradigm ('publish/subscribe') is a very powerful way of implementing
complex distributed systems.




--
john r pierce, recycling bits in santa cruz



Re: [HACKERS] Online DW

От
Craig Ringer
Дата:
On 10 June 2016 at 16:11, Sridhar N Bamandlapally <sridhar.bn1@gmail.com> wrote:
Hi

Is there any feature in PostgreSQL where online DW (Dataware housing) is possible ?

am looking for scenario like

1. Production DB will have CURRENT + LAST 7 DAYS data only

2. Archive/DW DB will have CURRENT + COMPLETE HISTORY

expecting something like streaming, but not ETL

There's nothing built-in, but that's exactly the sort of thing pglogical is intended for. You can also build something along those lines with Londiste fairly easily.

Hopefully this is the sort of thing we can move toward with built-in logical replication in coming releases.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Online DW

От
Sridhar N Bamandlapally
Дата:
One thing looks possible ( feature not available), just an idea

example/syntax:

BEGIN NOARCHIVE;

  --- transaction-1 
  --- transaction-2 
  .
  .
  --- transaction-N 

END;

This/These will be performed in Production to clean-up archive which will not be sync with Archive/DW DB only

one heads-up is Archive/DW DB may need to build WITHOUT CONSTRAINTS

May need to introduce ARCHIVE system/tag in pg_hba.conf

Thanks
Sridhar
OpenText














On Fri, Jun 10, 2016 at 2:22 PM, Craig Ringer <craig@2ndquadrant.com> wrote:
On 10 June 2016 at 16:11, Sridhar N Bamandlapally <sridhar.bn1@gmail.com> wrote:
Hi

Is there any feature in PostgreSQL where online DW (Dataware housing) is possible ?

am looking for scenario like

1. Production DB will have CURRENT + LAST 7 DAYS data only

2. Archive/DW DB will have CURRENT + COMPLETE HISTORY

expecting something like streaming, but not ETL

There's nothing built-in, but that's exactly the sort of thing pglogical is intended for. You can also build something along those lines with Londiste fairly easily.

Hopefully this is the sort of thing we can move toward with built-in logical replication in coming releases.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Online DW

От
John R Pierce
Дата:
On 6/10/2016 2:18 AM, Sridhar N Bamandlapally wrote:
> This/These will be performed in Production to clean-up archive which
> will not be sync with Archive/DW DB only
>
> one heads-up is Archive/DW DB may need to build WITHOUT CONSTRAINTS
>
> May need to introduce ARCHIVE system/tag in pg_hba.conf

there's a whole lot of implied magic here unless you want to get way
more specific what these features do, exactly, under all possible
conditions.



--
john r pierce, recycling bits in santa cruz



Re: [HACKERS] Online DW

От
Craig Ringer
Дата:
On 10 June 2016 at 18:56, John R Pierce <pierce@hogranch.com> wrote:
On 6/10/2016 2:18 AM, Sridhar N Bamandlapally wrote:
This/These will be performed in Production to clean-up archive which will not be sync with Archive/DW DB only

one heads-up is Archive/DW DB may need to build WITHOUT CONSTRAINTS

May need to introduce ARCHIVE system/tag in pg_hba.conf

there's a whole lot of implied magic here unless you want to get way more specific what these features do, exactly, under all possible conditions.

I'd go further than that and say I can't see how something like this could possibly work with physical (block based) replication. It's total hand-waving.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Online DW

От
Sridhar N Bamandlapally
Дата:

This is what I feel will give me solution to maintain production (current+7days) and archive(current+history) without any etl/scheduler

But there is no feature available in any database

Sridhar
Opentext

On 10 Jun 2016 19:03, "Craig Ringer" <craig@2ndquadrant.com> wrote:
On 10 June 2016 at 18:56, John R Pierce <pierce@hogranch.com> wrote:
On 6/10/2016 2:18 AM, Sridhar N Bamandlapally wrote:
This/These will be performed in Production to clean-up archive which will not be sync with Archive/DW DB only

one heads-up is Archive/DW DB may need to build WITHOUT CONSTRAINTS

May need to introduce ARCHIVE system/tag in pg_hba.conf

there's a whole lot of implied magic here unless you want to get way more specific what these features do, exactly, under all possible conditions.

I'd go further than that and say I can't see how something like this could possibly work with physical (block based) replication. It's total hand-waving.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Online DW

От
Sridhar N Bamandlapally
Дата:

One thing we can restrict to "begin noarchive" transaction block are DELETE and SELECT only

Sridhar
Opentext

On 10 Jun 2016 21:57, "Sridhar N Bamandlapally" <sridhar.bn1@gmail.com> wrote:

This is what I feel will give me solution to maintain production (current+7days) and archive(current+history) without any etl/scheduler

But there is no feature available in any database

Sridhar
Opentext

On 10 Jun 2016 19:03, "Craig Ringer" <craig@2ndquadrant.com> wrote:
On 10 June 2016 at 18:56, John R Pierce <pierce@hogranch.com> wrote:
On 6/10/2016 2:18 AM, Sridhar N Bamandlapally wrote:
This/These will be performed in Production to clean-up archive which will not be sync with Archive/DW DB only

one heads-up is Archive/DW DB may need to build WITHOUT CONSTRAINTS

May need to introduce ARCHIVE system/tag in pg_hba.conf

there's a whole lot of implied magic here unless you want to get way more specific what these features do, exactly, under all possible conditions.

I'd go further than that and say I can't see how something like this could possibly work with physical (block based) replication. It's total hand-waving.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Online DW

От
Francisco Olarte
Дата:
I may be wrong but ...

On Fri, Jun 10, 2016 at 6:33 PM, Sridhar N Bamandlapally
<sridhar.bn1@gmail.com> wrote:
> One thing we can restrict to "begin noarchive" transaction block are DELETE
> and SELECT only
> On 10 Jun 2016 21:57, "Sridhar N Bamandlapally" <sridhar.bn1@gmail.com>
> wrote:
>> This is what I feel will give me solution to maintain production
>> (current+7days) and archive(current+history) without any etl/scheduler

It seems ( to me ) you have a grossly underspecificied ( in the list )
problem and you have invented an even more underspecified keyword to
magically solve it, and expect someone to develop it.

And you haven't even bothered to avoid top posting, which is frowned
upon on this list and makes infering your problems solution even more
difficult.

If you want to have something like this you'll need to post much more
details on what you are proposing, what are the use cases for the
general public, etc.. Just eyeballing it I would estimate this will
need many pages just to state the problems and the intended semantics
of your proposal.

Regards.
   Francisco Olarte.


Re: [HACKERS] Online DW

От
Melvin Davidson
Дата:

On Fri, Jun 10, 2016 at 12:49 PM, Francisco Olarte <folarte@peoplecall.com> wrote:
I may be wrong but ...

On Fri, Jun 10, 2016 at 6:33 PM, Sridhar N Bamandlapally
<sridhar.bn1@gmail.com> wrote:
> One thing we can restrict to "begin noarchive" transaction block are DELETE
> and SELECT only
> On 10 Jun 2016 21:57, "Sridhar N Bamandlapally" <sridhar.bn1@gmail.com>
> wrote:
>> This is what I feel will give me solution to maintain production
>> (current+7days) and archive(current+history) without any etl/scheduler

It seems ( to me ) you have a grossly underspecificied ( in the list )
problem and you have invented an even more underspecified keyword to
magically solve it, and expect someone to develop it.

And you haven't even bothered to avoid top posting, which is frowned
upon on this list and makes infering your problems solution even more
difficult.

If you want to have something like this you'll need to post much more
details on what you are proposing, what are the use cases for the
general public, etc.. Just eyeballing it I would estimate this will
need many pages just to state the problems and the intended semantics
of your proposal.

Regards.
   Francisco Olarte.


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Please, you should specify your PostgreSQL version and O/S for questions like this.

As a solution to your problem, have you considered using a VIEW with a WHERE clause  similar to
WHERE your_date > current_date - interval '7 days' ?

--
Melvin Davidson
I reserve the right to fantasize.  Whether or not you
wish to share my fantasy is entirely up to you.

Re: [HACKERS] Online DW

От
"David G. Johnston"
Дата:
On Fri, Jun 10, 2016 at 4:11 AM, Sridhar N Bamandlapally <sridhar.bn1@gmail.com> wrote:
Hi

Is there any feature in PostgreSQL where online DW (Dataware housing) is possible ?

am looking for scenario like

1. Production DB will have CURRENT + LAST 7 DAYS data only

2. Archive/DW DB will have CURRENT + COMPLETE HISTORY

expecting something like streaming, but not ETL


​The entire DB couldn't operate this way since not every record has a concept of time and if you use any kind of physical time you are going to have issues as well.

First impression is you want to horizontally partition your "time-impacted" tables so that each partition contains only data having the same ISO Week number in the same ISO Year.

Remove older tables from the inheritance and stick them on a separate tablespace and/or stream them to another database.

As has been mentioned there are various tools out there today that can likely be used to fulfill whatever fundamental need you have.  "Not ETL" is not a need though, its at best a "nice-to-have" unless you are willing to forgo any solution to your larger problem just because the implementation is not optimal.

Unless you define your true goals and constraints its going to be hard to make recommendations.

David J.

Re: [HACKERS] Online DW

От
Dan Wierenga
Дата:


On Jun 10, 2016, at 9:27 AM, Sridhar N Bamandlapally <sridhar.bn1@gmail.com> wrote:

This is what I feel will give me solution to maintain production (current+7days) and archive(current+history) without any etl/scheduler

But there is no feature available in any database

Sridhar
Opentext 


If that's what you want, in any database, then MySQL's "SET sql_log_bin" statement could probably be used to accomplish what you want.

But, the engineering effort needed to get streaming replication to populate your data warehouse is probably much higher than just building an ETL job. Are you so certain your situation is so special that you can't use what the rest of the industry uses?

-Dan

Re: [HACKERS] Online DW

От
Sridhar N Bamandlapally
Дата:

Ok, let me put this way,

I need every transaction coming from application sync with both production and archive db,
but the transactions I do to clean old data(before 7 days) on production db in daily maintenance window should not sync with archive db,

Archive db need read-only, used for maintaining integrity with other business applications

Issue here is,
1. etl is scheduler, cannot run on every transaction, even if it does, its expensive

2. Materialize view(refresh on commit) or slony, will also sync clean-up transactions

3. Replication is not archive, definitely not option

I say, every online archive db is use case for this.

Thanks
Sridhar
Opentext


On 10 Jun 2016 22:36, "David G. Johnston" <david.g.johnston@gmail.com> wrote:
On Fri, Jun 10, 2016 at 4:11 AM, Sridhar N Bamandlapally <sridhar.bn1@gmail.com> wrote:
Hi

Is there any feature in PostgreSQL where online DW (Dataware housing) is possible ?

am looking for scenario like

1. Production DB will have CURRENT + LAST 7 DAYS data only

2. Archive/DW DB will have CURRENT + COMPLETE HISTORY

expecting something like streaming, but not ETL


​The entire DB couldn't operate this way since not every record has a concept of time and if you use any kind of physical time you are going to have issues as well.

First impression is you want to horizontally partition your "time-impacted" tables so that each partition contains only data having the same ISO Week number in the same ISO Year.

Remove older tables from the inheritance and stick them on a separate tablespace and/or stream them to another database.

As has been mentioned there are various tools out there today that can likely be used to fulfill whatever fundamental need you have.  "Not ETL" is not a need though, its at best a "nice-to-have" unless you are willing to forgo any solution to your larger problem just because the implementation is not optimal.

Unless you define your true goals and constraints its going to be hard to make recommendations.

David J.

Re: [HACKERS] Online DW

От
Eduardo Morras
Дата:
On Sat, 11 Jun 2016 09:59:59 +0530
Sridhar N Bamandlapally <sridhar.bn1@gmail.com> wrote:

> Ok, let me put this way,
>
> I need every transaction coming from application sync with both
> production and archive db,
> but the transactions I do to clean old data(before 7 days) on
> production db in daily maintenance window should not sync with
> archive db,
>
> Archive db need read-only, used for maintaining integrity with other
> business applications
>
> Issue here is,
> 1. etl is scheduler, cannot run on every transaction, even if it
> does, its expensive
>
> 2. Materialize view(refresh on commit) or slony, will also sync
> clean-up transactions
>
> 3. Replication is not archive, definitely not option
>
> I say, every online archive db is use case for this.

As I see, you have 2 options (A, and B)

A) With FDW

1) Create 2 databases on production:
  a) first, a normal database for production, with 7 last days data,
  b) second, as postgres_fdw remote database on archive server.

https://www.postgresql.org/docs/current/static/postgres-fdw.html

2) Make a trigger on each table in production database to replicate
inserts, updates and deletes to fdw remote tables.

3) At your scheduled time, truncate tables (don't delete content or the
trigger will fire up) on your production db.


Note, make date part of your primary keys, if not, the same pk may be
reused on production tables and the insert on archive fdw tables will
fail.


B) Use pgpoolII, use replication to store the same data on both
databases. On production db, delete old data. Don't connect both
databases with streaming replication nor slony or anything else,
pgpoolII will do the work.

http://pgpool.net/mediawiki/index.php/Main_Page
http://www.pgpool.net/docs/latest/tutorial-en.html#replication-config

Read full documentation, above url cites point directly to the replication part but read full documentation.

HTH

> Thanks
> Sridhar
> Opentext


---   ---
Eduardo Morras <emorrasg@yahoo.es>


Re: [HACKERS] Online DW

От
Craig Ringer
Дата:
On 11 June 2016 at 12:29, Sridhar N Bamandlapally <sridhar.bn1@gmail.com> wrote:

Ok, let me put this way,

I need every transaction coming from application sync with both production and archive db,
but the transactions I do to clean old data(before 7 days) on production db in daily maintenance window should not sync with archive db,


Would you please remove pgsql-hackers from the Cc list for future replies? This isn't specific discussion of a concrete feature patch or detailed design or bug, so it's more appropriate for pgsql-general.

I think everyone's understood what you want, but you can't just throw random syntax at the wall and expect someone to magically make it work. What you propose will not work with PostgreSQL's block based replication model. It just can't. You can't do this with built-in streaming replication.

You can, however, do roughly what you want with Londiste by (ab)using session_replication_role so that only some transactions get replicated. It is something that would be reasonably practical to add to pglogical too; in fact, you can already set pglogical up so it only replicates INSERTs and UPDATEs but not DELETEs, there's just no per-session/per-transaction/per-statement control over that yet. You could probably achieve it fairly easily with replication origin filtering and a custom replication origin for things you want to skip replication for though.

I say, every online archive db is use case for this.


Yeah, I don't think anyone's saying it wouldn't be useful. The point is that your proposal is extremely vague and lacks any useful detail, let alone a proposed implementation or a patch. I wish PostgreSQL had autonomous transactions, a shared-across-sessions JVM or Mono runtime with first-class Java or C# support, built-in commands to dump table definitions over SQL, and lots more, but it does no good to talk about them unless I'm willing to implement them or find someone else who will.

You are being pointed to alternatives that would meet your needs, but seem to be ignoring that because it's not the solution you have already decided you need for your problem.

I doubt anybody will implement this for you, especially since I don't think it's really possible in PostgreSQL's block-based physical replication architecture. So saying what you want repeatedly probably won't achieve anything. What are you going to do about it?


--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Online DW

От
Eduardo Morras
Дата:
On Sat, 11 Jun 2016 09:59:59 +0530
Sridhar N Bamandlapally <sridhar.bn1@gmail.com> wrote:

> Ok, let me put this way,
>
> I need every transaction coming from application sync with both
> production and archive db,
> but the transactions I do to clean old data(before 7 days) on
> production db in daily maintenance window should not sync with
> archive db,
>
> Archive db need read-only, used for maintaining integrity with other
> business applications
>
> Issue here is,
> 1. etl is scheduler, cannot run on every transaction, even if it
> does, its expensive
>
> 2. Materialize view(refresh on commit) or slony, will also sync
> clean-up transactions
>
> 3. Replication is not archive, definitely not option
>
> I say, every online archive db is use case for this.

As I see, you have 2 options (A, and B)

A) With FDW

1) Create 2 databases on production:
  a) first, a normal database for production, with 7 last days data,
  b) second, as postgres_fdw remote database on archive server.

https://www.postgresql.org/docs/current/static/postgres-fdw.html

2) Make a trigger on each table in production database to replicate
inserts, updates and deletes to fdw remote tables.

3) At your scheduled time, truncate tables (don't delete content or the
trigger will fire up) on your production db.


Note, make date part of your primary keys, if not, the same pk may be
reused on production tables and the insert on archive fdw tables will
fail.


B) Use pgpoolII, use replication to store the same data on both
databases. On production db, delete old data. Don't connect both
databases with streaming replication nor slony or anything else,
pgpoolII will do the work.

http://pgpool.net/mediawiki/index.php/Main_Page
http://www.pgpool.net/docs/latest/tutorial-en.html#replication-config

Read full documentation, above url cites point directly to the replication part but read full documentation.

HTH

> Thanks
> Sridhar
> Opentext


---   ---
Eduardo Morras <emorrasg@yahoo.es>