Обсуждение: proposals for LLL, part 1
Ok, I'm not sure that LLL will appear in 6.4 but it's good time to discuss about it. First, PostgreSQL is multi-version system due to its non-overwriting storage manager. And so, first proposal is use this feature (multi-versioning) in LLL implementation. In multi-version systems access methods don't use locks to read consistent data and so readers don't block writers, writers don't block readers and only the same-row writers block writers. In such systems access methods returns snapshot of data as they were in _some_ point in time. For read committed isolation level this moment is the time when statement began. For serialized isolation level this is the time when current transaction began. Oracle uses rollback segments to reconstract blocks that were changed after statement/transaction began and so statement sees only data that were committed by then. In our case we have to analyze tuple xmin/xmax to determine _when_ corresponding transaction was committed in regard to the last transaction (LCX) that was committed when statement/transaction began. If xmin/xmax was committed before LCX then tuple insertion/deletion is visible to statement, else - not visible. To achieve this, the second proposal is to use special SCN - System Change Number (C) Oracle :) - that will be incremented by 1 by each transaction commit. Each commited transaction will have corresponding SCN (4 bytes -> equal to sizeof XID). We have to keep XID --> SCN mapping as long as there is running transaction that is "interested" in XID: when transaction begins it will determine the first (the oldest) running transaction XID and this will be the minimum XID whose SCN transaction would like to know. Access methods will have to determine SCN for xmin/xmax only if FRX <= xmin/xmax <= LSX, where FRX is XID of first (oldest) running transactions and LSX is last started transaction - in the moment when statement (for read committed) or transaction (for serialized) began. For such xmin/xmax their SCNs will be compared with SCN determined in the moment of statement/transaction begin... Changes made by xmin/xmax < FRX are visible to statement/transaction, and changes made by xmin/xmax > LSX are not visible. Without xmin/xmax SCN lookup. For XID --> SCN mapping I propose to use the simplest schema: ordered queue of SCNs (or something like this) - i.e. keep SCNs for all transactions from the first one whose SCN could be required by some running transaction to the last started. This queue must be shared! The size of this queue and average number of commits/aborts per second will define how long transactions will be able to run. 30 xacts/sec and 400K of queue will enable 30 - 60 minuts running transactions... Keeping queue in shared memmory may be unacceptable in some cases... mmap or shared buffer pool could be used to access queue. We'll see... Also note that Oracle has special READ ONLY transactions mode. READ ONLY transactions are disallowed to change anything in the database. This is good mode for pg_dump (etc) long running applications. Because of no one will be "interested" in SCN of READ ONLY transactions - such transactions can make private copy of the queue part and after this queue could be truncated... Having 4 bytes per SCN enable to use special values to mark corresponding transaction as running or aborted and avoid pg_log lookup when we need in both SCN and state of transaction. ...Well, it's time to sleep :) To be continued... Comments ? Vadim
I am retaining your entire message here for reference. I have a good solution for this. It will require only 4k of shared memory, and will have no restrictions on the age or number of transactions. First, I think we only want to implement "read committed isolation level", not serialized. Not sure why someone would want serialized. OK, when a backend is looking at a row that has been committed, it must decide if the row was committed before or after my transaction started. If the transaction commit id(xmin) is greater than our current xid, we know we should not look at it because it is for a transaction that started after our own transaction. The problem is for transactions started before our own (have xmin's less than our own), and may have committed before or after our transaction. Here is my idea. We add a field to the shared memory Proc structure that can contain up to 32 transaction ids. When a transaction starts, we spin though all other open Proc structures, and record all currently-running transaction ids in our own Proc field used to store up to 32 transaction ids. While we do this, we remember the lowest of these open transaction ids. This is our snapshot of current transactions at the time our transaction starts. While analyzing a row, if it is greater than our transaction id, then the transaction was not even started before our transaction. If the xmin is lower than the min transaction id that we remembered from the Proc structures, it was committed before our transaction started. If it is greater than or equal to the min remembered transaction id, we must spin through our stored transaction ids. If it is in the stored list, we don't look at the row, because that transaction was not committed when we started our transaction. If it is not in the list, it must have been committed before our transaction started. We know this because if any backend starting a transaction after ours would get a transaction id higher than ours. Comments? > Ok, I'm not sure that LLL will appear in 6.4 but it's good time to > discuss about it. > > First, PostgreSQL is multi-version system due to its > non-overwriting storage manager. And so, first proposal is use > this feature (multi-versioning) in LLL implementation. > > In multi-version systems access methods don't use locks to read > consistent data and so readers don't block writers, writers don't > block readers and only the same-row writers block writers. In such > systems access methods returns snapshot of data as they were in > _some_ point in time. For read committed isolation level this > moment is the time when statement began. For serialized isolation > level this is the time when current transaction began. > > Oracle uses rollback segments to reconstract blocks that were > changed after statement/transaction began and so statement sees > only data that were committed by then. > > In our case we have to analyze tuple xmin/xmax to determine _when_ > corresponding transaction was committed in regard to the last > transaction (LCX) that was committed when statement/transaction > began. > > If xmin/xmax was committed before LCX then tuple > insertion/deletion is visible to statement, else - not visible. > > To achieve this, the second proposal is to use special SCN - > System Change Number (C) Oracle :) - that will be incremented by 1 > by each transaction commit. Each commited transaction will have > corresponding SCN (4 bytes -> equal to sizeof XID). > > We have to keep XID --> SCN mapping as long as there is running > transaction that is "interested" in XID: when transaction begins > it will determine the first (the oldest) running transaction XID > and this will be the minimum XID whose SCN transaction would like > to know. > > Access methods will have to determine SCN for xmin/xmax only if > FRX <= xmin/xmax <= LSX, where FRX is XID of first (oldest) > running transactions and LSX is last started transaction - in the > moment when statement (for read committed) or transaction (for > serialized) began. For such xmin/xmax their SCNs will be compared > with SCN determined in the moment of statement/transaction > begin... > > Changes made by xmin/xmax < FRX are visible to > statement/transaction, and changes made by xmin/xmax > LSX are not > visible. Without xmin/xmax SCN lookup. > > For XID --> SCN mapping I propose to use the simplest schema: > ordered queue of SCNs (or something like this) - i.e. keep SCNs > for all transactions from the first one whose SCN could be > required by some running transaction to the last started. > > This queue must be shared! > > The size of this queue and average number of commits/aborts per > second will define how long transactions will be able to run. 30 > xacts/sec and 400K of queue will enable 30 - 60 minuts running > transactions... > > Keeping queue in shared memmory may be unacceptable in some > cases... mmap or shared buffer pool could be used to access queue. > We'll see... > > Also note that Oracle has special READ ONLY transactions mode. > READ ONLY transactions are disallowed to change anything in the > database. This is good mode for pg_dump (etc) long running > applications. Because of no one will be "interested" in SCN of > READ ONLY transactions - such transactions can make private copy > of the queue part and after this queue could be truncated... > > Having 4 bytes per SCN enable to use special values to mark > corresponding transaction as running or aborted and avoid pg_log > lookup when we need in both SCN and state of transaction. > > ...Well, it's time to sleep :) > > To be continued... > > Comments ? > > Vadim > > -- Bruce Momjian | 830 Blythe Avenue maillist@candle.pha.pa.us | Drexel Hill, Pennsylvania 19026 + If your life is a hard drive, | (610) 353-9879(w) + Christ can be your backup. | (610) 853-3000(h)
Bruce Momjian wrote: > > I am retaining your entire message here for reference. > > I have a good solution for this. It will require only 4k of shared > memory, and will have no restrictions on the age or number of > transactions. > > First, I think we only want to implement "read committed isolation > level", not serialized. Not sure why someone would want serialized. Serialized is DEFAULT isolation level in standards. It must be implemented. Would you like inconsistent results from pg_dump, etc? > > OK, when a backend is looking at a row that has been committed, it must > decide if the row was committed before or after my transaction started. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > If the transaction commit id(xmin) is greater than our current xid, we > know we should not look at it because it is for a transaction that > started after our own transaction. It's right for serialized, not for read committed. In read committed backend must decide if the row was committed before or after STATEMENT started... > > The problem is for transactions started before our own (have xmin's less > than our own), and may have committed before or after our transaction. > > Here is my idea. We add a field to the shared memory Proc structure > that can contain up to 32 transaction ids. When a transaction starts, > we spin though all other open Proc structures, and record all > currently-running transaction ids in our own Proc field used to store up > to 32 transaction ids. While we do this, we remember the lowest of > these open transaction ids. > > This is our snapshot of current transactions at the time our transaction > starts. While analyzing a row, if it is greater than our transaction > id, then the transaction was not even started before our transaction. > If the xmin is lower than the min transaction id that we remembered from > the Proc structures, it was committed before our transaction started. > If it is greater than or equal to the min remembered transaction id, we > must spin through our stored transaction ids. If it is in the stored > list, we don't look at the row, because that transaction was not > committed when we started our transaction. If it is not in the list, it > must have been committed before our transaction started. We know this > because if any backend starting a transaction after ours would get a > transaction id higher than ours. Yes, this is way. But, first, why should we store running transaction xids in shmem ? Who is interested in these xids? We have to store in shmem only min of these xids: vacuum must not delete rows deleted by transactions with xid greater (or equal) than this min xid or we risk to get inconsistent results... Also, as you see, we have to lock Proc structures in shmem to get list of xids for each statement in read committed mode... I don't know what way is better but using list of xids is much easy to implement... Vadim
> Bruce Momjian wrote: > > > > I am retaining your entire message here for reference. > > > > I have a good solution for this. It will require only 4k of shared > > memory, and will have no restrictions on the age or number of > > transactions. > > > > First, I think we only want to implement "read committed isolation > > level", not serialized. Not sure why someone would want serialized. > > Serialized is DEFAULT isolation level in standards. > It must be implemented. Would you like inconsistent results > from pg_dump, etc? OK, I didn't know that. > > > > > OK, when a backend is looking at a row that has been committed, it must > > decide if the row was committed before or after my transaction started. > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > If the transaction commit id(xmin) is greater than our current xid, we > > know we should not look at it because it is for a transaction that > > started after our own transaction. > > It's right for serialized, not for read committed. > In read committed backend must decide if the row was committed > before or after STATEMENT started... OK. > > > > > The problem is for transactions started before our own (have xmin's less > > than our own), and may have committed before or after our transaction. > > > > Here is my idea. We add a field to the shared memory Proc structure > > that can contain up to 32 transaction ids. When a transaction starts, > > we spin though all other open Proc structures, and record all > > currently-running transaction ids in our own Proc field used to store up > > to 32 transaction ids. While we do this, we remember the lowest of > > these open transaction ids. > > > > This is our snapshot of current transactions at the time our transaction > > starts. While analyzing a row, if it is greater than our transaction > > id, then the transaction was not even started before our transaction. > > If the xmin is lower than the min transaction id that we remembered from > > the Proc structures, it was committed before our transaction started. > > If it is greater than or equal to the min remembered transaction id, we > > must spin through our stored transaction ids. If it is in the stored > > list, we don't look at the row, because that transaction was not > > committed when we started our transaction. If it is not in the list, it > > must have been committed before our transaction started. We know this > > because if any backend starting a transaction after ours would get a > > transaction id higher than ours. > > Yes, this is way. > But, first, why should we store running transaction xids in shmem ? > Who is interested in these xids? > We have to store in shmem only min of these xids: vacuum must > not delete rows deleted by transactions with xid greater > (or equal) than this min xid or we risk to get inconsistent > results... > Also, as you see, we have to lock Proc structures in shmem > to get list of xids for each statement in read committed > mode... You are correct. We need to lock Proc stuctures during our scan, but we don't need to keep the list in shared memory. No reason to do it. Do we have to keep the Proc's locked while we get our table data locks. I sure hope not. Not sure how we are going prevent someone from committing their transaction between our Proc scan and when we start our transaction. Not even sure if I should be worried about that. > I don't know what way is better but using list of xids > is much easy to implement... Sure, do a list. Getting the min allows you to reduce the number of times it has to be scanned. I had not thought about vacuum, but keeping the min in shared memory will certain fix that issue. -- Bruce Momjian | 830 Blythe Avenue maillist@candle.pha.pa.us | Drexel Hill, Pennsylvania 19026 + If your life is a hard drive, | (610) 353-9879(w) + Christ can be your backup. | (610) 853-3000(h)
Bruce Momjian wrote: > > You are correct. We need to lock Proc stuctures during our scan, but we > don't need to keep the list in shared memory. No reason to do it. Do > we have to keep the Proc's locked while we get our table data locks. I ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ No! Only while we are scanning Procs... > sure hope not. Not sure how we are going prevent someone from > committing their transaction between our Proc scan and when we start our > transaction. Not even sure if I should be worried about that. We shouldn't... It doesn't matter. Vadim
> Bruce Momjian wrote: > > > > You are correct. We need to lock Proc stuctures during our scan, but we > > don't need to keep the list in shared memory. No reason to do it. Do > > we have to keep the Proc's locked while we get our table data locks. I > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > No! Only while we are scanning Procs... > > > sure hope not. Not sure how we are going prevent someone from > > committing their transaction between our Proc scan and when we start our > > transaction. Not even sure if I should be worried about that. > > We shouldn't... It doesn't matter. One more item. If we don't lock Proc between the scan and our aquisition of a transaction id, it is possible some other backend will get a transaction id between the time we scan the Proc structure and when we get our transaction id, causing us to look at rows that are part of a non-committed transaction. I think we have to get our transaction id first, before scanning Proc. There is definately an area of vulnerabilty there. I am now wondering how much we need to lock Proc during the scan. -- Bruce Momjian | 830 Blythe Avenue maillist@candle.pha.pa.us | Drexel Hill, Pennsylvania 19026 + If your life is a hard drive, | (610) 353-9879(w) + Christ can be your backup. | (610) 853-3000(h)