Обсуждение: Arbitrary tuple size

Поиск
Список
Период
Сортировка

Arbitrary tuple size

От
wieck@debis.com (Jan Wieck)
Дата:
Well,

    doing  arbitrary tuple size should be as generic as possible.
    Thus I think the best place to do it is down  in  the  heapam
    routines  (heap_fetch(), heap_getnext(), heap_insert(), ...).
    I'm not 100% sure but nothing should access a  heap  relation
    going  around  them.  Anyway,  if there are places, then it's
    time to clean them up.

    What about adding  one  more  ItemPointerData  to  the  tuple
    header  which holds the ctid of a DATA continuation tuple. If
    a tuple doesn't fit into one block, this will tell  where  to
    get  the  next  chunk of tuple data building a chain until an
    invalid ctid is found. The continuation  tuples  can  have  a
    negative  t_natts  to  be  easily  identified  and ignored by
    scanning routines.

    By doing it this way we could also sqeeze out some  currently
    wasted space.  All tuples that get inserted/updated are added
    to the end of the relation.  If a tuple currently doesn't fit
    into  the  freespace of the actual last block, that freespace
    is wasted and the tuple is placed into a new allocated  block
    at  the  end.  So  if  there is 5K freespace and another 5.5K
    tuple is added, the relation grows effectively by 10.5K!

    I'm not sure how to handle this with vacuum,  but  I  believe
    Vadim is able to put some well placed goto's that make it.


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#========================================= wieck@debis.com (Jan Wieck) #

Re: [HACKERS] Arbitrary tuple size

От
Vadim Mikheev
Дата:
Jan Wieck wrote:
> 
>     What about adding  one  more  ItemPointerData  to  the  tuple
>     header  which holds the ctid of a DATA continuation tuple. If

Oh no. Fortunately we need not in this: we can just add new flag
to t_infomask and add continuation tid at the end of tuple chunk.
Ok?

>     a tuple doesn't fit into one block, this will tell  where  to
>     get  the  next  chunk of tuple data building a chain until an
>     invalid ctid is found. The continuation  tuples  can  have  a
>     negative  t_natts  to  be  easily  identified  and ignored by
>     scanning routines.
...
> 
>     I'm not sure how to handle this with vacuum,  but  I  believe
>     Vadim is able to put some well placed goto's that make it.

-:)))
Ok, ok - I have great number of goto-s in my pocket -:)

Vadim


Re: [HACKERS] Arbitrary tuple size

От
Bruce Momjian
Дата:
> Well,
> 
>     doing  arbitrary tuple size should be as generic as possible.
>     Thus I think the best place to do it is down  in  the  heapam
>     routines  (heap_fetch(), heap_getnext(), heap_insert(), ...).
>     I'm not 100% sure but nothing should access a  heap  relation
>     going  around  them.  Anyway,  if there are places, then it's
>     time to clean them up.
> 
>     What about adding  one  more  ItemPointerData  to  the  tuple
>     header  which holds the ctid of a DATA continuation tuple. If
>     a tuple doesn't fit into one block, this will tell  where  to
>     get  the  next  chunk of tuple data building a chain until an
>     invalid ctid is found. The continuation  tuples  can  have  a
>     negative  t_natts  to  be  easily  identified  and ignored by
>     scanning routines.
> 
>     By doing it this way we could also sqeeze out some  currently
>     wasted space.  All tuples that get inserted/updated are added
>     to the end of the relation.  If a tuple currently doesn't fit
>     into  the  freespace of the actual last block, that freespace
>     is wasted and the tuple is placed into a new allocated  block
>     at  the  end.  So  if  there is 5K freespace and another 5.5K
>     tuple is added, the relation grows effectively by 10.5K!
> 
>     I'm not sure how to handle this with vacuum,  but  I  believe
>     Vadim is able to put some well placed goto's that make it.

I agree this is the way to go.  There is nothing I can think of that is
limited to how large a tuple can be.  It is just accessing it from the
heap routines that is the problem.  If the tuple is alloc'ed to be used,
we can paste together the parts on disk and return one tuple.  If they
are accessing the buffer copy directly, we would have to be smart about
going off the end of the disk copy and moving to the next segment.

The code is very clear now about accessing tuples or tuple copies.
Hopefully locking will not be an issue because you only need to lock the
main tuple.  No one is going to see the secondary part of the tuple.

If Vadim can do MVCC, he certainly can handle this, with the help of
goto.  :-)

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] Arbitrary tuple size

От
Bruce Momjian
Дата:
> Jan Wieck wrote:
> > 
> >     What about adding  one  more  ItemPointerData  to  the  tuple
> >     header  which holds the ctid of a DATA continuation tuple. If
> 
> Oh no. Fortunately we need not in this: we can just add new flag
> to t_infomask and add continuation tid at the end of tuple chunk.
> Ok?

Sounds good.  I would rather not add stuff to the tuple header if we can
prevent it.

> >     I'm not sure how to handle this with vacuum,  but  I  believe
> >     Vadim is able to put some well placed goto's that make it.
> 
> -:)))
> Ok, ok - I have great number of goto-s in my pocket -:)

I can send you more.

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] Arbitrary tuple size

От
wieck@debis.com (Jan Wieck)
Дата:
Bruce Momjian wrote:

> > -:)))
> > Ok, ok - I have great number of goto-s in my pocket -:)
>
> I can send you more.

    I  have some cheap, spare longjmp()'s over here - anyone need
    them? :-)


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#========================================= wieck@debis.com (Jan Wieck) #

Re: [HACKERS] Arbitrary tuple size

От
wieck@debis.com (Jan Wieck)
Дата:
Bruce Momjian wrote:

> >     What about adding  one  more  ItemPointerData  to  the  tuple
> >     header  which holds the ctid of a DATA continuation tuple. If
> >     a tuple doesn't fit into one block, this will tell  where  to
> >     get  the  next  chunk of tuple data building a chain until an
> >     invalid ctid is found. The continuation  tuples  can  have  a
> >     negative  t_natts  to  be  easily  identified  and ignored by
> >     scanning routines.

    Yes,  Vadim,  putting  a  flag into the bits already there to
    tell it is much better. The  information  that  a  particular
    tuple  is  an extension tuple should also go there instead of
    misusing t_natts.

>
> I agree this is the way to go.  There is nothing I can think of that is
> limited to how large a tuple can be.  It is just accessing it from the
> heap routines that is the problem.  If the tuple is alloc'ed to be used,
> we can paste together the parts on disk and return one tuple.  If they
> are accessing the buffer copy directly, we would have to be smart about
> going off the end of the disk copy and moving to the next segment.

    Who's accessing tuple attributes directly inside  the  buffer
    copy  (not  only the header which will still be unsplitted at
    the top of the chain)?

    Aren't these situations where it is done restricted to system
    catalogs?   I think we can live with the restriction that the
    tuple split  will  not  be  available  for  system  relations
    because  the  only place where the limit hit us is pg_rewrite
    and that can be handled by redesigning the storage  of  rules
    which is already required by the rule recompilation TODO.

    I  can't think that anywhere in the code a buffer from a user
    relation (except for sequences and that's another  story)  is
    accessed that clumsy.

>
> The code is very clear now about accessing tuples or tuple copies.
> Hopefully locking will not be an issue because you only need to lock the
> main tuple.  No one is going to see the secondary part of the tuple.



Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#========================================= wieck@debis.com (Jan Wieck) #

Re: [HACKERS] Arbitrary tuple size

От
wieck@debis.com (Jan Wieck)
Дата:
Bruce Momjian wrote:

> I agree this is the way to go.  There is nothing I can think of that is
> limited to how large a tuple can be.

    Outch - I can.

    Having  an  index  on a varlen field that now doesn't fit any
    more into an index block. Wouldn't this cause problems?  Well
    it's  bad  database  design to index fields that will receive
    that long  data  because  indexing  them  will  blow  up  the
    database but it must work anyway.


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#========================================= wieck@debis.com (Jan Wieck) #

Re: [HACKERS] Arbitrary tuple size

От
Tom Lane
Дата:
wieck@debis.com (Jan Wieck) writes:
> Bruce Momjian wrote:
>> I agree this is the way to go.  There is nothing I can think of that is
>> limited to how large a tuple can be.

>     Outch - I can.

>     Having  an  index  on a varlen field that now doesn't fit any
>     more into an index block. Wouldn't this cause problems?

Aren't index tuples still tuples?  Can't they be split just like
regular tuples?
        regards, tom lane


Re: [HACKERS] Arbitrary tuple size

От
wieck@debis.com (Jan Wieck)
Дата:
Tom Lane wrote:

>
> wieck@debis.com (Jan Wieck) writes:
> > Bruce Momjian wrote:
> >> I agree this is the way to go.  There is nothing I can think of that is
> >> limited to how large a tuple can be.
>
> >     Outch - I can.
>
> >     Having  an  index  on a varlen field that now doesn't fit any
> >     more into an index block. Wouldn't this cause problems?
>
> Aren't index tuples still tuples?  Can't they be split just like
> regular tuples?

    Don't know, maybe.

    While  looking  for  some  places  where  tuple data might be
    accessed directly inside of the  buffers  I've  searched  for
    WriteBuffer() and friends. These are mostly used in the index
    access methods and some other places where I  expected  them,
    so  index  AM's  have  at  least to be carefully visited when
    implementing tuple split.


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#========================================= wieck@debis.com (Jan Wieck) #

Re: [HACKERS] Arbitrary tuple size

От
wieck@debis.com (Jan Wieck)
Дата:
I wrote:

>
> Tom Lane wrote:
> >
> > Aren't index tuples still tuples?  Can't they be split just like
> > regular tuples?
>
>     Don't know, maybe.

    Actually   we   have  some  problems  with  indices  on  text
    attributes when the content exceeds HALF of the blocksize:

        FATAL 1:  btree: failed to add item to the page

    It crashes the backend AND seems to corrupt the index!  Looks
    to  me that at least the btree code needs to be able to store
    at minimum two items into one block and painfully fails if it
    can't.

    And just another one:

        pgsql=> create table t1 (a int4, b char(4000));
        CREATE
        pgsql=> create index t1_b on t1 (b);
        CREATE
        pgsql=> insert into t1 values (1, 'a');

        TRAP: Failed Assertion("!(( itid)->lp_flags & 0x01):",
                                    File: "nbtinsert.c", Line: 361)

    Bruce: One more TODO item!


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#========================================= wieck@debis.com (Jan Wieck) #

Re: [HACKERS] Arbitrary tuple size

От
Tatsuo Ishii
Дата:
Going toward >8k tuples would be really good, but I suspect we may
some difficulties with LO stuffs once we implement it. Also it seems
that it's not worth to adapt LOs with newly designed tuples.  I think
the design of current LOs are so broken that we need to redesign them.

o it's slow: accessing a LO need a open() that is not cheap.  creating
many LOs makes data/base/DBNAME/ directory fat.

o it consumes lots of i-nodes

o it breaks the tuple abstraction: this makes difficult to maintain
the code.

I would propose followings for the new version of LO:

o create a new data type that represents the LO

o when defining the LO data type in a table, it actually points to a
LO "body" in another place where it is physically stored.

o the storage for LO bodies would be a hidden table that contains
several LOs, not single one.

o we can have several tables for the LO bodies. Probably a LO body
table for each corresponding table (where LO data type is defined) is
appropreate. 

o it would be nice to place a LO table on a separate
directory/partition from the original table where LO data type is
defined, since a LO body table could become huge.

Comments? Opinions?
---
Tatsuo Ishii


Re: [HACKERS] Arbitrary tuple size

От
Vadim Mikheev
Дата:
Jan Wieck wrote:
> 
> Bruce Momjian wrote:
> 
> > I agree this is the way to go.  There is nothing I can think of that is
> > limited to how large a tuple can be.
> 
>     Outch - I can.
> 
>     Having  an  index  on a varlen field that now doesn't fit any
>     more into an index block. Wouldn't this cause problems?  Well
>     it's  bad  database  design to index fields that will receive
>     that long  data  because  indexing  them  will  blow  up  the
>     database but it must work anyway.

Seems that in other DBMSes len of index tuple is more restricted
than len of heap one. So I think we shouldn't worry about this case.

Vadim



Re: [HACKERS] Arbitrary tuple size

От
Philip Warner
Дата:
At 10:12 9/07/99 +0900, Tatsuo Ishii wrote:
>
>o create a new data type that represents the LO
>



>o when defining the LO data type in a table, it actually points to a
>LO "body" in another place where it is physically stored.

Much as the purist in me hates concept of hard links (as in Leon's suggestions), this *may* be a good application for
them.Certainly that's how Dec(Oracle)/Rdb does it. Since most blobs will be totally rewritten when they are updated,
thisrepresents a slightly smaller problem in terms of MVCC.
 

>o we can have several tables for the LO bodies. Probably a LO body
>table for each corresponding table (where LO data type is defined) is
>appropreate. 

Did you mean a table for each field? Or a table for each table (which may have more than 1 LO field). See comments
below.

>o it would be nice to place a LO table on a separate
>directory/partition from the original table where LO data type is
>defined, since a LO body table could become huge.

I would very much like to see the ability to have multi-file databases and tables - ie. the ability to store and table
orindex in a separate file. Perhaps with a user-defined partitioning function for table rows. The idea being that:
 

1. User specifies that a table can be stored in one of (say) three files.
2. When a record is first stored, the partitioning function is called to determine the file 'storage area' to use. [or
arandom selection method is used]
 

If you are going to allow LOs to be stored in multiple files, it seems a pity not to add some or all of this feature.


Additionally, the issue of pg_dump support for LOs needs to be addressed.


That's sbout it for me,

Philip Warner.

----------------------------------------------------------------
Philip Warner                    |     __---_____
Albatross Consulting Pty. Ltd.   |----/       -  \
(A.C.N. 008 659 498)             |          /(@)   ______---_
Tel: +61-03-5367 7422            |                 _________  \
Fax: +61-03-5367 7430            |                 ___________ |
Http://www.rhyme.com.au          |                /           \|                                |    --________--
PGP key available upon request,  |  /
and from pgp5.ai.mit.edu:11371   |/


Re: [HACKERS] Arbitrary tuple size

От
Bruce Momjian
Дата:
> > I agree this is the way to go.  There is nothing I can think of that is
> > limited to how large a tuple can be.  It is just accessing it from the
> > heap routines that is the problem.  If the tuple is alloc'ed to be used,
> > we can paste together the parts on disk and return one tuple.  If they
> > are accessing the buffer copy directly, we would have to be smart about
> > going off the end of the disk copy and moving to the next segment.
> 
>     Who's accessing tuple attributes directly inside  the  buffer
>     copy  (not  only the header which will still be unsplitted at
>     the top of the chain)?


Every call to heap_getnext(), for one.  It locks the buffer, and returns
a pointer to the tuple.  The next heap_getnext(), or heap_endscan()
releases the lock.  The cost of returning every tuple as palloc'ed
memory would be huge.  We may be able to get away with just returning
palloc'ed stuff for long tuples.  That may be a simple, clean solution
that would be isolated.

In fact, if we want a copy, we call heap_copytuple() to return a
palloc'ed copy.  This interface has been cleaned up so it should be
clear what is happening.  The old code was messy about this.

See my comments from heap_fetch(), which does require the user to supply
a buffer variable, so they can unlock it when they are done.  The old
code allowed you to pass a NULL as a buffer pointer, so there was no
locking done, and that is bad!

---------------------------------------------------------------------------

/* ----------------*      heap_fetch      - retrive tuple with tid**      Currently ignores LP_IVALID during
processing!**     Because this is not part of a scan, there is no way to*      automatically lock/unlock the shared
buffers.*     For this reason, we require that the user retrieve the buffer*      value, and they are required to
BufferRelease()it when they*      are done.  If they want to make a copy of it before releasing it,*      they can call
heap_copytyple().*----------------*/
 
void
heap_fetch(Relation relation,          Snapshot snapshot,          HeapTuple tuple,            Buffer *userbuf)



--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 



Re: [HACKERS] Arbitrary tuple size

От
Bruce Momjian
Дата:
> > Aren't index tuples still tuples?  Can't they be split just like
> > regular tuples?
> 
>     Don't know, maybe.
> 
>     While  looking  for  some  places  where  tuple data might be
>     accessed directly inside of the  buffers  I've  searched  for
>     WriteBuffer() and friends. These are mostly used in the index
>     access methods and some other places where I  expected  them,
>     so  index  AM's  have  at  least to be carefully visited when
>     implementing tuple split.

See my recent mail.  heap_getnext and heap_fetch().  Can't get lower
access than that.

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 



Re: [HACKERS] Arbitrary tuple size

От
Bruce Momjian
Дата:
I knew there had to be a reason that some tests where BLCKSZ/2 and some
BLCKSZ.

Added to TODO:
* Allow index on tuple greater than 1/2 block size

Seems we have to allow columns over 1/2 block size for now.  Most people
wouln't index on them.


> >     Don't know, maybe.
> 
>     Actually   we   have  some  problems  with  indices  on  text
>     attributes when the content exceeds HALF of the blocksize:
> 
>         FATAL 1:  btree: failed to add item to the page
> 
>     It crashes the backend AND seems to corrupt the index!  Looks
>     to  me that at least the btree code needs to be able to store
>     at minimum two items into one block and painfully fails if it
>     can't.
> 
>     And just another one:
> 
>         pgsql=> create table t1 (a int4, b char(4000));
>         CREATE
>         pgsql=> create index t1_b on t1 (b);
>         CREATE
>         pgsql=> insert into t1 values (1, 'a');
> 
>         TRAP: Failed Assertion("!(( itid)->lp_flags & 0x01):",
>                                     File: "nbtinsert.c", Line: 361)
> 
>     Bruce: One more TODO item!
> 
> 
> Jan
> 
> --
> 
> #======================================================================#
> # It's easier to get forgiveness for being wrong than for being right. #
> # Let's break this rule - forgive me.                                  #
> #========================================= wieck@debis.com (Jan Wieck) #
> 
> 
> 


--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] Arbitrary tuple size

От
Bruce Momjian
Дата:

If we get wide tuples, we could just throw all large objects into one
table, and have an on it.  We can then vacuum it to compact space, etc.


> Going toward >8k tuples would be really good, but I suspect we may
> some difficulties with LO stuffs once we implement it. Also it seems
> that it's not worth to adapt LOs with newly designed tuples.  I think
> the design of current LOs are so broken that we need to redesign them.
> 
> o it's slow: accessing a LO need a open() that is not cheap.  creating
> many LOs makes data/base/DBNAME/ directory fat.
> 
> o it consumes lots of i-nodes
> 
> o it breaks the tuple abstraction: this makes difficult to maintain
> the code.
> 
> I would propose followings for the new version of LO:
> 
> o create a new data type that represents the LO
> 
> o when defining the LO data type in a table, it actually points to a
> LO "body" in another place where it is physically stored.
> 
> o the storage for LO bodies would be a hidden table that contains
> several LOs, not single one.
> 
> o we can have several tables for the LO bodies. Probably a LO body
> table for each corresponding table (where LO data type is defined) is
> appropreate. 
> 
> o it would be nice to place a LO table on a separate
> directory/partition from the original table where LO data type is
> defined, since a LO body table could become huge.
> 
> Comments? Opinions?
> ---
> Tatsuo Ishii
> 
> 


--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 



Re: [HACKERS] Arbitrary tuple size

От
Tatsuo Ishii
Дата:
>If we get wide tuples, we could just throw all large objects into one
>table, and have an on it.  We can then vacuum it to compact space, etc.

I thought about that too. But if a table contains lots of LOs,
scanning of it will take for a long time. On the otherhand, if LOs are
stored outside the table, scanning time will be shorter as long as we
don't need to read the content of each LO type field.
--
Tatsuo Ishii





Re: [HACKERS] Arbitrary tuple size

От
Vadim Mikheev
Дата:
Bruce Momjian wrote:
> 
> If we get wide tuples, we could just throw all large objects into one
> table, and have an on it.  We can then vacuum it to compact space, etc.

Storing 2Gb LO in table is not good thing.

Vadim


Re: [HACKERS] Arbitrary tuple size

От
Bruce Momjian
Дата:
> >If we get wide tuples, we could just throw all large objects into one
> >table, and have an on it.  We can then vacuum it to compact space, etc.
> 
> I thought about that too. But if a table contains lots of LOs,
> scanning of it will take for a long time. On the otherhand, if LOs are
> stored outside the table, scanning time will be shorter as long as we
> don't need to read the content of each LO type field.

Use an index to get to the LO's in the table.

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 



Re: [HACKERS] Arbitrary tuple size

От
Bruce Momjian
Дата:
> Bruce Momjian wrote:
> > 
> > If we get wide tuples, we could just throw all large objects into one
> > table, and have an on it.  We can then vacuum it to compact space, etc.
> 
> Storing 2Gb LO in table is not good thing.
> 
> Vadim
> 

Ah, but we have segemented tables now.  It will auto-split at 1 gig.

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] Arbitrary tuple size

От
Vadim Mikheev
Дата:
Bruce Momjian wrote:
> 
> > Bruce Momjian wrote:
> > >
> > > If we get wide tuples, we could just throw all large objects into one
> > > table, and have an on it.  We can then vacuum it to compact space, etc.
> >
> > Storing 2Gb LO in table is not good thing.
> >
> > Vadim
> >
> 
> Ah, but we have segemented tables now.  It will auto-split at 1 gig.

Well, now consider update of 2Gb row!
I worry not due to non-overwriting but about writing
2Gb log record to WAL - we'll not be able to do it, sure.

Isn't it why Informix restrict tuple len to 32k only?
And the same is what Oracle does.
Both of them have ability to use > 1 page for single row,
but they have this restriction anyway.

I don't like _arbitrary_ tuple size.
I vote for some limit. 32K or 64K, at max.

Vadim


Re: [HACKERS] Arbitrary tuple size

От
Hannu Krosing
Дата:
Vadim Mikheev wrote:
> 
> Bruce Momjian wrote:
> >
> > > Bruce Momjian wrote:
> > > >
> > > > If we get wide tuples, we could just throw all large objects into one
> > > > table, and have an on it.  We can then vacuum it to compact space, etc.
> > >
> > > Storing 2Gb LO in table is not good thing.
> > >
> > > Vadim
> > >
> >
> > Ah, but we have segemented tables now.  It will auto-split at 1 gig.
> 
> Well, now consider update of 2Gb row!
> I worry not due to non-overwriting but about writing
> 2Gb log record to WAL - we'll not be able to do it, sure.

Can't we write just some kind of diff (only changed pages) in WAL,
either starting at some thresold or just based the seek/write logic of
LOs?

It will add complexity, but having some arbitrary limits seems very
wrong.

It will also make indexing LOs more complex, but as we don't currently
index 
them anyway, its not a big problem yet.

Setting the limit higher (like 16M, where all my current LOs would fit
:) )
is just postponing the problems. Does "who will need more than 640k of
RAM"
sound familiar ?

> Isn't it why Informix restrict tuple len to 32k only?
> And the same is what Oracle does.

Does anyone know what the limit for Oracle8i is ? As they advertise it
as a 
replacement file system among other things, I guess it can't be too low
- 
I suspect 2G at minimum

> Both of them have ability to use > 1 page for single row,
> but they have this restriction anyway.
> 
> I don't like _arbitrary_ tuple size.

Why not ?

IMHO we should allow _arbitrary_ (in reality probably <= MAXINT), but 
optimize for some known size and tell the users that if they exceed it
the performance would suffer. 

So when I have 99% of my LOs in 10k-80k range but have a few 512k-2M
ones 
I can just live with the bigger ones having bad performance instead 
implementing an additional LO manager in the frontend too.

> I vote for some limit.

Why limit ?

> 32K or 64K, at max.

Why so low ? Please make it at least configurable, preferrably at
runtime.

And if you go that way, please assume this limit (in code) for tuple
size only,
and not in FE/BE protocol - it will make it easier for someone to fix
the backend 
to work with larger ones later

The LOs should remain load-on-demant anyway, just it should be made more
transparent 
for end-users.

> Vadim


Re: [HACKERS] Arbitrary tuple size

От
wieck@debis.com (Jan Wieck)
Дата:
Tatsuo Ishii wrote:

>
> Going toward >8k tuples would be really good, but I suspect we may
> some difficulties with LO stuffs once we implement it. Also it seems
> that it's not worth to adapt LOs with newly designed tuples.  I think
> the design of current LOs are so broken that we need to redesign them.
>
> [... LO stuff deleted ...]

    I  wasn't  talking  about  a new datatype that can exceed the
    tuple limit. The general tuple split I want will also  handle
    it  if  a row with 40 text attributes of each 1K gets stored.
    That's something different.


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#========================================= wieck@debis.com (Jan Wieck) #

Re: [HACKERS] Arbitrary tuple size

От
wieck@debis.com (Jan Wieck)
Дата:
Vadim wrote:

>
> Bruce Momjian wrote:
> >
> > > Bruce Momjian wrote:
> > > >
> > > > If we get wide tuples, we could just throw all large objects into one
> > > > table, and have an on it.  We can then vacuum it to compact space, etc.
> > >
> > > Storing 2Gb LO in table is not good thing.
> > >
> > > Vadim
> > >
> >
> > Ah, but we have segemented tables now.  It will auto-split at 1 gig.
>
> Well, now consider update of 2Gb row!
> I worry not due to non-overwriting but about writing
> 2Gb log record to WAL - we'll not be able to do it, sure.
>
> Isn't it why Informix restrict tuple len to 32k only?
> And the same is what Oracle does.
> Both of them have ability to use > 1 page for single row,
> but they have this restriction anyway.
>
> I don't like _arbitrary_ tuple size.
> I vote for some limit. 32K or 64K, at max.

    To  have  some  limit seems reasonable for me (I've also read
    the other comments). When dealing with regular tuples,  first
    off  the  query  to  insert or update them will get read into
    memory.  Next the querytree with the  Const  vars  is  built,
    rewritten,  planned.  Then  the  tuple is built in memory and
    maybe somewhere else copied (fulltext index trigger).  So the
    amount of memory will be allocated many times!

    There  is  some  natural limit on the tuple size depending on
    the  available  swapspace.  Not  everyone  has  multiple   GB
    swapspace  setup.  Making  it  a  well  known hard limit that
    doesn't hurt even if 20 backends do things simultaneously  is
    better.

    I vote for a limit too. 64K should be enough.


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#========================================= wieck@debis.com (Jan Wieck) #

Re: [HACKERS] Arbitrary tuple size

От
Bruce Momjian
Дата:
> > Ah, but we have segemented tables now.  It will auto-split at 1 gig.
> 
> Well, now consider update of 2Gb row!
> I worry not due to non-overwriting but about writing
> 2Gb log record to WAL - we'll not be able to do it, sure.
> 
> Isn't it why Informix restrict tuple len to 32k only?
> And the same is what Oracle does.
> Both of them have ability to use > 1 page for single row,
> but they have this restriction anyway.
> 
> I don't like _arbitrary_ tuple size.
> I vote for some limit. 32K or 64K, at max.

Yes, but having it all in one table prevents fopen() call for every
access, inode use for every large object, and allows vacuum to clean up
multiple versions.  Just an idea.  I realized your point.

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] Arbitrary tuple size

От
Bruce Momjian
Дата:
> > Well, now consider update of 2Gb row!
> > I worry not due to non-overwriting but about writing
> > 2Gb log record to WAL - we'll not be able to do it, sure.
> 
> Can't we write just some kind of diff (only changed pages) in WAL,
> either starting at some thresold or just based the seek/write logic of
> LOs?
> 
> It will add complexity, but having some arbitrary limits seems very
> wrong.
> 
> It will also make indexing LOs more complex, but as we don't currently
> index 
> them anyway, its not a big problem yet.

Well, we do indexing of large objects by using the OS directory code to
find a given directory entry.

> Why not ?
> 
> IMHO we should allow _arbitrary_ (in reality probably <= MAXINT), but 
> optimize for some known size and tell the users that if they exceed it
> the performance would suffer. 

If they go over a certain size, they can decide to store it in the file
system, as many users are doing now.


--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] Arbitrary tuple size

От
Brook Milligan
Дата:
> I don't like _arbitrary_ tuple size.  > I vote for some limit. 32K or 64K, at max.
      To  have  some  limit seems reasonable for me (I've also read      the other comments). When dealing with regular
tuples, first
 

Isn't anything other than arbitrary sizes just making us encounter the
same problem later.  Clearly, there are real hardware limits, but we
shouldn't build that into the code.  It seems to me the solution is to
have arbitrary (e.g., hardware driven) limits, document what is
necessary to support certain operations, and let the fanatics buy
mega-systems if they need to support huge tuples.  As long as the code
is optimized for more reasonable situations, there should be no
penalty.

Cheers,
Brook


Re: [HACKERS] Arbitrary tuple size

От
Vadim Mikheev
Дата:
Jan Wieck wrote:
> 
> Bruce Momjian wrote:
> 
> > I agree this is the way to go.  There is nothing I can think of that is
> > limited to how large a tuple can be.
> 
>     Outch - I can.
> 
>     Having  an  index  on a varlen field that now doesn't fit any
>     more into an index block. Wouldn't this cause problems?  Well
>     it's  bad  database  design to index fields that will receive
>     that long  data  because  indexing  them  will  blow  up  the
>     database but it must work anyway.

Seems that in other DBMSes len of index tuple is more restricted
than len of heap one. So I think we shouldn't worry about this case.

Vadim



Re: [HACKERS] Arbitrary tuple size

От
Bruce Momjian
Дата:
> > Aren't index tuples still tuples?  Can't they be split just like
> > regular tuples?
> 
>     Don't know, maybe.
> 
>     While  looking  for  some  places  where  tuple data might be
>     accessed directly inside of the  buffers  I've  searched  for
>     WriteBuffer() and friends. These are mostly used in the index
>     access methods and some other places where I  expected  them,
>     so  index  AM's  have  at  least to be carefully visited when
>     implementing tuple split.

See my recent mail.  heap_getnext and heap_fetch().  Can't get lower
access than that.

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 



Re: [HACKERS] Arbitrary tuple size

От
Bruce Momjian
Дата:

If we get wide tuples, we could just throw all large objects into one
table, and have an on it.  We can then vacuum it to compact space, etc.


> Going toward >8k tuples would be really good, but I suspect we may
> some difficulties with LO stuffs once we implement it. Also it seems
> that it's not worth to adapt LOs with newly designed tuples.  I think
> the design of current LOs are so broken that we need to redesign them.
> 
> o it's slow: accessing a LO need a open() that is not cheap.  creating
> many LOs makes data/base/DBNAME/ directory fat.
> 
> o it consumes lots of i-nodes
> 
> o it breaks the tuple abstraction: this makes difficult to maintain
> the code.
> 
> I would propose followings for the new version of LO:
> 
> o create a new data type that represents the LO
> 
> o when defining the LO data type in a table, it actually points to a
> LO "body" in another place where it is physically stored.
> 
> o the storage for LO bodies would be a hidden table that contains
> several LOs, not single one.
> 
> o we can have several tables for the LO bodies. Probably a LO body
> table for each corresponding table (where LO data type is defined) is
> appropreate. 
> 
> o it would be nice to place a LO table on a separate
> directory/partition from the original table where LO data type is
> defined, since a LO body table could become huge.
> 
> Comments? Opinions?
> ---
> Tatsuo Ishii
> 
> 


--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 



Re: [HACKERS] Arbitrary tuple size

От
Bruce Momjian
Дата:
> >If we get wide tuples, we could just throw all large objects into one
> >table, and have an on it.  We can then vacuum it to compact space, etc.
> 
> I thought about that too. But if a table contains lots of LOs,
> scanning of it will take for a long time. On the otherhand, if LOs are
> stored outside the table, scanning time will be shorter as long as we
> don't need to read the content of each LO type field.

Use an index to get to the LO's in the table.

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 



Re: [HACKERS] Arbitrary tuple size

От
Vadim Mikheev
Дата:
Bruce Momjian wrote:
> 
> > Bruce Momjian wrote:
> > >
> > > If we get wide tuples, we could just throw all large objects into one
> > > table, and have an on it.  We can then vacuum it to compact space, etc.
> >
> > Storing 2Gb LO in table is not good thing.
> >
> > Vadim
> >
> 
> Ah, but we have segemented tables now.  It will auto-split at 1 gig.

Well, now consider update of 2Gb row!
I worry not due to non-overwriting but about writing
2Gb log record to WAL - we'll not be able to do it, sure.

Isn't it why Informix restrict tuple len to 32k only?
And the same is what Oracle does.
Both of them have ability to use > 1 page for single row,
but they have this restriction anyway.

I don't like _arbitrary_ tuple size.
I vote for some limit. 32K or 64K, at max.

Vadim


Re: [HACKERS] Arbitrary tuple size

От
Tatsuo Ishii
Дата:
>If we get wide tuples, we could just throw all large objects into one
>table, and have an on it.  We can then vacuum it to compact space, etc.

I thought about that too. But if a table contains lots of LOs,
scanning of it will take for a long time. On the otherhand, if LOs are
stored outside the table, scanning time will be shorter as long as we
don't need to read the content of each LO type field.
--
Tatsuo Ishii





Re: [HACKERS] Arbitrary tuple size

От
Bruce Momjian
Дата:
> > I agree this is the way to go.  There is nothing I can think of that is
> > limited to how large a tuple can be.  It is just accessing it from the
> > heap routines that is the problem.  If the tuple is alloc'ed to be used,
> > we can paste together the parts on disk and return one tuple.  If they
> > are accessing the buffer copy directly, we would have to be smart about
> > going off the end of the disk copy and moving to the next segment.
> 
>     Who's accessing tuple attributes directly inside  the  buffer
>     copy  (not  only the header which will still be unsplitted at
>     the top of the chain)?


Every call to heap_getnext(), for one.  It locks the buffer, and returns
a pointer to the tuple.  The next heap_getnext(), or heap_endscan()
releases the lock.  The cost of returning every tuple as palloc'ed
memory would be huge.  We may be able to get away with just returning
palloc'ed stuff for long tuples.  That may be a simple, clean solution
that would be isolated.

In fact, if we want a copy, we call heap_copytuple() to return a
palloc'ed copy.  This interface has been cleaned up so it should be
clear what is happening.  The old code was messy about this.

See my comments from heap_fetch(), which does require the user to supply
a buffer variable, so they can unlock it when they are done.  The old
code allowed you to pass a NULL as a buffer pointer, so there was no
locking done, and that is bad!

---------------------------------------------------------------------------

/* ----------------*      heap_fetch      - retrive tuple with tid**      Currently ignores LP_IVALID during
processing!**     Because this is not part of a scan, there is no way to*      automatically lock/unlock the shared
buffers.*     For this reason, we require that the user retrieve the buffer*      value, and they are required to
BufferRelease()it when they*      are done.  If they want to make a copy of it before releasing it,*      they can call
heap_copytyple().*----------------*/
 
void
heap_fetch(Relation relation,          Snapshot snapshot,          HeapTuple tuple,            Buffer *userbuf)



--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 



Re: [HACKERS] Arbitrary tuple size

От
wieck@debis.com (Jan Wieck)
Дата:
Tatsuo Ishii wrote:

>
> Going toward >8k tuples would be really good, but I suspect we may
> some difficulties with LO stuffs once we implement it. Also it seems
> that it's not worth to adapt LOs with newly designed tuples.  I think
> the design of current LOs are so broken that we need to redesign them.
>
> [... LO stuff deleted ...]

    I  wasn't  talking  about  a new datatype that can exceed the
    tuple limit. The general tuple split I want will also  handle
    it  if  a row with 40 text attributes of each 1K gets stored.
    That's something different.


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#========================================= wieck@debis.com (Jan Wieck) #

Re: [HACKERS] Arbitrary tuple size

От
The Hermit Hacker
Дата:
On Fri, 9 Jul 1999, Vadim Mikheev wrote:

> 
> Bruce Momjian wrote:
> > 
> > > Bruce Momjian wrote:
> > > >
> > > > If we get wide tuples, we could just throw all large objects into one
> > > > table, and have an on it.  We can then vacuum it to compact space, etc.
> > >
> > > Storing 2Gb LO in table is not good thing.
> > >
> > > Vadim
> > >
> > 
> > Ah, but we have segemented tables now.  It will auto-split at 1 gig.
> 
> Well, now consider update of 2Gb row!
> I worry not due to non-overwriting but about writing
> 2Gb log record to WAL - we'll not be able to do it, sure.

What I'm kinda curious about is *why* you would want to store a LO in the
table in the first place?  And, consequently, as Bruce had
suggested...index it?  Unless something has changed recently that I
totally missed, the only time the index would be used is if a query was
based on a) start of string (ie. ^<string>) or b) complete string (ie.
^<string>$) ...

So what benefit would an index be on a LO?

Marc G. Fournier                   ICQ#7615664               IRC Nick: Scrappy
Systems Administrator @ hub.org 
primary: scrappy@hub.org           secondary: scrappy@{freebsd|postgresql}.org 



Re: [HACKERS] Arbitrary tuple size

От
Philip Warner
Дата:
At 09:04 28/07/99 -0300, The Hermit Hacker wrote:
>On Fri, 9 Jul 1999, Vadim Mikheev wrote:
>
>> 
>> Bruce Momjian wrote:
>> > 
>> > > Bruce Momjian wrote:
>> > > >
>> > > > If we get wide tuples, we could just throw all large objects into one
>> > > > table, and have an on it.  We can then vacuum it to compact space,
etc.
>> > >
>> > > Storing 2Gb LO in table is not good thing.
>> > >
>> > > Vadim
>> > >
>> > 
>> > Ah, but we have segemented tables now.  It will auto-split at 1 gig.
>> 
>> Well, now consider update of 2Gb row!
>> I worry not due to non-overwriting but about writing
>> 2Gb log record to WAL - we'll not be able to do it, sure.
>
>What I'm kinda curious about is *why* you would want to store a LO in the
>table in the first place?  And, consequently, as Bruce had
>suggested...index it?  Unless something has changed recently that I
>totally missed, the only time the index would be used is if a query was
>based on a) start of string (ie. ^<string>) or b) complete string (ie.
>^<string>$) ...
>
>So what benefit would an index be on a LO?
>

Some systems (Dec RDB) won't even let you index the contents of an LO.
Anyone know what other systems do?

Also, to repeat question from an earlier post: is there a plan for the BLOB
implementation that is available for comment/contribution?


----------------------------------------------------------------
Philip Warner                    |     __---_____
Albatross Consulting Pty. Ltd.   |----/       -  \
(A.C.N. 008 659 498)             |          /(@)   ______---_
Tel: +61-03-5367 7422            |                 _________  \
Fax: +61-03-5367 7430            |                 ___________ |
Http://www.rhyme.com.au          |                /           \|                                |    --________--
PGP key available upon request,  |  /
and from pgp5.ai.mit.edu:11371   |/