Обсуждение: Advices on custom data type and extension development
Dear friends, I'm planning to develop an extension, and I'm here for getting some help. But I would like to share the problem I intend to solve. Maybe my desired solution is not a good option. What I have: * a lot of data being generated every day, which are mainly queried by an immutable column of type date or timestamp; * as a standard, almost every table has a bigserial id column as a primary key; * data is huge enough to demand table partitioning, which is implemented as suggested in Postgres documentation, by using triggers and table inheritance. A function called by cron deal with creation of new partitions. What I would like to develop first is a custom type (let's call it datedserial) for replacing bigserial as the primary key: * the type would be 8 bytes long, being 4 dedicated to storing the Date, and 4 dedicated to storing a serial within that day; * the text representation of the type would show itsdate and its serial number (something like '2015-10-02.0000007296' as a canonical form, but which could accept inputs like '20151002.0000007296'); * as a consequence of this internal representation,the serial part could not be greater than 4 billion and some; * support for operator classes allowing the type being used in GIN and GIST indexes would be optional for now. That would allow me to have a compact primary key which I could use to partition the table based on the object's date. That would also allow me to partition detail tables on the foreign key column having this data type. Besides that, just by examining the value, mainly when used as a foreign key, I could infer where the record belongs to. When I have a working custom data type, I would go to the next and harder part. I would like to create a new structure like a sequence, and it should behave exactly like sequences, but separated by a date space. So I would have functions similar to the following: * createsequencegroup(sequence_group_name text): create a new named structure for managing the sequence group; * nextval(sequence_group_name text, context_date date): return next value of the sequence (as a datedserial) belonging to the sequence group and associated with the context date. The value returned have the context_date in its date part and the next value for that date in the sequence part. The first call for a specific date would return 1 for the sequence part. Concerning to concurrency and transactions, the function behaves exactly like nextval(sequence_group_name text); * currval(sequence_group_name text, context_date date): the currval function counterpart; * setval(sequence_group_name text, context_date date, int4 value): the setval function counterpart; * freeze_before(sequence_group_name text, freeze_date date): disallow using the sequence group with context dates before the freeze_date. I would consider extending the data type to allow including information about the cluster which generated the values. This way, the user could set a configuration entry defining a byte value for identifying the cluster among others involved in replication, so that the sequence group could have different sequences not only for different dates, but for different nodes as well. As I've said, I would like to package the resulting work as an extension. For now, I would like some help about where to start. I've downloaded the postgres source code and have successfully compiled it using my Ubuntu desktop, although have not tested the resulting binary. Should I create a folder in the contrib directory and use another extension as a starting point? Is this the recommended path? Or is this too much and I should create a separate project? Thanks in advance. Best regards, Luciano Barcellos
On Mon, Jan 18, 2016 at 9:36 PM, Luciano Coutinho Barcellos <luciano@geocontrol.com.br> wrote: > * a lot of data being generated every day, which are mainly queried > by an immutable column of type date or timestamp; > * as a standard, almost every table has a bigserial id column as a > primary key; > * data is huge enough to demand table partitioning, which is > implemented as suggested in Postgres documentation, by using triggers and > table inheritance. A function called by cron deal with creation of new > partitions. > > What I would like to develop first is a custom type (let's call it > datedserial) for replacing bigserial as the primary key: > > * the type would be 8 bytes long, being 4 dedicated to storing the > Date, and 4 dedicated to storing a serial within that day; Seriously, you should consider having a primary key with two columns, of type date and int. It would take exactly the same space as your current plan, and performance should be very close to what you propose. As long as you aren't using some ORM that is too dumb to deal with this, it should be far easier than creating the custom type. If you can't overcome the limitations of the "standard" or your development framework any other way, you plan sounds like the next best thing. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Seriously, you should consider having a primary key with two
columns, of type date and int. It would take exactly the same
space as your current plan, and performance should be very close to
what you propose. As long as you aren't using some ORM that is too
dumb to deal with this, it should be far easier than creating the
custom type.
+1
Most ORMs cannot handle ENUMs, let alone user defined composite types.
That, or they flood the database with SELECT * FROM pg_type WHERE ... queries. I'm looking at you, Cake.
You're far better off trying a (date,integer) key as Kevin said.
If the ORM doesn't allow that, I'd suggest a custom function that encodes the date bit-shifted to the high 4 bytes, and then adds in the four bytes from a cycling sequence. At least then you've got a shot at partitioning, though the lower/upper bounds of the partitions would not make sense to the casual observer.
Luciano Coutinho Barcellos wrote: > * the type would be 8 bytes long, being 4 dedicated to storing the > Date, and 4 dedicated to storing a serial within that day; Another thing to consider -- have you carefully defined what the "current day" is? This might sound a stupid question, but as far as I remember Brazil has at least two timezones, which means that you could have one date while far east and a different one at the west border of the country. If you misplace an order that a customer filed after 11pm, they will be pretty pissed. (This of course becomes more pressing if you have things outside the country.) If you simply state that dates are whatever is current in UTC zone, you're covered. (Pray you never get an order during a leap second.) -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services