Обсуждение: genomic locus

Поиск
Список
Период
Сортировка

genomic locus

От
Gene Selkov
Дата:
Greetings everyone,

I need a data type to represent genomic positions, which will consist of a string and a pair of integers with interval logic and access methods. Sort of like my seg type, but more straightforward.

I noticed somebody took a good care of seg while I was away for the last 20 years, and I am extremely grateful for that. I have been using it. In the meantime, things have changed and now I am almost clueless about how you deal with contributed modules and what steps I should take once I get it to work. Also, is it a good idea to clone and fix seg for this purpose, or is there a more appropriate template? Or maybe just building it from scratch will be a better idea?

I have seen a lot of bit rot in other extensions (never contributed) that I have not maintained since 2009 and I now I am unable to fix some of them, so I wonder how much of old knowledge is still applicable. In other words, is what I see in new code just a change of macros or the change of principles?

Thanks,

--Gene

Re: genomic locus

От
Michael Paquier
Дата:
On Sat, Dec 16, 2017 at 4:49 AM, Gene Selkov <selkovjr@gmail.com> wrote:
> I noticed somebody took a good care of seg while I was away for the last 20
> years, and I am extremely grateful for that. I have been using it. In the
> meantime, things have changed and now I am almost clueless about how you
> deal with contributed modules and what steps I should take once I get it to
> work. Also, is it a good idea to clone and fix seg for this purpose, or is
> there a more appropriate template? Or maybe just building it from scratch
> will be a better idea?

Wah. You are the author of the seg module and your name is listed in a
set of commits from 2000:
https://www.postgresql.org/docs/devel/static/seg.html
seg is now shaped as what is called an extension (see
https://www.postgresql.org/docs/9.6/static/external-extensions.html),
which is made roughly a library and a set of SQL commands that the
server can load automatically after enabling them with the command
CREATE EXTENSION. If you wish to fix seg in some way, you could always
patch them. But I am not sure what you are trying to fix, so more
details would be welcome.

> I have seen a lot of bit rot in other extensions (never contributed) that I
> have not maintained since 2009 and I now I am unable to fix some of them, so
> I wonder how much of old knowledge is still applicable. In other words, is
> what I see in new code just a change of macros or the change of principles?

APIs in Postgres are usually stable. You should be able to update your
own extensions. If you want to discuss about a couple of things in
particular, don't hesitate!
-- 
Michael


Re: genomic locus

От
Andrew Dunstan
Дата:

On 12/15/2017 05:50 PM, Michael Paquier wrote:
>
>> I have seen a lot of bit rot in other extensions (never contributed) that I
>> have not maintained since 2009 and I now I am unable to fix some of them, so
>> I wonder how much of old knowledge is still applicable. In other words, is
>> what I see in new code just a change of macros or the change of principles?
> APIs in Postgres are usually stable. You should be able to update your
> own extensions. If you want to discuss about a couple of things in
> particular, don't hesitate!


I keep most of the out-of-tree extensions I maintain green by building
and testing them in a buildfarm member. That way I become aware pretty
quickly if any API change has broken them, as happened just the other
day in fact. To do this requires writing a small perl paqckage. There
are three examples in the buildfarm client sources at
https://github.com/PGBuildFarm/client-code/tree/master/PGBuild/Modules
and one of these is included in the buildfarm client releases.

cheers

andrew

-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: genomic locus

От
Robert Haas
Дата:
On Fri, Dec 15, 2017 at 2:49 PM, Gene Selkov <selkovjr@gmail.com> wrote:
> I need a data type to represent genomic positions, which will consist of a
> string and a pair of integers with interval logic and access methods. Sort
> of like my seg type, but more straightforward.
>
> I noticed somebody took a good care of seg while I was away for the last 20
> years, and I am extremely grateful for that. I have been using it. In the
> meantime, things have changed and now I am almost clueless about how you
> deal with contributed modules and what steps I should take once I get it to
> work. Also, is it a good idea to clone and fix seg for this purpose, or is
> there a more appropriate template? Or maybe just building it from scratch
> will be a better idea?

Have you thought about just using a composite type?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: genomic locus

От
PostgreSQL - Hans-Jürgen Schönig
Дата:
hello …

it seems they have put a lot of work into that.

regards,

hans


On 15 Dec 2017, at 20:49, Gene Selkov <selkovjr@gmail.com> wrote:

Greetings everyone,

I need a data type to represent genomic positions, which will consist of a string and a pair of integers with interval logic and access methods. Sort of like my seg type, but more straightforward.

I noticed somebody took a good care of seg while I was away for the last 20 years, and I am extremely grateful for that. I have been using it. In the meantime, things have changed and now I am almost clueless about how you deal with contributed modules and what steps I should take once I get it to work. Also, is it a good idea to clone and fix seg for this purpose, or is there a more appropriate template? Or maybe just building it from scratch will be a better idea?

I have seen a lot of bit rot in other extensions (never contributed) that I have not maintained since 2009 and I now I am unable to fix some of them, so I wonder how much of old knowledge is still applicable. In other words, is what I see in new code just a change of macros or the change of principles?

Thanks,

--Gene

--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26, A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at




Re: genomic locus

От
Craig Ringer
Дата:
On 16 December 2017 at 03:49, Gene Selkov <selkovjr@gmail.com> wrote:
Greetings everyone,

I need a data type to represent genomic positions, which will consist of a string and a pair of integers with interval logic and access methods. Sort of like my seg type, but more straightforward.

I noticed somebody took a good care of seg while I was away for the last 20 years, and I am extremely grateful for that. I have been using it. In the meantime, things have changed and now I am almost clueless about how you deal with contributed modules and what steps I should take once I get it to work. Also, is it a good idea to clone and fix seg for this purpose, or is there a more appropriate template? Or maybe just building it from scratch will be a better idea?

I have seen a lot of bit rot in other extensions (never contributed) that I have not maintained since 2009 and I now I am unable to fix some of them, so I wonder how much of old knowledge is still applicable. In other words, is what I see in new code just a change of macros or the change of principles?

Welcome back!

I think your chances of modifying 'seg' are slim, since you'd surely have to change the on-disk format and that'd break existing users. Plus probably change the API.

If you think it'd make logical sense to extend seg with a string descriptor of some sort and could come up with a name/use case that's not quite so narrowly focused as genetics alone, then I could see adding it as a secondary type in the same extension.

But it's more likely that the best course would be to extract the seg extension from core, rename it, hack it as desired, and build it as an extension maintained out-of-tree. 

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: genomic locus

От
Oleg Bartunov
Дата:
On Fri, Dec 15, 2017 at 10:49 PM, Gene Selkov <selkovjr@gmail.com> wrote:
> Greetings everyone,

Привет !

>
> I need a data type to represent genomic positions, which will consist of a
> string and a pair of integers with interval logic and access methods. Sort
> of like my seg type, but more straightforward.

Why not use composite type ? For simple interval approach it's worked for us
(see attached hdate.sql). If you need to specify distribution
function, than it may be
worth to see orion project http://orion.cs.purdue.edu/index.html
6 years ago we was thinking about implementation special UNCERTAINTY data type
(http://www.sai.msu.su/~megera/postgres/talks/big_uncertain_data.pdf), but never
started :( It'd be nice if you start this very interesting for science project.

btw, now you can use range data type, check
https://wiki.postgresql.org/images/7/73/Range-types-pgopen-2012.pdf



>
> I noticed somebody took a good care of seg while I was away for the last 20
> years, and I am extremely grateful for that. I have been using it. In the
> meantime, things have changed and now I am almost clueless about how you
> deal with contributed modules and what steps I should take once I get it to
> work. Also, is it a good idea to clone and fix seg for this purpose, or is
> there a more appropriate template? Or maybe just building it from scratch
> will be a better idea?
>
> I have seen a lot of bit rot in other extensions (never contributed) that I
> have not maintained since 2009 and I now I am unable to fix some of them, so
> I wonder how much of old knowledge is still applicable. In other words, is
> what I see in new code just a change of macros or the change of principles?

nothing special, copy, modify, compile




>
> Thanks,
>
> --Gene

Вложения

Re: genomic locus

От
Andrey Zhidenkov
Дата:
Uncertain type is a great idea. I needed type like this to store information about premiere date or people birth date in movie database. Finally I made two columns: <date> and <precocious>. Precocious could me 'year', 'month' and so on and if precocious was 'year' the date should always had to be like 'yyyy-01-01'. I have just read your presentation from 2011 and have a question - is it possible to compare two uncertain dates like 'January 2017' and 'year 2017'?

On 19 Dec. 2017 23:22, "Oleg Bartunov" <obartunov@gmail.com> wrote:
On Fri, Dec 15, 2017 at 10:49 PM, Gene Selkov <selkovjr@gmail.com> wrote:
> Greetings everyone,

Привет !

>
> I need a data type to represent genomic positions, which will consist of a
> string and a pair of integers with interval logic and access methods. Sort
> of like my seg type, but more straightforward.

Why not use composite type ? For simple interval approach it's worked for us
(see attached hdate.sql). If you need to specify distribution
function, than it may be
worth to see orion project http://orion.cs.purdue.edu/index.html
6 years ago we was thinking about implementation special UNCERTAINTY data type
(http://www.sai.msu.su/~megera/postgres/talks/big_uncertain_data.pdf), but never
started :( It'd be nice if you start this very interesting for science project.

btw, now you can use range data type, check
https://wiki.postgresql.org/images/7/73/Range-types-pgopen-2012.pdf



>
> I noticed somebody took a good care of seg while I was away for the last 20
> years, and I am extremely grateful for that. I have been using it. In the
> meantime, things have changed and now I am almost clueless about how you
> deal with contributed modules and what steps I should take once I get it to
> work. Also, is it a good idea to clone and fix seg for this purpose, or is
> there a more appropriate template? Or maybe just building it from scratch
> will be a better idea?
>
> I have seen a lot of bit rot in other extensions (never contributed) that I
> have not maintained since 2009 and I now I am unable to fix some of them, so
> I wonder how much of old knowledge is still applicable. In other words, is
> what I see in new code just a change of macros or the change of principles?

nothing special, copy, modify, compile




>
> Thanks,
>
> --Gene

Re: genomic locus

От
Gene Selkov
Дата:
> On Dec 15, 2017, at 4:50 PM, Michael Paquier <michael.paquier@gmail.com> wrote:
> If you wish to fix seg in some way, you could always
> patch them. But I am not sure what you are trying to fix, so more
> details would be welcome.

I was contemplating how much functionality I could borrow from seg to build another interval type and what unforeseen
hurdleswould emerge while I was digging into it. It turned out to be less straightforward than I thought.  

>> I have seen a lot of bit rot in other extensions (never contributed) that I
>> have not maintained since 2009 and I now I am unable to fix some of them, so
>> I wonder how much of old knowledge is still applicable. In other words, is
>> what I see in new code just a change of macros or the change of principles?
>
> APIs in Postgres are usually stable. You should be able to update your
> own extensions. If you want to discuss about a couple of things in
> particular, don't hesitate!

Thank you Michael. I will summarize the problems I have already encountered in a later reply to this thread.

I do find the API to be unchanged, but I get the sense that some macros are new. Maybe it’s just my bad memory.
Overall,I am pleased with a much better automation of extension building and testing. 

—Gene



Re: genomic locus

От
Gene Selkov
Дата:
Nice work, Andrew!

I wish I knew about it earlier.

> On Dec 16, 2017, at 8:53 AM, Andrew Dunstan <andrew.dunstan@2ndquadrant.com> wrote:
>
> I keep most of the out-of-tree extensions I maintain green by building
> and testing them in a buildfarm member. That way I become aware pretty
> quickly if any API change has broken them, as happened just the other
> day in fact. To do this requires writing a small perl paqckage. There
> are three examples in the buildfarm client sources at
> https://github.com/PGBuildFarm/client-code/tree/master/PGBuild/Modules
> and one of these is included in the buildfarm client releases.


Re: genomic locus

От
Gene Selkov
Дата:

On Dec 17, 2017, at 7:57 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Dec 15, 2017 at 2:49 PM, Gene Selkov <selkovjr@gmail.com> wrote:
I need a data type to represent genomic positions, which will consist of a
string and a pair of integers with interval logic and access methods. Sort
of like my seg type, but more straightforward.

Have you thought about just using a composite type?

Yes, I have. That is sort of what I have been doing; a composite type certainly gets the job done but I don’t feel it reduces query complexity, at least from the user’s point of view. Maybe I don’t know enough.

Here’s an example of how I imagine a composite genomic locus (conventionally represented as text ‘:’ integer ‘-‘ integer):

CREATE TYPE locus AS (contig text, coord int4range);
CREATE TABLE test_locus (
  pos locus,
  ref text,
  alt text,
  id text
);
CREATE INDEX test_locus_coord_ix ON test_locus (((pos).coord));
\copy test_locus from test_locus.tab

Where test_locus.tab has stuff like:

(chr3,"[178916937,178916940]") GAA CHP2_PIK3CA_2
(chr3,"[178916939,178916948]") AGAAAAGAT CHP2_PIK3CA_2
(chr3,"[178916940,178916941]") G A CHP2_PIK3CA_2
(chr3,"[178916943,178916944]") A G CHP2_PIK3CA_2
(chr3,"[178916943,178916946]") AAG CHP2_PIK3CA_2
(chr3,"[178916943,178916952]") AAGATCCTC CHP2_PIK3CA_2
(chr3,"[178916944,178916945]") A G CHP2_PIK3CA_2
(chr3,"[178916945,178916946]") G C CHP2_PIK3CA_2
(chr3,"[178916945,178916946]") G T CHP2_PIK3CA_2
(chr3,"[178916945,178916948]") GAT CHP2_PIK3CA_2

When the table is loaded, I can pull the subset shown above with this query:

SELECT * FROM test_locus WHERE (pos).contig = 'chr3' AND (pos).coord && '[178916937, 178916948]’;
              pos               |    ref    | alt |      id
--------------------------------+-----------+-----+---------------
 (chr3,"[178916937,178916941)") | GAA       |     | CHP2_PIK3CA_2
 (chr3,"[178916939,178916949)") | AGAAAAGAT |     | CHP2_PIK3CA_2
 . . . .

So far so good. It gets the job done. However, it is only a small step towards a fully encapsulated, monolithic type I want it to be. The above query It is marginally better than its atomic-type equivalent:

SELECT * FROM test WHERE contig = 'chr3' AND greatest(start, 178916937) <= least(stop, 178916948);
 contig |   start   |   stop    |    ref    | alt |      id
--------+-----------+-----------+-----------+-----+---------------
 chr3   | 178916937 | 178916940 | GAA       |     | CHP2_PIK3CA_2
 chr3   | 178916939 | 178916948 | AGAAAAGAT |     | CHP2_PIK3CA_2
 . . . .

and it requires addition syntax transformations steps to go from conventional locus representation 'chr3:178916937-178916940' to composite '(chr3,"[178916937,178916940]”)’ and back.

Of course, the relative benefits of partial encapsulation I achieve by bundling text with int4range accumulate, compared to (text, int4, int4), as queries grow more complex. But because the elements of a composite type still require a separate query term for each of them (unless there is some magic I am not aware of), the complexity of a typical query I need to run exceeds my feeble sight-reading capacity. I want things that are conceptually simple to be expressed in simple terms, if possible.

Like so:

CREATE EXTENSION locus;
CREATE TABLE test_locus (
  pos locus,
  ref text,
  alt text,
  id text
);
\copy test_locus from data/oncomine.hotspot.tab

SELECT * FROM test_locus WHERE pos && 'chr3:178916937-178916948';
           pos            |    ref    | alt |      id
--------------------------+-----------+-----+---------------
 chr3:178916937-178916940 | GAA       |     | CHP2_PIK3CA_2
 chr3:178916939-178916948 | AGAAAAGAT |     | CHP2_PIK3CA_2
 chr3:178916940-178916941 | G         | A   | CHP2_PIK3CA_2
 chr3:178916943-178916944 | A         | G   | CHP2_PIK3CA_2
 chr3:178916943-178916946 | AAG       |     | CHP2_PIK3CA_2
 chr3:178916943-178916952 | AAGATCCTC |     | CHP2_PIK3CA_2
 chr3:178916944-178916945 | A         | G   | CHP2_PIK3CA_2
 chr3:178916945-178916946 | G         | C   | CHP2_PIK3CA_2
 chr3:178916945-178916946 | G         | T   | CHP2_PIK3CA_2
 chr3:178916945-178916948 | GAT       |     | CHP2_PIK3CA_2
(10 rows)

I have encountered some pesky geometry / indexing problems while building this extension (https://github.com/selkovjr/locus), but I hope I can solve them at least at the level afforded by the composite type, while keeping the clean interface of a monolithic type. I understand I could probably achieve the same cleanliness by defining functions and operators over the complex type, but by the time I’m done with that, will I have coded about the same amount of stuff as required to build an extended type?


Regards,

—Gene




Re: genomic locus

От
Gene Selkov
Дата:

On Dec 18, 2017, at 5:00 AM, PostgreSQL - Hans-Jürgen Schönig <postgres@cybertec.at> wrote:

it seems they have put a lot of work into that.

Yes, that is a curious piece of code and I will give it a test, although it is not related to the problem at hand. It is about representing sequences, while I am struggling to create a convenient representation of loci.

I liked the slide show you linked. It touched on a pertinent subject (although again, not related to loci). Just last week, I attempted to help a relative who works on a microbiome project to sort through the guidelines about the encoding of altitude, elevation, and depth thought up by somebody at EBI without much regard for their usability. We haven’t had much success in sorting that out because the guidelines were obviously written for other people to follow; their practical application resulted in total confusion.

For example, is this definition redundant or contradictory:

"The altitude of the sample is the vertical distance between Earth's surface above sea level and the sampled position in the air

Is that AGL or AMSL? Why not use a dictionary definition? Because of stuff like that, I treat all products by biologists with suspicion, until proven workable — especially stuff developed by consortia of multiple institutions.

Anyway, thanks for the link.

—Gene

Re: genomic locus

От
Gene Selkov
Дата:

On Dec 18, 2017, at 6:59 AM, Craig Ringer <craig@2ndquadrant.com> wrote:

If you think it'd make logical sense to extend seg with a string descriptor of some sort and could come up with a name/use case that's not quite so narrowly focused as genetics alone, then I could see adding it as a secondary type in the same extension.

But it's more likely that the best course would be to extract the seg extension from core, rename it, hack it as desired, and build it as an extension maintained out-of-tree. 

That is exactly what I’ve been doing for a few days now, and the process is testing my sanity.

Attaching a string to an interval seems like an easy enough undertaking, and I have got it to work at the UI level, at least. The queries I wanted to be able to make against it run without problems and produce the desired results. Here is my first attempt: https://github.com/selkovjr/locus

Problems arise around queries I didn’t expect to be making and there are issues around indexing that I am not sure how to solve.

The main problem is that attaching a tag to an interval makes it incommensurate with intervals having a different tag. That makes them hard to index with an access method based on containment, such as GiST.

Problem 1. What is a union of ‘1:6000-7000’ and ‘X:10000-20000’? Intuitively, it should be NULL, however, I am not sure the method allows for that; it was developed for objects living in the same metric space. I have mechanistically reproduce the indexing methods of seg, but the resulting index is broken. All queries against an indexed table return a null result.

Problem 2. While the intersection (overlap, &c.) of any two loci produces obvious results, non-intersection does not. When I query for all loci not overlapping ‘1:6000-7000’, I expect to find all non-overlapping loci on contig 1. I don’t want the query to return anything from other contigs, because it is obvious that features on different contigs do not overlap. I may be able to fix that by making separate functions for non-overlaps and adding a constraint to them, but that seems like a kludge.

Problem 3 (alternative to 1). I realize that any clustering can help build an efficient index, no matter how bizarre. So I could, for example, ignore the contigs altogether and build a single index tree, using only position co-ordinates and pretending that all positions are on the same contig; the question then is whether and how such lossy index will affect the ordering of query results. Can I use a separate function for ordering? I have yet to make an experiment. Not that this would be equivalent to indexing the attributes of a composite type separately (if I understood it correctly).

An alternative to neglecting the contig element might be to use it as a second dimension. Expressed that way, a union of several loci might consist of a set of contig names attached to the bounding interval. Not sure whether that makes any sense; in the first approximation, I imagine something equivalent to storing each contig’s data in a separate table with a separate index, except derived from a single actual database table, but I have no clue for how to go about doing that.


Thanks,

—Gene


Re: genomic locus

От
Gene Selkov
Дата:
Hi Oleg,

Great to hear from you. I wondered how many of the old-timers were still around.

Why not use composite type ? For simple interval approach it's worked for us
(see attached hdate.sql).

I have just begun looking at your hdate example; I see potentially useful stuff in it, but the first thing that I noticed is hat it is not fully equivalent to my problem. It looks like you only need to match intervals, while I need to match intervals and something else — ideally, in a single operation. I attempted to explain that in my reply to Craig Ringer.

If you need to specify distribution
function,

Not in this case; there is no uncertainty associated with the loci; where there is uncertainty is in the existence of a feature called at a locus: is it real or is it a technogenic artifact? But that is a different problem for a later day.

than it may be
worth to see orion project http://orion.cs.purdue.edu/index.html
6 years ago we was thinking about implementation special UNCERTAINTY data type
(http://www.sai.msu.su/~megera/postgres/talks/big_uncertain_data.pdf), but never
started :( It'd be nice if you start this very interesting for science project.

I love uncertainty, and I’ve always wished I could make it computable. I also wish folks around me had the same appreciation for it. My job is to say yes or no where the data suggest maybe, or maybe not. Needless to say, I feel a bit exercised.

I am reading the info you provided with keen interest.


Great stuff, I was not aware of it. I saw it in early development but did not know it made it to the core. I tried it (and will go and update a few kludgy apps where I had to use bad surrogates). It is not directly applicable to genomic loci because it will require additional constraints for intelligent matching. I want to go for compete encapsulation of constraints.

Part of the reason for such a perverse desire is that I use the database as a calculator — that is, I load some data in a one-off experiment and I literally type everything in psql while I muddle through. There is a limit on how much I can type and not screw things up beyond comprehension, so I want the query language to be as easy and interactive as possible. Having to drag along a set of additional constraints is not quite interactive and is error-prone.


Regards,

—Gene

Re: genomic locus

От
Teodor Sigaev
Дата:
Hi!

> Problem 1. What is a union of ‘1:6000-7000’ and ‘X:10000-20000’? Intuitively, it 
> should be NULL, however, I am not sure the method allows for that; it was 
> developed for objects living in the same metric space. I have mechanistically 
> reproduce the indexing methods of seg, but the resulting index is broken. All 
> queries against an indexed table return a null result.
You can use fixed-size signature for string part (contig?) of locus, it will 
help to find that string part by equlity or intersection of need it. Signature 
of inner key should be bitwise-ORed of all child keys. Looking on query example, 
seems, you need to build signature tree and, if signature is the same of two 
key, take into account start-end range.


> Problem 2. While the intersection (overlap, &c.) of any two loci produces 
> obvious results, non-intersection does not. When I query for all loci not 
> overlapping ‘1:6000-7000’, I expect to find all non-overlapping loci on contig 
> 1. I don’t want the query to return anything from other contigs, because it is 
> obvious that features on different contigs do not overlap. I may be able to fix 
> that by making separate functions for non-overlaps and adding a constraint to 
> them, but that seems like a kludge.
Seems like you have two different non-overlap operation...


> Problem 3 (alternative to 1). I realize that any clustering can help build an 
> efficient index, no matter how bizarre. So I could, for example, ignore the 
> contigs altogether and build a single index tree, using only position 
> co-ordinates and pretending that all positions are on the same contig; the 
> question then is whether and how such lossy index will affect the ordering of 
> query results. Can I use a separate function for ordering? I have yet to make an 
> experiment. Not that this would be equivalent to indexing the attributes of a 
> composite type separately (if I understood it correctly).
> 
Ordering and GiST aren't very close things :) You can try to have a look to 
KNN-search feature of GiST. https://www.pgcon.org/2010/schedule/events/227.en.html






-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/


Re: genomic locus

От
Teodor Sigaev
Дата:
Hmm, would you try to implement separate type for querying? Similar to tsquery, 
lquery (for ltree), jsquery etc.

Gene Selkov wrote:
> 
>> On Dec 17, 2017, at 7:57 PM, Robert Haas <robertmhaas@gmail.com 
>> <mailto:robertmhaas@gmail.com>> wrote:
>>
>> On Fri, Dec 15, 2017 at 2:49 PM, Gene Selkov <selkovjr@gmail.com 
>> <mailto:selkovjr@gmail.com>> wrote:
>>> I need a data type to represent genomic positions, which will consist of a
>>> string and a pair of integers with interval logic and access methods. Sort
>>> of like my seg type, but more straightforward.
>>
>> Have you thought about just using a composite type?
> 
> Yes, I have. That is sort of what I have been doing; a composite type certainly 
> gets the job done but I don’t feel it reduces query complexity, at least from 
> the user’s point of view. Maybe I don’t know enough.
> 
> Here’s an example of how I imagine a composite genomic locus (conventionally 
> represented as text ‘:’ integer ‘-‘ integer):
> 
> CREATE TYPE locus AS (contig text, coord int4range);
> CREATE TABLE test_locus (
>    pos locus,
>    ref text,
>    alt text,
>    id text
> );
> CREATE INDEX test_locus_coord_ix ON test_locus (((pos).coord));
> \copy test_locus from test_locus.tab
> 
> Where test_locus.tab has stuff like:
> 
> (chr3,"[178916937,178916940]")GAACHP2_PIK3CA_2
> (chr3,"[178916939,178916948]")AGAAAAGATCHP2_PIK3CA_2
> (chr3,"[178916940,178916941]")GACHP2_PIK3CA_2
> (chr3,"[178916943,178916944]")AGCHP2_PIK3CA_2
> (chr3,"[178916943,178916946]")AAGCHP2_PIK3CA_2
> (chr3,"[178916943,178916952]")AAGATCCTCCHP2_PIK3CA_2
> (chr3,"[178916944,178916945]")AGCHP2_PIK3CA_2
> (chr3,"[178916945,178916946]")GCCHP2_PIK3CA_2
> (chr3,"[178916945,178916946]")GTCHP2_PIK3CA_2
> (chr3,"[178916945,178916948]")GATCHP2_PIK3CA_2
> 
> When the table is loaded, I can pull the subset shown above with this query:
> 
> SELECT * FROM test_locus WHERE (pos).contig = 'chr3' AND (pos).coord && 
> '[178916937, 178916948]’;
>                pos               |    ref    | alt |      id
> --------------------------------+-----------+-----+---------------
>   (chr3,"[178916937,178916941)") | GAA       |     | CHP2_PIK3CA_2
>   (chr3,"[178916939,178916949)") | AGAAAAGAT |     | CHP2_PIK3CA_2
>   . . . .
> 
> So far so good. It gets the job done. However, it is only a small step towards a 
> fully encapsulated, monolithic type I want it to be. The above query It is 
> marginally better than its atomic-type equivalent:
> 
> SELECT * FROM test WHERE contig = 'chr3' AND greatest(start, 178916937) <= 
> least(stop, 178916948);
>   contig |   start   |   stop    |    ref    | alt |      id
> --------+-----------+-----------+-----------+-----+---------------
>   chr3   | 178916937 | 178916940 | GAA       |     | CHP2_PIK3CA_2
>   chr3   | 178916939 | 178916948 | AGAAAAGAT |     | CHP2_PIK3CA_2
>   . . . .
> 
> and it requires addition syntax transformations steps to go from conventional 
> locus representation 'chr3:178916937-178916940' to composite 
> '(chr3,"[178916937,178916940]”)’ and back.
> 
> Of course, the relative benefits of partial encapsulation I achieve by bundling 
> text with int4range accumulate, compared to (text, int4, int4), as queries grow 
> more complex. But because the elements of a composite type still require a 
> separate query term for each of them (unless there is some magic I am not aware 
> of), the complexity of a typical query I need to run exceeds my feeble 
> sight-reading capacity. I want things that are conceptually simple to be 
> expressed in simple terms, if possible.
> 
> Like so:
> 
> CREATE EXTENSION locus;
> CREATE TABLE test_locus (
>    pos locus,
>    ref text,
>    alt text,
>    id text
> );
> \copy test_locus from data/oncomine.hotspot.tab
> 
> SELECT * FROM test_locus WHERE pos && 'chr3:178916937-178916948';
>             pos            |    ref    | alt |      id
> --------------------------+-----------+-----+---------------
>   chr3:178916937-178916940 | GAA       |     | CHP2_PIK3CA_2
>   chr3:178916939-178916948 | AGAAAAGAT |     | CHP2_PIK3CA_2
>   chr3:178916940-178916941 | G         | A   | CHP2_PIK3CA_2
>   chr3:178916943-178916944 | A         | G   | CHP2_PIK3CA_2
>   chr3:178916943-178916946 | AAG       |     | CHP2_PIK3CA_2
>   chr3:178916943-178916952 | AAGATCCTC |     | CHP2_PIK3CA_2
>   chr3:178916944-178916945 | A         | G   | CHP2_PIK3CA_2
>   chr3:178916945-178916946 | G         | C   | CHP2_PIK3CA_2
>   chr3:178916945-178916946 | G         | T   | CHP2_PIK3CA_2
>   chr3:178916945-178916948 | GAT       |     | CHP2_PIK3CA_2
> (10 rows)
> 
> I have encountered some pesky geometry / indexing problems while building this 
> extension (https://github.com/selkovjr/locus), but I hope I can solve them at 
> least at the level afforded by the composite type, while keeping the clean 
> interface of a monolithic type. I understand I could probably achieve the same 
> cleanliness by defining functions and operators over the complex type, but by 
> the time I’m done with that, will I have coded about the same amount of stuff as 
> required to build an extended type?
> 
> 
> Regards,
> 
> —Gene
> 
> 
> 
> 

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/


Re: genomic locus

От
Gene Selkov
Дата:

On Dec 22, 2017, at 1:53 AM, Teodor Sigaev <teodor@sigaev.ru> wrote:

Hmm, would you try to implement separate type for querying? Similar to tsquery, lquery (for ltree), jsquery etc.

That sounds like a good idea if I want to make an app that will only be accessed through a purpose-built front end. Now I’m messing with lots of data, making numerous small one-off experiments. Since all of that stuff consumes my brain power and keystrokes, I want to minimize all the ancillary stuff or at least make it invisible. But I’ll probably go ahead and add a separate query-friendly type  that if nothing else helps.

I think I can wrangle this type into GiST just by tweaking consistent(), union(), and picksplit(), if I manage to express my needs in C without breaking too many things. My first attempt segfaulted.

Here’s my plan of attack. I think that by setting a union of inconsistent loci (those on different contigs) to [0, MAX_INT], I will expose such union to a huge penalty, even without having to do anything with the current penalty(). No query will go there.

The required change to consistent() is obvious: contigs do not match — go away.

The situation with picksplit() may be more tricky; I can’t imagine all possible consequences until I’ve seen how it works. Currently, with simple intervals, it sorts them by center points and splits the sorted list in half. I have already changed the internal sort function, picksplit_item_cmp(), to make sure the data are sorted on the contig first, then by center point (or any other geometric feature). I am thinking about splitting the list at the first inconsistent contig, sending the consistent first part with its well-defined geometry to the left page, and filling the right page with whatever remains. If the right page has inconsistent contigs, its bounding box will be [0, MAX_INT], and it should be again picked for splitting at the next iteration (if I understand the algorithm correctly).

If all goes to plan, I will end up with an index tree partitioned by contig at the top level and geometrically down from there. That will be as close as I can get to an array of config-specific indices, without having to store data in separate tables.

What do you think of that?

~~~~~~~~~~~~~~~~~~~~~~~~~~~~

I have a low-level technical question. Because I can’t anticipate the maximum length of contig names (and do not want to waste space), I have made the new locus type a varlena, like this:

#include "utils/varlena.h"

typedef struct LOCUS
{
  int32 l_len_; /* varlena header (do not touch directly!) */
  int32 start;
  int32 end;
  char  contig[FLEXIBLE_ARRAY_MEMBER];
} LOCUS;

#define LOCUS_SIZE(str) (offsetof(LOCUS, contig) + sizeof(str))

That flexible array member messes with me every time I need to copy it while deriving a new locus object from an existing one (or from a pair). What I ended up doing is this:

  LOCUS  *l = PG_GETARG_LOCUS_P(0);
  LOCUS  *new_locus;
  char   *contig;
  int    size;
  new_locus = (LOCUS *) palloc0(sizeof(*new_locus));
  contig = pstrdup(l->contig); // need this to determine the length of contig name at runtime
  size = LOCUS_SIZE(contig);
  SET_VARSIZE(new_locus, size);
  strcpy(new_locus->contig, contig);

Is there a more direct way to clone a varlena structure (possibly assigning an differently-sized contig to it)? One that is also memory-safe?


Thank you,

—Gene


Gene Selkov wrote:
On Dec 17, 2017, at 7:57 PM, Robert Haas <robertmhaas@gmail.com <mailto:robertmhaas@gmail.com>> wrote:

On Fri, Dec 15, 2017 at 2:49 PM, Gene Selkov <selkovjr@gmail.com <mailto:selkovjr@gmail.com>> wrote:
I need a data type to represent genomic positions, which will consist of a
string and a pair of integers with interval logic and access methods. Sort
of like my seg type, but more straightforward.

Have you thought about just using a composite type?
Yes, I have. That is sort of what I have been doing; a composite type certainly gets the job done but I don’t feel it reduces query complexity, at least from the user’s point of view. Maybe I don’t know enough.
Here’s an example of how I imagine a composite genomic locus (conventionally represented as text ‘:’ integer ‘-‘ integer):
CREATE TYPE locus AS (contig text, coord int4range);
CREATE TABLE test_locus (
  pos locus,
  ref text,
  alt text,
  id text
);
CREATE INDEX test_locus_coord_ix ON test_locus (((pos).coord));
\copy test_locus from test_locus.tab
Where test_locus.tab has stuff like:
(chr3,"[178916937,178916940]")GAACHP2_PIK3CA_2
(chr3,"[178916939,178916948]")AGAAAAGATCHP2_PIK3CA_2
(chr3,"[178916940,178916941]")GACHP2_PIK3CA_2
(chr3,"[178916943,178916944]")AGCHP2_PIK3CA_2
(chr3,"[178916943,178916946]")AAGCHP2_PIK3CA_2
(chr3,"[178916943,178916952]")AAGATCCTCCHP2_PIK3CA_2
(chr3,"[178916944,178916945]")AGCHP2_PIK3CA_2
(chr3,"[178916945,178916946]")GCCHP2_PIK3CA_2
(chr3,"[178916945,178916946]")GTCHP2_PIK3CA_2
(chr3,"[178916945,178916948]")GATCHP2_PIK3CA_2
When the table is loaded, I can pull the subset shown above with this query:
SELECT * FROM test_locus WHERE (pos).contig = 'chr3' AND (pos).coord && '[178916937, 178916948]’;
              pos               |    ref    | alt |      id
--------------------------------+-----------+-----+---------------
 (chr3,"[178916937,178916941)") | GAA       |     | CHP2_PIK3CA_2
 (chr3,"[178916939,178916949)") | AGAAAAGAT |     | CHP2_PIK3CA_2
 . . . .
So far so good. It gets the job done. However, it is only a small step towards a fully encapsulated, monolithic type I want it to be. The above query It is marginally better than its atomic-type equivalent:
SELECT * FROM test WHERE contig = 'chr3' AND greatest(start, 178916937) <= least(stop, 178916948);
 contig |   start   |   stop    |    ref    | alt |      id
--------+-----------+-----------+-----------+-----+---------------
 chr3   | 178916937 | 178916940 | GAA       |     | CHP2_PIK3CA_2
 chr3   | 178916939 | 178916948 | AGAAAAGAT |     | CHP2_PIK3CA_2
 . . . .
and it requires addition syntax transformations steps to go from conventional locus representation 'chr3:178916937-178916940' to composite '(chr3,"[178916937,178916940]”)’ and back.
Of course, the relative benefits of partial encapsulation I achieve by bundling text with int4range accumulate, compared to (text, int4, int4), as queries grow more complex. But because the elements of a composite type still require a separate query term for each of them (unless there is some magic I am not aware of), the complexity of a typical query I need to run exceeds my feeble sight-reading capacity. I want things that are conceptually simple to be expressed in simple terms, if possible.
Like so:
CREATE EXTENSION locus;
CREATE TABLE test_locus (
  pos locus,
  ref text,
  alt text,
  id text
);
\copy test_locus from data/oncomine.hotspot.tab
SELECT * FROM test_locus WHERE pos && 'chr3:178916937-178916948';
           pos            |    ref    | alt |      id
--------------------------+-----------+-----+---------------
 chr3:178916937-178916940 | GAA       |     | CHP2_PIK3CA_2
 chr3:178916939-178916948 | AGAAAAGAT |     | CHP2_PIK3CA_2
 chr3:178916940-178916941 | G         | A   | CHP2_PIK3CA_2
 chr3:178916943-178916944 | A         | G   | CHP2_PIK3CA_2
 chr3:178916943-178916946 | AAG       |     | CHP2_PIK3CA_2
 chr3:178916943-178916952 | AAGATCCTC |     | CHP2_PIK3CA_2
 chr3:178916944-178916945 | A         | G   | CHP2_PIK3CA_2
 chr3:178916945-178916946 | G         | C   | CHP2_PIK3CA_2
 chr3:178916945-178916946 | G         | T   | CHP2_PIK3CA_2
 chr3:178916945-178916948 | GAT       |     | CHP2_PIK3CA_2
(10 rows)
I have encountered some pesky geometry / indexing problems while building this extension (https://github.com/selkovjr/locus), but I hope I can solve them at least at the level afforded by the composite type, while keeping the clean interface of a monolithic type. I understand I could probably achieve the same cleanliness by defining functions and operators over the complex type, but by the time I’m done with that, will I have coded about the same amount of stuff as required to build an extended type?
Regards,
—Gene

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                  WWW: http://www.sigaev.ru/

Re: genomic locus

От
Teodor Sigaev
Дата:

> I think I can wrangle this type into GiST just by tweaking consistent(), 
> union(), and picksplit(), if I manage to express my needs in C without breaking 
> too many things. My first attempt segfaulted.
Actually, consistent() can determ actual query data type by strategy number. See 
examples in ltree, intarray


> If all goes to plan, I will end up with an index tree partitioned by contig at 
> the top level and geometrically down from there. That will be as close as I can 
> get to an array of config-specific indices, without having to store data in 
> separate tables.
> 
> What do you think of that?

I have some doubt that you can distinguish root page, but it's possible to 
distinguish leaf pages, intarray and tsearch do that.

Reading your plan, I found an idea for GIN: key for GIN is a pair of (contig, 
one genome position). So, any search for interset operation with be actually a 
range search from (contig, start) to (contig, end)


> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> I have a low-level technical question. Because I can’t anticipate the maximum 
> length of contig names (and do not want to waste space), I have made the new 
> locus type a varlena, like this:
> 
> #include "utils/varlena.h"
> 
> typedef struct LOCUS
> {
>    int32 l_len_; /* varlena header (do not touch directly!) */
>    int32 start;
>    int32 end;
>    char  contig[FLEXIBLE_ARRAY_MEMBER];
> } LOCUS;
> 
> #define LOCUS_SIZE(str) (offsetof(LOCUS, contig) + sizeof(str))
sizeof? or strlen ?

> 
> That flexible array member messes with me every time I need to copy it while 
> deriving a new locus object from an existing one (or from a pair). What I ended 
> up doing is this:
> 
>    LOCUS  *l = PG_GETARG_LOCUS_P(0);
>    LOCUS  *new_locus;
>    char   *contig;
>    int    size;
>    new_locus = (LOCUS *) palloc0(sizeof(*new_locus));
>    contig = pstrdup(l->contig); // need this to determine the length of contig 
l->contig should be null-terminated for pstrdup, but if so, you don't need to 
pstrdup() it - you could use l->contig directly below. BTW, LOCUS_SIZE should 
add 1 byte for '\0' character in this case.


> name at runtime
>    size = LOCUS_SIZE(contig);
>    SET_VARSIZE(new_locus, size);
>    strcpy(new_locus->contig, contig);
> 
> Is there a more direct way to clone a varlena structure (possibly assigning an 
> differently-sized contig to it)? One that is also memory-safe?
Store length of contig in LOCUS struct.


-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/