Обсуждение: Add xicorr(X, Y): support for the xi (ξ) correlation coefficient by Chatterjee

Поиск

Список

Период

Сортировка

Add xicorr(X, Y): support for the xi (ξ) correlation coefficient by Chatterjee

От

Florents Tselai

Дата:

31 августа 2025 г., 15:29:51

Hi hackers,

In my analytics work, I frequently conduct extensive correlation discovery.

i.e., given a list of columns, run corr(X, Y) over different pairs and see what pairs score high.

Standard Postgres as-is offers the well-known corr(X, Y)

which is based on the classic Spearman correlation.

Its main drawback is that it detects linear associations.

Over the last 20 years, several measures have been proposed that can detect non-linear relationships as well.

including the Kendall rank and the Maximal Information Coefficient.

The latest celebrity in the area is the xi (ξ) correlation coefficient proposed by Chatterjee [0].

It's rank-based, and is very appealing due to its relatively simple implementation.

You can view a by-hand computation in this video (https://www.youtube.com/watch?v=2OTHH8wz25c)

I've already released pgxicor [1], an extension.

However, since Scipy has already added this to its library [2], I thought I'd propose it for core PG as well.

Here’s a first cut of a patch at this stage I’m mainly looking to gauge interest in including this in core.

Future versions will likely refine the implementation details (e.g., use ArrayType instead of a growable buffer of doubles,

revisit the way ties are handled, and decide whether clamping of negative values is appropriate).

[0] https://arxiv.org/pdf/1909.10140 https://souravchatterjee.su.domains/beam-correlation-trans.pdf

[1] https://github.com/Florents-Tselai/pgxicor

[2] https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chatterjeexi.html

https://discuss.scientific-python.org/t/new-function-scipy-stats-xi-correlation/1498

Вложения

v1-0001-Add-xicorr-X-Y.patch

Re: Add xicorr(X, Y): support for the xi (ξ) correlation coefficient by Chatterjee

От

Dean Rasheed

Дата:

08 сентября 2025 г., 10:20:40

On Sun, 31 Aug 2025 at 13:30, Florents Tselai <florents.tselai@gmail.com> wrote:
>
> Hi hackers,
>
> In my analytics work, I frequently conduct extensive correlation discovery.
> i.e., given a list of columns, run corr(X, Y) over different pairs and see what pairs score high.
> Standard Postgres as-is offers the well-known corr(X, Y)
> which is based on the classic Spearman correlation.
> Its main drawback is that it detects linear associations.

For the record, the Postgres correlation function corr(x,y) computes
the Pearson correlation, not the Spearman correlation.

Assuming there are no ties, the Spearman correlation can be calculated
by applying corr() to the ranks of x and y. For example:

SELECT corr(x, y) AS pearson_corr,
       corr(rx, ry) AS spearman_corr
  FROM (
  SELECT x, y,
         rank() OVER (ORDER BY x) AS rx,
         rank() OVER (ORDER BY y) AS ry
    FROM foo
) t;

> Over the last 20 years, several measures have been proposed that can detect non-linear relationships as well.
> including the Kendall rank and the Maximal Information Coefficient.
>
> The latest celebrity in the area is the xi (ξ) correlation coefficient proposed by Chatterjee [0].
> It's rank-based, and is very appealing due to its relatively simple implementation.
> You can view a by-hand computation in this video (https://www.youtube.com/watch?v=2OTHH8wz25c)

Hmm, interesting.

Based on the way ties were resolved in that video, the same result can
be computed using row_number() instead of rank() as follows:

SELECT 1 - 3 * sum(abs(delta_ry)) / (count(*)^2 - 1) FROM (
  SELECT ry - lag(ry) OVER(ORDER BY x) AS delta_ry FROM (
    SELECT x, row_number() OVER (ORDER BY y, x) AS ry FROM foo
  ) t1
) t2;

but I wouldn't fancy trying to do the full ties computation in SQL.

> I've already released pgxicor [1], an extension.
> However, since Scipy has already added this to its library [2], I thought I'd propose it for core PG as well.
>
> Here’s a first cut of a patch at this stage I’m mainly looking to gauge interest in including this in core.
> Future versions will likely refine the implementation details (e.g., use ArrayType instead of a growable buffer of
doubles,
> revisit the way ties are handled, and decide whether clamping of negative values is appropriate).

Regarding the patch itself, I have 2 main issues:

1). It only works for type float8. It seems to me that any rank-based
correlation function ought to work for any sortable datatype, as the
query above does.

2). It reads all the data into in-memory arrays for processing, which
could consume excessive amounts of memory.

However, even if those issues were addressed, my feeling is that this
is too specialised to be considered for inclusion in core. The fact
that it exists in Scipy and not a core python module is a hint at
that. There are a lot of other Scipy stat functions, the vast majority
of which aren't included in core Postgres.

Regards,
Dean

Re: Add xicorr(X, Y): support for the xi (ξ) correlation coefficient by Chatterjee

От

Andrey Borodin

Дата:

08 сентября 2025 г., 12:03:18

> On 8 Sep 2025, at 12:20, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:
>
> However, even if those issues were addressed, my feeling is that this
> is too specialised to be considered for inclusion in core. The fact
> that it exists in Scipy and not a core python module is a hint at
> that. There are a lot of other Scipy stat functions, the vast majority
> of which aren't included in core Postgres.

Kind of +1.
IMO, it's better to have Spearman and Kendal coefficients before original research from 2019.
I don't see anything wrong with having all them in core eventually, cost of maintaining settled math tools must not be
toohigh. 

Best regards, Andrey Borodin.

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Add xicorr(X, Y): support for the xi (ξ) correlation coefficient by Chatterjee

Add xicorr(X, Y): support for the xi (ξ) correlation coefficient by Chatterjee

Вложения

Re: Add xicorr(X, Y): support for the xi (ξ) correlation coefficient by Chatterjee

Re: Add xicorr(X, Y): support for the xi (ξ) correlation coefficient by Chatterjee