Re: Does people favor to have matrix data type?

Поиск
Список
Период
Сортировка
От Kouhei Kaigai
Тема Re: Does people favor to have matrix data type?
Дата
Msg-id 9A28C8860F777E439AA12E8AEA7694F8011F8DC7@BPXM15GP.gisp.nec.co.jp
обсуждение исходный текст
Ответ на Re: Does people favor to have matrix data type?  (Jim Nasby <Jim.Nasby@BlueTreble.com>)
Список pgsql-hackers
> -----Original Message-----
> From: Jim Nasby [mailto:Jim.Nasby@BlueTreble.com]
> Sent: Wednesday, June 01, 2016 11:32 PM
> To: Kaigai Kouhei(海外 浩平); Gavin Flower; Joe Conway; Ants Aasma; Simon Riggs
> Cc: pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Does people favor to have matrix data type?
> 
> On 5/30/16 9:05 PM, Kouhei Kaigai wrote:
> > Due to performance reason, location of each element must be deterministic
> > without walking on the data structure. This approach guarantees we can
> > reach individual element with 2 steps.
> 
> Agreed.
> 
> On various other points...
> 
> Yes, please keep the discussion here, even when it relates only to PL/R.
> Whatever is being done for R needs to be done for plpython as well. I've
> looked at ways to improve analytics in plpython related to this, and it
> looks like I need to take a look at the fast-path function stuff. One of
> the things I've pondered for storing ndarrays in Postgres is how to
> reduce or eliminate the need to copy data from one memory region to
> another. It would be nice if there was a way to take memory that was
> allocated by one manager (ie: python's) and transfer ownership of that
> memory directly to Postgres without having to copy everything. Obviously
> you'd want to go the other way as well. IIRC cython's memory manager is
> the same as palloc in regard to very large allocations basically being
> ignored completely, so this should be possible in that case.
> 
> One thing I don't understand is why this type needs to be limited to 1
> or 2 dimensions? Isn't the important thing how many individual elements
> you can fit into GPU? So if you can fit a 1024x1024, you could also fit
> a 100x100x100, a 32x32x32x32, etc. At low enough values maybe that stops
> making sense, but I don't see why there needs to be an artificial limit.
>
Simply, I didn't looked at the matrix larger than 2-dimensional. 
Is it a valid mathematic concept?
Because of the nature of GPU, it is designed to map threads on X-, Y-,
and Z-axis. However, not limited to 3-dimensions, because programmer can
handle upper 10bit of X-axis as 4th dimension for example.

> I think what's important for something like kNN is that the storage is
> optimized for this, which I think means treating the highest dimension
> as if it was a list. I don't know if it then matters whither the lower
> dimensions are C style vs FORTRAN style. Other algorithms might want
> different storage.
>
FORTRAN style is preferable, because it allows to use BLAS library without
data format transformation.
I'm not sure why you prefer a list structure on the highest dimension.
A simple lookup table is enough and suitable for massive parallel processors.

> Something else to consider is the 1G toast limit. I'm pretty sure that's
> why MADlib stores matricies as a table of vectors. I know for certain
> it's a problem they run into, because they've discussed it on their
> mailing list.
> 
> BTW, take a look at MADlib svec[1]... ISTM that's just a special case of
> what you're describing with entire grids being zero (or vice-versa).
> There might be some commonality there.
> 
> [1] https://madlib.incubator.apache.org/docs/v1.8/group__grp__svec.html
>
Once we try to deal with a table as representation of matrix, it goes
beyond the scope of data type in PostgreSQL. Likely, users have to take
something special jobs to treat it, more than operator functions that
support matrix data types.

For a matrix larger than toast limit, my thought is that; a large matrix
which is consists of multiple grid can have multiple toast pointers for
each sub-matrix. Although individual sub-matrix must be up to 1GB, we can
represent an entire matrix as a set of grid more than 1GB.

As I wrote in the previous message, a matrix structure head will have offset
to each sub-matrix of grid. The sub-matrix will be inline, or external toast
if VARATT_IS_1B_E(ptr).
Probably, we also have to hack row deletion code not to leak sub-matrix in
the toast table.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>
--
> Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
> Experts in Analytics, Data Architecture and PostgreSQL
> Data in Trouble? Get it in Treble! http://BlueTreble.com
> 855-TREBLE2 (855-873-2532)   mobile: 512-569-9461

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andreas Karlsson
Дата:
Сообщение: Re: Parallel safety tagging of extension functions
Следующее
От: Andres Freund
Дата:
Сообщение: Re: Perf Benchmarking and regression.