Re: Adding facility for injection points (or probe points?) for more advanced tests
От | Michael Paquier |
---|---|
Тема | Re: Adding facility for injection points (or probe points?) for more advanced tests |
Дата | |
Msg-id | ZVNftCZgry5pSDzD@paquier.xyz обсуждение исходный текст |
Ответ на | Re: Adding facility for injection points (or probe points?) for more advanced tests (Andres Freund <andres@anarazel.de>) |
Ответы |
Re: Adding facility for injection points (or probe points?) for more advanced tests
|
Список | pgsql-hackers |
On Fri, Nov 10, 2023 at 06:32:27PM -0800, Andres Freund wrote: > I would like to see a few example tests using this facility - without that > it's a bit hard to judge how the impact on core code would be and how easy > tests are to write. Sure. I was wondering if people would be interested in that first. > It also seems like there's a few bits and pieces missing to actually be able > to write interesting tests. It's one thing to be able to inject code, but what > you commonly want to do for tests is to actually wait for such a spot in the > code to be reached, then perhaps wait inside the "modified" code, and do > something else in the test script. But as-is a decent amount of C code would > need to be written to write such a test, from what I can tell? Depends on what you'd want to achieve. As I mentioned at the top of the thread, error, fatal, panics, hardcoded waits are the most common cases I've seen in the last years. Conditional waits are not in the main patch but these are simple to support done (I mean, as in the 0003 attached with a TAP example). While on it, I have extended the patch in the hash table a library name and a function name so as the callback is loaded each time an injection point is run. (Perhaps the list of callbacks already loaded in a process should be saved in a session-level static list/array to avoid loading the same callbacks again, not sure if that's worth doing for a test facility assuming that the number of times a callback is called in a single session is usually very limited. Anyway, that would be simple to add if people prefer this addition.) Anyway, here is a short list of commits that could have taken benefit from this facility. There are is much more, but that's a list I grabbed quickly from my notes: 1) 8a4237908c0f 2) cb0cca188072 3) 7863ee4def65 (See https://postgr.es/m/YnT/Y2sEYj7pyOdc@paquier.xyz where an expensive TAP test was included, and I've seen users facing this bug in real life). Revert of the original is clean here as well. The trick is simple: stop a restartpoint during a promotion, and let the restartpoint finish after the promotion. 4) 409f9ca44713, where injecting an error would stress the consistency of the data reset (mentioned an error injected at https://postgr.es/m/YWZk6nmAzQZS4B/z@paquier.xyz). This reverts cleanly even today. 5) b4721f39505b, quite similar (mentioned an error injection exactly here: https://postgr.es/m/20181011033810.GB23570@paquier.xyz). This one requires an error when a transaction is started, something can be achieved if the error is triggered conditionally (note that hard failure would prevent the transaction to begin with the initial snapshot taken in InitPostgres, but the module could just use a static variable to track that). Among these, I have implemented two examples on top of the main patch set in 0002 and 0003: 4) as a TAP test with replication commands and an error injection, and 3) that relies on a custom wait event and a conditional variable to make the test posted on the other thread cheaper, with an injection point waiting for a condition variable in the middle of a restartpoint in the checkpointer. I don't mean to necessarily include all that in the upstream tree, these are just here for reference first. 3) is the most interesting in this set, for sure. That was a nasty problem, and some cheap coverage in the core tree could be really good for it, so I'd like to propose for commit after more polishing. The test of the bug 3) I am referring to takes originally 30~45s to run and it was unstable as it could timeout. With an injection point it takes 1~2s. Note that test_injection_points gains a wait/wake logic to be able to use condition variables to wait on the restartpoint of a promoted standby). Both tests are not shaped for prime day yet, but that's enough for a set of examples IMHO to show what can be done. Does it answer your questions? -- Michael
Вложения
В списке pgsql-hackers по дате отправления: