Re: Double insertion from scala spark job

Поиск

Список

Период

Сортировка

От	Antoine DUBOIS
Тема	Re: Double insertion from scala spark job
Дата	10 февраля 2021 г. 09:02:23
Msg-id	1856232242.3494395.1612947743989.JavaMail.zimbra@cc.in2p3.fr обсуждение исходный текст
Ответ на	Re: Double insertion from scala spark job (Dave Cramer <davecramer@postgres.rocks>)
Список	pgsql-jdbc

Дерево обсуждения

Hi Dave

I do agree that without more code or data it's hard,
I just wanted to share with you my issue.
It is an edge case even in my case and only happen with some of my data sample.

It was just to tell you I happen to encounter such an edge case. If I'm not the only one.
Also it might be a bug in spark not in jdbc drive.

Thank you for your answer and have the best of day.

De: "Dave Cramer" <davecramer@postgres.rocks>
À: "antoine dubois" <antoine.dubois@cc.in2p3.fr>
Cc: "List" <pgsql-jdbc@postgresql.org>
Envoyé: Mercredi 10 Février 2021 01:44:44
Objet: Re: Double insertion from scala spark job

On Tue, 9 Feb 2021 at 06:48, Antoine DUBOIS <antoine.dubois@cc.in2p3.fr> wrote:

Hello

I'm working with spark and postgresql to compute stat.
I came to encounter a strange behaviour in my job, when working with postgresql output I sometime have a double insertion happenning into my table (and violating constraint).
Detail: Key (xxx, xxx, xxx, xxx, xxx, xxxx, xxxx)=(2021-02-05 00:00:00, data, moredate, evenmoredata, somuchmoredata, dataagain, somuchofit) already exists.  Call getNextException to see other errors in the batch.
My data are generated as duplicate if I write the same data into mysql or into a parquet file with the same input and treatment I don't observe this behaviour.
Dev spec:
Scala 2.12
Spark Version 3.0.1
JDK 8
jdbc "org.postgresql" % "postgresql" % "42.2.18"

PostgreSQL 12.5

My code is pretty simple and apply a SQL request to a parquet file and write the result like this :

outputDF.write.format("jdbc").option("driver", "org.postgresql.Driver").option("url", "jdbc:postgresql://<HOST>:<PORT>/<SCHEMA>?user=<USERNAME>&password=<PASSWORD>").option("dbtable", "mytable").mode(append).save()

What lead me to think it's a postgres jdbc bug more than anything else is the fact that this same command to output in mysql or in a parquet file produce no duplicate in this particular edge case i have with only some of my input files.
If any of you had any idea what could cause such a behavior (special char in the input, misconfigured something, maybe an option I don't know could help solving this issue )

I came to a point where I'm not sure of anything any longer.
Hope anyone will have some though about it.

You are the first person to report such a problem.

without additional information such as your code, there's little we can do.

Dave Cramer

www.postgres.rocks

Вложения

smime.p7s

В списке pgsql-jdbc по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Double insertion from scala spark job

Вложения