Question:
I am trying to write PySpark DataFrames to ADW (Oracle Autonomous Data Warehouse) using JDBC in a Jupyter Lab environment, but the performance is low.
dataframe.format("jdbc").mode('overwrite').option("batchsize", batchsize).option('createTableColumnTypes', create_str).option("rewriteBatchedStatements", "true").option("url", jdbc_url).option("dbtable", table).option("user", self.user).option("password", self.password).option("driver", "oracle.jdbc.OracleDriver").save()
I’m using the rewriteBatchedStatements
and batchsize
parameters, but the performance is still bad.
Using other tools like DBeaver, the load performance is better. Could you suggest a guide or best practices to achieve this connection?
ojdbc8 Spark 3.5.0 Oracle 19c
Answer:
Oracle ADW performs best when using its bulk load capabilities. You can enable it by setting a specific connection property:
.option(“oracle.jdbc.defaultBatchValue”, “5000”)
And try use defaultRowPrefetch 100 which tells the Oracle driver how many rows to fetch (default is 10).
Hope it helps.