在python中读取Teradata表的性能降低

如何解决在python中读取Teradata表的性能降低

我正在尝试从teradata读取表，这需要很多时间。我的表有500万行和60列，加载到内存需要30分钟。我正在使用teradatasql软件包，但是同一张表花了5分钟才能用RJDBC软件包加载到R中。

Python代码（这需要30分钟）

import teradatasql
import pandas as pd

conn = teradatasql.connect(host=host,user=user_name,password=password,database=database)
df = pd.read_sql("SELECT * FROM big_table",conn)

R码（仅需3分钟）

library(RJDBC)

# teradata conecction
con_tera <- dbConnect(drv_tera,"jdbc:teradata://{ip_host}/DATABASE=DBI_MIN,DBS_PORT=1025",Sys.getenv("tera_DB_USER"),Sys.getenv("tera_DB_PASS"))

# create query
final_query <- 'select * from big_table'

# get data
dataset_caribu <- dbGetQuery(con_tera,final_query)

我试图在python中增加游标的数组大小，但并没有大大提高执行时间。

解决方法

pandas.read_sql比直接使用teradatasql驱动程序慢。

这是一个简单的Python脚本，我用它来测试500万行和60列，其中80％的非NULL和20％的NULL列值：

fetchall took 638.6090559959412 seconds,or 10.64348426659902 minutes,and returned 5000000 rows
read_sql took 2293.84486413002 seconds,or 38.23074773550034 minutes,and returned 5000000 rows

我的结果是：

{{1}}