5-Minute Quickstart¶
heavyai
follows the python DB API 2.0, so experience with other Python database
clients will feel similar to heavyai.
Note
This tutorial assumes you have an HeavyDB server running on localhost:6274
with the
default logins and databases, and have loaded the example flights_2008_10k
dataset. This dataset can be obtained from the insert_sample_data
script included
in the HeavyDB install directory.
Installing heavyai¶
heavyai¶
heavyai
can be installed with conda using conda-forge or pip.
# conda
conda install -c conda-forge heavyai
# pip
pip install heavyai
If you have an NVIDIA GPU in the same machine where your heavyai code will be running, you’ll want to install cudf as well to return results sets into GPU memory as a cudf GPU DataFrame:
cudf¶
The pre-built cudf is only available on conda. Hence, the use of a conda environment is required.
conda install -c nvidia -c conda-forge -c defaults cudf cudatoolkit
Connecting¶
Self-Hosted Install¶
For self-hosted HeavyDB installs, use protocol='binary'
(this is the default)
to connect with HeavyDB, as this will have better performance than using
protocol='http'
or protocol='https'
.
To create a Connection
using the connect()
method along with user
,
password
, host
and dbname
:
>>> from heavyai import connect
>>> con = connect(user="admin", password="HyperInteractive", host="localhost",
... dbname="heavyai")
>>> con
Connection(heavydb://admin:***@localhost:6274/heavyai?protocol=binary)
Alternatively, you can pass in a SQLAlchemy-compliant connection string to
the connect()
method:
>>> uri = "heavydb://admin:HyperInteractive@localhost:6274/heavyai?protocol=binary"
>>> con = connect(uri=uri)
Connection(heavydb://admin:***@localhost:6274/heavyai?protocol=binary)
HeavyDB Cloud¶
When connecting to HeavyDB Cloud, the two methods are the same as above,
however you can only use protocol='https'
. For a step-by-step walk-through with
screenshots, please see this blog post.
Querying¶
A few options are available for getting the results of a query into your Python process.
Into GPU Memory via cudf (
Connection.select_ipc_gpu()
)Into CPU shared memory via Apache Arrow and pandas (
Connection.select_ipc()
)Into python objects via Apache Thrift (
Connection.execute()
)
The best option depends on the hardware you have available, your connection to the database, and what you plan to do with the returned data. In general, the third method, using Thrift to serialize and deserialize the data, will be slower than the GPU or CPU shared memory methods. The shared memory methods require that your HeavyDB database is running on the same machine.
Note
We currently support Timestamp(0|3|6)
data types i.e. seconds, milliseconds,
and microseconds granularity. Support for nanoseconds, Timestamp(9)
is in
progress.
pandas.read_sql()¶
With a Connection
defined, you can use pandas.read_sql()
to
read your data in a pandas DataFrame
. This will be slower than using
Connection.select_ipc()
, but works regardless of where the Python code
is running (i.e. select_ipc()
must be on the same machine as the HeavyDB
install, pandas.read_sql()
works everywhere):
>>> from heavyai import connect
>>> import pandas as pd
>>> con = connect(user="admin", password="HyperInteractive", host="localhost",
... dbname="heavyai")
>>> df = pd.read_sql("SELECT depdelay, arrdelay FROM flights_2008_10k limit 100", con)
Cursors¶
After connecting to HeavyDB, a cursor can be created with Connection.cursor()
:
>>> c = con.cursor()
>>> c
<heavyai.cursor.Cursor at 0x110fe6438>
Or by using a context manager:
>>> with con.cursor() as c:
... print(c)
<heavyai.cursor.Cursor object at 0x1041f9630>
Arbitrary SQL can be executed using Cursor.execute()
.
>>> c.execute("SELECT depdelay, arrdelay FROM flights_2008_10k limit 100")
<heavyai.cursor.Cursor at 0x110fe6438>
This will set the rowcount
property, with the number of returned rows
>>> c.rowcount
100
The description
attribute contains a list of Description
objects, a
namedtuple with the usual attributes required by the spec. There’s one entry per
returned column, and we fill the name
, type_code
and null_ok
attributes.
>>> c.description
[Description(name='depdelay', type_code=0, display_size=None, internal_size=None, precision=None, scale=None, null_ok=True),
Description(name='arrdelay', type_code=0, display_size=None, internal_size=None, precision=None, scale=None, null_ok=True)]
Cursors are iterable, returning a list of tuples of values
>>> result = list(c)
>>> result[:5]
[(38, 28), (0, 8), (-4, 9), (1, -1), (1, 2)]
Loading Data¶
The fastest way to load data is Connection.load_table_arrow()
. Internally,
this will use pyarrow
and the Apache Arrow format to exchange data with
the HeavyDB database.
>>> import pyarrow as pa
>>> import pandas as pd
>>> df = pd.DataFrame({"A": [1, 2], "B": ['c', 'd']})
>>> table = pa.Table.from_pandas(df)
>>> con.load_table_arrow("table_name", table)
This accepts either a pyarrow.Table
, or a pandas.DataFrame
, which will
be converted to a pyarrow.Table
before loading.
You can also load a pandas.DataFrame
using Connection.load_table()
or Connection.load_table_columnar()
methods.
>>> df = pd.DataFrame({"A": [1, 2], "B": ["c", "d"]})
>>> con.load_table_columnar("table_name", df, preserve_index=False)
If you aren’t using arrow or pandas you can pass list of tuples to
Connection.load_table_rowwise()
.
>>> data = [(1, "c"), (2, "d")]
>>> con.load_table_rowwise("table_name", data)
The high-level Connection.load_table()
method will choose the fastest
method available based on the type of data
.
Connection.load_table_arrow()
cannot accept a dataFrame with more than 2^31-1 elements.Lists of tuples are always loaded with
Connection.load_table_rowwise()
A
pandas.DataFrame
orpyarrow.Table
will be loaded usingConnection.load_table_arrow()
If upload fails using the arrow method, a
pandas.DataFrame
can be loaded usingConnection.load_table_columnar()
Database Metadata¶
Some helpful metadata are available on the Connection
object.
Get a list of tables with
Connection.get_tables()
>>> con.get_tables()
['flights_2008_10k', 'stocks']
Get column information for a table with
Connection.get_column_details()
>>> con.get_column_details('stocks') [ColumnDetails(name='date_', type='STR', nullable=True, precision=0, scale=0, comp_param=32), ColumnDetails(name='trans', type='STR', nullable=True, precision=0, scale=0, comp_param=32), ...
Runtime User-Defined Functions¶
Connection instance is callable, it can be used as a decorator to Python functions to define these as Runtime UDFs:
>>> @con('int32(int32, int32)')
... def totaldelay(dep, arr):
... return dep + arr
...
>>> query = ("SELECT depdelay, arrdelay, totaldelay(depdelay, arrdelay)"
... " FROM flights_2008_10k limit 100")
>>> df = con.select_ipc(query)
>>> df.head()
depdelay arrdelay EXPR$2
0 8 -14 -6
1 19 2 21
2 8 14 22
3 -4 -6 -10
4 34 34 68
Note
Runtime UDFs can be defined if the HeavyDB server has enabled its
support (see --enable-runtime-udfs
option of heavydb
)
and rbc package is installed. This is still experimental functionality, and
currently it does not work on the Windows operating system.