5-Minute Quickstart¶

heavyai follows the python DB API 2.0, so experience with other Python database clients will feel similar to heavyai.

Note

This tutorial assumes you have an HeavyDB server running on localhost:6274 with the default logins and databases, and have loaded the example flights_2008_10k dataset. This dataset can be obtained from the insert_sample_data script included in the HeavyDB install directory.

Installing heavyai¶

heavyai¶

heavyai can be installed with conda using conda-forge or pip.

# conda
conda install -c conda-forge heavyai

# pip
pip install heavyai

If you have an NVIDIA GPU in the same machine where your heavyai code will be running, you’ll want to install cudf as well to return results sets into GPU memory as a cudf GPU DataFrame:

cudf¶

The pre-built cudf is only available on conda. Hence, the use of a conda environment is required.

conda install -c nvidia -c conda-forge -c defaults cudf cudatoolkit

Connecting¶

Self-Hosted Install¶

For self-hosted HeavyDB installs, use protocol='binary' (this is the default) to connect with HeavyDB, as this will have better performance than using protocol='http' or protocol='https'.

To create a Connection using the connect() method along with user, password, host and dbname:

>>> from heavyai import connect
>>> con = connect(user="admin", password="HyperInteractive", host="localhost",
...               dbname="heavyai")
>>> con
Connection(heavydb://admin:***@localhost:6274/heavyai?protocol=binary)

Alternatively, you can pass in a SQLAlchemy-compliant connection string to the connect() method:

>>> uri = "heavydb://admin:HyperInteractive@localhost:6274/heavyai?protocol=binary"
>>> con = connect(uri=uri)
Connection(heavydb://admin:***@localhost:6274/heavyai?protocol=binary)

HeavyDB Cloud¶

When connecting to HeavyDB Cloud, the two methods are the same as above, however you can only use protocol='https'. For a step-by-step walk-through with screenshots, please see this blog post.

Querying¶

A few options are available for getting the results of a query into your Python process.

Into GPU Memory via cudf (Connection.select_ipc_gpu())
Into CPU shared memory via Apache Arrow and pandas (Connection.select_ipc())
Into python objects via Apache Thrift (Connection.execute())

The best option depends on the hardware you have available, your connection to the database, and what you plan to do with the returned data. In general, the third method, using Thrift to serialize and deserialize the data, will be slower than the GPU or CPU shared memory methods. The shared memory methods require that your HeavyDB database is running on the same machine.

Note

We currently support Timestamp(0|3|6) data types i.e. seconds, milliseconds, and microseconds granularity. Support for nanoseconds, Timestamp(9) is in progress.

GPU Shared Memory¶

Use Connection.select_ipc_gpu() to select data into a GpuDataFrame, provided by cudf. To use this method, the Python code must be running on the same machine as the HeavyDB installation AND you must have an NVIDIA GPU installed.

>>> query = "SELECT depdelay, arrdelay FROM flights_2008_10k limit 100"
>>> df = con.select_ipc_gpu(query)
>>> df.head()
  depdelay arrdelay
0       -2      -13
1       -1      -13
2       -3        1
3        4       -3
4       12        7

CPU Shared Memory¶

Use Connection.select_ipc() to select data into a pandas DataFrame using CPU shared memory to avoid unnecessary intermediate copies. To use this method, the Python code must be running on the same machine as the HeavyDB installation.

>>> df = con.select_ipc(query)
>>> df.head()
  depdelay arrdelay
0       -2      -13
1       -1      -13
2       -3        1
3        4       -3
4       12        7

pandas.read_sql()¶

With a Connection defined, you can use pandas.read_sql() to read your data in a pandas DataFrame. This will be slower than using Connection.select_ipc(), but works regardless of where the Python code is running (i.e. select_ipc() must be on the same machine as the HeavyDB install, pandas.read_sql() works everywhere):

>>> from heavyai import connect
>>> import pandas as pd
>>> con = connect(user="admin", password="HyperInteractive", host="localhost",
...               dbname="heavyai")
>>> df = pd.read_sql("SELECT depdelay, arrdelay FROM flights_2008_10k limit 100", con)

Cursors¶

After connecting to HeavyDB, a cursor can be created with Connection.cursor():

>>> c = con.cursor()
>>> c
<heavyai.cursor.Cursor at 0x110fe6438>

Or by using a context manager:

>>> with con.cursor() as c:
...     print(c)
<heavyai.cursor.Cursor object at 0x1041f9630>

Arbitrary SQL can be executed using Cursor.execute().

>>> c.execute("SELECT depdelay, arrdelay FROM flights_2008_10k limit 100")
<heavyai.cursor.Cursor at 0x110fe6438>

This will set the rowcount property, with the number of returned rows

>>> c.rowcount
100

The description attribute contains a list of Description objects, a namedtuple with the usual attributes required by the spec. There’s one entry per returned column, and we fill the name, type_code and null_ok attributes.

>>> c.description
[Description(name='depdelay', type_code=0, display_size=None, internal_size=None, precision=None, scale=None, null_ok=True),
 Description(name='arrdelay', type_code=0, display_size=None, internal_size=None, precision=None, scale=None, null_ok=True)]

Cursors are iterable, returning a list of tuples of values

>>> result = list(c)
>>> result[:5]
[(38, 28), (0, 8), (-4, 9), (1, -1), (1, 2)]

Loading Data¶

The fastest way to load data is Connection.load_table_arrow(). Internally, this will use pyarrow and the Apache Arrow format to exchange data with the HeavyDB database.

>>> import pyarrow as pa
>>> import pandas as pd
>>> df = pd.DataFrame({"A": [1, 2], "B": ['c', 'd']})
>>> table = pa.Table.from_pandas(df)
>>> con.load_table_arrow("table_name", table)

This accepts either a pyarrow.Table, or a pandas.DataFrame, which will be converted to a pyarrow.Table before loading.

You can also load a pandas.DataFrame using Connection.load_table() or Connection.load_table_columnar() methods.

>>> df = pd.DataFrame({"A": [1, 2], "B": ["c", "d"]})
>>> con.load_table_columnar("table_name", df, preserve_index=False)

If you aren’t using arrow or pandas you can pass list of tuples to Connection.load_table_rowwise().

>>> data = [(1, "c"), (2, "d")]
>>> con.load_table_rowwise("table_name", data)

The high-level Connection.load_table() method will choose the fastest method available based on the type of data.

Connection.load_table_arrow() cannot accept a dataFrame with more than 2^31-1 elements.
Lists of tuples are always loaded with Connection.load_table_rowwise()
A pandas.DataFrame or pyarrow.Table will be loaded using Connection.load_table_arrow()
If upload fails using the arrow method, a pandas.DataFrame can be loaded using Connection.load_table_columnar()

Database Metadata¶

Some helpful metadata are available on the Connection object.

Get a list of tables with Connection.get_tables()

>>> con.get_tables()
['flights_2008_10k', 'stocks']

Get column information for a table with Connection.get_column_details()

>>> con.get_column_details('stocks')
[ColumnDetails(name='date_', type='STR', nullable=True, precision=0,
               scale=0, comp_param=32),
 ColumnDetails(name='trans', type='STR', nullable=True, precision=0,
               scale=0, comp_param=32),
 ...

Runtime User-Defined Functions¶

Connection instance is callable, it can be used as a decorator to Python functions to define these as Runtime UDFs:

>>> @con('int32(int32, int32)')
... def totaldelay(dep, arr):
...     return dep + arr
...
>>> query = ("SELECT depdelay, arrdelay, totaldelay(depdelay, arrdelay)"
...          " FROM flights_2008_10k limit 100")
>>> df = con.select_ipc(query)
>>> df.head()
   depdelay  arrdelay  EXPR$2
0         8       -14      -6
1        19         2      21
2         8        14      22
3        -4        -6     -10
4        34        34      68

Note

Runtime UDFs can be defined if the HeavyDB server has enabled its support (see --enable-runtime-udfs option of heavydb) and rbc package is installed. This is still experimental functionality, and currently it does not work on the Windows operating system.