Verdict can run on top of Apache Hive, Apache Impala, Apache Spark, and Amazon Redshift We are adding drivers for other database systems, such as Facebook Presto, Google BigQuery, Google Dataproc, etc.

Using Verdict is easy. Following this guide, you can finish setup in five minutes if you have any of those supported systems ready. Verdict takes a slightly different approach depending on the database system it works with. Once connected, however, you can issue the same SQL queries.

Download and Install

See this page to download jar or zip archives relevant to your data analytics platforms.

Verdict on Apache Spark

Verdict works with Spark by creating Spark’s HiveContext internally. In this way, Verdict can load persisted tables through Hive Metastore.

We show how to use Verdict in spark-shell and pyspark. Using Verdict in a Spark application written either in Scala or Python is the same.

Due to the seamless integration of Verdict on top of Spark (and PySpark), Verdict can be used within Apache Zeppelin notebooks and Python Jupyter notebooks. Our documentation provides more information.

Spark 1.6

You can start spark-shell with Verdict as follows.

$ spark-shell --jars verdict-spark-lib-(version).jar

After spark-shell starts, import and use Verdict as follows.

import edu.umich.verdict.VerdictSparkHiveContext

scala> val vc = new VerdictSparkHiveContext(sc)   // sc: SparkContext instance

scala> vc.sql("show databases").show(false)       // Simply displays the databases (or often called schemas)

// Creates samples for the table. This step needs to be done only once for the table.
// The created tables are automatically persisted through HiveContext and can be used in the other
// Spark applications.
scala> vc.sql("create sample of database_name.table_name").show(false)

// Now Verdict automatically uses available samples for speeding up this query.
scala> vc.sql("select count(*) from database_name.table_name").show(false)

The return value of VerdictSparkHiveContext#sql() is a Spark’s DataFrame class; thus, any methods that work on Spark’s DataFrame work on Verdict’s answer seamlessly.

Spark 2.0 or later

You can start spark-shell with Verdict as follows.

$ spark-shell --jars verdict-spark-lib-(version).jar

After spark-shell starts, import and use Verdict as follows.

import edu.umich.verdict.VerdictSpark2Context

scala> val vc = new VerdictSpark2Context(sc)      // sc: SparkContext instance

scala> vc.sql("show databases").show(false)       // Simply displays the databases (or often called schemas)

// Creates samples for the table. This step needs to be done only once for the table.
// The created tables are automatically persisted and can be used in the other
// Spark applications.
scala> vc.sql("create sample of database_name.table_name").show(false)

// Now Verdict automatically uses available samples for speeding up this query.
scala> vc.sql("select count(*) from database_name.table_name").show(false)

The return value of VerdictSpar2Context#sql() is a Spark’s Dataset class; thus, any methods that work on Spark’s Dataset work on Verdict’s answer seamlessly.

PySpark 1.6

You can start pyspark shell with Verdict as follows.

$ export PYTHONPATH=$(pwd)/python:$PYTHONPATH

$ pyspark --driver-class-path verdict-spark-lib-(version).jar

Limitation: Note that, for the --driver-class-path option to work, the jar file (i.e., ``) must be placed in the Spark’s driver node. --jars option can be used for Spark 2.0 or later.

After pyspark shell starts, import and use Verdict as follows.

from pyverdict import VerdictHiveContext

vc = VerdictHiveContext(sc)        # sc: SparkContext instance

vc.sql("show databases").show()    # Simply displays the databases (or often called schemas)

# Creates samples for the table. This step needs to be done only once for the table.
# The created tables are automatically persisted through HiveContext and can be used in the other
# pyspark applications.
vc.sql("create sample of database_name.table_name").show()

# Now Verdict automatically uses available samples for speeding up this query.
vc.sql("select count(*) from database_name.table_name").show()

The return value of VerdictHiveContext#sql() is a pyspark’s DataFrame class; thus, any methods that work on pyspark’s DataFrame work on Verdict’s answer seamlessly.

PySpark 2.0 or later

This will be added shortly.

Impala, Hive, Redshift

We will use our command line interface (which is called verdict-shell) for connecting to those databases. You can also programmatically connect to Verdict through its JDBC driver (see this page).

Notes

verdict-shell relies on the JDBC drivers provided by individual database vendors; thus, if your database is not compatible with the drivers packaged in , verdict-shell will not be able to make a connection to your database. In this case, please contact our team for support. We will add a right set of JDBC drivers promptly.

Apache Impala

Type the following command in terminal to launch verdict-shell that connects to Impala.

$ bin/verdict-shell -h "impala://hostname:port/schema;key1=value1;key2=value2;..." -u username -p password

Note that parameters are delimited using semicolons (;). The connection string is quoted since the semicolons have special meaning in bash. The user name and password can be passed in the connection string as parameters, too.

Verdict also supports the Kerberos connection. For this, add principal=user/host@domain as one of those key-values pairs.

After verdict-shell launches, you can issue regular SQL queries as follows.

verdict:impala> show databases;

// Creates samples for the table. This step needs to be done only once for the table.
verdict:impala> create sample of database_name.table_name;

verdict:impala> select count(*) from database_name.table_name;

verdict:impala> !quit

Apache Hive

Type the following command in terminal to launch verdict-shell that connects to Hive.

$ bin/verdict-shell -h "hive2://hostname:port/schema;key1=value1;key2=value2;..." -u username -p password

Note that parameters are delimited using semicolons (;). The connection string is quoted since the semicolons have special meaning in bash. The user name and password can be passed in the connection string as parameters, too.

Verdict supports the Kerberos connection. For this, add principal=user/host@domain as one of those key-values pairs.

After verdict-shell launches, you can issue regular SQL queries as follows.

verdict:Apache Hive> show databases;

// Creates samples for the table. This step needs to be done only once for the table.
verdict:Apache Hive> create sample of database_name.table_name;

verdict:Apache Hive> select count(*) from database_name.table_name;

verdict:Apache Hive> !quit

Amazon Redshift

Type the following command in terminal to launch verdict-shell that connects to Amazon Redshift.

$ bin/verdict-shell -h "redshift://endpoint:port/database;key1=value1;key2=value2;..." -u username -p password

Note that parameters are delimited using semicolons (;). The connection string is quoted since the semicolons have special meaning in bash. The user name and password can be passed in the connection string as parameters, too.

After verdict-shell launches, you can issue regular SQL queries as follows.

// Displays the schemas in the database to which you are connected
verdict:PostgreSQL> show schemas;

// Creates samples for the table. This step needs to be done only once for the table.
verdict:PostgreSQL> create sample of schema_name.table_name;

verdict:PostgreSQL> select count(*) from schema_name.table_name;

verdict:PostgreSQL> !quit

The search path can be set by use schema_name; statement. Currently, only a single schema name can be set for the search path using the use statement.

What’s Next

See what types of queries are supported by Verdict in this page, and enjoy the speedup provided Verdict for those queries.

Learn in this page how to quickly visualize your query answers using Verdict in Apache Zeppelin or Jupyter.

If you have use cases that are not supported by Verdict, please contact us at verdict-user@umich.edu, or create an issue in our Github repository. We will answer your questions or requests shortly (at most in a few days).