Configurations
Changing Verdict configuration
- Creating a configuration file (i.e.,
verdict.properties
) in theconfig
directory. In order for the options in this file to be effective, Verdict should be build again using amvn package
command. - Setting environment variables: When Verdict’s Java instance is created, it scans all environment variables and stores them in Verdict’s configuration if they start with
verdict
. For setting environment variables, periods (.
) in option keys must be replaced with underscores (_
). For example,verdict.loglevel
option can be set byverdict_loglevel=debug
. - Passing parameters when making a connection
- Spark: You can pass parameters using the
--conf
argument. For example$ spark-shell --jars target/ --conf verdict.loglevel=warn
will set the Verdict’s log level to
warn
. - Hive, Impala: You can pass the parameters in the JDBC connection string. For example
$ veeline/bin/veeline -h "impala://host:port/default;verdict.loglevel=warn"
will set the Verdict’s log level to
warn
.
- Spark: You can pass parameters using the
- Using a Verdict query: You can issue a
set key=value
query after a Verdict instance is created.
The higher-numbered items take higher precedences.
Verdict configuration options
Basic options
Key | Default values | Description |
---|---|---|
verdict.bypass |
false | If set true, Verdict simply passes all the query to the underlying database and returns the result to the user without any modifications. The only exception is set bypass=false . |
verdict.loglevel |
info | Determines the log level; should be one of “debug”, “info”, “warn”, “error”. loglevel is a synonym for verdict.loglevel . |
verdict.relative_target_cost |
10% | Keeps the cost of Verdict’s query processing under a certain level. Verdict chooses a set of the largest samples under this limit. |
JDBC options
Key | Default values | Description |
---|---|---|
verdict.jdbc.user |
If set, pass as a “user” value when making a JDBC connection to an underlying database. | |
verdict.jdbc.password |
If set, pass as a “password” value when making a JDBC connection to an underlying database. | |
verdict.jdbc.kerberos_principal |
n/a | If set, makes an appropriate Kerberos authentication when making a JDBC connection to an underlying database. principal is a synonym for verdict.jdbc.kerberos_principal . |
verdict.jdbc.ignore_user_credentials |
false | If set to true, “user” and “password” values are not passed even if they are set in verdict.jdbc.user and verdict.jdbc.password . |
verdict.jdbc.dbname |
One of “impala”, “hive2”. The database name specified in the JDBC connection string is set. | |
verdict.jdbc.host |
127.0.0.1 | The host that will be used for a JDBC connection. |
verdict.jdbc.port |
The port number that will be used for a JDBC connection. | |
verdict.jdbc.schema |
default | The default schema (or database) that a JDBC connection will connect to. |
verdict.jdbc.hive2.default_port |
10000 | The default port number that will be used for a hive2 connection if the port number is not explicitly set in the passed JDBC connection string. |
verdict.jdbc.hive2.class_name |
com.cloudera.hive.jdbc4.HS2Driver | The class name of the JDBC driver that will be used for a hive2 connection. |
verdict.jdbc.impala.default_port |
21050 | The default port number that will be used for an impala connection if the port number is not explicitly set in the passed JDBC connection string. |
verdict.jdbc.impala.class_name |
com.cloudera.impala.jdbc4.Driver | The class name of the JDBC driver that will be used for an impala connection. |
Spark options
Key | Default values | Description |
---|---|---|
verdict.spark.cache_samples |
true | (Lazily) cache the sample tables created by Verdict using Spark’s DataFrame#cache() function. The sample tables are typically significantly smaller than the original tables. Caching sample tables will lower query response times. |
Metadata (and Verdict samples) options
Key | Default values | Description |
---|---|---|
verdict.meta_data.meta_database_suffix |
_verdict | When set to a non-empty string, Verdict creates a separate database (or called a schema or catalog in some systems) for storing its sample tables and related metadata. If the original database’s name is “default”, all the sample tables for that database are stored in the database named “default_verdict”. The user must have the privilege to create a new database unless the database already exists. Also, the user who wants to create samples must have the privilege to create new tables in the database. |
verdict.meta_data.meta_name_table |
verdict_meta_name | A table for storing the names of the sample tables. |
verdict.meta_data.meta_size_table |
verdict_meta_size | A table for storing the sizes of the sample tables. |
verdict.meta_data.refresh_policy |
per_session | How often to refresh metadata. One of “per_session”, “per_query”, “manual”. If set to “manual”, “refresh” query must be used for updating metadata. |
Error bound options
Key | Default values | Description |
---|---|---|
verdict.error_bound.confidence_internal_probability |
95% | The probability with which Verdict’s error bounds hold. Should be one of 80%, 85%, 90%, 95%, 99%, 99.5%, 99.9%. |
verdict.error_bound.error_column_pattern |
err bound of %s | Column names for error bounds. Must include a single “%s”. |
verdict.error_bound.method |
subsampling | A method for error bound computation. One of “subsampling”, “bootstrapping” (only for research), “analytic”, “no_error_bound”. |
verdict.error_bound.subsampling.partition_column |
__vpart | A column name internally used by subsampling for partitioning tuples. |
verdict.error_bound.subsampling.probability_column |
__vprob | A column name internally used by subsampling for sampling probabilities. |
verdict.error_bound.subsampling.partition_count |
100 | The number of partitions used by subsampling for computing error bounds. |
verdict.error_bound.bootstrapping.random_value_column_name |
verdict_rand | only for research |
verdict.error_bound.bootstrapping.bootstrap_multiplicity_colname |
__mul | only for research |
verdict.error_bound.bootstrapping.num_of_trials |
100 | only for research |
Note that changing subsampling-related parameters affect sample creation process.
Passing database-specific options through Verdict
Verdict-on-Spark
Simply launch Spark (or PySpark) as usual. Verdict runs as an imported module on Spark; thus, it does not affect Spark’s regular interface.
Verdict-on-{Hive, Impala}
Specify the parameters in the JDBC connection string. Or, if you are programmatically connecting to Verdict using the JDBC interface, you can pass java.util.Properties
as an optional argument. For example, the following command will set your Hive applications to use a specific job queue.
$ veeline/bin/veeline -h "hive2://host:port/default;mapred.job.queue.name=root.example_queue"
Verdict passes all key-value parameters (except for the ones that start with verdict
) when it makes an internal JDBC connection to the database it works with.