Services such as AWS s3, Azure Blob Storage, Google Cloud Storage provide highly scalable cloud based repositories of files, which can be used to store unstructured and semi-structured data.
Apache Drill is a free, open source query engine that allows one to perform ANSI SQL queries against NoSQL databases and semi-structured data.
One can connect Sisense to the Apache Drill service using the following connectors:
- RESTful API
A variety of data sources can be queried with Drill including:
- Amazon s3
- Azure Blob Storage
- Google Cloud Storage
- Network files
- Local files
Drill allows the user to load data from multiple data formats, including
- CSV (Comma-Separated-Values)
- TSV (Tab-Separated-Values)
- PSV (Pipe-Separated-Values)
- Like Sisense, Drill can leverage self-describing data formats in real time (e.g.; Parquet, JSON, AVRO, and NoSQL databases).
- Using Drill, we can connect to csv files (stored in the cloud, locally, or on a network) through an ODBC or JDBC driver. This will allow us to use Sisense’s data accumulator feature on csv files.
- While cloud storage providers may have their own querying services (such as AWS Athena, Google BigQuery), by using Apache Drill, one only needs to be familiar with one tool to query multiple services, and unlike applications such as Athena, the user is not charged by the amount of data scanned in the query.
- Drill allows the user to query across multiple data sources, in a single query such as using data from both Hadoop and s3. This might be useful if for example, one wished to filter one data set by data in another source prior to bringing it into Sisense.
A detailed description of the architecture can be found here, but at a high level, we use Drillbits to query the data. There are two implementation modes for Drill.
- Embedded/single node approach can be successfully installed on a typical Sisense server machine quickly.
- Distributed mode, used for large scale implementations.
The embedded (single node) service can be run in Windows, Linux, or Mac OS X, and can be implemented relatively quickly (in a matter of a few hours).
The single node implementation has the following prerequisites.
This is generally run on a clustered Hadoop environment. A clustered (multi-server) installation of ZooKeeper is required.
The distributed mode implementation has the following prerequisites.
- (Required) Running Oracle JDK version 7 or version 8 if running Drill 1.6 or later.
- (Required) Running a ZooKeeper quorum
- (Recommended) Running a Hadoop cluster
- (Recommended) Using DNS
Installation and Connecting Sisense to Apache Drill
You can find a walkthrough on installing and connecting Sisense to Drill here.
Case Study - Getting 37 million rows of NYC Citibike data from AWS s3
The NYC Citibike program provides bikes that can be rented by New Yorkers for 30 or 45 minute rides.
We had over 44 months of data about the individual trips, stored in a AWS s3 bucket.
- 44 csv files, one for each month of trip data.
- 14 columns of data (Start Time, Start Location, End Time, End Location etc)
- 37 million rows in total.
Apache Drill was installed in embedded mode (single node) on the Sisense server, a 16GB, 4 Core (2 Intel® Xeon), 2GHz box.
We connected to s3 using Apache Drill and used the schema functionality to create a view in Drill, based on the contents of the s3 bucket.
Installation of Apache Drill, connecting to AWS s3 and configuring the ODBC and JDBC connections for Sisense took under 2 hours in total.
- Using the ODBC connector, it took roughly 22 minutes to load the 37 million rows from s3 bucket into Sisense.
- Surprisingly, using the JDBC connector took longer, 39 minutes.
Please sign in to leave a comment.