At Visual Meta we build and maintain a platform that must be highly available. There should be no situation, when our webpage is not accessible to a user for any reason. Therefore there should be no single component that can fail and bring the whole system down. At the same time it should be responsive and easy to use.
There is a significant set of applications and frameworks that we utilize. Web servers, caches, databases, resource managers, distributed processing systems, full text search indexes. They all have to be redundant to make the system highly available. Failure of one or even several servers, network outage, death of a process, neither of that should be a source of the overall downtime.
MySQL, the database system that we use as our main data storage, has long been a single point of failure in the system. If the database process goes down or hardware on the server fails, our platform is seriously harmed by this. It takes time to react and handle the problem. Worst of all, while frontend is affected, we risk losing money. We wanted to make our data store redundant and highly available, that is why we tried Apache Phoenix.
“Apache Phoenix is a relational database layer over HBase”.
If you already use Hadoop and HBase, Phoenix is a good extension. It provides the well-known SQL query language, that is much more convenient than HBase shell language or HBase Java API. A primary benefit is that it runs all queries in parallel on all region servers. As a result, HBase works faster and more efficiently. It is easy to install, because it’s just a library and doesn’t require deployment. It is compatible with the JDBC driver so that migration from regular relational database system doesn’t require major code changes.
These are some of the features that Phoenix provides:
The official Phoenix web page contains a complete list of all the functionalities that are supported.
We ran our experiments using the following setup:
7 HBase region servers
23 GB of data split across different tables and schemas.
Our first step after having Phoenix installed in our existing Hadoop+HBase setup was trying some of the queries that we run more often and that have big performance impact in our system.
Then we found out that Phoenix requires some syntax changes, which have been documented in the table below. Syntax differences between Phoenix and MySQL:
|UPSERT INTO t (col) VALUES (val);||INSERT INTO t (col) VALUES (val);|
|UPSERT INTO t (PK, col1) SELECT PK, ‘val1’ FROM t WHERE col2 = ‘val2’; (*)||UPDATE t SET col1 = ‘val1’ WHERE col2 = ‘val2’;|
|CREATE TABLE t (id BIGINT(20) UNSIGNED NOT NULL);
CREATE SEQUENCE t_seq;
UPSERT INTO t (col) VALUES (NEXT VALUE FROM t_seq);
|CREATE TABLE t (id BIGINT(20) UNSIGNED NOT NULL AUTO-INCREMENT);
INSERT INTO t (col) VALUES (NULL);
|UPSERT INTO t (date_col) SELECT TO_DATE(‘some_date’) FROM t; (**)||UPDATE t SET date_col = ‘some_date’;|
|SELECT DATE(NOW()) – 7 FROM t;||SELECT TIMESTAMPADD(DAY, -7, DATE(NOW())) FROM t;|
|SELECT NOW() – 4.0 / 24.0 FROM t; (***)||SELECT DATE_SUB(NOW(), INTERVAL 4 HOUR)
|SELECT (NOW() – date_col)*24.0 FROM t;||SELECT HOUR(TIMEDIFF(NOW(), date_col)) FROM t;|
|SELECT * FROM t WHERE LENGTH(REGEXP_SUBSTR
(col, ‘regex’)) > 0;
|SELECT * FROM t WHERE col RLIKE ‚regex‘;|
|SELECT * FROM t WHERE col IS NULL;||SELECT * FROM t WHERE ISNULL(col);
SELECT * FROM t WHERE col IS NULL;
(*) This kind of upsert statement (i.e. upsert select) is going to be simplified in the future when this ticket is resolved.
(**) Assuming, for simplicity, that date_col is the PK. Also, TO_DATE function would be required not only in projections, but also in case they are part of a WHERE condition.
(***) This transforms the value after ‘-’ into hours, by dividing by 24.0.
After having adapted our subset of queries to be Phoenix compatible, we ran some performance tests building a simple Java program to connect to Phoenix and give back a response time after running a query. This is how we checked that connecting through JDBC driver was straightforward.
We also had a very positive experience when we tried some of our queries. Two of them were:
QUERY 1 →SELECT col1 FROM t WHERE col2 = 123
QUERY 2 →SELECT id FROM t WHERE LENGTH(REGEXP_SUBSTR(col, ‚(350 char regex)‘)) > 0
You can see the performance comparison on the following chart:
The first query was executed on a table containing 20 million rows. As you can see, the performance of Phoenix when using indexes is much better than the results obtained with MySQL. You can notice that we tried different index types (global index vs local index) to achieve better performance. Nevertheless, the strategy about how to choose which type to use was not so clear.
For the second query we can see again that Phoenix is much faster than MySQL, even without indexes.
However, we started to see some problems when we tried running join queries. First of all, some joins would not work with the default settings. The problem being that Phoenix by default uses hash-join algorithm which is faster than the alternative sort merge join algorithm, but it requires enough memory for the hash cache that it uses. in order to fix this situation we tried two options:
Option 2. was good in the sense that it allowed us to try join queries using hash-join, but the problem was that we still saw some timeouts when running join queries even when using hash-join. Example:
SELECT COUNT(t1.id) FROM t1 LEFT JOIN t2 ON t2.id = t1.col1 WHERE t1.date_col < (NOW() – 7) AND t2.col2 = 123;
| COUNT(„T1.:ID“) |
java.lang.RuntimeException: java.sql.SQLTimeoutException: Operation timed out
One more issue observed when using joins was that one query stopped working when we added an index to make our query faster:
SELECT /*+USE_SORT_MERGE_JOIN*/ COUNT(DISTINCT t1.id) FROM t1, t2 WHERE t2.id = t1.id AND t1.col1 IN (1,2,3,4,5,6,7) AND t2.col2 = 123;
create local index ilocal_t1_col1 on t1 (col1);
20,739,473 rows affected (933.623 seconds)
SELECT COUNT(DISTINCT t1.id) FROM t1, t2 WHERE t2.id = t1.id AND t1.col1 IN (1,2,3,4,5,6,7) AND t2.col2 = 123;
So it would fail with both hash-join and sort merge join algorithms.
We decided not to proceed with Phoenix as a replacement for MySQL. The reason is that it does not fulfill our requirements concerning join-queries and sub-selects. It was either slow or not working for many of them. Generally, we found it not mature enough for use as our main data storage. However, it is possible that we would reconsider it in the future, because the project still looks promising.
Even though we are not going to use Phoenix now, we are thinking about other possible applications. As we already use HBase, we want to try Phoenix as an extension of it. The good features we described in the beginning are definitely worth further exploration. One example of this is secondary indexes. When used for filter search, they reduce full-scan queries to instant lookup or range scan. This is a frequent use case in our system. Сounting all records or subset of them in the table is another important use case and Phoenix does this really well. All in all, the SQL language is much easier to use than HBase shell language. This is useful when we need to get insight into the data during testing or while investigating problems.
*This article was created by Lucia Pasarin Perea & Viacheslav Inozemtsev.