lundi 29 juin 2015

Slave Election is welcoming GTID

Slave election is a popular HA architecture,  first MySQL MariaDB toolkit to manage switchover and failover in a correct way was introduce by Yoshinori Matsunobu into MHA.

Failover and switchover in asynchronous clusters require caution:

- The CAP theorem need to be satisfy. Getting strong consistency, require the slave election to reject transactions ending up in the old master when electing the candidate master.

- Slave election need to take care that all events on the old master are applied to the candidate master before switching roles.

- Should be instrumented to found a good candidate master and make sure it's setup to take the master role.

- Need topology detection, a master role can't be pre defined, as the role is moving around nodes .

- Need monitoring to escalate switchover to failover.

MHA as been coded at a time no unique event id was possible in a cluster,  each event was track as independent coordinate on each node, making MHA architecture to have an internal way to rematch coordinate on all nodes.

With introduction of GTID, MHA brings the heritage and looks like unnecessary complex, with an agent base solution and ssh connections requirement to all nodes .

A lighter MHA was needed for MariaDB when the replication is using GTID, and that's what my colleague Guillame Lefranc have been addressing inside a new MariaDB toolkit

In MariaDB GTID usage is as simple as:

#>stop slave;change master to master_use_gtid=current_pos;start slave; 

As a bonus, the code is in golang and do not require any external dependencies
We can enjoy a singe command line procedure in interactive mode.

mariadb-repmgr -hosts=,, -user=admin:xxxxx -rpluser=repl:xxxxxx -pre-failover-script="/root/" -post-failover-script="/root/" -verbose -maxdelay 15    
Don't be afraid default is to run in interactive mode and it does not launch anything yet.

In my post configuration script i usually update some haproxy configuration store in a NAS or a SAN and reload or shoot in the head all proxies

Note that the new elected master will be passed as second argument of the script.

I strongly advice not to try to auto failover base on some monitoring, get a good replication monitoring tool and analyze all master status alerts, checking for false positive situation before enjoying pre coded failover.

Loss less semi-synchronous replication in MDEV-162  and multiple performance improvements of semi-synchronous MDEV-7257, have made it to MariaDB 10.1, it can be use to greatly improve zero data lost in case of failure . Combine with parallel replication it's now possible to have an HA architecture that is as robust as asynchronous can be, and under replication delay control is crash safe as well.    

Galera aka MariaDB Cluster as a write speed limit bound to upper network speed, it come at the advantage to always offer crash safe consistency. Slave election HA have the master disk speed limit and do not suffer lower network speed but is losing consistency in failover when slave can't catch.

Interesting time to see how flash storage adoption flavor one or the other architecture.

vendredi 17 avril 2015

Social Networking Using OQGraph

I was given the chance to experiment typical social networking query on an existing 60 Millions edges dataset

How You're Connected

Such algorithms and others are simply hardcoded into the OQGraph. 

With the upgrade of OQGraph V3 into MariaDB 10 we can proceed directly on top of the exiting tables holding the edges kine of featured VIRTUAL VIEW. 

  `id1` int(10) unsigned NOT NULL,
  `id2` int(10) unsigned NOT NULL,
  `relation_type` tinyint(3) unsigned DEFAULT NULL,
  KEY `id1` (`id1`),
  KEY `id2` (`id2`)

oqgraph=# select count(*) from relations;

| count(*) |
| 59479722 |
1 row in set (23.05 sec)

Very nice integration of table discovery that save me referring to documentation to found out all columns definition.  

CREATE TABLE `oq_graph`
ENGINE=OQGRAPH `data_table`='relations' `origid`='id1' `destid`='id2';

oqgraph=# SELECT * FROM oq_graph WHERE latch='breadth_first' AND origid=175135 AND destid=7;
| latch         | origid | destid | weight | seq  | linkid |
| breadth_first | 175135 |      7 |   NULL |    0 | 175135 |
| breadth_first | 175135 |      7 |      1 |    1 |      7 |
2 rows in set (0.00 sec)

oqgraph=# SELECT * FROM oq_graph WHERE latch='breadth_first' AND origid=175135 AND destid=5615775;
| latch         | origid | destid  | weight | seq  | linkid   |
| breadth_first | 175135 | 5615775 |   NULL |    0 |   175135 |
| breadth_first | 175135 | 5615775 |      1 |    1 |        7 |
| breadth_first | 175135 | 5615775 |      1 |    2 | 13553091 |
| breadth_first | 175135 | 5615775 |      1 |    3 |  1440976 |
| breadth_first | 175135 | 5615775 |      1 |    4 |  5615775 |
5 rows in set (0.44 sec)

What we first highlight is that underlying table indexes KEY `id1` (`id1`), KEY `id2` (`id2`) are used by OQgrah to navigate the vertices via a number of key reads and range scans, such 5 level relation was around 2689 jump and 77526  range access to the table . 

Meaning the death of the graph was around 2500 with an average of 30 edges per vertex 


oqgraph=# SELECT * FROM oq_graph_myisam WHERE latch='breadth_first' AND origid=175135 AND destid=5615775;
| latch         | origid | destid  | weight | seq  | linkid   |
| breadth_first | 175135 | 5615775 |   NULL |    0 |   175135 |
| breadth_first | 175135 | 5615775 |      1 |    1 |        7 |
| breadth_first | 175135 | 5615775 |      1 |    2 | 13553091 |
| breadth_first | 175135 | 5615775 |      1 |    3 |  1440976 |
| breadth_first | 175135 | 5615775 |      1 |    4 |  5615775 |
5 rows in set (0.11 sec)

Need to investigate more such speed difference using MyISAM. Ideas are welcome ?

jeudi 16 avril 2015

Howto - Move a table to different schema with no outage

I remember a time when it was debate if views can be useful for a web oriented workload ?

This post is about one good use case:

The  story is that some tables have been creating into a schema and used by the application into same connection.

Later on some more schema have been added to separate data for multiple application domain but still using original table, kind of cross domain universal table.

With addition of many new domains, a new global schema was added storing freshly create universal tables.

The question was how to move back the old table in the correct new schema without stopping availability of the service ?

We decided to use a view that point to the physical table. Change the application to use the view and later atomically switch the table and the view.

Here is the test case for doing that :

-- Create schemas

-- Create table in schema 1
CREATE TABLE schema1.t1 (
  id int

-- Create views in schema 2
CREATE VIEW schema2.t1 AS SELECT * FROM schema1.t1;
-- Create dummy view on view in schema 1 
CREATE VIEW schema1.t1_new AS SELECT * FROM schema2.t1;

-- Changing the API 

-- Switch schema 1 table and schema 2 view
RENAME TABLE schema2.t1 TO schema2.t1_old,
  schema1.t1 TO schema2.t1,
  schema1.t1_new TO schema1.t1;

Is there some other path ? Surely some triggers + insert ignore like done in OAK or Pt Online Alter table but i also remember a time when it was debate if triggers can be useful for a web oriented workload :)

Thanks to Nicolas @ccmbenchmark for contributing the test case.

mardi 25 février 2014

Introduction to Job Queue daemon plugin

Dr. Adrian Partl is working in the E-Science group of the Leibniz Institute for Astrophysics Potsdam (AIP), where the key topics are cosmic magnetic fields and extragalactic, astrophysics is the branch of astronomy concerned with objects outside our own Milky Way galaxy

Why did you decided to create a Job Queue plugin, what issues does it solve?

A: Basically our MySQL databases hold astronomic simulations and observations content, the datasets are in multi Terra Bytes size and queries can take long time, astronomers can definitely wait for data acquisition, but jump on the data as soon as they are available.  Job Queue offer a protection from too many parallel query executions and prevent our servers to be spammed. Multiple queues are here to give us priority between users, today queries are executed as soon as a slot is available. Some timeouts per group can be define and queries will be killed passing that delay.

Would you like telling us more about your personal background?

A: I studied astronomy and have a PHD in astrophysics. For my PHD I focused on high performance computing by parallelizing a radiation transport simulation code to enable running it in large computational cluster. Now a day i'm more specialized in programming and managing big dataset. I stop doing scientists tasks, but i enjoy helping in making those publications happen by providing all the IT infrastructure for doing the job.

How did you came to MySQL ?

A: In the past we used SQL Server but we rapidly rich the performance limits of a single box, we found out that it can be very expensive to expend it for sharding.

We moved to MySQL and mostly MyISAM storage engine.  We are also using Spider storage engine since 3 years, for creating the shards. We needed true parallel queries, to do so we created PAQU a fork of Shard Query to better integrate with Spider, The map-reduce tasks in PaQu are all done by submitting multiple subsequent "direct background queries" to the Spider engine and we shortcut Gearman in shard-query. With this in place it is possible to manage map-reduce tasks using our Job Queue plugin.

S: Spider is now integrated in MariaDB10 and it is making fast improvements regarding map-reduce jobs, using UDF functions with multiple channels on partitions and for some simple aggregation query plans. Are you using advanced DBT3 big queries algorithms like BKA joins and MRR? Did you explore new engines like TokuDB that could bring massive compression, and disk IO saving to your dataset.

A: I will definitely have look at this. In the past we have experimented column stores, but it's not really adapted to what we do. Scientists extract all columns despite they don't use all of them. Better getting more, then to re extract :)       

When did you start working on Job Queue and how much time did it take? Did you found enough informations during the task of developing a plugin ? What was useful to you?

A: I took me one and a half year, i started by reading MySQL source code. Some books helped me, MySQL Internals from Sacha Pachev at Percona and MySQL plugins development from Sergei Golubchick at SkySQL and Andrew Hutchings at HP. Reading the source code of handler_socket plugin from Yoshinori Matsunobu definitely put me on faster track.

S: Yes we all miss Yoshinori but he is now more social than ever:), did you also search help from our public freenode IRC MariaDB channel.

A: Not at all, but i will visit knowing now about it.

How is the feedback from the community so far?

It did not yet pickup, but i ported the PgSphere API from PostgreSQL. The project is call mysql_sphere, it's still lacking indexes but it is fully functional and that project get so far very good feedback.

Any wishes to the core ?   

A: GiST index API like in PostgreSQL would be very nice to have, i have recently started a proxying storage engine to support multi dimensional R-Tree, but i would really like to add indexing on top of the existing storage engine.

S: ConnectDB made by Olivier Bertrand share the same requirements, to create  indexing proxy you still need to create a full engine for this, we support R-tree in InnoDB and MyISAM but this a valid point, we do not have functional indexes API like GiST. This has been already discuss internally but never been implemented.  

The results of the job execution are materialized in tables, can you force a storage engine for a job result ?  

A: This is not yet possible at the moment but easy to implement.

What OS and Forks are known to be working with Jog Queue?  

A: It’s not very deep tested because we mostly use it internally on linux and MySQL 5.5 and we have tested it on MariaDB recently, i don't see any reason why it would not work for other OS. Feedback are of course very welcome!

Do you plan to add features in upcoming release?

A: We don't really need additional features now a day, but we are open to any user requests.

S: Run some query on a scheduler ?

A: Can be done. I could allocate time if it make sense for users.  
Job Queue is part of a bigger project Daiquiri, using Gearmand can you elaborate?  

A: Yes Daiquiri is our PHP web framework for publication of datasets.This is manage by Dr. Jochen Klar and control dataset permissions and roles independently of the grants of MySQL. Job Queue is an optional component on top of it, for submitting jobs to multiple predefine dataset. We allow our users to enter free queries. Daiquiri is our front office for Paqu and Job Queue plugin. We are using Gearman in Daiquiri to dump user requests to CSV or into specialized data formats.

S: We have recently implemented Roles in MariaDB 10, you may enjoy this as well but for sure it may not feet all specific custom requirements.

Where can we learn more about Job Queue?  


S: Transporting MySQL and MariaDB to the space last frontier, there are few days like that one when i discovered your work making me proud to work for an Open Source company. Many thanks Adrian for your contributions!

S: If you found this plugin useful and would like to use it, tell it to our engineer team by voting to this Public Jira Task. If your share the same needs to have GiST like indexing API please vote for this Public Jira Task.  

mercredi 11 décembre 2013

MariaDB world record price per row 0.0000005$ on a single DELL R710

Don't look at an industry benchmark here, it's a real client story.

200 Billion records in a month and it should be transactional but not durable.

For regular workload we use LOAD DATA INFILE into partitioned InnoDB, but here we have estimated 15TB of RAID storage. This is a lot of disks and it can't no more stay inside a single server internal storage.

MariaDB 5.5 come with TokuDB storage engine for compression, but is it possible in the time frame impose by the workload?

We start benchmarking 380G of raw input data files,  6 Billion rows.

First let's check the compression with the dataset.

Great job my TokuDB 1/5, without tuning a single parameter other than durability! well i love you more every day my TokuDB.

My ex InnoDB, 30% compression missed in 8K, very bad compression ratio and slow insertion time. Don't worry InnoDB i still love you in memory :) 

Ok every love affair have a dark side :)

So now you can see that it works for 200 Billions rows because it give 277 hours of processing time at 200K insert/s.

In a month if we impose 12 hours, 6 days a week of processing with full capacity this is 288 hours.

That was very short, getting compression over 200 Billions records and without sharding will be hard.

Fortunately MariaDB 10 have native network partitioning using the spider contribution don't miss that.

vendredi 5 juillet 2013

MariaDB Storage Engine for CCM forum

CCM Benchmark is one of the leading forum provider on the web,  ROI is a major concern for them  and historically MyISAM was used on the forum replication cluster.  Reason is that MyISAM gave better ROI/performance on data that is hardly electable to cache mechanism.

This post is for MySQL users at scale,  if the number of servers or datacenter cost is not an issue for you, better get some more memory or flash storage and ou will found Lucifer server to demonstrate that your investment is not a lost of money or just migrate to Mongo.  

Quoting Damien Mangin, CTO at CCM "I like my data to be small, who want's to get to a post where the question is not popular and have no answer. Despite cleaning we still get more data than what commodity hardware memory can offer and storing all post in memory would be a major waste of money".

Like many other big web players at an other scale, Damien need to scale on disk not because it's good, but because you can catch more with less hardware. Doing this you need to control the cache missed at the level that you found acceptable and that give constant response time for your workload.

What data size do we get  retaining the most popular forum posts ?

TokuDB Fast
InnoDB 8K
TokuDB Small
InnoDB 4K

What hardware do we have ?

PUMA : MariaDB 5.5 InnoDB 32G RAM

|__ LUCIFER : MariaDB 5.5 InnoDB compressed 8K 64G RAM

|__ GERTRUDE : MariaDB 5.5 MyISAM 32G RAM

|__ MYSQL1 : MariaDB 5.5 MyISAM 32G RAM

|__ MYSQL3 : MariaDB 5.5 TokuDB Fast 32G RAM

What are the top 10 queries, response time on each server ?


SELECT categorie, best_answer_id FROM ccmforum_index WHERE id=169328

No surprise here  that table is small and we notice that that TokuDB and InnoDB compression does not affect the response time of the queries.


SELECT id,message FROM ccmforum WHERE id IN(?,?,?,?,?)

In range of 1 to 5000 values in the IN clause.
This table is the big baby that generate RND IOps .

Interesting you get the raison here of why MyISAM is better than InnoDB at equal hardware on disk bound workload.

3 times better is something that matter as the second most frequent query.
We get almost equal performance for MyISAM(mysql1) and TokuDB(mysql3) knowing that TokuDB get all data in RAM and MyISAM 75% ; and InnoDB (puma) uncompressed 50%.


SELECT parentx FROM uforums WHERE module="download" AND info_id=223


SELECT i.categorie,c.resume,c.title,count(i.categorie) AS nbFROM ccmforum_index i INNER JOIN ccmforum_cat c ON i.categorie=c.idWHERE i.parentx IN(32932,213290,2937,15002,13612,10016,154379,116397,79497,31886,4235,5038,5222,84819,81100,36025,8274,162824,10620,21731,12130,123360,232454) AND c.visibilite=0 AND c.acces=0
GROUP BY i.categorieORDER BY nb DESC


SELECT,s.contribs,s.contribs_technique,p.devise,UNIX_TIMESTAMP(m.ts_create) AS date,,p.photo_etag,m.nick,UNIX_TIMESTAMP(s.ts_last_post) AS ts_last_post,p.siteperso AS website,(m.rang+1) AS level,m.contributeur AS contributor,m.blockedFROM commentcamarche.ccmmembres m INNER JOIN ccmforum_stats s ON LEFT JOIN commentcamarche.ccmprofils p ON IN(1191274)


SELECT,i.titre,i.auteur,UNIX_TIMESTAMP( AS date,i.membre,UNIX_TIMESTAMP(i.datex) AS datex,i.etat,i.categorie,i.parentx,i.member_id,i.reponses,i.dernier,i.dernier_membre,i.premier,i.premier_membre,UNIX_TIMESTAMP(i.datex) AS unix_datex,UNIX_TIMESTAMP( AS unix_date,0 AS view,i.appreciation FROM ccmforum_index i   WHERE i.categorie IN (2,105,10,111,108,106,110,109,107) AND i.etat!=0


select sum(count) as cpt from ccmforum_count


SELECT id,nick FROM commentcamarche.ccmmembres WHERE nick="hyxo"


SELECT m.nick,m.mail,m.valid,s2.site_id AS id_site_create,
(m.ts_create) AS ts_create,UNIX_TIMESTAMP(s.ts_last_post) AS ts_last_post,UNIX_TIMESTAMP(p.ts_last_edit) AS ts_last_edit,m.rang+1 AS level,m.contributeur AS contributor,m.following,m.followers,
.signature,p.configuration AS config,p.domaines AS interest_areas,p.devise AS quote,,m.sexe AS gender,m.ville AS city,m.pays AS country,CONCAT(p.anniversaire_annee,'-',p.anniversaire_mois,'-',p.anniversaire_jour) AS birthdate,
.siteperso AS website,m.newsletter AS optin_ccm,m.optin_part,m.`blocked`,m.messagerie AS accept_pm,m.notifications, AS picture,p.photo_etag AS picture_etag,m.domaine AS registration_domain,
.date AS show_date,p.ville AS show_city, p.pays AS show_country, p.anniversaire AS show_birthdate, p.sexe AS show_gender,
.email AS show_mail, LENGTH(p.siteperso) AS show_website,
.job,,d.biography, AS websiteMD,d.twitter,d.facebook,d.linkedin,d.googleplus,d.firstname,d.lastnameFROM   commentcamarche.ccmmembres mLEFT JOIN commentcamarche.ccmprofils p ON = m.idLEFT JOIN commentcamarche.ccmmembres_data d ON = m.idLEFT JOIN ccmforum_stats s ON = m.idINNER JOIN globals.sites s2 ON s2.domain=m.domaineWHERE

Take away 

TokuDB proved identical MyISAM response time but being at least 2 time smaller on disk, we did not check InnoDB compression on 32G should be a more fair test but it was not the point as CCM have a server with memory to cover InnoDB fatness.

We notice that TokuDB like InnoDB does not bring count(*) query faster if data stay in the cache but TokuDB compression does not hurt the performance in all major queries.