Dr. Adrian Partl is working in the E-Science group of the Leibniz Institute for Astrophysics Potsdam (AIP), where the key topics are cosmic magnetic fields and extragalactic, astrophysics is the branch of astronomy concerned with objects outside our own Milky Way galaxy
Why did you decided to create a Job Queue plugin, what issues does it solve?
A: Basically our MySQL databases hold astronomic simulations and observations content, the datasets are in multi Terra Bytes size and queries can take long time, astronomers can definitely wait for data acquisition, but jump on the data as soon as they are available. Job Queue offer a protection from too many parallel query executions and prevent our servers to be spammed. Multiple queues are here to give us priority between users, today queries are executed as soon as a slot is available. Some timeouts per group can be define and queries will be killed passing that delay.
Would you like telling us more about your personal background?
A: I studied astronomy and have a PHD in astrophysics. For my PHD I focused on high performance computing by parallelizing a radiation transport simulation code to enable running it in large computational cluster. Now a day i'm more specialized in programming and managing big dataset. I stop doing scientists tasks, but i enjoy helping in making those publications happen by providing all the IT infrastructure for doing the job.
How did you came to MySQL ?
A: In the past we used SQL Server but we rapidly rich the performance limits of a single box, we found out that it can be very expensive to expend it for sharding.
We moved to MySQL and mostly MyISAM storage engine. We are also using Spider storage engine since 3 years, for creating the shards. We needed true parallel queries, to do so we created PAQU a fork of Shard Query to better integrate with Spider, The map-reduce tasks in PaQu are all done by submitting multiple subsequent "direct background queries" to the Spider engine and we shortcut Gearman in shard-query. With this in place it is possible to manage map-reduce tasks using our Job Queue plugin.
S: Spider is now integrated in MariaDB10 and it is making fast improvements regarding map-reduce jobs, using UDF functions with multiple channels on partitions and for some simple aggregation query plans. Are you using advanced DBT3 big queries algorithms like BKA joins and MRR? Did you explore new engines like TokuDB that could bring massive compression, and disk IO saving to your dataset.
A: I will definitely have look at this. In the past we have experimented column stores, but it's not really adapted to what we do. Scientists extract all columns despite they don't use all of them. Better getting more, then to re extract :)
When did you start working on Job Queue and how much time did it take? Did you found enough informations during the task of developing a plugin ? What was useful to you?
A: I took me one and a half year, i started by reading MySQL source code. Some books helped me, MySQL Internals from Sacha Pachev at Percona and MySQL plugins development from Sergei Golubchick at SkySQL and Andrew Hutchings at HP. Reading the source code of handler_socket plugin from Yoshinori Matsunobu definitely put me on faster track.
S: Yes we all miss Yoshinori but he is now more social than ever:), did you also search help from our public freenode IRC MariaDB channel.
A: Not at all, but i will visit knowing now about it.
How is the feedback from the community so far?
It did not yet pickup, but i ported the PgSphere API from PostgreSQL. The project is call mysql_sphere, it's still lacking indexes but it is fully functional and that project get so far very good feedback.
Any wishes to the core ?
A: GiST index API like in PostgreSQL would be very nice to have, i have recently started a proxying storage engine to support multi dimensional R-Tree, but i would really like to add indexing on top of the existing storage engine.
S: ConnectDB made by Olivier Bertrand share the same requirements, to create indexing proxy you still need to create a full engine for this, we support R-tree in InnoDB and MyISAM but this a valid point, we do not have functional indexes API like GiST. This has been already discuss internally but never been implemented.
The results of the job execution are materialized in tables, can you force a storage engine for a job result ?
A: This is not yet possible at the moment but easy to implement.
What OS and Forks are known to be working with Jog Queue?
A: It’s not very deep tested because we mostly use it internally on linux and MySQL 5.5 and we have tested it on MariaDB recently, i don't see any reason why it would not work for other OS. Feedback are of course very welcome!
Do you plan to add features in upcoming release?
A: We don't really need additional features now a day, but we are open to any user requests.
S: Run some query on a scheduler ?
A: Can be done. I could allocate time if it make sense for users.
Job Queue is part of a bigger project Daiquiri, using Gearmand can you elaborate?
A: Yes Daiquiri is our PHP web framework for publication of datasets.This is manage by Dr. Jochen Klar and control dataset permissions and roles independently of the grants of MySQL. Job Queue is an optional component on top of it, for submitting jobs to multiple predefine dataset. We allow our users to enter free queries. Daiquiri is our front office for Paqu and Job Queue plugin. We are using Gearman in Daiquiri to dump user requests to CSV or into specialized data formats.
S: We have recently implemented Roles in MariaDB 10, you may enjoy this as well but for sure it may not feet all specific custom requirements.
Where can we learn more about Job Queue?
S: Transporting MySQL and MariaDB to the space last frontier, there are few days like that one when i discovered your work making me proud to work for an Open Source company. Many thanks Adrian for your contributions!