Thomas Pellissier Tanon

Is there something better than Blazegraph for Wikidata?

The following are some quickly written ideas about the current state of SPARQL systems allowing to query Wikidata. It is a bit of a follow-up to James' blog post on the subject and the great Wikimedia work on a possible alternative to Blazegraph for the Wikidata Query Service. This is just my own thought to keep the conversation going. It is very likely to contain wrong affirmations. A lot of affirmations are just gut feelings not supported by benchmarks. Feedbacks and benchmarks are more than welcome.

The challenge

The current Wikidata query service is powered by Blazegraph. Blazegraph is an RDF database implementing SPARQL that is aimed at providing fast analytic queries using SPARQL. It has seen a lot of progress before the acquisition by Amazon of the company developing it. It has been chosen to be the backend software for the new Wikidata Query Service in 2015 after some great comparison work.

However, since its adoption, it has been left in an unmaintained state and shows some limits with respect to the very high update speed of Wikidata. Indeed, it seems to me that Blazegraph developers have focused more on complex analytical query performances at the cost of update speed, maybe because most RDF data are mostly static and do not change much. For example, it leads them to not allow concurrent but only sequential updates.

To formalize the problem, the Wikidata Query Service presents a very atypical workload:

  • A very high rate of small updates similar to OLTP-style workloads to be able to keep up with Wikidata changes.
  • Complex queries span sometimes a large part of the database similar to OLAP-style workloads.

Supporting this kind of hybrid workload while having full SPARQL support seems to be an important requirement for any system that might be considered to replace Blazegraph. However, it seems to me that only targeting one of the two kinds of workload (OLAP vs OLTP) is not something specific to Blazegraph but is very common in databases. Most of the systems are focusing either on OLAP or on OLTP, with OLAP being the main focus of most SPARQL systems.

The other systems

Note: I restricted myself here to open-source systems implementing SPARQL. The list is fairly incomplete, please reach out if you know another interesting system to add to the list. For people wanting to know more about how to implement SPARQL and the existing implementations in general, there is this great survey.

The main contender: Virtuoso

Virtuoso is a SPARQL implementation developed for more than a decade by a small company, OpenLink Software. It is at its core a SQL database targeting OLAP workloads with a layer on top converting SPARQL to SQL. It seems to provide great performances, powering very large endpoints like Uniprot. However, according to WDQS Backend Alternative work, Virtuoso is also tuned for bulk-load with high-frequency read, and not for read/write. But, this is also the case with Blazegraph. So, it might be interesting to do a good benchmark to see if it actually outperforms Blazegraph or not.

The read-only systems: QLever, HDT, QEndpoint...

This section has been rewritten after a great feedback from Pavel Klinov. Thank you so much!

To my knowledge, all of these systems rely on derived techniques introduced by RDF-3X. They encode the RDF terms using consecutive integers and build fast and compact indexes using this property. It seems possible to make these systems read-write by implementing techniques inspired by LSM trees. The basic idea of LSM trees is to build immutable partial indexes. When updates are done, the systems batch changes in memory, and, after a given threshold, create a new partial index. To speed up reads, the implementations often run background tasks to create bigger indexes by merging the smaller ones together. But it seems to be a significant implementation work. So, it is likely that these systems will not be suitable for Wikidata in the short term.

The toolkits: Apache Jena and Eclipse RDF4J

Jena and RDF4J are two well-maintained, mature, and full of features RDF toolkits written in Java. They both provide native storage systems with SPARQL. However, they seem to be mostly focused on light or medium workloads. Their latest storage systems, TDB 2 for Jena and LMDB for RDF4J both allow only a single writer thread and use a copy-on-write mechanism. So, we end up with the same problem as Virtuoso. To my knowledge, they seem to not be widely used for large datasets, opposite to Virtuoso. I would be very curious to see good benchmarks on how well they behave, especially now that RDF4J seems to have made some performance efforts in the past year.

The work-in-progress system: Oxigraph

Oxigraph is a work-in-progress SPARQL system aiming at write-heavy workloads. It is based on the RocksDB key-value store that targets such workloads. It is developed by myself as a hobby project, at a very slow speed. The SPARQL implementation is complete and basic OLTP operations seem fairly competitive. However, before being able to maybe work at Wikidata-scale it still requires at least two significant improvements. The initial loader is quite slow. Loading Wikidata takes more than a week currently. Also, Oxigraph does not contain yet a proper query planner and optimizer, leading to abysmal performances for complex analytical queries. So, Oxigraph is unlikely to be able to solve Wikidata problems very soon, even when omitting the "bus factor 1" problem.

The dead experiment: History Query Service

The History Query Service is a tool I developed to run big analytical queries on Wikidata history. To this aim, I built a custom storage tailored for its need using RocksDB and used RDF4J on top of it for the SPARQL evaluation. Sadly it suffered from strong shortcomings: The process to build the index took a week on 2018 full Wikidata history dumps and I never managed to make it work again on more recent dumps. Also, it never supported updates. The query evaluation itself was based on the current state of RDF4J at this time, leading to fair but not great query plans and quite slow execution time.

However, I believe it has shown some interesting ideas. Building custom storage for Wikidata history is possible and does not lead to huge space expansion compared to the indexing of only the latest version while allowing fairly efficient querying. I hope at one point to be able to build an optimized version 2 using Oxigraph. But, to be worth it, it requires Oxigraph to have a good and fast SPARQL query planner and evaluator. So, it is not for tomorrow.

Conclusion

Like everyone before me, it seems to me there is no good answer for Blazegraph replacement. At first glance, Virtuoso seems worth benchmarking to see if it's actually better or worse than Blazegraph. I hope that Oxigraph might provide something in the future, but it will not happen until many years at the current development speed. Investigating using Jena or RDF4J with custom storage providing good enough update performances might also be something to do, following the History Query Service ideas.