Every time we build a new feature we try to use open source technologies. We simply like it. For sure we have some commercial products such as HP Vertica, simply because open source columnar databases don’t have enough quality yet. We have a lot of freedom internally in the engineering around choosing the opens source technologies. Usually you just make a simple research around pros and cons of some specific area or technology and you just select “the best” technology available. Choosing criteria are mainly features, documentation quality and completeness, stability, performance and extensibility. Sometimes there is not enough technologies available so you choose a product that is not that easy to work with.
We are always trying to put something back to the open source project itself, newly created project (if possible) or we always do some bug fixes or report some bugs.
One of many open source projects we use in GoodData is the OpenOffice.org. It is not always easy to work with it, but I would like to share our experience here. We use this library in our tabular report export in GoodData Platform. We use it for almost 5 years, such a long time! And, it is one of the most controversial software we have in house. Even though OpenOffice has a plenty of features, we use just one that OpenOffice fits perfectly - XLS to PDF conversion.
We’ve started with simple CSV export for customers who needed at least some external version of computed report. In the next phase, we’ve added even better exports that keep the metric formatting, the same as customers knew from the GoodData UI.
Following XLS export solution includes also PDF export. We’ve chosen Spreadsheet::WriteExcel for XLS export because part of our backend is written in Perl, and we’ve chosen OpenOffice with Perl UNO bridge (OpenOffice::UNO) for PDF conversion. Number formatting was done externally using SpiderMonkey. Talking about OpenOffice pros, that are no doubt quality exports from XLS to PDF with unicode support. There is no free alternative to this solution so far, but let’s talk about some cons also (we realised after implementation):
- memory consumption
- memory leaking
- incomplete documentation for UNO
We use this solution till today and it’s not the best one. We’ve already discussed multiple times about re-implementing this for web-based solution which would use headless browser to “snap” a result to PDF. We use similar solution for our chart export. Anyway, there is almost always something more important that you have to work on and you sometimes don’t have enough time to re-implement something that “already works”. Currently, we daily export approximately 10.000 tabular reports, just 800 to PDF and the rest is XLS (~3300), HTML (~3200) and CSV (~2700).
You can simply realised that customers more often use other formats. But it may also be due to less quality of PDF export. Talking about the cons above, there is more often events such timeout or crashing during exporting huge tables and so customers choose other formats.
In the usage model we finally matured to the situation where 1 worker that process export has 1 OpenOffice process in headless mode. It can do 10 exports and then the worker is restarted including OpenOffice process, mainly due to memory leaking. Unfortunately, the OpenOffice process is launched by oosplash process, which has the control and so you couldn’t manage the suffice process directly:
That means another impact in case of restarting the worker - oosplash sometimes doesn’t react to to the termination signal and then we have parentless processes on the server that allocate memory and you have to kill them separately. Usually it is the oosplash:
or the suffice itself:
It is related to wrong messaging between OpenOffice processes - oosplash wrongly propagates
SIGTERM signal and translates it as
SIGKILL for soffice.bin, which doesn’t provide correct termination and cleanup - is necessary to do cleanup manually (pipe file). We currently use
SIGINT for termination which behaves better but it does not always ensure proper ending and cleanup. Few teammates already challenged this issue but we still are not in state when it runs automatically without outside interventions.
Another Feature Request
And now imagine a situation where you need to implement new feature request. Recently, it was landscape mode export. Trying to google this, you realise that a lot of people is solving this problem, anyway only a few of them are successful. I agreed with my teammate we will do a parallel research. One of us look for perl solution and the other for java based solution. My expectation was that we find java based solution (thanks to little bit better documentation and bigger usage). In the end we found possible solution in almost same time which was surprise to us. Another surprise was that we have the same result with two different approaches.
Perl snippet looks as follows:
and Java is here:
For Perl solution we reach the goal by modifying the page style - IsLandscape is just flag without any impact to the result. The key part is changing the Width and Height. For Java solution we use print properties settings, that we tried to do also in Perl without any success. If I add that finding solution was the whole day of trying, you can simply realised the quality of documentation.
Despite all of difficulties mentioned above, OpenOffice still fills up its role and doesn’t have any other alternate solution without huge and painful architecture change. It is not only our problem, it is a problem in a lot of other companies. Try to search on Google for “open office pdf conversion” - 324 million results vs “excel pdf conversion” - 162 million results. It means there is still a lot of interest in this open technology. If company generates Office documents and needs PDF format, the OpenOffice is a choice. OpenOffice is still favourite product and is still being developed. Unfortunately the headless mode doesn’t have a big attention from developers.
If I have to do this decision again, I would select different technology. Most probably something web based, with headless browser (maybe http://phantomjs.org/).
For you who are interested in OpenOffice I would recommend this tutorial and if you want to play in Java around with it try LibreOfficeCSVGenerator with interesting abstraction of OOoServer and BootstrapSocketConnector, which solve the initialisation and termination of headless OpenOffice process. If you are about to generate PDF directly, you can try great library - iText which is open source (until version 2.7.1, then AGPL) or if you want to export your web based outputs to PDF, go with PhantomJS.