Update on performance issues - Forum announcements

Forum (Archive)

This forum is kept largely for historic reasons and for our latest changes announcements. (It was focused around the older EPPI Reviewer version 4.)

There are many informative posts and answers to common questions, but you may find our videos and other resources more informative if you are an EPPI Reviewer WEB user.

Click here to search the forum. If you do have questions or require support, please email eppisupport@ucl.ac.uk.

<< Back to main Help page

Home

EPPI-Reviewer 4...

Forum announcem...

Update on performance issues

18/04/2024 15:54

Sergio Graziosi

Joined: 17/10/2011

Posts: 325

Update on performance issues

Over the past months, EPPI Reviewer users have gracefully endured a number of performance issues, and while we've been busy trying to tackle and resolve them as soon as possible, we haven't been good enough at informing users about what was happening and why. This "off schedule" post is our attempt to remedy to this and, for transparency purposes, explain what has been happening behind the scenes.
Back in September 2023 the need to rapidly upgrade our underlying infrastructure became impossible to ignore, and we set-up to redesign it, with an eye on new technologies made available through the Azure infrastructure. We erred (very much) on the side of optimism and decided to try a relatively new system called "Azure SQL Managed Instance". The EPPI Centre webserver (running webpages and apps) was upgraded moved to a new-generation Virtual Machine, which proved to work more than well enough. For the databases side, inspecting Microsoft documentation, it was clear that they believe this class of services (more and more "virtualised") is the future, so we thought it was a good idea to switch to "Azure SQL Managed Instance", and thus future-proof our technical infrastructure.

That was a big mistake; despite spending time evaluating performance, before making the decision to jump onboard, it turned out that our performance-testing routines, while simulating well enough "regular usage", were not good enough to replicate the full-range of activity patterns. In fact, EPPI Reviewer does, on regular occasions, need to run queries which stressed the disk IO subsystem of the Managed Instance far too much. The result was a very noticeable degradation of performance for specific functions, like duplicate checking, where the system inevitably does need to read vast amounts of data in order to function - other, regular functions were indeed running faster than before, but that did not help at all when some other functionalities were barely useable. The backups subsystem was also a source of problems, while being fully automated and thus, not very customisable, the amount of data held in EPPI Reviewer meant that the poor IO performance offered by the "Azure SQL Managed Instance" became an issue both in terms of how long it took for backups to complete and in terms of the very noticeable performance impact they had while they were running.

After liaising with Microsoft to try resolving these (and other) issues, we managed to make backups work somewhat better, but the root cause of problems (the limited bandwidth to/from the disks subsystem) could not be worked around, so in late Autumn, we reluctantly made the decision to move back to an architecture based on virtual machines only. This time, we knew from the previous attempt what to look out for, so could re-design our performance-testing routines to be more comprehensive, and finally were able to switch back to a new-generation SQL Virtual Machine, which is currently in production and performing well overall.

Meanwhile, however, an entirely unrelated problem was brewing and went almost unnoticed for a little while. In March, we started recording a number of episodes whereby EPPI Reviewer became extremely slow. Such episodes appeared to last between 14-20 minutes, but were also "clustered" together, interspersed by short periods where things went back to normal. Isolating the root cause has been the main concern for us ever since. It soon became clear that it was SQL server itself somehow causing this - during these "episodes", the SQL server process started reading data at "saturation" level, using all the IO bandwidth, thus grinding to a halt all regular activity. These incidents were originally quite rare, but did became more frequent between late March and April and would have impacted most EPPI Reviewer users, at one point or the other.
SQL of course comes with a range of subsystems dedicated to analysing performance and current activity, unfortunately they are almost entirely blind to the root cause, and did not report about it at all. This is why it took us an awful long time to isolate the cause, as the troubleshooting process got side-tracked by all the things that SQL own reporting functions did show - creating a series of red herrings. It turns out that the root cause was a single Full-Text index (used to support full-text searches on titles and abstracts). Upon inserting a new- or updating an existing-reference, the relative full-text index needs to be updated, and sometimes it was failing to do so, and would instead keep trying for 15-17 minutes, during which all other queries were badly affected, and in the case of queries that needed to update references, were entirely blocked. SQL's "Full-Text" engine is indeed a separate application, which is probably why SQL Server’s own performance-analysis systems are almost entirely blind to it (not completely blind, luckily, as eventually they did indirectly point us in the right direction).

We do not know exactly what causes all this, in the relevant logs, the problem gets reported as "Error '0x80043630'" (which stands for "The filter daemon process MSFTEFD timed out for an unknown reason", helpfully), but we do now know with relative confidence that the diagnosis above is indeed correct. Based on this knowledge, we initially disabled the updating of the full-text index on Titles and Abstracts, which made sure the "episodes" stopped occurring. On Saturday 13 April, we re-constructed the relevant full-text index from scratch (which creates a "pristine" well optimised index, but did take 8 hours to complete, so it's not something that can be done often) and since then, EPPI Reviewer has been fully functional and the "episodes" have not re-occurred. We are also putting in place new dedicated systems to keep this and all other EPPI Reviewer full text indexes in good shape, and are relatively confident that they will be sufficient to prevent the problem from happening again in the foreseeable future. Of course, the real world might prove us wrong (again), but given the lack of documentation on this specific error, the only option we have is to monitor the situation (which is now possible and straightforward, as we do know where to look) and re-evaluate things in case the problem occurs again.

Overall, it has been a complicated semester for us in the EPPI Reviewer team, and no doubt, a frustrating one for all EPPI Reviewer users. All this also had a very significant impact on our ability to write new features, as we spent so much of our working hours investigating "issues" instead. From now on, hopefully, normal service should be resumed, both in terms of everyday usability and of writing new features.

Page 1 of 1

Home

EPPI-Reviewer 4...

Forum announcem...

Update on performance issues