Version 6.18.0.0 is a major upgrade with numerous new functionalities in multiple different areas. Highlights include: new "Evaluate LLM" features, support for GPT 5.x, Claude and other LLM models and the ability to supply LLMs with a general "per item" (system) prompt. Also included: new functionalities to plan, document and evaluate coverage of OpenAlex (and more) in living evidence scenarios, and an R&D implementation of the Buscar stopping rule for Priority Screening simulations.
New Features:
LLM Evaluations
In the "Search & Classify" tab, a new button ("LLM Evaluation") provides access to a new page dedicated to the evaluation of prompts and large language models for coding purposes.
The new "LLM Evaluation" button appears if LLM credit is present for the current review, and/or if LLM evaluations are already present for the current review.
It is common knowledge that usefulness and accuracy of any and all LLMs is a direct function of the quality of the prompts used to interrogate them; however it is also uncontroversially accepted that a reliable and known set of rules that can be used to generate "fit for purpose" prompts does not exist. This generates the need to proceed by cycles of trial and error, progressively refining prompts, so to incrementally improve the quality of LLM responses.
Until now, this was possible in EPPI Reviewer, but required a high degree of manual (and well organised) work. To reduce the required effort the new functionality streamlines the "evaluation of current prompts" stage.
To use it, it is of course necessary to already have a "gold standard" dataset, containing items coded correctly, with the codes/coding tool that contain the to-be-evaluated prompts.
The "run new evaluation" form available in the new page requires a name for the evaluation, it then needs to be pointed to a code that has been assigned to the items that hold the "gold standard" data.
Once a code is picked, EPPI reviewer will count how many (Included and/or Excluded) items have the chosen code. This value, along with the "number of iterations" setting (see below) is used to calculate the overall batch size, which cannot exceed the usual limitation of 1000 "LLM API requests" per batch.
Users need also to indicate the coding tool containing the to-be-evaluated prompts, to select a robot/LLM, and to indicate how many times each item should be submitted to the LLM (the "Number of iterations"). LLMs never provide deterministic answers and will on occasion change their answer for no apparent reason. The only way to find out how often this may happen for a given "prompt and item" pair, is to submit the same data to the LLM mutliple times, which is why the "number of iterations" setting has been implemented.
Users can then click "Submit", which will generate a "batch Job" in the regular LLM jobs queue. The Jobs queue is shown in the page, allowing to easily and immediately follow proceedings. Once the job finishes, clicking "Refresh" in the "Evaluations List" panel will make the job appear in the list of past evaluations, with up-to-date overall figures.
This list provides an overview of evaluations, reporting the overall numbers of true/false positives and true/false negatives. Clicking on the evaluation name will fetch and display all the evaluation results.
Each evaluation stores an HTML representation of the coding tool used, which therefore provides a permanent record of the exact prompts submitted. It is also possible to download the raw data about the evaluation, for re-analysis on third party software.
The "Summary performance metrics" table shows a list of codes belonging to the relevant coding tool, with per-code figures re "Sensitivity (Recall)", "Specificity", "Target Hits", "Actual Hits", "Precision", "F1 Score", and "Accuracy". "Target hits" counts how many items had the relevant code in the gold standard and multiplies it for the number of iterations. "Actual hits" shows how many times the relevant code has been assigned to items for the totality of the batch. Thus, when "Target" and "Actual" hits differ, it is already clear that the machine has made mistakes, while when the two figures are equal, to know whether mistakes happened, it is necessary to check whether the code has been assigned to the correct items.
It is important to note here that the evaluation functions only consider whether an item has received or not a given code, and do **not** consider the text optionally associated with it. Therefore these evaluations provide a "full and exhaustive" picture only for "boolean" prompts, and are thus the most useful for evaluating "screening" prompts.
To find out exactly how well each prompt performed, the results also show a list of "Contingency"(/confusion) tables, one per code. These list the numbers of True Positives, True Negatives, False Positives and False Negatives. Figures shown are total numbers, considering all iterations (but the mean per single iteration is displayed too). Whenever the figure shown is not zero, the cell is clickable and clicking it will bring the user to the relative list of items. This directly supports the need to refine prompts and re-submit the single items for which the machine provided wrong answers.
In this same new page, an ancillary new function is also present, called "Create train/test datasets".
Given a gold standard dataset, it is a good practice to split it in two sets, one used to develop prompts (in this case, or, in other cases, to build a classifier, etc.) and one used to test against a fresh, not-used-in-training dataset. By default, this functionality splits the number of "Items with this code" in two equal sets, but it does of course allow people to specify how many items should go in one or the other set. Once the data requested has been specified, clicking on "create" will generate two new codes called "Train set" and "Test set", and randomly assign items to it, in the quantities specified. This function is useful for prompt-evaluation purposes, but can be used also to aid building and evaluating custom classifiers and more.
Support for additional LLM models.
This release includes support for GPT 5.1, GPT 5.2, Claude Opus 4.5, Claude Sonnet 4.5, DeepSeek V3.2 and Mistral Large 3.
With the addition of the two Claude/Anthropic models, this release thus marks the point where EPPI Reviewer supports models from all mainstream Large Language Model providers. At the same time, some older models are starting to be "retired". Currently, this is already the case only for the "Mistral Large 24.11" model. As things stand (retirement dates tend to change without warning), the next models set to expire are Llama 3.1 (13 June) and DeepSeek-R1 (3 July).
LLM coding: user-defined system prompt
When using LLMs for robot-coding purposes, EPPI Reviewer automatically generates a "system" prompt that instructs the machine to provide answers in a computer-friendly format. This is then followed by the per-code user-generated prompts, and finally by the "user prompt", consisting of "title and abstract" or "full text" of the item being processed. This arrangement works, but we have strong reasons to suspect that it can be very useful to provide the machine with "context" information, which EPPI Reviewer can now supply as part of the aforementioned system prompt. This now happens when the description of the relevant coding tool starts with "Contextual_prompt:". When this is the case, all the text after the triggering prefix will be included in the system prompt submitted to the large language model of choice. Specifically, it will be included at the start of the system prompt, followed by the not-optional instructions about how to format the replies, and the per-code prompts.
Completing/un-completing coding from the Item Details page
In the items details page, EPPI Reviewer already allows to mark as "complete" the coding shown if/when it belongs to the logged-in person, or to un-complete completed coding (if present). It also allows to "live compare" all coding versions available through the "live comparison" feature (accessed from the "coding record" tab). This latter feature now also allows Review Administrators to "complete and lock", or "complete" or "un-complete" the coding versions shown. This functionality is activated via the apposite "Enable Editing" checkbox; when the function is active the buttons to perform the complete/un-complete actions will appear according to the state of the coding shown in the "Live comparison" panel.
"All coding report" on "Items With this Code"
In the "Coding Progress" area of the "Review Home" tab, Review Administrators have access to "Get 'all coding' report for this tool". These functions can be used to generate reports that include all coding data, from all users, completed and incomplete. In big reviews, "all coding" reports can easily grow very big, to the point of becoming too-big-to-handle, which in turn generates headaches if and when specific workflows rely on them. To solve the problem, in the "More..." section of the "Get all coding" panel there is a new "Only with this code" option. When ticked, it requires to select a code from the coding-tools column on the right, in order to decide which code to use. It will then generate the reports normally, but instead of considering all items that happen to have coding data for the relevant coding tool, it will restrict results also to items with the code indicated.
In "Update Review", "Search and browse", the list of text-searches has been enhanced
- It is now possible to "enumerate" (short for: getting and recording all results directly from the OpenAlex API) the search results at any time. Doing so fixes the list of results at "enumeration time".
- Upon importing the search results, results are automatically enumerated if necessary, otherwise the already recorded set of results is imported.
- The date when the results have been enumerated (and therefore "frozen in time") is always recorded and shown (if present).
- Combining searches also enumerates (and fixes) the search results of the searches used. When combining already enumerated searches, the process is significantly faster.
- It is always possible to re-run primary text-searches, so to obtain up-to-date results (this has not changed).
This set of changes has been introduced to allow keeping an exact trace of what was imported and how, and/or what was searched, found, but not imported. It is part of a set of functionalities included in this release, which, taken together, allow to fully evaluate whether OpenAlex can be used as a single source of updates for a given living review, and/or which other sources need to be used along with it.
Better default names and data for sources created via OpenAlex
In "update review" there are four different ways to import new items in the review: from AutoUpdate feeds, from Related Searches, from Text searches, and as "individually selected" items. For each type, upon importing, EPPI Reviewer creates a new source and automatically generates the source name and data. The ways in which source names are created have been improved significantly. What gets saved in the other source fields has also been extended and improved, making it easier to maintain a reliable record of what was imported and how. Moreover, whenever "import filters" were active upon import, the filter values are now automatically recorded in the "notes" field of each source.
OpenAlex "Origin Report"
In "Update Review", "Match records", the "Actions on items with this code" section contains a new button, "Origin Report". Having picked "this code", clicking this button will generate a new report, which:
- Considers all Included and Excluded items that belong to the code chosen.
- Produces Summary tables, counting how many items belong to each "type" of OpenAlex search.
- Per each item, this report shows whether the item appears in AutoUpdate results (along with its scores), or if it appears in any "related search", or in any enumerated text-search. It will also report what source the item was imported with.
This report, together with the previous two new features (above) supports the streamlined evaluation of coverage provided by OpenAlex, compared to any other (manually imported) source of items. It is therefore especially useful to setup and maintain living-evidence update cycles. The report is designed to be "readable" by both people and machines, so to facilitate data analysis. It can be saved in either HTML or Excel file formats.
Improvements to the Meta Analysis page
In the "Outcomes Table", besides the automatically-generated (up to 30) "outcome classification" columns, 2 new (related) types of custom columns are now available: "Answer (Outcome level)" and "Question (Outcome level)".
The former is equivalent to the automatically-generated "outcome classification" columns, and was added to aid practicality and for consistency reasons. The latter new type of column ("Question (Outcome level)") is genuine new functionality and completes the range of options provided. These 2 new type of columns, together with the two existing types of custom columns ("Answer (Item level)" and "Question (Item level)") thus allow to use the Outcomes table to find/filter/sort by all possible criteria, matching any and all types of data collected during data extraction.
The "extreme customisation" facilities offered by the Outcomes Table are intended to support the precise identification of the desired outcomes. Once this is done, it is then possible to use the Meta Analysis functions of EPPI Reviewer itself, or to export the data and analyse it elsewhere. For the latter purpose, EPPI Reviewer offers two "types" of exporting routines. The regular route exports the outcomes table "as shown" (with minor differences). Otherwise the "Raw" format exports a lot more, and crucially includes the exact values that were "data-extracted", rather than only the calculated effect sizes and SE values.
Until now, both "regular" and "raw" export types had limitations and a number of problems/bugs that limited their intended usefulness. Both systems have been re-written from scratch and should now be uncompromisingly fit for purpose.
The most significant improvements (besides general reliably) are:
- Both formats now include the ItemId, before they only reported the "Short title", which can sometimes be ambiguous.
- Both formats fix the inconsistencies we previously had between naming of fields, specifically about "outcome name" and "outcome description". This has been fixed also for the "Outcome table" itself, as shown in the UI.
- Order and number of columns in the raw format is now optimised for functionality, before it was not organised in a consistent/logical manner.
Priority Screening Simulations
Priority Screening Simulations are available from the "Search & Classify" tab: running a simulation will now generate data on "when it would be safe to stop screening" according to the Buscar "stopping algorithm". These simulations are useful for Research and Development purposes, and, with this addition, they will help to inform our future decision re implementing this algorithm in the regular Priority Screening workflow.
Enhancements:
In the "Classify" (Apply classifiers) panel, pre-built classifiers now have a short description enriched with links to external resources/documentation.
In Item Details, the "URL" and "DOI" fields now have a "copy" button, which allows to copy their values with a single click.
EPPI Visualiser:
Panels in the visualisation home-page now remember their "expanded or collapsed" state, it is thus possible to navigate to inner pages and return to the home page and find the panels in the same open/closed state as they were before. This does not include retaining the data they were showing, for those panels that do show user-determined content.
Bug fixes:
Many machine learning jobs, including build, rebuild and apply classifiers, "Check screening" and "Priority Screening Simulations" are now more reliable. Because of a problem created by the basic architecture of EPPI Reviewer, these jobs could fail at what appeared to be random times (affecting up to 1% of jobs executions). This problem is now solved for the workloads mentioned; in the following release we expect to extend this improvement also to priority screening jobs, which are currently still affected.
Priority screening training: on rare occasions, the routines that collect data to "train and score" could get stuck and fail to complete in a timely fashion. This would result in not completing a training round, without generating obvious errors on the UI. To solve this the whole routine has been re-written, making it far more reliable, neither faster nor slower in normal circumstances, but also more accurate in the presence of contradictory data (i.e. when an item has been assigned to both include and exclude codes).
Incorrect counts and item lists from "OpenAlex matching". In "Update Review", "Match records", figures about how many items are matched to OpenAlex records (and how well they are matched) are shown. It's also possible to click on the apposite links and be sent to the corresponding list of EPPI Reviewer items. We have discovered a few rare cases where either the counts or the corresponding item list could be slightly wrong, missing one or two items in a thousand, or counting the same item twice. To resolve this, we re-wrote all associated routines from scratch, which meant the figures are now accurate, and the corresponding items lists are faster to load.
The "Meta Analysis" page allows to generate custom columns which can be used to identify precisely what outcomes belong to a specific Meta Analysis. This system could break if/when the code used for any one of the custom columns had been deleted. To fix this problem, we implemented a new system that automatically removes custom columns whenever the code they refer to has been deleted.
"Delete source forever": the function to irreversibly delete a source and all its items would fail if any of the affected items contained timepoints. Moreover, the operation would fail "ungracefully", resulting in no immediately visible/recognisable error messages. This problem is now solved.
Deleting a whole coding tool: this operation could fail (and did show an error message) if the coding tool was used in a visualisation, and therein, used within a preconfigured map. To resolve this, the code that checks for consequences before deleting coding tools and codes has been updated. When necessary, the user interface will tell users "you need to remove this tool/code from the visualisation/map first". In other cases, it will warn users that a given deletion may have (undesired) effects on existing visualisations. Moreover, in the "setup visualisations" page, if/when a map refers to "now-deleted" codes, the affected dimensions in the map are highlighted as requiring attention. In EPPI Visualiser, any affected map will appear in the visualisation homepage, but its "open map" button will be disabled.
Updated on: 16 Apr 2026