This release marks the point where the LLM-Coding features have reached a mature-enough stage to be made available (on request) to all users. This release also includes a brand new "machine-learning" powered function to retrospectively check screening decisions.
LLM-powered coding
The core of the feature is a system to instruct a large language model (LLM) which relies on user-written prompts. The LLM available at the moment is OpenAI’s GPT4o, however we will support other models, including those from open source providers. Having the prompts tied to individual codes allows EPPI Reviewer to submit them along with data pertaining one item, process the LLM answer and code the submitted item accordingly.
It is well known that LLMs can make mistakes, meaning that, even when using carefully engineered prompts, error rates may be low overall, but can affect certain pairs of studies/questions systematically. Care must be taken to ensure that this does not undermine your evidence synthesis.
Therefore, the set of features implemented in EPPI Reviewer is explicitly designed to support the careful, per-reference and per-question evaluation of results, and of course, to support the iterative refinement of prompts.
LLM-driven coding can now be enabled on a per-review basis, strictly on-demand. This is because we want to make sure that anyone using these features will have read the documentation and guidance we're in the process of updating. The general "how to" instructions are here and will be expanded/updated shortly.
Every call to the LLM API incurs a cost which will need to be paid for by advance credit or the use of your own OpenAI API key (coming soon).
Please contact EPPI Support to get LLM coding enabled in your reviews, receive the documentation, and setup the payment system.
Summary of LLM-related features
General:
- LLM Jobs can be submitted on a per-item basis, in which case, they always use (only) title and abstract, and execute in real time.
- LLM jobs can also be submitted in batches of up to 1000, which get queued, logged and executed by EPPI Reviewer on the server side.
- LLM batches can use titles and abstracts or full text documents (PDFs only).
- LLM batches can run in parallel, minimising the impact of bottlenecks created by someone submitting lots of large jobs.
- Users can cancel submitted jobs (from the job details, in the "queued" list), as long as a given job hasn't started yet.
- The list of past jobs (user- and review-specific), can be consulted at any time, allowing to check costs, batch-specific errors and more.
Specific features aimed at supporting careful evaluation and refinement of prompts:
- The "all coding report" (for a given coding tool) in Excel and/or JSON (own format). This allows people to double-code along with the LLM, so to spot specific questions that produce high levels of disagreement.
- "Delete all coding from this 'reviewer' in this coding tool" - allows to get a fresh start whenever needed.
- Copy a coding tool within a review, to allow creating a "version history", without too much hassle.
Cost, payment and accounting:
- EPPI Support can link any review or site license to any "Credit purchase", so end users can manage their own LLM costs by purchasing credit for this purpose (any leftovers can be used to pay for subscriptions too).
- Costs are calculated in a way that mirrors the native LLM system, so depend on the number of "input and output tokens".
- Each call to the LLM API gets costed, and then aggregated within the cost of the batch it belongs to (if any).
- We do not aim at making a profit, or to match exactly the LLM native cost (which varies also with the GPB to USD exchange rate!), so will update the LLM costs regularly, but not on a hourly or daily basis.
- Current costs in GPB are: £2 per million input tokens. £9 per million output tokens.
New feature: Check Screening
This feature is intended to help people double-check their screening decisions, especially on the title and abstract phase. To use it, people will pick two codes, typically listing "all items included" and "all excluded". The latter pot is then randomly split in 5 segments.
For each segment, EPPI Reviewer will use titles and abstracts to build a model, using all includes and 4/5ths of the excludes. It will then apply the classifier to the "kept aside" segment of excluded items. Results are then merged and sent back as "search results".
This provides an ordered list, where the items at the top of the list are those with the highest probability of being "includes". It thus provides a system to quickly check if mistakes were made.
LLM Options
LLM-coding allows people to set a few options, mostly designed to help people make sure that they can always know what coding was created by the robot. These settings are important, but the UI was keeping them hidden by default. They are now more visible by default, in a manner that is hopefully not too intrusive.
LLM batches queue
EPPI Reviewer can now process multiple batches in parallel. It also implements new logic to select which job to run next. When at least one other job is already running, instead of picking the oldest queued job, it will look for jobs triggered by someone else (not the people who triggered jobs that are currently running). This ensures that no single person can saturate the batches queue and make everyone else wait for hours or more.
Moreover, it is now possible to cancel submitted jobs, as long as they are not already running; the cancel button appears in the "details" box, which is useful to prevent accidental clicks and ensures only people who have control over a given job can cancel it (along with other checks, of course). We have also made a number of tweaks to the user interface, to enhance usability.
Finally, if people make a mistake and submit batch for coding, but select a coding tool that contains no prompts, the job will abort after trying to process the first item. Before it would try and fail to process each item, and was thus wasting time and processing power.
List of past robot jobs
This list provides access to the log of past jobs belonging to the logged on user and/or the current review. The list is accessible from the "Review home\edit review" panel (limited to review admins) and from the "LLM batches" panel (all regular review users).
The list allows to consult all details of past jobs, including their cost, and, if present, any error details.
Priority screening: counting items screened in a session
The Item Details page in priority screening now includes a small counter, which keeps track of how many items have been screened in a given session. The count is retained if users leave the page and then resume screening, as long as they don't change or reload the review.
New Feature: copy coding tool within a review
The main purpose of this function (available in "edit coding tools", "edit tool") is to allow iterative refinement of LLM prompts, while retaining a record of previous versions.
The copied coding tool will be an exact replica, down to the name of the coding tool. It is recommended to edit the name(s) of original/copied tool to avoid confusion, right after creating a copy. Moreover, for the purpose of "record keeping", we recommend to lock the original tool, so to prevent accidental editing thereof.
Small change, "All Coding" Excel report
This report produces tables where reviewers appear in columns repeated for each code included in the report. The order of such columns was previously determined by the order in which people created the first coding in the relevant coding tool. This meant that while iterating prompt refinement (and deleting all coding done by the LLM), the order of reviewers could change across iterations, making the comparison or results harder. Reviewers are now ordered by their ID, thus keeping their order constant over time/iterations.
BugFix: Importing from OpenAlex allows to specify comma-separated strings to filter-out (exclude) some records. Such lists wouldn't work if one included a space before or after the commas, this problem is now solved.
Bugfix: while creating "Reference Groups" and doing so while changing reviews, it was possible to confuse the system and create reference groups in the wrong review. The resulting "Group N" codes would then refuse to be deleted. This could only happen by executing a very specific sequence of actions, which is extremely rare, causing this bug to remain undetected for years. The problem is now solved.