Data mining - Questions about using EPPI-Reviewer

Forum (Archive)

This forum is kept largely for historic reasons and for our latest changes announcements. (It was focused around the older EPPI Reviewer version 4.)

There are many informative posts and answers to common questions, but you may find our videos and other resources more informative if you are an EPPI Reviewer WEB user.

Click here to search the forum. If you do have questions or require support, please email eppisupport@ucl.ac.uk.

<< Back to main Help page

31/01/2014 09:41

Mark Corbett

Joined: 24/01/2013

Posts: 14

Data mining

Hi
Just some feedback and questions about the "find similar documents" data mining functions:

For the option TF*IDF I was timed-out several times until I'd removed many terms, then I finally got some results in the search tab - is there a maximum number of terms?, or a recommended number of terms? (or can the time-out period be extended?). Can more than one term be selected at once? (otherwise the deleting process is very slow).

For TerMine I got an error message: Data at the root level is invalid, Line 1 position 1.

I also got an error message for Yahoo: The remote server returned an error: 404 (Not found)

Mark

31/01/2014 12:17

Sergio Graziosi

Joined: 17/10/2011

Posts: 319

Re: Data mining

Dear Mark,

Thanks for letting us know! Yahoo has been announcing the closure of their term-extraction service for years, and apparently have finally done it for real now. We will remove the option at the next available opportunity.
The Termine issue was a licensing glitch and was swiftly resolved by NaCTeM. It is working now.

The timeouts are a complex issue, they depend on many factors, mostly, but not only, the current workload. Means that it's worth trying again every now and then, as they should be the exception and not the norm. We can (and will consider doing so) increase the timeout, but it's a tricky decision: keeping a relatively short timeout limit allows us to make sure the underlying Data-access system remains in top-form and doesn't hide inefficiencies. This in turn allows our system to support our ever-increasing user-base. In other words, I would raise the timeout threshold only if I'm convinced that there is no way to make the slow procedure more efficient.

Anyway, the guidance for how many terms can be used has to be fuzzy, we don't want to impose artificial limits on what can be done as they will necessarily need to err on the safe side, significantly limiting the usefulness of EPPI-Reviewer functions. The practical consequence is that you will have to keep using the trial and error approach: searching 10'000 items with hundreds of weighted terms will always be a computationally costly task, therefore doing so may always trigger a timeout error. [Yes, there is a tension between our general approach with timeouts and the practical guidance above, we will give this some serious thought]
On the other hand, deleting more than one term in one go is not possible at the time, but I can see why it may be useful, so we will enable this option with the next update (still several weeks away, I'm afraid).

Finally, we are now using a support vector machine for data mining in EPPI-Reviewer (applied to the screening stages). This is still in testing, but we can enable it for your review if you'd be interested?

Thanks and best wishes,

Sergio

31/01/2014 13:14

Mark Corbett

Joined: 24/01/2013

Posts: 14

Re: Data mining

Thanks for that Sergio. I was looking at the data mining funtionality in terms of possibly using it in future reviews (rather than in my curent review), but thanks for the offer (support vector machine) - I'm interested as to how this will improve the current functionality?

31/01/2014 16:31

Sergio Graziosi

Joined: 17/10/2011

Posts: 319

Re: Data mining

The system we are testing right now is completely different from what you have tried, the basic idea is that the software will look at your decisions about what items should be included or excluded and gradually learn to predict your choices. Based on this "knowledge" it will present you the next item to be screened, picking the one that currently has the highest probability of being included.

The result is that you can normally find the vast majority of included items by screening only a fraction of all the imported references. However there will never be the absolute certainty that all the "to be included" references have been identified until all of them have been actually looked at. In all cases, it speeds up big reviews because allows to start finding the relevant papers earlier and proceed with some of the next stages while the (relatively unproductive) tail of the screening rounds are still going on.

You can find more information on the following provisional paper: http://www.oapublishinglondon.com/images/article/pdf/1389491803.pdf
I will also send you privately a short, also provisional, guide to the specific system we are testing.

Best wishes,

Sergio

Page 1 of 1