Managing Duplicates - Advanced Mark Automatically

Forum (Archive)

This forum is kept largely for historic reasons and for our latest changes announcements. (It was focused around the older EPPI Reviewer version 4.)

There are many informative posts and answers to common questions, but you may find our videos and other resources more informative if you are an EPPI Reviewer WEB user.

Click here to search the forum. If you do have questions or require support, please email eppisupport@ucl.ac.uk.

<< Back to main Help page

Home

Using EPPI-Revi...

Questions about...

Managing Duplicates - Advanced Mark Automatically

13/01/2014 16:23

Jennifer Stevenson

Joined: 03/10/2013

Posts: 10

Managing Duplicates - Advanced Mark Automatically

Hi there,

We have a large number of duplicates to get rid of in our review, and we would like to know the minimum similarity threshold you would recommend to automatically mark duplicates. So far we have reduced to 0.95 similarity but still have around 5000 to go through!

Thanks,

Jennifer

14/01/2014 10:02

Jeff Brunton

Joined: 17/10/2011

Posts: 594

Re: Managing Duplicates - Advanced Mark Automatically

Hello Jennifer,

The lowest that you would want to set the threshold is dependant on the similarity scores that you see in your duplicate groups. The similarity scores are often dependant on the completeness of your imported items. If you have missing data in your records then your similarity scores will be quite low. If you find that some of your author names use complete full names while others use initials there will be lower similarity scores.

What I normally do is run it at the default initially (1.0) to complete as many groups as possible. I then look through the incomplete groups to get a feel of the range of similarity scores. Running it again at 0.95 (as you did) is what I would have done. I would then look at a number of incomplete groups to see the similarity scores and make an estimate of how I could go. If I could find any items that had similarity scores above 0.9 that weren't duplicates then that would give me a good indication. If everything above 0.9 was a duplicate I would run it again at that level to catch as many more groups as possible.

I would continue to lower the score as long as I couldn't find a group that wasn't a duplicate. The lowest I have probably gone is about 0.85 but that was based on the similarity scores I was seeing in my data.

Best regards,

Jeff

14/01/2014 10:26

Jennifer Stevenson

Joined: 03/10/2013

Posts: 10

Re: Managing Duplicates - Advanced Mark Automatically

Thank you Jeff, this is very helpful.

Best,

Jennifer

Page 1 of 1

Home

Using EPPI-Revi...

Questions about...

Managing Duplicates - Advanced Mark Automatically