Coding while de-duplicating

Forum (Archive)

This forum is kept largely for historic reasons and for our latest changes announcements. (It was focused around the older EPPI Reviewer version 4.)

There are many informative posts and answers to common questions, but you may find our videos and other resources more informative if you are an EPPI Reviewer WEB user.

Click here to search the forum. If you do have questions or require support, please email eppisupport@ucl.ac.uk.

<< Back to main Help page

Home

Using EPPI-Revi...

Questions about...

Coding while de-duplicating

26/01/2011 13:12

Beki Langford

Joined: 25/02/2013

Posts: 11

Coding while de-duplicating

I'm currently having to go through a list of duplicates manually (as their similarity threshold is <1). Most of these references will not actually be relevant for our review so will be excluded. Is there a way of marking these references as 'excluded' whilst I go through and mark them as duplicates? It seems a shame to have to screen them all for duplicates and then do it again to exlcude them from the review. Thank you.

26/01/2011 15:40

Sergio Graziosi

Joined: 17/10/2011

Posts: 318

Re: Coding while de-duplicating

Dear Beki,
thanks for posting! Short answer is no, I'm afraid we don't have a way to "mark as excluded" or assign codes to references while working on the duplicates window. However, you are suggesting a very useful feature, and you have got me and my colleagues seriously thinking about how to implement this.
It will not be easy, as there are numerous implications that should be taken into account (not to mention the need to offer a clear and ergonomic user interface). I am not saying that this feature will not enter into our roadmap; all I'm saying is that I can't make any promises. However, I have in mind a few (pretty urgent) additions to the duplicate checking features, so I will discuss your ideas with the rest of the team and try to agree how to implement it; with a little luck, we might be able to write it along with the other de-dup new features.

Unfortunately, I don't expect this to be ready in days, so I'm afraid it might not be published in time for you to take advantage of it.

Back on the subject, it is true that 'automatically marking' duplicates with a threshold lower than 1 is dangerous, however, it might be worth taking a calculated risk (especially for reviews that contain 10'000+ references): if you have already manually evaluated at least a hundred groups, you may have the possibility to indentify a 'safe' threshold to use with the 'advanced mark automatically' function. Take a look at the 'completed' groups, and find the highest similarity score applied to items that you have identified as 'not a duplicate' (I'll call this value T). Let's say that T is 0.92xxxx, in this case, you may argue that selecting a threshold of 0.96 (half way between 1 and T) is likely to be safe. If T was 0.88xxxx, the 'safe' threshold could be 0.94 and so on. If all references you've seen so far are in fact legitimate duplicates, I would suggest continuing the manual procedure until you'll find a few non-duplicates.

This way to proceed is not guaranteed to be 100% accurate, it will be up to you to decide if it's good enough (and of course, you may change how you calculate the 'safe' threshold). The big limitation of 'automatic marking' is that there is no easy 'undo' route, this is why it's risky! As a way to mitigate this problem, it is possible to cancel the operation while it runs. In this way, to double check the accuracy of the process, you can lower the threshold a bit, let the system evaluate a few groups, click cancel and manually look at what happened. Note that, to save time, the engine skips already completed groups.

I am very aware that duplicate checking is complex and time consuming, so I'm hoping this message will help you a little.
Best wishes,
Sergio

27/01/2011 12:07

Beki Langford

Joined: 25/02/2013

Posts: 11

Re: Coding while de-duplicating

Hi Sergio,

Thanks for you thoughts. I decided to try re-running the deduplication setting the threshold at .95 but I got the same problem as before :

('"MARK AUTOMATICALLY FAILED:
Could not retrieve the current group details.
Data portal.Fetch failed (Object reference not set to an instance of an object)
Please contact the support team.")

Can you help? (I hope this doesn't mean I'll lose the one's I've marked manually!)

Thanks,

Beki

27/01/2011 12:19

Sergio Graziosi

Joined: 17/10/2011

Posts: 318

Re: Coding while de-duplicating

Hi Beki,

this is really disappointing! I'm sorry to hear you are still having trouble. I gave a quick look at your data, and it all looks all right at first sight. May I ask you to allow me trying the 'advanced mark automatically' with 0.95 threshold (on your review), to see if I get the same error?

Thanks,

Sergio

27/01/2011 12:36

Sergio Graziosi

Joined: 17/10/2011

Posts: 318

Re: Coding while de-duplicating

Actually, with a second, more targeted search, I've found one and only one group that has wrong data. How it happened that one group over 18000 got wrong still escapes me, I'll investigate and report back - this is easy to fix, but I'd like to find out why it happened as well.

Sorry for the inconvenience,

Sergio

27/01/2011 12:54

Beki Langford

Joined: 25/02/2013

Posts: 11

Re: Coding while de-duplicating

Hi Sergio,

Thanks for looking into it. Having done as you suggested in yoru initial response,I may decide to drop the threshold lower than .95 (depending on how much difference that has made to the numbers needing to be manually screened). Once you've investigated, would it be OK for me to re-run it with a lower threshold?

Beki

27/01/2011 13:10

Sergio Graziosi

Joined: 17/10/2011

Posts: 318

Re: Coding while de-duplicating

Hi Again,

I've finished collecting data: I know what happened in detail, I'll need to figure out why, but that is something I can do later and without keeping you on hold. I'm now running the .95 marking for you, just to be sure that it will complete correctly. Will let you know when it is over, at that point it will be absolutely fine for you to try lower thresholds.

Sorry once again for the trouble,

Sergio

27/01/2011 13:41

Sergio Graziosi

Joined: 17/10/2011

Posts: 318

Re: Coding while de-duplicating

Hi Beki,

it is all done now. Please feel free to continue your work as usual.

Best wishes,

Sergio

27/01/2011 14:03

Beki Langford

Joined: 25/02/2013

Posts: 11

Re: Coding while de-duplicating

Thanks Sergio!

02/02/2011 16:40

Sergio Graziosi

Joined: 17/10/2011

Posts: 318

Re: Coding while de-duplicating

Sergio Graziosi wrote

[...]
I know what happened in detail, I'll need to figure out why, but that is something I can do later and without keeping you on hold.
[...]

And finally, I've found why it happened and fixed it. A quite unlikely case of data collision, it was much more difficult to replicate than to fix. Thanks for letting us know!

Sergio

Page 1 of 1

Home

Using EPPI-Revi...

Questions about...

Coding while de-duplicating