HelpForum

Forum (Archive)

This forum is kept largely for historic reasons and for our latest changes announcements. (It was focused around the older EPPI Reviewer version 4.)

There are many informative posts and answers to common questions, but you may find our videos and other resources more informative if you are an EPPI Reviewer WEB user.

Click here to search the forum. If you do have questions or require support, please email eppisupport@ucl.ac.uk.

<< Back to main Help page

HomeHomeUsing EPPI-Revi...Using EPPI-Revi...Questions about...Questions about...Incomplete duplicate searchIncomplete duplicate search
Previous
 
Next
New Post
06/12/2011 19:19
 

 Hi,

I am part of a team working on a scoping review. After running a first automatic duplicate check, out of 6630 or so articles, 1085 duplicates were found and marked. We then started screening the titles and abstracts and noticed that there were still duplicates. We performed a visual screening for duplicates and found around 200 more duplicates.  Although the titles and authors were similar, it seems the duplicate search was not able to identify all the duplicates. Is this common? Is there something we could have done to improve the search for duplicates?

Best regards,

Andrea

 
New Post
07/12/2011 11:49
 

Hello Andrea,

When you run the ‘Get new duplicates’ function the system looks through all of the items in the review and compares each item with all of the others. Possible duplicates are grouped together into duplicate groups that are displayed on the left side of the screen in the ‘Manage Duplicate Groups’ window. The system calculates a score for each match based on the authors and titles, book authors and titles (when applicable) and journal names (when applicable). If the score is 1, the chance of an exact match is 100%.  If there are slight differences in how the authors are written (such as initials rather than full names) the score for the match becomes lower. These scores can be seen if you click on a group on the left and look at the details of the items on the right (in the ‘Manage Duplicate Groups’ window).
When you run the ‘Mark Automatically’ function the system will look at the items and their scores and if the score is set to 1 it will mark it as a duplicate. It is important to note that if an item has already been coded the ‘Mark Automatically’ function will ignore that item to avoid the situation where a coded item could be marked as a duplicate.
If you are finding items that the ‘Mark Automatically’ function did not pick up the most likely reasons are that the score was less than 1 or the item may have already had a code assigned to it (when the ‘Mark Automatically’ function was run).

You can adjust the threshold of the ‘Mark Automatically’ function so it is less than one. This can be done in the ‘Manual/Advance’ tab within the ‘Manage Duplicate Groups’ window. Lowering the threshold may pick up the ‘close’ matches but setting it too low could result in non-duplicates being matched. You can look at the scores to get a feel of how low you should go.

In your case, lowering the threshold may not help as you have already done quite a bit of coding. If an item has had a code applied to it, the ‘Mark Automatically’ function will ignore that item. What you may want to do instead is to either add the duplicate items to an existing group (if similar items are already in a group) or place the items in a new duplicate group. You will find details on how to do this in the user manual under ‘Manually adding a duplicate’. Manually adding a duplicate is done in the ‘Manual/Advance’ tab within the ‘Manage Duplicate Groups’ window. You can select the item(s) from a list at the bottom of the page and pick the group to assign it to from the list of groups on the left side of the page. If a similar group doesn’t exist you can create a new duplicate group to add the items to.

Best regards,

Jeff
 

 
New Post
07/12/2011 15:49
 

 Hi Jeff,

Thank you for your rapid and detailed response. 

We did not run a "Mark automatically" duplicate check, but rather chose the "Get new duplicates" function in order to individually view and mark all the possible duplicates. There are quite a few articles that the system was not able to identify when the authors names or title was written differently. Indeed, the articles we found did not show up at all in the "Get new duplicates" search (not even at a low percentage). If we had run a "Mark automatically" search and had used a low threshold, would these duplicate articles been found by the system? Wouldn't the results be the same except that the duplicates would have been marked?

 We were able to deal with the articles we found as duplicates and add them manually to existing groups or create new groups. This was quite easy to do, but time consuming.

Thank you again,

Andrea

 
New Post
07/12/2011 16:12
 

Hello Andrea,

Do you have an example of items that were not picked up by 'Get new duplicates'. If you let me know the ID numbers I can find them in the database. I would need you to identify which items were grouped and which ones were missed. (ex. items12345 and 23456 were grouped together but the system missed item 3456 which should have been in the same group  OR  items 12345 and 23456 should have been grouped but were not).

'Get new duplicates' should be finding the similar items but if some are slipping past we want to know about them to better understand  why.

Thanks for you assistance,

Jeff

 

 
New Post
07/12/2011 20:37
 

 Hi Jeff,

Indeed, our team created a document that lists all the duplicates we manually added to a group or used to create a new group.  

Here are some example of articles which were added to a group:

  • 4524577 added to 4523837 + 2066662 (group 387721)
  • 4661901 added to 4526755 + 4526073 (group 388043)
  • 2066725 added to 4526084 + 4526817 (group 388053)

Here are some examples of new groups:

  • 4523736 + 2066502 = group 390720
  • 2066516 + 4523774 =  group 390721
  • 2066281+ 4523567 = group 390722
  • In fact, if I am not mistaken, groups 390560 to 390766 are new groups that we created.

Many thanks,

Andrea

 

 

 

 

 

 
New Post
08/12/2011 14:54
 

Hello Andrea,

Thank you for sending those details to us.  What you are seeing is what we would have expected.

If we look at the first case ‘4524577 added to 4523837 + 2066662 (group 387721)’ and compare the data I can see that 4523837 and 2066662 have a similarity score of .881. The author names in 2066662 use the full first names while the authors names in 4523837 use initials. This would account for the score being less than 1. The journal and the titles are the same in both (‘Studies in Health Technology & Informatics’ and ‘Architectural and usability considerations in the development of a Web 2.0-based EHR.’.
If we now look at 4524577 I can see that the authors are the same as in 4523837. The journal name is the same as well. The main difference occurs in the title field.  For 4524577 the title is ‘Architectural and usability considerations in the development of a Web 2.0-based EHR... electronic health record.’ ( the text ‘.. electronic health record.’ has been added to the title). The system is seeing the title as being very different so it is not matching it with the other 2 items.

Having a quick look at the other examples I can see that similar things occur. For ‘4661901 added to 4526755 + 4526073 (group 388043)’ the titles are the same for all 3 items. The author field is a bit different for the item not included (‘Shihab M ;’  vs ‘Shihab Mahmud M;’). The big difference is that item 4661901 (the one not placed in the group) did not have any data in the Journal field. The journal field is part of matching/scoring algorithm so a missing journal name can have a huge affect. The journal information may have been missing in the original search results or may have been lost if the import filter was not a good match for the file to be imported.

At this time the user does not have the option of adjusting the threshold of what is considered a match when 'Get new duplicates' is run. This is something we are looking at adding to the program but this will not happen in the near future.

If you have any questions about this please let me know.

Best regards,

Jeff

 
Previous
 
Next
HomeHomeUsing EPPI-Revi...Using EPPI-Revi...Questions about...Questions about...Incomplete duplicate searchIncomplete duplicate search


Copyright 2021 by EPPI-Centre :: Privacy Statement :: Terms Of Use :: Site Map :: Login
Home::Help::EPPI-Mapper::RIS Export::About::Account Manager