Dear Abdul,
I gave a look into your review: your claim that "both articles were identical" worried me. This should never happen: it would be very wrong if two articles that have identical "fields" (title, authors, journal, and so on) weren't picked up by the automatic duplicate checking. After looking around for a while, I could find two items that are the same (they represent the same original article), but their fields are quite different, and therefore I can see why this happened.
Forgive me, but I'll need to go into the technical details, in order to explain this.
First, the two items I've spotted (there may be others) are 7787378 and 7787450, short titles are "Cooper (2011)" and "James (2011)". If you look at the two of them, you'll notice that both have a very long list of authors, and that item 7787378 has a much longer list than 7787450 (includes organisation information). Since the article title is relatively short, and there is also a small difference in the journal title, the overall differences have been considered too big, and the duplicate detection routine have ignored the similarity. The automatic procedures can't discern obvious (to humans) similarities, they just use fixed algorithm to calculate a similarity score and work on that. This score is used to decide if items are possible candidates or not: below a given similarity threshold, items will not be suggested as duplicates. Our (the developers) job is to establish the threshold, so that it will work well in most situations. This is a difficult task, because a low threshold will create more false positives, and this will in turn generate more work for all reviewers (that will need to sift through more, unnecessary candidates) or to actual mistakes if reviewers will "automatically assign" items as duplicates using a low similarity threshold and never check the results in person.
Our approach therefore was to accept the risk of letting a few actual duplicates "slip past" the automatic procedure instead of risking to generate too much work or even undue exclusions from a review. The flip side is that your only option for the remaining duplicates is "manual detection". In your case, ordering the main list of documents by their title is a fast and effective way to visually spot possible duplicates pretty fast [You may want to increase the page size, this is done through the "select the fields you want to display" button on the main documents page (the third button from the right) - the maximum is 4000 items per page.] .
Once you'll identify some actual duplicates, you can proceed in two ways, the first is fast and dirty (suitable if you don't need an accurate "duplicates" number), the second a little bit slower.
First route: delete/exclude the additional item copies. You could also have a "duplicates" administrative code and assign items to it. At the end of this exercise you could list all the items with that code and delete/exclude them in one go.
Second route: mark items as duplicates. You will group duplicate items together in the proper way.
- Select the two (or more) duplicate items (making sure you are selecting items that should be part of the same group).
- Click "manage duplicates" and go to the Manual/Advanced tab.
- Click "Find\Find groups that contain the selected Documents".
?-? If no group is found, click the Plus button ("Add new Group") and follow the on-screen instructions.
?-? If a group is found, click the "Add Selected Item(s)" button. This tries to add the selected item to the selected group, if an item is already part of it, it will be ignored.
?-? If more than one group is found, let us know. I will explain why this may happen and how to sort this cases out.
For completeness: you could also proceed in a very different way. That is to "fix" the differences manually, and then re-run the "find new duplicates" function. This will work, but it will not be faster and may create other issues so I don't normally recommend this approach.
Sorry for the long message, I hope this makes sense to you.
Sergio