Version 220.127.116.11 is a relatively large one; most of the effort was put into expanding the duplicate-checking functionalities. In the mean time the internal testing of semi-automatic screening has provided very encouraging results, the work on the long-awaited-for online shop is also proceeding steadily.
Lots of new features and bug fixes in this area. As a consequence, the best practices have changed a little. All details will be entered in the user manual shortly.
The main purpose that inspired the new features was to provide solutions to the problems that appear when overlapping groups are created. This happens if “get new duplicates” is clicked more than once, after importing new. Since the master item is chosen automatically, each time new items are grouped for the first time, new and larger versions of existing groups may appear. The new features are supposed to help coping with such situations in various different ways.
New features (Manual/Advanced tab):
- Find related Groups: it is now possible to find groups based on three different criteria
+ Find related groups will list all groups that share at least one document with the currently selected one.
+ Find groups that contain the selected Document(s).
+ Find groups based on a list of Document IDs.
Finding related groups is useful when dealing with overlapping groups. Finding by selected document or document ID is useful to manually check why a given item is or is not marked as a duplicate.
- Reset: there are two ways to delete unwanted, wrong or outdated grouping data.
Option 1): You can delete all duplicate groups and keep information about documents already marked as duplicates. This will give you a fresh start to re-evaluate duplicates without losing the work already done.
Note that documents already marked as duplicates will not be re-evaluated, and this will have a few consequences:
1) When you 'Get new Duplicates' you should get a smaller number of groups as all 'completed' groups should not reappear.
2) Overlapping groups will not show up again; they may re-appear only if you import new items and 'get new duplicates' once more.
3) Information about the old groups will be lost! You will not be able to find out the similarity scores of items you have already marked as duplicates.
Option 2): You can delete all duplicate information, effectively going back to square one.
This will give you a fresh start to re-evaluate duplicates in case you believe what you have done so far is likely to be wrong.
Note that documents already marked as duplicates will reappear.
You might want to proceed with this rather radical choice in case:
1) You have used the 'Advanced Mark Automatically' feature with too permissive thresholds and you have marked as duplicates too many false positives.
In this case, deleting all de-dup data and starting over again is likely to be faster than manually looking for errors.
2) You have a large number of overlapping groups and you have not invested a lot of time in manually evaluating groups.
Getting a 100% fresh start will eliminate overlapping groups and allow you to re-run the automatic marking procedure with little waste of time.
New features (document details window):
- If the current document is marked as duplicate, the ID of its master will be shown.
- At the same time a “Restore” button is shown: you may use it to “unduplicate” the current item.
- Manual merge of groups now works in a slightly different way, before the update, in certain situations it was possible to enter into a dead end, which could prevent from marking as duplicates all the needed items.
- Greying out individual groups members (“read-only” items). Also in the case, this behaviour was tweaked to prevent dead-locks and the creation of inconsistent data. The more relevant changes are: 1. Items that are marked as master on another group are now “read-only” in all other groups. 2. Items that are marked as duplicate in another group now always appear as “read-only”, checked and “not a duplicate” in all other groups.
When dealing with overlapping group, we now recommend a new best practice.
a) Users should use the “find related groups” function to identify overlapping groups (a greyed out group or group member is always a sign of overlap).
b) Among the related group list, find the smallest group.
c) Manually add the items you need to the smallest group. This can be done on a per item basis or with the “add group” function.
Still to come:
- The ability to manually create new groups.
- A “Delete this group” function.
Automatically generate codes:
- It is now possible filter the clustering input by code: only items with the chosen code will be clustered.
- Fixed a bug that meant that deleted items (deleted as part of a source deletion) could still be picked up in the clustering.
- When editing or creating codes and code sets, hitting “Enter” twice will now save the changes.
- Added to the search dialog box: function to search for items with and without documents uploaded.
- Fixed a bug in our behind the scenes authors handling. With some very special and really uncommon situations, it was possible to confuse our author-name recognition system. This meant that it was possible to manually edit the “author” fields and lose all its content on save.