Hi again,
As promised, a little more information about the inner working of duplicate checking, hoping this will make it clear why sometimes groups or references are greyed out unexpectedly.
When “get new duplicates” is clicked, the grouping algorithm evaluates (and quantifies) the differences between all evaluated references. It then groups them together, choosing the groups in the way that minimises the differences within each group. In this phase, the “Master Item” is chosen, selecting the group member that has the minimum overall difference with all the other group members. In this way the automatically selected master is the one that has the highest chance of being a legitimate representative of the group.
This approach works very well initially, but it starts to show its limits when “get new duplicates” is clicked more than once, normally after importing new items into a review. At this stage, it is necessary to match the fresh “get duplicates” results with what is already present in the database (so not to lose what marking work has already been done). The only legitimate way to do this is to compare master items; whenever they match, the system will assume that the new and old groups are the same and add any additional new group member to the existing one. If a group (or better, its master item) in the new results does not match any existing group, a new group is created, and all of its members are inserted in the database. This works perfectly for groups that are new, but doesn’t in the case the group is not new but has a new master.
Since the master is chosen automatically as the best representative of a group, and since similarity scores are provided only between the master and the other group members, when new references have been imported, and “get new duplicates” is triggered, the resulting groups might have a new master, making the corresponding old group outdated (in conceptual terms). The problem is that the old groups might have been already evaluated (automatically or manually), so disabling (and discharging) the old group doesn’t look like a wise choice. My solution was to provide two mechanisms to manually recover from these awkward situations: (a) disabling items already “marked as duplicates” and (b) providing a way to merge two groups.
(a) has two separate parts: first, group members that are already marked as duplicates are greyed out individually. Second, if a group has a master item that is already marked as duplicate somewhere else, then the whole group is disabled (greyed out). This second safeguard is there to prevent the creation of meaningless duplicate chains, where item A is marked as duplicate of B and B is marked as duplicate of C.
(b) is quite simple, as explained in the previous post, it is just a shortcut to manually add items from group X into group Y, it is entirely reversible and it will ignore items that are present in both groups.
Finally, I have realised that a third component is missing! What I’m describing here usually happens for large reviews, which have more than 10000 references, and could easily have thousands of duplicate groups. In such situations, dealing with multiple versions of the same groups might be difficult, as manually finding the related groups can be difficult and time consuming. For this reason, the additional function that I’m planning to write is something like (details might change in due course):
(c) Find related groups. This would be a mechanism to list the groups that share some items (including manually added items), so that applying (b), and/or untangling overlapping groups in other ways (see previous post) will be a simple matter of point and click.
Unfortunately, I’m still buried under another big project: I’m currently testing our online shop. I believe this is the most waited-for new feature and hence has the highest priority (apart from user support, of course!).
Thanks for reading,
Sergio