Feature Request - Duplicate Management

Chris_Scott · August 18, 2021, 2:44pm

Maybe I missed it but it does not seem like there is duplicate management functionality. At least without formulas.

It would be great to be able to highlight a column of data and run a “remove duplicates” or “highlight duplicates” function via the context menu.

Is this possible?

paul-grist · August 18, 2021, 5:51pm

Hi @Chris_Scott, you’re right, managing duplicates currently depends on formulas, see for example: Ensure unique values or detect duplicates.

We do have plans to make it easier to avoid duplicates when importing new data into a document that already has data in it. Is that your situation?

Chris_Scott · August 18, 2021, 7:38pm

@paul-grist That is my ideal use case. Removal on import. An easier way to manage them once imported would also be great, but on import would be excellent!

alexmojaki · August 18, 2021, 10:00pm

If you create a summary table grouped by the column you want to deduplicate, then sort the summary table by descending count, all the duplicated values will be at the top.

You can then use linking to select the original table by the summary table, meaning that clicking a row in the summary table will show only the corresponding rows in the original table.

Automatically removing duplicates requires defining which rows get removed. What do you want to happen when you import data that already exists in the table? Keep the new row from the import (can you assume there’s only one?) or the existing one in Grist?

Chris_Scott · August 19, 2021, 3:57am

It seems to me that keeping the new one would make the most sense, at least in our case… But I can see that could actually vary. Could there be a way to have both options, and the user choose the one that makes the most sense at the time?

MaximilianKohler · October 26, 2024, 4:26am

My use case is removing duplicates from existing spreadsheets with more than a million rows.

The linked thread & tutorial is useful but it doesn’t look like it can find dupes across multiple pages/sheets. If you implement this feature, please make it able to do that.

For example, I may have 1-2 sheets with a million rows, and another with 5,000, and I’d want to remove the 5,000 from the other two sheets.