Is it possible to clean up duplicate entries with just SQL?

Aeo*_*cid 3 mysql mysql-5.7

I accidentally introduced a bug in my software that caused it to create duplicate entries. Now I have a lot of duplicate entries, the total rows of that table is around 111 million.

This is my table structure for that table.

CREATE TABLE `graph_player_ranked` (
  `player_id` int(11) NOT NULL,
  `season_id` int(11) NOT NULL,
  `playlist_id` int(11) NOT NULL,
  `timestamp` datetime NOT NULL,
  `matches_played` int(11) DEFAULT NULL,
  `rank_points` int(11) DEFAULT NULL,
  `tier` int(11) DEFAULT NULL,
  `division` int(11) DEFAULT NULL,
  `no_update_at` datetime NOT NULL,
  PRIMARY KEY (`player_id`,`season_id`,`playlist_id`,`timestamp`),
  KEY `player_id` (`player_id`,`playlist_id`,`timestamp`),
  KEY `idx_timestamp` (`timestamp`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Run Code Online (Sandbox Code Playgroud)

Please don't comment about the possibly bad structure, I am switching to a new structure but I have to fix this issue first. However, should I remove timestamp from the PK?

The idea was to only add a new row when "rank_points" OR "tier" OR "division" changed OR "matches_played" increased. At some time I accidentally made it create a new row when "matches_played" was the same as the previous row or larger.

Example of duplicate data:

数据样本

I want to remove all duplicate entries with just one row. So for the above image that would be:

  1. Remove all of the rows in a red rectangle
  2. Grab "timestamp" from the first row in the rectangle.
  3. Grab "no_update_at" timestamp from the last row in the rectangle.
  4. Place one row back with all previous data and the above two timestamps.

I have no idea how to achieve this, the only solution I can think of is creating an external tool to do this. But that would probably take very long to process all rows.

Edit: Sample data for two players can be found here https://dl.dropboxusercontent.com/u/184110260/sample_data/graph_player_ranked.sql

And*_*y M 5

Your data sample seems to suggest that the two datetime columns are the only columns that can change in each group of duplicates. In other words, duplicates are determined by columns player_id, season_id, playlist_id, matches_played, rank_points, tier, division.

One other assumption that your data sample prompts is that a row with a greater timestamp corresponds to a row with a greater no_update_at as well. (In fact, the two columns always have the same value in your example.)

Finally, for the purpose of this answer, I am also going to assume that when you are saying "first" or "last" (row in the rectangle), you are talking about the earliest or latest datetime value.

With those out of the way, you can generate the desired result set from your table if you just group the rows by the columns listed above and take MIN(timestamp) for timestamp and MAX(no_update_at) for no_update_at – that is, like this:

SELECT
  player_id,
  season_id,
  playlist_id,
  MIN(timestamp) AS timestamp,
  matches_played,
  rank_points,
  tier,
  division,
  MAX(no_update_at) AS no_update_at
FROM
  graph_player_ranked
GROUP BY
  player_id,
  season_id,
  playlist_id,
  matches_played,
  rank_points,
  tier,
  division
;
Run Code Online (Sandbox Code Playgroud)

You could store that in a temporary table:

CREATE TEMPORARY TABLE tmp_graph_player_ranked
AS
SELECT
  ...
;
Run Code Online (Sandbox Code Playgroud)

then truncate the source:

TRUNCATE graph_player_ranked;
Run Code Online (Sandbox Code Playgroud)

and populate it from the temporary table:

INSERT INTO
  graph_player_ranked
    (
      player_id,
      season_id,
      playlist_id,
      timestamp,
      matches_played,
      rank_points,
      tier,
      division,
      no_update_at
    )
SELECT
  player_id,
  season_id,
  playlist_id,
  timestamp,
  matches_played,
  rank_points,
  tier,
  division,
  no_update_at
FROM
  tmp_graph_player_ranked
;
Run Code Online (Sandbox Code Playgroud)

当然,如果该表被其他表引用,您不能简单地删除所有数据——您可能需要在重新填充表之前禁用或删除外键,然后再启用或重新创建它们。(这是一个完全不同的问题,需要单独解决,以防您需要帮助。)