我有一个包含以下列的表:
URL_ID
URL_ADDR
URL_Time
Run Code Online (Sandbox Code Playgroud)
我想URL_ADDR
使用MySQL查询删除列上的重复项.
不使用任何编程就可以做这样的事吗?
Dan*_*llo 31
考虑以下测试用例:
CREATE TABLE mytb (url_id int, url_addr varchar(100));
INSERT INTO mytb VALUES (1, 'www.google.com');
INSERT INTO mytb VALUES (2, 'www.microsoft.com');
INSERT INTO mytb VALUES (3, 'www.apple.com');
INSERT INTO mytb VALUES (4, 'www.google.com');
INSERT INTO mytb VALUES (5, 'www.cnn.com');
INSERT INTO mytb VALUES (6, 'www.apple.com');
Run Code Online (Sandbox Code Playgroud)
我们的测试表现在包含:
SELECT * FROM mytb;
+--------+-------------------+
| url_id | url_addr |
+--------+-------------------+
| 1 | www.google.com |
| 2 | www.microsoft.com |
| 3 | www.apple.com |
| 4 | www.google.com |
| 5 | www.cnn.com |
| 6 | www.apple.com |
+--------+-------------------+
5 rows in set (0.00 sec)
Run Code Online (Sandbox Code Playgroud)
然后我们可以使用多表DELETE
语法如下:
DELETE t2
FROM mytb t1
JOIN mytb t2 ON (t2.url_addr = t1.url_addr AND t2.url_id > t1.url_id);
Run Code Online (Sandbox Code Playgroud)
...将删除重复的条目,只留下第一个网址url_id
:
SELECT * FROM mytb;
+--------+-------------------+
| url_id | url_addr |
+--------+-------------------+
| 1 | www.google.com |
| 2 | www.microsoft.com |
| 3 | www.apple.com |
| 5 | www.cnn.com |
+--------+-------------------+
3 rows in set (0.00 sec)
Run Code Online (Sandbox Code Playgroud)
更新 - 继上述新评论:
如果重复的URL格式不同,您可能需要应用REPLACE()
要删除的功能www.
或http://
部分.例如:
DELETE t2
FROM mytb t1
JOIN mytb t2 ON (REPLACE(t2.url_addr, 'www.', '') =
REPLACE(t1.url_addr, 'www.', '') AND
t2.url_id > t1.url_id);
Run Code Online (Sandbox Code Playgroud)
您可能想尝试http://labs.creativecommons.org/2010/01/12/removing-duplicate-rows-in-mysql/中提到的方法.
ALTER IGNORE TABLE your_table ADD UNIQUE INDEX `tmp_index` (URL_ADDR);
Run Code Online (Sandbox Code Playgroud)
这将留下具有最高特点URL_ID
的那些URL_ADDR
DELETE FROM table
WHERE URL_ID NOT IN
(SELECT ID FROM
(SELECT MAX(URL_ID) AS ID
FROM table
WHERE URL_ID IS NOT NULL
GROUP BY URL_ADDR ) X) /*Sounds like you would need to GROUP BY a
calculated form - e.g. using REPLACE to
strip out www see Daniel's answer*/
Run Code Online (Sandbox Code Playgroud)
(派生表'X'是为了避免错误 "你无法为FROM子句中的更新指定目标表'tablename'")
您可以对 URL_ADDR 进行分组,这将有效地在 URL_ADDR 字段中仅提供不同的值。
select
URL_ID
URL_ADDR
URL_Time
from
some_table
group by
URL_ADDR
Run Code Online (Sandbox Code Playgroud)
享受!
归档时间: |
|
查看次数: |
22985 次 |
最近记录: |