如何定义正则表达式从Java字符串中删除文本掩码垃圾邮件链接(“spam1 dot com”)?

前端之家收集整理的这篇文章主要介绍了如何定义正则表达式从Java字符串中删除文本掩码垃圾邮件链接(“spam1 dot com”)?前端之家小编觉得挺不错的,现在分享给大家,也给大家做个参考。
我有一个代表垃圾链接的网站列表:

List<String> bannedSites = ["spam1.com","spam2.com","spam3.com"];

是否有正则表达式从此文本中删除与这些禁止的网站匹配的链接

Dear Arezzo,Please check out my website at spam1.com or http://www.spam1.com 
or http://spam1.com or spam1 dot com to win millions of dollars in prizes.
Thank you.
Big Spammer

请注意,链接可能有多种URL格式,aioobe‘s solution可以很好地识别:

String input = "Dear Arezzo,\n"
        + "Please check out my website at spam1.com or http://www.spam1.com" 
        + "or http://spam1.com or spam1 dot com to win millions of dollars in prizes."
        + "Thank you.";

    List<String> bannedSites = Arrays.asList("spam1.com","spam3.com");

    StringBuilder re = new StringBuilder();
    for (String bannedSite : bannedSites) {
        if (re.length() > 0)
            re.append("|");
        re.append(String.format("http://(www\\.)?%s\\S*|%1$s",Pattern.quote(bannedSite)));
    }

    System.out.println(input.replaceAll(re.toString(),"LINK REMOVED"));

但是,虽然上面的代码适用于URL格式spam1.com或http://www.spam1.com或http://spam1.com,但它错过了多种文本格式:

如何修改正则表达式以定位这些文本格式?

spam1 dot com
spam1[.com]
spam1 .com
spam1 . com

想法是产生这样的结果:

Dear Arezzo,Please check out my website at [LINK REMOVED] or [LINK REMOVED] 
or [LINK REMOVED] or [LINK REMOVED] to win millions of dollars in prizes.
Thank you.
Big Spammer

正如我在下面的评论中所说,我可能不需要禁止整个字符串spam1 dot com.如果我可以消除垃圾邮件1部分,使其成为:[LINK REMOVED] dot com – 这将完成这项工作.

解决方法

这是一个开始.

import java.util.*;
import java.util.regex.Pattern;

class Test {
    public static void main(String[] args) {

        String input = "Dear Arezzo,\n"
            + "Please check out my website at spam1.com "
            + "or http://www.spam1.com or http://spam1.com or " 
            + "spam1 dot com to win millions of dollars in prizes.\n"
            + "Thank you.";

        List<String> bannedSites = Arrays.asList("spam1","spam2","spam3");

        StringBuilder re = new StringBuilder();
        for (String bannedSite : bannedSites) {
            if (re.length() > 0)
                re.append("|");
            String quotedSite = Pattern.quote(bannedSite);
            re.append("https?://(www\\.)?" + quotedSite + "\\S*");
            re.append("|" + quotedSite + "\\s*(dot|\\.)?\\s*(com|net|org)");
            //re.append("|" ... your variation here);
        }

        System.out.println(input.replaceAll(re.toString(),"LINK REMOVED"));
    }
}

输出

Dear Arezzo,

Please check out my website at LINK REMOVED or LINK REMOVED or LINK REMOVED or LINK
REMOVED to win millions of dollars in prizes.
Thank you.

@H_301_62@

根据需要扩展正则表达式.

猜你在找的正则表达式相关文章