我有一个代表垃圾链接的网站列表:
List<String> bannedSites = ["spam1.com","spam2.com","spam3.com"];
Dear Arezzo,Please check out my website at spam1.com or http://www.spam1.com or http://spam1.com or spam1 dot com to win millions of dollars in prizes. Thank you. Big Spammer
请注意,链接可能有多种URL格式,aioobe‘s solution可以很好地识别:
String input = "Dear Arezzo,\n" + "Please check out my website at spam1.com or http://www.spam1.com" + "or http://spam1.com or spam1 dot com to win millions of dollars in prizes." + "Thank you."; List<String> bannedSites = Arrays.asList("spam1.com","spam3.com"); StringBuilder re = new StringBuilder(); for (String bannedSite : bannedSites) { if (re.length() > 0) re.append("|"); re.append(String.format("http://(www\\.)?%s\\S*|%1$s",Pattern.quote(bannedSite))); } System.out.println(input.replaceAll(re.toString(),"LINK REMOVED"));
但是,虽然上面的代码适用于URL格式spam1.com或http://www.spam1.com或http://spam1.com,但它错过了多种文本格式:
如何修改正则表达式以定位这些文本格式?
spam1 dot com spam1[.com] spam1 .com spam1 . com
想法是产生这样的结果:
Dear Arezzo,Please check out my website at [LINK REMOVED] or [LINK REMOVED] or [LINK REMOVED] or [LINK REMOVED] to win millions of dollars in prizes. Thank you. Big Spammer
正如我在下面的评论中所说,我可能不需要禁止整个字符串spam1 dot com.如果我可以消除垃圾邮件1部分,使其成为:[LINK REMOVED] dot com – 这将完成这项工作.
解决方法
这是一个开始.
import java.util.*; import java.util.regex.Pattern; class Test { public static void main(String[] args) { String input = "Dear Arezzo,\n" + "Please check out my website at spam1.com " + "or http://www.spam1.com or http://spam1.com or " + "spam1 dot com to win millions of dollars in prizes.\n" + "Thank you."; List<String> bannedSites = Arrays.asList("spam1","spam2","spam3"); StringBuilder re = new StringBuilder(); for (String bannedSite : bannedSites) { if (re.length() > 0) re.append("|"); String quotedSite = Pattern.quote(bannedSite); re.append("https?://(www\\.)?" + quotedSite + "\\S*"); re.append("|" + quotedSite + "\\s*(dot|\\.)?\\s*(com|net|org)"); //re.append("|" ... your variation here); } System.out.println(input.replaceAll(re.toString(),"LINK REMOVED")); } }
输出:
Dear Arezzo,
Please check out my website at LINK REMOVED or LINK REMOVED or LINK REMOVED or LINK
@H_301_62@
REMOVED to win millions of dollars in prizes.
Thank you.根据需要扩展正则表达式.