c# 正则表达式对网页进行内容抓取

2014-07-30来源：易贤网

搜索引擎中一个比较重要的环节就是从网页中抽取出有效内容。简单来说，就是吧HTML文本中的HTML标记去掉,留下我们用IE等浏览器打开HTML文档看到的部分（我们这里不考虑图片）.

将HTML文本中的标记分为:注释,script ,style，以及其他标记分别去掉：

1.去注释,正则为:

output = Regex.Replace(input, @"", string.Empty, RegexOptions.IgnoreCase);

2.去script,正则为:

ouput = Regex.Replace(input, @"<script[^>]*?>.*?</script>", string.Empty, RegexOptions.IgnoreCase | RegexOptions.Singleline);

output2 = Regex.Replace(ouput , @"<noscript[^>]*?>.*?</noscript>", string.Empty, RegexOptions.IgnoreCase | RegexOptions.Singleline);

3.去style,正则为:

output = Regex.Replace(input, @"<style[^>]*?>.*?</style>", string.Empty, RegexOptions.IgnoreCase | RegexOptions.Singleline);

4.去其他HTML标记

result = result.Replace(" ", " ");

result = result.Replace(""", "\"");

result = result.Replace("<", "<");

result = result.Replace(">", ">");

result = result.Replace("&", "&");

result = result.Replace("<br>", "\r\n");

result = Regex.Replace(result, @"<[\s\S]*?>", string.Empty, RegexOptions.IgnoreCase);

以上的代码中大家可以看到,我使用了RegexOptions.Singleline参数，这个参数很重要，他主要是为了让"."(小圆点)可以匹配换行符.如果没有这个参数，大多数情况下，用上面列正则表达式来消除网页HTML标记是无效的.

HTML发展至今，语法已经相当复杂,上面只列出了几种最主要的标记,更多的去HTML标记的正则我将在

Rost WebSpider 的开发过程中补充进来。

下面用c#实现了一个从HTML字符串中提取有效内容的类:

using System;

using System.Collections.Generic;

using System.Text;

using System.Text.RegularExpressions;

class HtmlExtract

{

#region private attributes

private string _strHtml;

#endregion

#region public mehtods

public HtmlExtract(string inStrHtml)

{

_strHtml = inStrHtml

}

public override string ExtractText()

{

string result = _strHtml;

result = RemoveComment(result);

result = RemoveScript(result);

result = RemoveStyle(result);

result = RemoveTags(result);

return result.Trim();

}

#endregion

#region private methods

private string RemoveComment(string input)

{

string result = input;

//remove comment

result = Regex.Replace(result, @"", string.Empty, RegexOptions.IgnoreCase);

return result;

}

private string RemoveStyle(string input)

{

string result = input;

//remove all styles

result = Regex.Replace(result, @"<style[^>]*?>.*?</style>", string.Empty, RegexOptions.IgnoreCase | RegexOptions.Singleline);

return result;

}

private string RemoveScript(string input)

{

string result = input;

result = Regex.Replace(result, @"<script[^>]*?>.*?</script>", string.Empty, RegexOptions.IgnoreCase | RegexOptions.Singleline);

result = Regex.Replace(result, @"<noscript[^>]*?>.*?</noscript>", string.Empty, RegexOptions.IgnoreCase | RegexOptions.Singleline);

return result;

}

private string RemoveTags(string input)

{

string result = input;

result = result.Replace(" ", " ");

result = result.Replace(""", "\"");

result = result.Replace("<", "<");

result = result.Replace(">", ">");

result = result.Replace("&", "&");

result = result.Replace("<br>", "\r\n");

result = Regex.Replace(result, @"<[\s\S]*?>", string.Empty, RegexOptions.IgnoreCase);

return result;

}

#endregion

更多信息请查看IT技术专栏

推荐信息

Shell中如何删除文本比较长的行的实现方法

vue.js语法及常用指令

python 读写中文json的实例详解

Objective-C Json 实例详解

bootstrap table sum总数量统计实现方法

python生成二维码的实例详解

Python批量更改文件名的实现方法

解决出现Incorrect integer value的问题

jQuery实现切换隐藏与显示同时切换图标功能

docker python api 安装配置的详解

javascript按钮禁用和启用的效果实例代码

vue.js todolist实现代码

vue.js 父向子组件传参的实例代码

apache 开启重定向 rewrite的实现方法

Vue.js划分组件的方法

python logging日志模块的详解

vue中的scope使用详解

docker cgroup 资源监控的详解

使用Android Studio 开发自己的SDK教程

linux系统下MongoDB单节点安装教程