#自然语言处理#Module for automatic summarization of text documents and HTML pages.
翻译 - 自动汇总文本文档和HTML页面的模块。
Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)
#网络爬虫#Automatically extract the main text content (and more) from an HTML document
PHP library which determines which css is used from html snippets.
Xtract-html is a tool for extracting HTML display code from a website, which you can also use for your website.
Xtract-htmlV2 is a tool for getting the HTML code from the website you want and is the successor to the previous version
Media Graper is a open source tool for Linux which is developed to extract all the Images, links, Videos from a Webpage.
A simple extractor based on BeatufulSoup, You can use it to iterate through all the HTML files in the website root directory and get the text, placeholders and other text.