data-extraction · GitHub Topics

getmaxun / maxun

#网络爬虫#一个可视化，通过鼠标点击完成数据采集的爬虫平台

TypeScript 11.09 k

4 小时前

vi3k6i5 / flashtext

#自然语言处理#Extract Keywords from sentence or Replace keywords in sentences.

翻译 - 从句子中提取关键字或替换句子中的关键字。

search-in-text keyword-extraction 自然语言处理 word2vec data-extraction

Python 5.64 k

9 个月前

D4Vinci / Scrapling

#网络爬虫#🕷️ An undetectable, powerful, flexible, high-performance Python library that makes Web Scraping simple and easy again!

爬虫 crawling Hacktoberfest Playwright Python scraping selectors stealth-game web-scraper web-scraping web-scraping-python webscraping xpath 自动化人工智能 ai-scraping data data-extraction

Python 2.89 k

4 小时前

JonathanLink / PDFLayoutTextStripper

Converts a pdf file into a text file while keeping the layout of the original pdf. Useful to extract the content from a table in a pdf file for instance. This is a subclass of PDFTextStripper class (f...

layout text Java pdf extract data-extraction pdfbox

Java 1.59 k

1 年前

hi-primus / optimus

#计算机科学#🚚 Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark

Apache Spark pyspark data-wrangling bigdata 数据科学 data-transformation 机器学习 data-profiling data-extraction data-exploration 数据分析 data-preparation cudf dask data-cleaning

Python 1.5 k

4 个月前

raznem / parsera

#网络爬虫#Lightweight library for scraping web-sites with LLMs

data-extraction 大语言模型 scraping Python Open Source webscraping 人工智能 ai-scraping Playwright

Python 1.07 k

2 天前

polyrabbit / hacker-news-digest

#网络爬虫#📰 Let ChatGPT Summarize Hacker News for You

hacker-news Python data-extraction hacker-news-reader RSS spider 爬虫机器学习 news-aggregator ChatGPT ChatGPT API openai openai-api

Python 713

5 天前

thinh-vu / vnstock

A beginner-friendly yet powerful Python toolkit for financial analysis and automation — built to make modern investing accessible to everyone

stock-market data-extraction quantitative-finance quantitative-analysis quantitative-trading

Python 712

14 小时前

adrienjoly / npm-pdfreader

🚜 Parse text and tables from PDF files.

data-extraction pdf-converter Parsing JavaScript tabular-data

HTML 672

3 个月前

a-maliarov / amazoncaptcha

Pure Python, lightweight, Pillow-based solver for Amazon's text captcha.

captcha captcha-solver amazon Python pillow training-data data-extraction

Python 467

10 个月前

py-pdf / benchmarks

Benchmarking PDF libraries

benchmark data-extraction mupdf pdf pypdf2 text-extraction

Python 269

1 年前

jpjacobpadilla / Stealth-Requests

Undetected Web-Scraping & Seamless HTML Parsing in Python!

Python http-client data html-parsing http-requests requests web-crawler web-scraping webscraping xpath data-extraction

Python 234

2 个月前

serpapi / clauneck

A tool for scraping emails, social media accounts, and much more information from websites using Google Search Results.

自动化命令行界面 command-line-tool data-extraction data-extractor email email-extract-with-proxy email-extraction email-extractor email-marketing Open Source Ruby rubygem web-crawling webscraping

Ruby 176

1 年前

molybdenum-99 / infoboxer

Wikipedia information extraction library

wikipedia MediaWiki data-extraction

Ruby 175

1 年前

sypht-team / sypht-python-client

A python client for the Sypht API

data-extraction information-extraction API Python python3-library invoice extract extract-data-from-pdf pdf-parser

Python 162

9 个月前

johnbumgarner / newspaper3_usage_overview

This repository provides usage examples for the Python module Newspaper3k.

news scraping-websites Python data-extraction beautifulsoup python-requests nlp-parsing

Python 146

1 年前

dilawar / PlotDigitizer

A Python utility to digitize plots.

data-extraction Python 图像处理

Python 139

8 个月前

173TECH / sayn

Data processing and modelling framework for automating tasks (incl. Python & SQL transformations).

analytics etl data-modeling data-engineering 数据科学 Python SQL 自动化 elt data-extraction

Python 122

2 天前

nfx / go-htmltable

Structured HTML table data extraction from URLs in Go that has almost no external dependencies

Go data-extraction HTML

Go 121

5 天前

CambioML / any-parser

#大语言模型#Accurate, private and configurable document retrieval LLM

data-extraction document 大语言模型 pdf 隐私 structured-data unstructured-data

Python 121

17 天前