trino 是一个分布式大数据 SQL 查询引擎(前身 PrestoSQL)
StarRocks 是新一代极速全场景 MPP (Massively Parallel Processing) 数据库。StarRocks 的愿景是能够让用户的数据分析变得更加简单和敏捷。用户无需经过复杂的预处理,就可以用 StarRocks 来支持多种数据分析场景的极速分析。
#数据仓库#Database for AI. Store Vectors, Images, Texts, Videos, etc. Use with LLMs/LangChain. Store, query, version, & visualize any AI data. Stream data in real-time to PyTorch/TensorFlow. https://activeloop....
翻译 - 访问和管理PyTorch和TensorFlow数据集的最快方法。轻松构建可伸缩的数据管道。Leading Data 2.0 http://activeloop.ai
Upserts, Deletes And Incremental Processing on Big Data.
翻译 - 大数据的更新,删除和增量处理。
lakeFS - Data version control for your data lake | Git for data
翻译 - 一个开源平台,可为基于对象存储的数据湖提供弹性和可管理性
一个基于 Apache Flink 二次开发、易扩展的一站式开发运维 FlinkSQL 及 SQL 的实时计算平台
LakeSoul is an end-to-end, realtime and cloud native Lakehouse framework with fast data ingestion, concurrent update and incremental data analytics on cloud storages for both BI and AI applications.
The LeoFS Storage System
World's most powerful open data catalog for building a high-performance, geo-distributed and federated metadata lake.
Scalable identity resolution, entity resolution, data mastering and deduplication using ML
翻译 - 用于数据掌握、重复数据删除和实体解析的可扩展模糊匹配。
Apache Amoro (incubating) is a Lakehouse management system built on open data lake formats.
A free to use dbt package for creating and loading Data Vault 2.0 compliant Data Warehouses (powered by dbt, an open source data engineering tool, registered trademark of dbt Labs)
DuckDB-powered data lake analytics from Postgres
Open Control Plane for Tables in Data Lakehouse
Use SQL to build ELT pipelines on a data lakehouse.
#Awesome#A curated list of open source tools used in analytics platforms and data engineering ecosystem
The Internals of Delta Lake