lakeFS - Data version control for your data lake | Git for data
翻译 - 一个开源平台,可为基于对象存储的数据湖提供弹性和可管理性
data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.
BitSail is a distributed high-performance data integration engine which supports batch, streaming and incremental scenarios. BitSail is widely used to synchronize hundreds of trillions of data every d...
Few projects related to Data Engineering including Data Modeling, Infrastructure setup on cloud, Data Warehousing and Data Lake development.
翻译 - 与数据工程相关的项目很少,包括数据建模,云上的基础架构设置,数据仓库和数据湖开发。
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
翻译 - 用于构建数据湖,数据仓库和分析平台的端到端GoodReads数据管道。
Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is license...
Personal Data Engineering Projects
Data API Framework for AI Agents and Data Apps
Generic Data Ingestion & Dispersal Library for Hadoop
Enterprise-grade, production-hardened, serverless data lake on AWS
翻译 - AWS上的企业级,经过生产强化,无服务器的数据湖
Real Time Big Data / IoT Machine Learning (Model Training and Inference) with HiveMQ (MQTT), TensorFlow IO and Apache Kafka - no additional data store like S3, HDFS or Spark required
Use SQL to build ELT pipelines on a data lakehouse.
Amazon S3 Find and Forget is a solution to handle data erasure requests from data lakes stored on Amazon S3, for example, pursuant to the European General Data Protection Regulation (GDPR)
BtrBlocks: Efficient Columnar Compression for Data Lakes (SIGMOD 2023 Paper)
#大语言模型#🤖 The semantic engine for LLMs, bringing semantic context to AI agents. 🔥
Resources for video demonstrations and blog posts related to DataOps on AWS
An efficient storage and compute engine for both on-prem and cloud-native data analytics.