data-lake · GitHub Topics

lakeFS - Data version control for your data lake | Git for data

翻译 - 一个开源平台，可为基于对象存储的数据湖提供弹性和可管理性

data-engineering data-versioning Go object-storage data-lake aws-s3 data-quality azure-blob-storage google-cloud-storage git-for-data Apache Spark hadoop-filesystem datalake data-version-control azure-storage

Go 4.62 k

16 小时前

dlt-hub / dlt

data load tool (dlt) is an open source Python library that makes data loading easy 🛠️

data Python data-engineering data-lake data-loading data-warehouse elt extract load transform

Python 3.46 k

3 小时前

apache / kyuubi

Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.

Apache Spark hive SQL thrift jdbc spark-sql data-lake hadoop Kubernetes Hacktoberfest

Scala 2.18 k

3 天前

bytedance / bitsail

BitSail is a distributed high-performance data integration engine which supports batch, streaming and incremental scenarios. BitSail is widely used to synchronize hundreds of trillions of data every d...

flink big-data data-integration data-lake data-pipeline data-synchronization high-performance real-time

Java 1.65 k

1 年前

san089 / Udacity-Data-Engineering-Projects

Few projects related to Data Engineering including Data Modeling, Infrastructure setup on cloud, Data Warehousing and Data Lake development.

翻译 - 与数据工程相关的项目很少，包括数据建模，云上的基础架构设置，数据仓库和数据湖开发。

data data-engineering data-engineering-pipeline etl-pipeline cassandra-database postgresql-database data-modeling data-warehouse data-lake airflow cluster Apache Cassandra infrastructure PostgreSQL Amazon Web Services aws-ec2 aws-sdk aws-s3 cloudformation

Python 1.6 k

3 年前

san089 / goodreads_etl_pipeline

An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.

翻译 - 用于构建数据湖，数据仓库和分析平台的端到端GoodReads数据管道。

etl-pipeline etl-framework Apache Spark apache-airflow airflow redshift emr-cluster livy s3 data-lake scheduler data-migration data-engineering data-engineering-pipeline Python etl-job

Python 1.36 k

5 年前

Teradata / kylo

Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is license...

Apache Spark nifi data-lake teradata hadoop

Java 1.11 k

2 年前

alanchn31 / Data-Engineering-Projects

Personal Data Engineering Projects

data-lake data-engineering data-warehouse Apache Cassandra MongoDB scrapy Apache Spark airflow PostgreSQL star-schema data-modeling

Jupyter Notebook 921

2 年前

Canner / vulcan-sql

Data API Framework for AI Agents and Data Apps

api-builder data-lake data-warehouse 数据库 SQL analytics reporting Spreadsheet BigQuery duckdb PostgreSQL snowflake restful-api TypeScript clickhouse ksqldb 人工智能 ai-agent

TypeScript 676

9 个月前

lakekeeper / lakekeeper

Lakekeeper is an Apache-Licensed, secure, fast and easy to use Apache Iceberg REST Catalog written in Rust.

catalog data-lake iceberg lakehouse Rust

Rust 558

21 小时前

uber / marmaray

Generic Data Ingestion & Dispersal Library for Hadoop

hadoop data-lake avro-schema Apache Spark

Java 478

2 年前

aws-solutions-library-samples / data-lakes-on-aws

Enterprise-grade, production-hardened, serverless data lake on AWS

翻译 - AWS上的企业级，经过生产强化，无服务器的数据湖

Serverless 框架 data-lake analytics Amazon Web Services etl data-engineering lake-formation Infrastructure as code best-practices

Python 448

15 天前

kaiwaehner / hivemq-mqtt-tensorflow-kafka-realtime-iot-machine-learning-training-inference

Real Time Big Data / IoT Machine Learning (Model Training and Inference) with HiveMQ (MQTT), TensorFlow IO and Apache Kafka - no additional data store like S3, HDFS or Spark required

kafka hivemq MQTT kafka-streams kafka-connect ksql Tensorflow gRPC Java Python data-lake confluent ksqldb Terraform Google 云 Kubernetes cloud MongoDB

Jupyter Notebook 414

4 年前

cuebook / cuelake

Use SQL to build ELT pipelines on a data lakehouse.

apache-iceberg delta lakehouse datalake data-lake elt etl data-engineering data-integration data-ingestion Apache Spark spark-sql data-transfer pipelines data-pipeline zeppelin-notebook SQL

JavaScript 286

3 年前

Canner / wren-engine

#大语言模型#🤖 The Semantic Engine for Model Context Protocol(MCP) Clients and AI Agents 🔥

business-intelligence data 数据分析 data-analytics data-lake data-warehouse SQL semantic semantic-layer 大语言模型 Hacktoberfest agent agentic-ai 人工智能 mcp mcp-server

Java 272

3 天前

awslabs / amazon-s3-find-and-forget

Amazon S3 Find and Forget is a solution to handle data erasure requests from data lakes stored on Amazon S3, for example, pursuant to the European General Data Protection Regulation (GDPR)

data-lake amazon-s3 s3 gdpr Amazon Web Services parquet ccpa big-data 隐私 data

Python 243

1 个月前