etl-pipeline · GitHub Topics

Zipstack / unstract

No-code LLM Platform to launch APIs and ETL Pipelines to structure unstructured documents

etl-pipeline llm-platform unstructured-data

Python 5.04 k

9 小时前

orchest / orchest

#编辑器#Build data pipelines, the easy way 🛠️

翻译 - Orchest是用于创建数据科学管道的工具。

数据科学机器学习 pipelines ide Jupyter Notebook cloud 自托管 jupyterlab notebooks Docker Python data-pipelines 部署 Kubernetes airflow dag etl etl-pipeline

TypeScript 4.12 k

2 年前

apache / streampark

StreamX 的初衷是为了让流处理更简单. 打造一个一站式大数据平台,流批一体,湖仓一体的解决方案

streaming streampark apache development-framework easy-to-use etl-pipeline operation-platform

Java 4.04 k

2 天前

DAGWorks-Inc / hamilton

#计算机科学#Hamilton helps data scientists and engineers define testable, modular, self-documenting dataflows, that encode lineage/tracing and metadata. Runs and scales everywhere python does.

数据科学 Python dag data-engineering dataframe etl etl-framework etl-pipeline feature-engineering 机器学习 pandas 软件工程数据分析 lineage llmops mlops orchestration Hacktoberfest rag

Jupyter Notebook 2.09 k

6 天前

AlexIoannides / pyspark-example-project

Implementing best practices for PySpark ETL jobs and applications.

pyspark etl-job Python data-engineering Apache Spark 数据科学 etl etl-pipeline

Python 1.88 k

2 年前

san089 / Udacity-Data-Engineering-Projects

Few projects related to Data Engineering including Data Modeling, Infrastructure setup on cloud, Data Warehousing and Data Lake development.

翻译 - 与数据工程相关的项目很少，包括数据建模，云上的基础架构设置，数据仓库和数据湖开发。

data data-engineering data-engineering-pipeline etl-pipeline cassandra-database postgresql-database data-modeling data-warehouse data-lake airflow cluster Apache Cassandra infrastructure PostgreSQL Amazon Web Services aws-ec2 aws-sdk aws-s3 cloudformation

Python 1.6 k

3 年前

san089 / goodreads_etl_pipeline

An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.

翻译 - 用于构建数据湖，数据仓库和分析平台的端到端GoodReads数据管道。

etl-pipeline etl-framework Apache Spark apache-airflow airflow redshift emr-cluster livy s3 data-lake scheduler data-migration data-engineering data-engineering-pipeline Python etl-job

Python 1.36 k

5 年前

stitchfix / hamilton

#计算机科学#A scalable general purpose micro-framework for defining dataflows. THIS REPOSITORY HAS BEEN MOVED TO www.github.com/dagworks-inc/hamilton

Python pandas dag 数据科学 data-engineering NumPy 软件工程 etl-framework etl-pipeline etl feature-engineering dataframe data-platform 机器学习

Python 861

2 年前

JSv4 / OpenContracts

#大语言模型#Enterprise-grade and API-first LLM workspace for unstructured documents, including data extraction, redaction, rights management, prompt playground, and more!

agent agentic-ai etl etl-pipeline 大语言模型 unstructured-data vector-database prompt-engineering

Python 832

3 天前

techascent / tech.ml.dataset

#计算机科学#A Clojure high performance data processing system

Clojure dataframe CSV xlsx datascience 机器学习 dataset etl-pipeline Java

Clojure 702

9 天前

SorellaLabs / brontes

A blazingly fast general purpose blockchain analytics engine specialized in systematic mev detection

以太坊 evm mev etl-pipeline Rust

Rust 604

1 天前

Pravko-Solutions / FlashLearn

#大语言模型#Integrate LLM in any pipeline - fit/predict pattern, JSON driven flows, and built in concurency support.

人工智能 ai-agents concurrency 大语言模型 llm-agent Python ai-agents-framework etl-pipeline

Python 586

1 个月前

YotpoLtd / metorikku

A simplified, lightweight ETL Framework based on Apache Spark

big-data Apache Spark Scala etl-framework distributed-computing SQL etl etl-pipeline

Scala 585

1 年前

ebonnal / streamable

Pythonic Stream-like manipulation of iterables

data-engineering etl-pipeline etl Python reverse-etl collections streams fluent-interface immutability iterator lazy-evaluation method-chaining visitor-pattern data asyncio threads

Python 250

2 小时前

airscholar / e2e-data-engineering

An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All compone...

apache-airflow apache-kafka Apache Spark big-data Apache Cassandra containerization data-engineering data-pipeline data-processing Docker etl-pipeline PostgreSQL real-time-analytics

Python 240

2 个月前

SETL-Framework / setl

#计算机科学#A simple Spark-powered ETL framework that just works 🍺

Apache Spark etl 框架 Scala pipeline data-transformation 数据科学 data-engineering 数据分析 modularization dataset big-data etl-pipeline 机器学习

Scala 181

13 天前

data-engineering-community / data-engineering-project-template

This is a template you can use for your next data engineering portfolio project.

data-engineering SQL Python data data-warehouse etl etl-pipeline

176

4 年前

jvalue / jayvee

Jayvee is a domain-specific language and runtime for automated processing of data pipelines

data-engineering data-pipeline 数据科学 domain-specific-language etl-pipeline TypeScript

TypeScript 176

4 天前

jitsucom / bulker

Service for bulk-loading data to databases with automatic schema management (Redshift, Snowflake, BigQuery, ClickHouse, Postgres, MySQL)

data-engineering datawarehouse etl etl-pipeline ingestion pipeline

Go 173

1 天前

Wittline / uber-expenses-tracking

The goal of this project is to track the expenses of Uber Rides and Uber Eats through data Engineering processes using technologies such as Apache Airflow, AWS Redshift and Power BI.

data-engineering Python apache-airflow etl-pipeline data-modeling uber Amazon Web Services

Jupyter Notebook 121

3 年前