data-ingestion · GitHub Topics

#大语言模型#SeaTunnel （原名为 waterdrop）是一个易用的支持海量数据实时同步的高性能分布式数据集成平台，每天可以稳定同步数百亿数据

data-integration high-performance offline real-time apache batch cdc change-data-capture data-ingestion elt streaming embeddings 大语言模型 multimodal

Java 8.79 k

2 天前

bruin-data / ingestr

ingestr is a CLI tool to copy data between any databases with a single command seamlessly.

BigQuery copy-database data-ingestion data-integration data-pipeline duckdb ingestion-pipeline sql-server PostgreSQL snowflake

Python 3.22 k

1 天前

apache / paimon

Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.

big-data data-ingestion flink paimon real-time-analytics Apache Spark table-store streaming-datalake

Java 3 k

17 小时前

dashbitco / broadway

Concurrent and multi-stage data ingestion and data processing with Elixir

Elixir data-ingestion data-processing concurrent

Elixir 2.58 k

2 个月前

pravega / pravega

Pravega - Streaming as a new software defined storage primitive

streaming streaming-data distributed-storage real-time-data data-ingestion

Java 2 k

7 个月前

bruin-data / bruin

Build data pipelines with SQL and Python, ingest data from different sources, add quality checks, and build end-to-end flows.

analytics BigQuery data-modeling data-pipelines Python snowflake SQL 数据分析 data-transformation data-ingestion data-platform

Go 980

1 天前

CrunchyData / pg_parquet

Copy to/from Parquet in S3, Azure Blob Storage, Google Cloud Storage, http(s) stores, local files or standard inout stream from within PostgreSQL

columnar data-ingestion data-migration parquet PostgreSQL azure-storage google-cloud-storage HTTP s3

Rust 588

22 天前

unbody-io / unbody

#大语言模型#The Supabase of AI era. A modular, open-source backend for building AI-native software — designed for knowledge, not static data.

agentic-ai ai-native 后端聊天机器人 data-ingestion developer-tools etl-pipeline generative-ai knowledge-base 大语言模型 rag vector-database

TypeScript 351

3 个月前

orbitalapi / orbital

Orbital automates integration between data sources (APIs, Databases, Queues and Functions). BFF's, API Composition and ETL pipelines that adapt as your specs change.

API integration Kotlin 微服务 api-integration api-management REST API TypeScript data-engineering data-ingestion etl Java

TypeScript 334

3 个月前

cuebook / cuelake

Use SQL to build ELT pipelines on a data lakehouse.

apache-iceberg delta lakehouse datalake data-lake elt etl data-engineering data-integration data-ingestion Apache Spark spark-sql data-transfer pipelines data-pipeline zeppelin-notebook SQL

JavaScript 288

3 年前

merantix-momentum / squirrel-core

#自然语言处理#A Python library that enables ML teams to share, load, and transform data in a collaborative, flexible, and efficient way 🌰

Python 机器学习数据科学机器视觉 cv 自然语言处理人工智能 PyTorch Tensorflow 数据集 distributed DataOps 深度学习 data-ingestion cloud-computing collaboration internal

Python 281

5 个月前

apache / paimon-rust

Apache Paimon Rust The rust implementation of Apache Paimon.

big-data data-ingestion paimon real-time-analytics Rust streaming-datalake table-store

Rust 128

5 个月前

thedataengineeringbook / thedataengineeringbook

The Data Engineering Book - หนังสือวิศวกรรมข้อมูล ของคนไทย เพื่อคนไทย

data-engineering data Hacktoberfest book data-engineer data-pipeline data-integration data-ingestion data-infrastructure

JavaScript 114

1 个月前

jgperrin / net.jgp.labs.spark

Apache Spark examples exclusively in Java

Apache Spark ingestion Java data-ingestion dataframe

Java 102

2 年前

paloaltodatabases / sequor

Build complete API integrations with YAML and SQL. Rapid development without vendor lock-in and per-row costs.

api-integration data-integration etl ipaas SQL workflow-automation data-engineering data-ingestion reverse-etl

Python 85

3 个月前

XavientInformationSystems / Data-Ingestion-Platform

data-ingestion flink storm apex Apache Spark batch-processing

Java 50

6 年前

merantix-momentum / squirrel-datasets-core

#自然语言处理#Squirrel dataset hub

Python 数据科学机器学习自然语言处理人工智能机器视觉深度学习 Tensorflow cv collaboration PyTorch distributed DataOps cloud-computing 数据集 data-ingestion

Python 42

2 年前

aws-samples / amazon-kinesis-data-processor-aws-fargate

Sample code for the AWS Big Data Blog Post Building a scalable streaming data processor with Amazon Kinesis Data Streams on AWS Fargate

data-ingestion containers

Python 38

5 个月前

Dynatrace / OneAgent-SDK-for-Java

Enables custom tracing of Java applications in Dynatrace

SDK sdk-java Application Performance Management (APM)agent data-ingestion

Java 38

4 个月前

Dynatrace / openkit-java

OpenKit Java Reference Implementation

data-ingestion SDK

Java 35

1 年前