Skip to content
View Ansu-John's full-sized avatar

Block or report Ansu-John

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Ansu-John/README.md

Hi, I'm Ansu John 👋

Senior Data Engineer with hands-on experience designing production-grade data pipelines, real-time streaming architectures, AI-augmented data platforms, and cloud-native data warehousing solutions. I build end-to-end systems — from raw ingestion to analytical marts — with a focus on reliability, observability, and modern data stack best practices.

📍 Kochi, India  |  🔗 LinkedIn


🚀 Featured Projects

Production-grade systems built to demonstrate senior Data Engineering capabilities across streaming, AI pipelines, observability, and data governance.


An end-to-end Modern Data Stack pipeline delivering sub-minute latency from checkout event to BI dashboard.

  • Ingestion: Python (Faker) on AWS EC2 → Aiven Apache Kafka (mTLS secured)
  • Processing: PySpark Structured Streaming on Databricks → Delta Lake (Bronze)
  • Warehousing: Python connector → Snowflake (Silver) with dbt transformations → Gold Marts
  • Design decisions: Chose PySpark over SaaS ELT tools for true event-driven streaming with exactly-once semantics; implemented a Delta → Pandas → Snowflake REST pattern to work around Databricks Serverless DML restrictions

Stack: Python · Apache Kafka · PySpark · Databricks · Delta Lake · Snowflake · dbt · AWS EC2 · Aiven Cloud


An enterprise-grade RAG pipeline and AI agent for extracting complex financial and contractual data from unstructured PDFs, cataloging it in Snowflake, and serving intelligent answers via LangGraph.

  • Extraction: PySpark + pdfplumber parses complex nested tables and checkboxes from PDFs stored in AWS S3
  • Cataloging: Data loaded into Snowflake; embeddings generated via Snowflake Cortex (e5-base-v2) — data never leaves the secure database boundary
  • AI Agent: LangGraph state machine routes queries to Cortex LLM (llama3-70b) for grounded, context-aware responses
  • Testing: CI/CD-ready pytest suite with full mocking to validate pipeline logic without consuming database compute credits

Stack: Python · PySpark · AWS S3 · Snowflake · Snowflake Cortex · LangChain · LangGraph · Pytest


A stateful, event-driven, observable AI workflow that automates the full lifecycle of research document processing — from PDF upload to structured AI-generated synthesis.

  • API: FastAPI returns 202 Accepted instantly; heavy processing runs asynchronously via RabbitMQ
  • Worker: FastStream consumer with exponential backoff, configurable retries, and a Dead Letter Queue (DLQ)
  • LangGraph pipeline: Three nodes — Azure AI Document Intelligence (OCR + layout), OpenAI embeddings → Chroma/pgvector, GPT-4o synthesis with JSON-constrained outputs
  • Observability: End-to-end Langfuse tracing on every LLM call, graph transition, and tool invocation
  • CI: GitHub Actions — Ruff + Pyright + Pytest (80% coverage gate); all external services mocked

Stack: Python 3.12 · FastAPI · FastStream · RabbitMQ · LangGraph · Azure AI Document Intelligence · Azure Blob Storage · OpenAI · Chroma · pgvector · SQLAlchemy 2.0 · Langfuse


An intelligent observability platform that ingests system metrics into TimescaleDB and exposes an AI-powered diagnostic agent via FastAPI and the Model Context Protocol (MCP).

  • Ingestion: Batch-ingest CPU, memory, disk, and network metrics into a TimescaleDB hypertable with automatic compression
  • Analytics: Advanced SQL using time_bucket, PERCENTILE_CONT, regr_slope/regr_r2 for windowed aggregations, outlier detection (Z-score), and trend analysis (linear regression)
  • AI Agent: LangGraph supervisor with FastMCP tools classifies natural-language queries and returns structured diagnostic reports
  • Caching: Redis 7 caches identical queries; duplicate requests are rate-limited (HTTP 429)
  • Production-ready: Multi-stage Docker build, Kubernetes manifest (ArgoCD-ready), CI with Ruff + Pyright

Stack: Python 3.12 · FastAPI · TimescaleDB · SQLAlchemy 2.0 · LangGraph · FastMCP · Anthropic Claude · Redis · Docker · Kubernetes


🛠️ Core Skills

Domain Technologies
Streaming & Pipelines Apache Kafka · PySpark Structured Streaming · Delta Lake · Databricks
Data Warehousing Snowflake · Snowflake Cortex · dbt · TimescaleDB · PostgreSQL
AI & LLM Pipelines LangGraph · LangChain · FastMCP · RAG · OpenAI · Anthropic Claude
Cloud & Infrastructure AWS (S3, EC2) · Azure (Blob Storage, Document Intelligence) · Docker · Kubernetes
APIs & Messaging FastAPI · RabbitMQ · FastStream · Redis
Languages Python · Scala · SQL
Quality & Observability Ruff · Pyright · Pytest · Langfuse · GitHub Actions

📂 Other Projects

ETL & Data Engineering

ETL Verifier — Post-load data validation after ETL using Scala.

Big Data Analysis — Spark RDD / DataFrame operations using PySpark 3.x.

Machine Learning

Feature Engineering with MLlib · Regression Models · Classification Models · Linear Regression with Spark · Logistic Regression with Spark · Clustering · Movie Recommender System

AI & Real-Life Projects

Natural Language Processing · Credit Card Fraud Detection · IBM HR Analytics — Employee Attrition · AB Testing · Customer Segmentation · Retail Sales Forecasting

Tutorials

Study Notes · Introduction to Python


💬 Feel free to message me on LinkedIn for any comments or collaboration opportunities.

Pinned Loading

  1. Big-Data-Analysis Big-Data-Analysis Public

    Explore various Spark RDD / Dataframe operations using PySpark library.

    Jupyter Notebook 1 2

  2. Classification-Models Classification-Models Public

    Build and evaluate various machine learning classification models using Python.

    Jupyter Notebook 6 6

  3. Natural-Language-Processing Natural-Language-Processing Public

    Explore various natural language processing models using Python.

    Jupyter Notebook 1

  4. Regression-Models Regression-Models Public

    Build and evaluate various machine learning regression models using Python.

    Jupyter Notebook 3 4

  5. Customer-Segmentation Customer-Segmentation Public

    Implement Customer segmentation using Python 3

    Jupyter Notebook 1

  6. ETL-Verifier ETL-Verifier Public

    Data validation after ETL using scala.

    Scala