Role Overview & Key Responsibilities
Data Pipeline Operations & On-Call : Own on-call rotation for ingestion pipelines (Kafka, AWS Glue); triage and resolve pipeline failures, schema mismatches, and throughput degradation; author RCAs.
Data Quality Monitoring : Implement and maintain data quality checks across Bronze->Silver->Gold lakehouse layers (S3->Kafka->Snowflake/Redshift); alert on anomalies, missing data, or drift.
ML Model Health & MLOps : Monitor deployed models for accuracy degradation, data drift, and concept drift; manage model redeployment workflows; maintain ML experiment tracking.
AI Platform Reliability (Bedrock + LangChain) : Monitor AWS Bedrock inference latency, token usage, error rates, and cost; operate LangChain agent pipelines; use Langfuse for Al evaluation and observability.
DORA Metrics - Data & AI Lens : Track deployment and release health for data pipeline and model updates; measure lead time for data model changes; monitor pipeline reliability as a DORA proxy.
Schema & Contract Management : Monitor AWS Glue Schema Registry for schema evolution events; validate Avro contract compliance for new producer payloads; coordinate schema changes with module teams.
Snowflake / Redshift Operations : Manage query performance, warehouse sizing, cost controls, and data retention policies; monitor Gold-layer data freshness and SLA compliance.
Incident Escalation : Serve as first-line triage for all data and Al incidents; escalate to core data/ML engineers only when root cause requires architectural changes or new feature work.

Required Skills & Experience
Data Engineering (Strong)
4+ years of data engineering experience with production-grade pipelines
Proficient with Apache Kafka: consumer groups, topic management, lag monitoring, DLQ handling
Experience with AWS Glue, AWS Glue Schema Registry, and Avro/Parquet data formats
Hands-on with Snowflake or Redshift: query optimization, cost management, RBAC
Familiarity with lakehouse patterns: Bronze/Silver/Gold (S3-based) data architecture

ML/AI Operations (Core Competency)
Experience with MLOps practices: model versioning, drift detection, retraining pipelines Familiarity with AWS Bedrock, SageMaker, or equivalent managed ML inference platforms
Working knowledge of LangChain or LlamaIndex for LLM application pipelines
Experience with AI/LLM observability tools (Langfuse, LangSmith, or equivalent)
Understanding of RAG (Retrieval-Augmented Generation) architectures and vector stores

Operational Excellence (Core Competency)
DORA metrics application to data and ML delivery pipelines
On-call experience for data infrastructure; structured incident management and RCA
Data quality framework implementation: Great Expectations, dbt tests, or custom checks
Experience with monitoring and alerting for streaming pipelines (Kafka lag, throughput)

Backend & AWS Exposure
Python proficiency - scripting, pipeline development, data transformation
AWS services: S3, Lambda, Glue, Bedrock, CloudWatch, SQS/SNS, IAM
Familiarity with containerized workloads on Kubernetes (EKS)
Experience with dbt or similar data transformation frameworks is a plus

Nice to Have
Exposure to ontology or knowledge graph systems (RDF, OWL, or property graphs)
Familiarity with Temporal for workflow orchestration of ML pipelines
Experience with multi-tenant data platforms and row-level security patterns
Understanding of GDPR-compliant data handling and encryption key management

Data Engineer (with AWS & AI/ML)

Similar Engineering Jobs

Sr. Data Engineer I (6450)

Data Engineer (Tableau and ETL)

Front Office Data Engineering

Business Intelligence Developer

Data Platform Engineer

AI / ML Platform Engineer