We’re seeking a Senior Data Engineer to join our team and lead the design and development of data pipelines across both streaming and batch layers. You’ll work closely with fraud analysts and backend engineers to ensure timely, accurate, and reliable data flows. In this role, you'll help build a secure, compliant, and API-ready data lake that powers both operational and analytical systems.
What You’ll Be Doing
-
Architect and maintain our data lake using AWS S3, AWS Glue, Lake Formation, and Athena, enabling scalable and schema-aware data storage and querying.
-
Design and optimize ETL/ELT pipelines using PySpark, Glue Jobs, or Airflow, with strong handling of schema evolution and data partitioning.
-
Use AWS DMS to replicate and consolidate data from transactional systems (MySQL, PostgreSQL) to the data lake in near real-time.
-
Support both real-time streaming (Kinesis, Kafka) and batch pipelines for key functions like reconciliation, fraud scoring, and operational reporting.
-
Implement data quality checks, observability metrics, and auditing mechanisms to ensure data integrity and regulatory compliance.
-
Work with modern open table formats (Apache Iceberg, Delta Lake) to enable features like upserts, time-travel, and incremental queries.
-
Enforce data governance using IAM, encryption, and fine-grained access policies via Lake Formation.
-
Empower teams by building data marts, materialized views, and APIs for dashboards, machine learning models, and alerts.
-
Accelerate development using GenAI tools to generate PySpark scaffolds, tests, documentation, and DDL scripts while maintaining quality and traceability.
Responsibilities
-
Design and maintain scalable, cost-effective data workflows using AWS-native tools for both real-time and batch processing.
-
Write efficient and testable PySpark scripts, Glue jobs, and SQL transformations for complex logic (joins, windows, aggregations).
-
Optimize S3 storage layout with proper file sizing, compression, and partitioning, using table formats like Iceberg or Hudi.
-
Manage data cataloging via AWS Glue Data Catalog, setting up crawlers, validating schemas, and applying metadata tags.
-
Use Athena, Redshift Spectrum, or EMR for querying large datasets and validating results.
-
Integrate closely with fraud and reconciliation systems to ensure real-time data availability and correctness.
-
Contribute to CI/CD pipelines for data jobs, including testing, rollback mechanisms, and deployment monitoring.
-
Partner with Data Governance, InfoSec, and Engineering teams to enforce data protection, encryption, and compliance standards.
Requirements
-
4+ years of hands-on experience in data engineering, preferably in fintech or high-throughput environments.
-
Strong expertise with AWS Glue (Jobs + Crawlers), S3, Athena, Lake Formation, and Redshift.
-
Solid understanding of ETL/ELT pipelines using PySpark or Spark SQL, and orchestration tools like Airflow or Step Functions.
-
Experience working with streaming technologies (Kinesis, Kafka, MSK).
-
Familiarity with AWS DMS for near real-time data replication.
-
Hands-on experience with open table formats (Iceberg, Delta Lake, or Hudi).
-
Strong SQL skills and an understanding of query performance tuning.
-
Exposure to data observability tools (Great Expectations, Monte Carlo) and troubleshooting data pipelines.
-
Knowledge of data security practices, encryption protocols, and IAM for access control.
Nice-to-Haves
-
Experience working in regulated environments (e.g., PCI DSS, central banks) with strict audit and data retention needs.
-
Exposure to ML feature stores, streaming aggregations, or fraud analytics platforms.
-
Familiarity with BI tools such as QuickSight, Metabase, or Looker.
-
Experience using GitOps, Terraform, or AWS CDK for infrastructure-as-code in data workflows.
-
Proven ability to collaborate across teams (fraud, finance, compliance) in a cross-functional setting.
-
Use of Generative AI tools for enhancing operational efficiency — e.g., automating reconciliation or anomaly detection alerts.