Project Overview | Cybersecurity Threat Detection with SageMaker

Overview

This project showcases a serverless, machine learning-powered cybersecurity solution deployed entirely on AWS. It detects anomalous network activity, such as DDoS attacks, unauthorized access, and phishing attempts using a trained model hosted on SageMaker. The system automates data ingestion, preprocessing, model training, deployment, and real-time inference. It is secured with IAM policies, monitored via CloudWatch, and designed for scalability and automation.

Data Storage and Management

Amazon S3

Stores raw network traffic logs collected from various sources.
Holds preprocessed datasets and extracted features used for training.
Archives model artifacts including training outputs and serialized models.
Bucket access tightly controlled via IAM and used as input/output for SageMaker and Lambda.

Machine Learning Pipeline

Amazon SageMaker

Trains an XGBoost model to classify network activity as normal or malicious.
Deploys the trained model as a real-time inference endpoint.
Automates the ML lifecycle with SageMaker Pipelines:
- Data preprocessing
- Feature engineering
- Model training
- Evaluation
- Deployment
Endpoint is integrated with Lambda for real-time threat detection.

Data Preprocessing and Feature Engineering

AWS Lambda

Triggered to process raw logs stored in S3.
Extracts features such as IP entropy, packet size variance, and protocol usage.
Outputs structured datasets back to S3 for SageMaker ingestion.
Configured with IAM roles for secure access to S3 and SageMaker.

Monitoring and Logging

Amazon CloudWatch

Captures logs from Lambda preprocessing and SageMaker inference.
Tracks performance metrics such as latency, invocation count, and error rates.
Extensible with CloudWatch Alarms to notify on anomalies or failures.
Logs include flagged threats and prediction confidence scores.

Security and Permissions

AWS IAM

IAM roles scoped for least privilege:
Lambda

Execution role with access to S3 and SageMaker.

SageMaker

Role with access to training data and model artifacts.

Policies enforce encryption at rest and in transit.
All services interact securely via IAM-authenticated API calls.

Architecture Summary

Architecture: S3 stores raw/preprocessed data & artifacts; Lambda preprocesses & extracts features; SageMaker trains XGBoost, deploys real-time endpoint; CloudWatch collects logs/metrics; IAM manages least-privilege access. — Secure, scalable threat detection architecture with SageMaker, Lambda, S3, CloudWatch, and IAM.

Storage – Amazon S3: Raw logs, preprocessed datasets, model artifacts; IO for Lambda/SageMaker.
Compute – AWS Lambda: Automates preprocessing & feature extraction; orchestrates IO.
Machine Learning – Amazon SageMaker: Trains & deploys model; manages lifecycle with Pipelines.
Monitoring – Amazon CloudWatch: Logs/metrics; alarms for anomalies.
Security – AWS IAM: Least-privilege access; encryption at rest/in transit.

Summary of Architecture Flow

Raw logs are uploaded to S3.
Lambda is triggered to preprocess data and extract features.
Preprocessed data is stored back in S3 and fed into SageMaker for training.
Trained model is deployed as a SageMaker endpoint.
Inference requests are sent to the endpoint via Lambda or other triggers.
CloudWatch logs all activity and monitors for anomalies.
IAM roles enforce secure access across all services.

Skills Demonstrated

End-to-end ML pipeline automation with SageMaker Pipelines
Feature engineering and preprocessing with Lambda
Secure data handling and encryption via IAM
Real-time inference and logging with CloudWatch
Scalable, serverless architecture design
Threat detection modeling with XGBoost