from data chaos to decision-ready intelligence
A full-stack AWS and Databricks Lakehouse that uses RAG to turn fragmented legacy data into a high-performance knowledge engine.
#AWSBedrock #Databricks #UnityCatalog #DeltaLiveTables #Python #GitLab #CI/CD #Streamlit
Modernizing Intelligence Discovery
I’ve spent 20 years pulling systems apart to understand how they tick. Today, I apply that same forensic curiosity to the Intelligence Community’s biggest challenge: Data Saturation.
When legacy siloes create 48-hour intelligence latencies, I build the solution. I engineered a Full-Stack Knowledge Synthesis Engine that leverages AWS and Databricks to automate the path from raw field reports to verified insights. Using Medallion Architecture and RAG, I created a system where analysts don’t have to hunt for data, they simply ask for it. This IL5-ready architecture replaces manual extraction with automated semantic discovery, ensuring that in a world of endless data, the most critical insights are always the first to be found.
The modernization blueprint: data at the speed of decision
Legacy Challenge: The Intelligence Gap
Thousands of siloed, unstructured reports required manual analysis, creating a 48-hour lag between data ingestion and actionable intelligence.
Modern Solution: The Knowledge Engine
A serverless, automated pipeline that ingests and indexes data in real-time, reducing the path from raw data to natural language query to under 5 minutes.
-
Databricks Auto Loader
Auto Loader is the ingestion engine that continuously processes new GDELT files as they land in the S3 bucket. It eliminates directory rescans, handles schema drift, and scales to billions of files. For real-time pipelines it provides:
Incremental discovery of new objects
Automatic schema inference and evolution
Checkpointing for exactly-once processing
Streaming Ingestion Example:
df = (spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "csv")
.option("cloudFiles.schemaLocation", "/mnt/schemas/gdelt")
.load("/mnt/raw/gdelt/")
)
(df.writeStream
.format("delta")
.option("checkpointLocation", "/mnt/checkpoints/gdelt_bronze")
.outputMode("append")
.start("/mnt/bronze/gdelt"))
Delta Live Tables (DLT)
DLT manages the transformation pipeline from Bronze > Silver > Gold using declarative logic. It ensures reliability through automatic lineage, monitoring, and data quality enforcement.
Bronze: Raw ingested GDELT records
Silver: Cleaned, deduplicated, PII-masked, language-normalized
Gold: Curated, analytics-ready tables for embedding and RAG
Data Quality with dlt.expect
DLT expectations enforce constraints and optionally drop or fail invalid records.
import dlt
@dlt.table
@dlt.expect("valid_event_date", "event_date IS NOT NULL")
@dlt.expect_or_drop("valid_tone", "tone BETWEEN -100 AND 100")
def gdelt_silver():
df = spark.readStream.table("bronze_gdelt")
return df.dropDuplicates(["global_event_id"])
This ensures only valid, high-quality events progress to the Silver layer.
-
Why Claude 3 Sonnet via Amazon Bedrock?
Claude 3 Sonnet provides:
Large context window (200K-1M tokens) for multi-document synthesis
High factual accuracy and low hallucination rates
Fast inference suitable for real-time mission workflows
Native AWS integration, simplifying security and deployment
This makes it ideal for synthesizing GDELT-derived intelligence.
Vector Search with Databricks + Unity Catalog
Embeddings are stored in Databricks Vector Search, which is fully governed by Unity Catalog:
Centralized permissions for tables, models, and vector indexes
Row-level and column-level security
Full lineage tracking and auditability
Secure isolation across workspaces
This ensures embeddings remain protected while still enabling high-performance semantic retrieval.
LangChain Retrieval Logic
LangChain orchestrates the RAG loop by retrieving relevant vectors and passing them to Claude 3.
from langchain.vectorstores import DatabricksVectorSearch
from langchain.chains import RetrievalQA
vector_store = DatabricksVectorSearch(
index_name="gdelt_events_index"
).as_retriever(search_kwargs={"k": 5})
qa = RetrievalQA.from_chain_type(
llm=claude_llm,
retriever=vector_store
)
response = qa.run("Summarize political unrest events in South America.")
This creates a clean, modular retrieval pipeline.
-
AWS Lambda API Layer
Lambda provides a lightweight, serverless inference endpoint for your mission portal or external systems.
Bedrock Call Using Boto3
import boto3
import json
client = boto3.client("bedrock-runtime")
payload = {
"prompt": "Summarize the latest GDELT events.",
"max_tokens": 300
}
response = client.invoke_model(
modelId="anthropic.claude-3-sonnet",
body=json.dumps(payload)
)
result = json.loads(response["body"].read())
print(result["completion"])
This enables low-latency, scalable inference.
Streamlit UI
Streamlit provides a fast, interactive interface for analysts:
Real-time querying
Visualization of retrieved context
Secure integration with Lambda or Databricks endpoints
Style choice
GitLab CI/CD Snippet
A minimal pipeline step for deploying or testing Bedrock inference (yaml):
bedrock_inference_test:
image: python:3.10
script:
- pip install boto3
- python scripts/test_bedrock_inference.py
only:
- main
This ensures the RAG pipeline is validated on every merge.