Research
Work organized by lab and institution. Each entry outlines my contributions, methods, and future directions
DRAM Processing-in-Memory for Fully Homomorphic Encryption
SAFARI Research Group · ETH Zürich
Prof. Onur Mutlu · with Ismail Emir Yuksel, Mayank Kabra
May 2025 – December 2025
Designed DRAM-based processing-in-memory architectures that run fully homomorphic encryption directly in memory, cutting latency and energy for large-scale polynomial multiplication (4096-element polynomials across 131,072 ciphertexts).
Key Contributions
- Designed and benchmarked three DRAM data-placement strategies (FIGARO, LISA+FIGARO, LISA+RowCopy), reaching 318× speedup over pure in-memory FIGARO and 6.06× lower latency than processor-centric baselines.
- Cut memory energy 41.5× with RowClone-optimized polynomial-shift strategies that minimize costly inter-bank data movement across a 16-bank hierarchy.
- Enabled concurrent processing of 131,072 ciphertexts by combining column-granularity (FIGARO) and hierarchical row-layout (LISA) in-memory paradigms with minimal row-buffer contention.
- Extended Ramulator 2.0 and CipherMatch simulators with subarray-level access tracking and RowCopy latency models to validate architectural feasibility.
Multimodal Neuromorphic Digit Recognition (NeuTNN)
NeuroAI Computer Architecture Lab (NCAL) · Carnegie Mellon University
Prof. John Shen · with Shanmuga Venkatachalam, Liam Carden
Ongoing
Extended the NeuTNN architecture to classify visual and auditory digits jointly using only biologically-plausible R-STDP learning (no backpropagation), building an edge-ready system that holds stable multimodal representations under strict biological and hardware constraints.
Key Contributions
- Lifted audio-only accuracy from 21% → 90.26% with a log-mel spike-encoding pipeline (per-bin median thresholding) — no changes to the architecture or learning rule.
- Achieved 100% multimodal accuracy on a 72-segment model via a block-diagonal segment mask enforcing strict modality separation, stable across 2:1 and 1:1 visual-to-audio ratios.
- Sustained 100% accuracy at 9.7% synaptic density (~460K of 4.75M synapses) through systematic pruning analysis, quantifying the efficiency relevant to NCAL's edge hardware targets.
- Proposed a dynamic confidence-weighting mechanism — using winning body-potential magnitude as an inference-time reliability signal — to fix robustness under single-modality degradation, with no retraining or architecture change.
3D CNFET Accelerator — Pin Allocation & PCB Bring-Up
Nexus Research Group · Carnegie Mellon University
Tathagata Srimani
January 2026 – Present
Supported bring-up of a monolithic-3D CNFET accelerator within the lab's 3D Integration program, focused on pin allocation and PCB placement-and-routing for the chip's high-speed external test interface.
Key Contributions
- Refined pin allocation across 220+ functional pins over six VHDCI connectors and chip-on-board routing, preserving signal grouping for dual instruction/data buses with zero routing conflicts on the mapped plan.
- Applied SerDes and differential-pair PCB constraints for GHz+ signaling — 100Ω ±10% impedance, <50ps inter-pair skew, 3× spacing for crosstalk isolation.
- Supported DFT pin planning across scan domains and 15+ redundancy configuration pins to preserve yield and diagnostic coverage across VHDCI channels.
- Applied split-voltage power-distribution and ground-stitching constraints (250µm perimeter via spacing, <5% droop targets) for signal integrity under high-speed switching.
Hardware-Aware Sensor Placement for Autonomous Vehicles
CMU ECE — Cyber-Physical Systems (course research)
Embedded Systems: CPS Design
Spring 2026
Optimizes which sensor — lidar, radar, camera, or disabled — fills each of five fixed AV chassis slots under a target SoC's ingest-bandwidth budget, balancing detection rate, time-to-detect, and cost.
Key Contributions(Phase 1 of 4 complete)
- Reframed the problem as hardware/software co-design where the "software" is the sensor suite and the binding constraint is ingest bandwidth (not memory bandwidth, which never bound in v1) — making the hardware-aware claim substantive.
- Built a CMA-ES optimizer over a continuous relaxation of the 5-slot assignment, with candidates evaluated in parallel MuJoCo ray-cast simulations honoring per-sensor FOV, range, and latency.
- Defined the objective
L = w_acc·detection + w_lat·latency + w_cost·costwith a hard ingest-bandwidth cap and two weight presets (safety_first,efficiency). - Delivered the Phase 1 evaluation infrastructure: spawn-safe parallel workers, Common Random Numbers for fair within-generation comparison, provenance logging (git SHA, MuJoCo version, host), and declarative RUN_CONFIGS for scenario/platform swaps.
Planned Work & Future Directions
- Phase 2 — Reliability: 30-seed bootstrap confidence intervals; four scenarios (straight, urban-cluttered, highway-speed, adversarial-blind); sensor-noise models for rain (lidar), glare (camera), and multipath ghosts (radar).
- Phase 3 — Ablations & SoC sweep: m_lidar, bandwidth-cap, and w_cost sweeps; a five-platform SoC sweep yielding a "bandwidth needed for a 4-sensor AV stack" figure.
- Phase 4 — Manuscript: replace point estimates with CI-bracketed results and add a hardware-co-design section grounded in the SoC sweep.