会议日程

2024-12-14 (Day 1)
8:45 - 9:00 开幕:大会主席、程序委员会主席
9:00 - 9:10 ChinaSys新星奖和优秀博士论文奖颁奖
Keynote I
(Session Chair: 章明星, 清华大学)
9:10 - 9:50 高性能安全并发的工业实践
讲者:付明(华为)
9:50 - 10:30 面向视觉多模态模型的专用AI处理器芯片架构设计
讲者:梁晓峣(上海交通大学)
茶歇 (10:30 - 10:40)
Session 1: Serverless
(Session Chair: 庞浦, 上海交通大学)
10:40 - 10:57 Derm: SLA-aware Resource Management for Highly Dynamic Microservices
讲者:陈钌 (澳门大学)
10:57 - 11:14 TrEnv: Transparently Share Serverless Execution Environments Across Different Functions and Nodes
讲者:黄嘉良 (清华大学, 阿里巴巴)
11:15 - 12:00 Lightning Talk I
午休 & Poster Session (12:00 - 13:30)
Session 2: Operating Systems
(Session Chair: 杜冬冬, 上海交通大学)
13:30 - 13:47 Understanding the Linux Kernel, Visually
讲者:刘瀚之 (南京大学)
13:47 - 14:04 Skyloft: A General High-Efficient Scheduling Framework in User Space
讲者:田凯夫 (清华大学)
14:04 - 14:21 Fast Core Scheduling with Userspace Process Abstraction
讲者:林家桢 (清华大学)
14:21 - 14:38 BULKHEAD: Secure, Scalable, and Efficient Kernel Compartmentalization with PKS
讲者:郭迎港 (南京大学)
Session 3: Storage Systems
(Session Chair: 张峰, 中国人民大学)
14:38 - 14:55 CHIME: A Cache-Efficient and High-Performance Hybrid Index on Disaggregated Memory
讲者:罗旭川 (复旦大学)
14:55 - 15:12 Boosting Data Center Performance via Intelligently Managed Multi-backend Disaggregated Memory
讲者:杨涵章 (上海交通大学)
15:12 - 15:29 NeoMem: Hardware/Software Co-Design for CXL-Native Memory Tiering
讲者:周哲 (华为)
15:29 - 15:46 Mirage: Generating Enormous Databases for Complex Workloads
讲者:黄煦华 (华东师范大学)
茶歇 (15:46 - 16:00)
Session 4: Machine Learning Systems
(Session Chair: 张钊宁, 国防科技大学)
16:00 - 16:17 Parrot: Efficient Serving of LLM-based Applications with Semantic Variable
讲者:林超凡 (清华大学)
16:17 - 16:34 Llumnix: Dynamic Scheduling for Large Language Model Serving
讲者:赵汉宇 (阿里巴巴)
16:34 - 16:51 Heterogeneous Collaborative Speculative Decoding: An Acceleration Method for LLM on Personal Devices
讲者:张立博 (国防科技大学)
16:51 - 17:08 In-Storage Attention Offloading for Efficient Long-Context LLM Inference
讲者:李恩典 (北京大学)
Session 5: Industry
(Session Chair: 马腾, 阿里巴巴)
17:08 - 17:25 OceanBase Paetica: A Hybrid Shared-Nothing/Shared-Everything Database for Supporting Single Machine and Distributed Cluster
讲者:徐泉清 (蚂蚁集团)
17:25 - 17:42 异构融合OS-异构智算时代的操作系统创新与挑战
讲者:林飞龙 (华为)
17:42 - 17:59 GraphUniverse数据管理分析新范式
讲者:林恒 (蚂蚁集团)
Poster Session (18:00 - 18:30)
晚宴 (18:30 东凯悦酒店二楼宴会厅)

2024-12-15 (Day 2)
Keynote II
(Session Chair: 宋卓然, 上海交通大学)
9:00 - 9:40 低开销可逆缓存一致性协议
讲者:钱学海(清华大学)
9:40 - 10:20 赋能基础大模型的边缘设备部署
讲者:曹婷(微软亚洲研究院)
最佳展示奖颁奖
茶歇 (10:20 - 10:30)
Session 6: Best Paper
(Session Chair: 宋新开, 中科院计算所)
10:30 - 10:47 Serialization/Deserialization-free State Transfer in Serverless Workflows
讲者:魏星达 (上海交通大学)
10:47 - 11:04 AmgT: Algebraic Multigrid Solver on Tensor Cores
讲者:曾礼杰 (中国石油大学(北京))
11:04 - 11:21 PolarDB-MP: A Multi-Primary Cloud-Native Database via Disaggregated Shared Memory
讲者:章颖强 (阿里云)
11:21 - 12:00 Lightning Talk II
午休 (12:00 - 13:30)
Session 7: Computer Architecture
(Session Chair: 周哲, 华为)
13:30 - 13:47 Constructing Block-Interface All-Flash Array with Zoned-Namespace SSDs
讲者:彭力 (北京大学)
13:47 - 14:04 ChameleonEC: Exploiting Tunability of Erasure Coding for Low-Interference Repair
讲者:蔡煜晖 (厦门大学)
14:04 - 14:21 A Scalable, Efficient, and Robust Dynamic Memory Management Library for HLS-based FPGAs
讲者:王庆刚 (华中科技大学)
14:21 - 14:38 Gaze into the Pattern: Characterizing Spatial Patterns with Footprint-Internal Correlations for Hardware Prefetching
讲者:陈子啸 (上海交通大学)
Session 8: High-Performance Computing
(Session Chair: 甘一鸣, 中科院计算所)
14:38 - 14:55 Exploring Hierarchical Patterns for Alert Aggregation in Supercomputers
讲者:孙永谦 (南开大学)
14:55 - 15:12 Gemini: Mapping and Architecture Co-exploration for Large-scale DNN Chiplet Accelerators
讲者:蔡经纬 (清华大学)
15:12 - 15:29 Towards Highly Compatible I/O-aware Workflow Scheduling on HPC Systems
讲者:唐宇 (国防科技大学)
15:29 - 15:46 RPCAcc: A High-Performance and Reconfigurable PCIe-attached RPC Accelerator
讲者:张杰 (浙江大学)
茶歇 (15:46 - 16:00)
Session 9: GPU
(Session Chair: 周迪宇, 北京大学)
16:00 - 16:17 Efficient 4-bit Matrix Unit via Primitivization
讲者:陈亦 (中科院计算所)
16:17 - 16:34 DTuner: Efficiently Tuning and Compiling for Dynamic Shape Tensor Programs
讲者:刘硕 (中国科学技术大学)
16:34 - 16:51 PHOENIXOS: A Concurrent OS-level GPU Checkpoint and Restore System
讲者:黄卓彬 (上海交通大学)
16:51 - 17:08 Hydrogen: Contention-Aware Hybrid Memory for Heterogeneous CPU-GPU Architectures
讲者:李一苇 (清华大学)
Session 10: Accelerator
(Session Chair: 张余豪, 天津大学)
17:08 - 17:25 WarpDrive: GPU-Based Fully Homomorphic Encryption Acceleration Leveraging Tensor and CUDA Cores
讲者:范广 (蚂蚁技术研究院)
17:25 - 17:42 ActiveN: A Scalable and Flexibly-programmable Event-driven Neuromorphic Processor
讲者:刘晓义 (清华大学)
17:42 - 17:59 Hassert: Hardware Assertion-Based Agile Verification Framework with FPGA Acceleration
讲者:张子卿 (中科院计算所)
17:59 - 18:16 Enabling Tensor Language Model to Assist in Generating High-Performance Tensor Programs for Deep Learning
讲者:翟祎 (中国科学技术大学)
闭幕 (18:16 - 18:21)

2024-12-14 (Day 1)
Lightning Talk I
11:15 - 11:16 Trinity: A General Purpose FHE Accelerator
讲者:邓翔龙 (中科院信工所)
11:16 - 11:17 SQLStateGuard: Statement-Level SQL Injection Defense Based on Learning-Driven Middleware
讲者:王天一 (兰州大学)
11:17 - 11:18 Beaver: A High-Performance and Crash-Consistent File System Cache via PM-DRAM Collaborative Memory Tiering
讲者:潘庆霖 (中科院软件所)
11:18 - 11:19 A System-Level Dynamic Binary Translator using Automatically-Learned Translation Rules
讲者:梁超毅 (复旦大学)
11:19 - 11:20 Swift Unfolding of Communities: GPU-Accelerated Louvain Algorithm
讲者:林夕 (南京大学)
11:20 - 11:21 Mobilizing underutilized storage nodes via job path: A job-aware file striping approach
讲者:鲜港 (中国空气动力研究与发展中心)
11:21 - 11:22 EZTopo: An Automatic Custom Topology Generation Framework for Large Scale NoC Design
讲者:唐岩 (国防科技大学)
11:22 - 11:23 VertexSurge: Variable Length Graph Pattern Match on Billion-edge Graphs
讲者:谢威宇 (清华大学)
11:23 - 11:24 GPU Performance Optimization via Inter-group Cache Cooperation
讲者:王国升 (武汉理工大学)
11:24 - 11:25 CROSS: Compiler-Driven Optimization of Sparse DNNs Using Sparse/Dense Computation Kernels
讲者:黄世远 (上海交通大学)
11:25 - 11:26 DDP-Fsim: Efficient and Scalable Fault Simulation for Deterministic Patterns with Two-Dimensional Parallelism
讲者:谷丰 (中科院计算所)
11:26 - 11:27 Towards Hotness-aware Object Locality Optimization for Memory Tiering
讲者:黄瑞哲 (北京大学)
11:27 - 11:28 T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge
讲者:魏剑宇 (微软亚洲研究院)
11:28 - 11:29 Fast State Restoration in LLM Serving with HCache
讲者:高世伟 (清华大学)
11:29 - 11:30 UNICO: Unifying Coroutines in Rust
讲者:徐启航 (北京大学)
11:30 - 11:31 面向超级计算系统的节点故障异常预测方法
讲者:赵一宁 (中科院计算机网络信息中心)
11:31 - 11:32 Scaling Disk Failure Prediction via Multi-Source Stream Mining
讲者:韩淑捷 (西北工业大学)
11:32 - 11:33 Sparsification-Driven Accelerator Design for Video Generation Models
讲者:刘军 (上海交通大学)
11:33 - 11:34 TB-STC: Transposable Block-wise N:M Structured Sparse Tensor Core
讲者:刘军 (上海交通大学)
11:34 - 11:35 Scalable Billion-point Approximate Nearest Neighbor Search Using SmartSSDs
讲者:田冰 (华中科技大学)
11:35 - 11:36 Holmes: Towards Distributed Training Across Clusters with Heterogeneous NIC Environment
讲者:杨非 (之江实验室)
11:36 - 11:37 FastGL: A GPU-Efficient Framework for Accelerating Sampling-Based GNN Training at Large Scale
讲者:朱泽雨 (中科院自动化研究所)
11:37 - 11:38 BulkCC: Scaling Concurrent Data Structures to GPU Parallelism
讲者:芮轲 (中科院计算所)
11:38 - 11:39 SMIless: Serving DAG-based Inference with Dynamic Invocations under Serverless Computing
讲者:卢澄志 (澳门大学)
11:39 - 11:40 HyFiSS: A Hybrid Fidelity Stall-Aware Simulator for GPGPUs
讲者:杨建超 (国防科技大学)
11:40 - 11:41 M-ANT: Efficient Low-bit Group Quantization for LLMs via Mathematically Adaptive Numerical Type
讲者:胡洧铭 (上海交通大学)
11:41 - 11:42 VQ-LLM: High-performance Code Generation for Vector Quantization Augmented LLM Inference
讲者:刘子汉 (上海交通大学)
11:42 - 11:43 COMPASS: SRAM-Based Computing-in-Memory SNN Accelerator with Adaptive Spike Speculation
讲者:汪宗武 (上海交通大学)
11:43 - 11:44 Scaling Up Memory Disaggregated Applications With SMART
讲者:任峰 (启元实验室)
11:44 - 11:45 Criticality-Aware Instruction-Centric Bandwidth Partitioning for Data Center Applications
讲者:朱立人 (北京大学)
11:45 - 11:46 Flame: A Centralized Cache Controller for Serverless Computing
讲者:杨亚南 (中国电信云计算研究院)

2024-12-15 (Day 2)
Lightning Talk II
11:22 - 11:23 Stream-Based Data Placement for Near-Data Processing with Extended Memory
讲者:李一苇 (清华大学)
11:23 - 11:24 MV4PG: Materialized Views for Property Graphs
讲者:徐柴俊 (中国科学技术大学)
11:24 - 11:25 Tribase: A Vector Data Query Engine for Reliable and Lossless Pruning Compression using Triangle Inequalities
讲者:许骞 (中国人民大学)
11:25 - 11:26 SparSynergy: Unlocking Flexible and Efficient DNN Acceleration through Multi-Level Sparsity
讲者:杨靖奎 (国防科技大学)
11:26 - 11:27 gVulkan: Scalable GPU Pooling for Pixel-Grained Rendering in Ray Tracing
讲者:顾翼成 (上海交通大学)
11:27 - 11:28 On-demand and Parallel Checkpoint/Restore for GPU Applications
讲者:杨彦凝 (上海交通大学)
11:28 - 11:29 WeiPipe: Weight Pipeline Parallelism for Communication-Effective Long-Context Large Model Training
讲者:林俊峰 (清华大学)
11:29 - 11:30 NodeSentry: Unsupervised Anomaly Detection in Production HPC Systems Using Model Sharing
讲者:孙永谦 (南开大学)
11:30 - 11:31 Mille-feuille: A Tile-Grained Mixed Precision Single-Kernel Conjugate Gradient Solver on GPUs
讲者:杨德闯 (中国石油大学(北京))
11:31 - 11:32 Medusa: Accelerating Serverless LLM Inference with Materialization
讲者:曾少勋 (清华大学)
11:32 - 11:33 UniNDP: A Unified Compilation and Simulation Tool for Near DRAM Processing Architectures
讲者:谢童欣 (清华大学)
11:33 - 11:34 LegoZK: A Dynamically Reconfigurable Accelerator for Zero-Knowledge Proof
讲者:杨正帮 (中科院信工所)
11:34 - 11:35 Rubick: Exploiting Job Reconfigurability for Deep Learning Cluster Scheduling
讲者:张欣怡 (华东师范大学)
11:35 - 11:36 PimPam: Efficient Graph Pattern Matching on Real Processing-in-Memory Hardware
讲者:蔡双羽 (清华大学)
11:36 - 11:37 WinAcc: Window-based Acceleration of Neural Networks Using Block Floating Point
讲者:鞠鑫 (国防科技大学)
11:37 - 11:38 Topology-aware Preemption for Co-located LLM Workloads
讲者:张平 (百川智能)
11:38 - 11:39 EDGaE: Efficient Distributed GNN Training System at Edge
讲者:靳松 (武汉大学)
11:39 - 11:40 Ladder: Enabling Efficient Low-Precision Deep Learning Computing through Hardware-aware Tensor Transformation
讲者:王磊 (微软亚洲研究院)
11:40 - 11:41 Componentized OS: Flexible and Efficient Architecture for Specialized Operating Systems
讲者:蒋周奇 (华中科技大学)
11:41 - 11:42 DRack: A CXL-Disaggregated Rack Architecture to Boost Inter-Rack Communication
讲者:张旭 (中科院计算所)
11:42 - 11:43 Improving the Ability of Thermal Radiation Based Hardware Trojan Detection
讲者:苏颋 (国防科技大学)
11:43 - 11:44 Constructing a Supplementary Benchmark Suite to Represent Android Applications with User Interactions by Using Performance Counters
讲者:欧阳铖浩 (中科院深圳先进院)
11:44 - 11:45 Performance Analysis of Light-weight Memory Isolation Methods
讲者:裴辰举 (吉林大学)
11:45 - 11:46 Efficient LLM Inference on GPUs with Operator Optimization and Compilation
讲者:许珈铭 (上海交通大学)
11:46 - 11:47 Detecting Broken Object-Level Authorization Vulnerabilities in Database-Backed Applications
讲者:黄永恒 (中科院计算所)
11:47 - 11:48 Using Analytical Performance/Power Model and Fine-Grained DVFS to Enhance AI Accelerator Energy Efficiency
讲者:王梓博 (南京大学)

2024-12-14 (Day 1)
Poster Session
1 Trinity: A General Purpose FHE Accelerator
2 SQLStateGuard: Statement-Level SQL Injection Defense Based on Learning-Driven Middleware
3 Beaver: A High-Performance and Crash-Consistent File System Cache via PM-DRAM Collaborative Memory Tiering
4 A System-Level Dynamic Binary Translator using Automatically-Learned Translation Rules
5 Exploring Hierarchical Patterns for Alert Aggregation in Supercomputers
6 Serialization/Deserialization-free State Transfer in Serverless Workflows
7 Swift Unfolding of Communities: GPU-Accelerated Louvain Algorithm
8 Mobilizing underutilized storage nodes via job path: A job-aware file striping approach
9 EZTopo: An Automatic Custom Topology Generation Framework for Large Scale NoC Design
10 Heterogeneous Collaborative Speculative Decoding: An Acceleration Method for LLM on Personal Devices
11 VertexSurge: Variable Length Graph Pattern Match on Billion-edge Graphs
12 Enabling Tensor Language Model to Assist in Generating High-Performance Tensor Programs for Deep Learning
13 GPU Performance Optimization via Inter-group Cache Cooperation
14 CROSS: Compiler-Driven Optimization of Sparse DNNs Using Sparse/Dense Computation Kernels
15 OceanBase Paetica: A Hybrid Shared-Nothing/Shared-Everything Database for Supporting Single Machine and Distributed Cluster
16 CHIME: A Cache-Efficient and High-Performance Hybrid Index on Disaggregated Memory
17 Parrot: Efficient Serving of LLM-based Applications with Semantic Variable
18 DDP-Fsim: Efficient and Scalable Fault Simulation for Deterministic Patterns with Two-Dimensional Parallelism
19 Towards Hotness-aware Object Locality Optimization for Memory Tiering
20 Gemini: Mapping and Architecture Co-exploration for Large-scale DNN Chiplet Accelerators
21 Accelerating Transparent Memory Compression in the Cloud
22 Derm: SLA-aware Resource Management for Highly Dynamic Microservices
23 T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge
24 Fast State Restoration in LLM Serving with HCache
25 RPCAcc: A High-Performance and Reconfigurable PCIe-attached RPC Accelerator
26 UNICO: Unifying Coroutines in Rust
27 面向超级计算系统的节点故障异常预测方法
28 Scaling Disk Failure Prediction via Multi-Source Stream Mining
29 Sparsification-Driven Accelerator Design for Video Generation Models
30 TB-STC: Transposable Block-wise N:M Structured Sparse Tensor Core
31 Scalable Billion-point Approximate Nearest Neighbor Search Using SmartSSDs
32 AmgT: Algebraic Multigrid Solver on Tensor Cores
33 A Scalable, Efficient, and Robust Dynamic Memory Management Library for HLS-based FPGAs
34 Boosting Data Center Performance via Intelligently Managed Multi-backend Disaggregated Memory
35 Hydrogen: Contention-Aware Hybrid Memory for Heterogeneous CPU-GPU Architectures
36 Gaze into the Pattern: Characterizing Spatial Patterns with Footprint-Internal Correlations for Hardware Prefetching
37 WarpDrive: GPU-Based Fully Homomorphic Encryption Acceleration Leveraging Tensor and CUDA Cores
38 Holmes: Towards Distributed Training Across Clusters with Heterogeneous NIC Environment
39 FastGL: A GPU-Efficient Framework for Accelerating Sampling-Based GNN Training at Large Scale
40 BulkCC: Scaling Concurrent Data Structures to GPU Parallelism
41 Skyloft: A General High-Efficient Scheduling Framework in User Space
42 SMIless: Serving DAG-based Inference with Dynamic Invocations under Serverless Computing
43 NeoMem: Hardware/Software Co-Design for CXL-Native Memory Tiering
44 HyFiSS: A Hybrid Fidelity Stall-Aware Simulator for GPGPUs
45 M-ANT: Efficient Low-bit Group Quantization for LLMs via Mathematically Adaptive Numerical Type
46 PolarDB-MP: A Multi-Primary Cloud-Native Database via Disaggregated Shared Memory
47 VQ-LLM: High-performance Code Generation for Vector Quantization Augmented LLM Inference
48 COMPASS: SRAM-Based Computing-in-Memory SNN Accelerator with Adaptive Spike Speculation
49 Scaling Up Memory Disaggregated Applications With SMART
50 Mirage: Generating Enormous Databases for Complex Workloads
51 Criticality-Aware Instruction-Centric Bandwidth Partitioning for Data Center Applications
52 Stream-Based Data Placement for Near-Data Processing with Extended Memory
53 MV4PG: Materialized Views for Property Graphs
54 DTuner: Efficiently Tuning and Compiling for Dynamic Shape Tensor Programs
55 Tribase: A Vector Data Query Engine for Reliable and Lossless Pruning Compression using Triangle Inequalities
56 PHOENIXOS: A Concurrent OS-level GPU Checkpoint and Restore System
57 TrEnv: Transparently Share Serverless Execution Environments Across Different Functions and Nodes
58 SparSynergy: Unlocking Flexible and Efficient DNN Acceleration through Multi-Level Sparsity
59 gVulkan: Scalable GPU Pooling for Pixel-Grained Rendering in Ray Tracing
60 In-Storage Attention Offloading for Efficient Long-Context LLM Inference
61 On-demand and Parallel Checkpoint/Restore for GPU Applications
62 WeiPipe: Weight Pipeline Parallelism for Communication-Effective Long-Context Large Model Training
63 NodeSentry: Unsupervised Anomaly Detection in Production HPC Systems Using Model Sharing
64 Understanding the Linux Kernel, Visually
65 Constructing Block-Interface All-Flash Array with Zoned-Namespace SSDs
66 ActiveN: A Scalable and Flexibly-programmable Event-driven Neuromorphic Processor
67 Mille-feuille: A Tile-Grained Mixed Precision Single-Kernel Conjugate Gradient Solver on GPUs
68 Medusa: Accelerating Serverless LLM Inference with Materialization
69 ChameleonEC: Exploiting Tunability of Erasure Coding for Low-Interference Repair
70 UniNDP: A Unified Compilation and Simulation Tool for Near DRAM Processing Architectures
71 LegoZK: A Dynamically Reconfigurable Accelerator for Zero-Knowledge Proof
72 Rubick: Exploiting Job Reconfigurability for Deep Learning Cluster Scheduling
73 PimPam: Efficient Graph Pattern Matching on Real Processing-in-Memory Hardware
74 WinAcc: Window-based Acceleration of Neural Networks Using Block Floating Point
75 BULKHEAD: Secure, Scalable, and Efficient Kernel Compartmentalization with PKS
76 Topology-aware Preemption for Co-located LLM Workloads
77 EDGaE: Efficient Distributed GNN Training System at Edge
78 Hassert: Hardware Assertion-Based Agile Verification Framework with FPGA Acceleration
79 Ladder: Enabling Efficient Low-Precision Deep Learning Computing through Hardware-aware Tensor Transformation
80 Efficient 4-bit Matrix Unit via Primitivization
81 Componentized OS: Flexible and Efficient Architecture for Specialized Operating Systems
82 DRack: A CXL-Disaggregated Rack Architecture to Boost Inter-Rack Communication
83 Llumnix: Dynamic Scheduling for Large Language Model Serving
84 Flame: A Centralized Cache Controller for Serverless Computing
85 Improving the Ability of Thermal Radiation Based Hardware Trojan Detection
86 Towards Highly Compatible I/O-aware Workflow Scheduling on HPC Systems
87 Constructing a Supplementary Benchmark Suite to Represent Android Applications with User Interactions by Using Performance Counters
88 Fast Core Scheduling with Userspace Process Abstraction
89 Performance Analysis of Light-weight Memory Isolation Methods
90 Efficient LLM Inference on GPUs with Operator Optimization and Compilation
91 Detecting Broken Object-Level Authorization Vulnerabilities in Database-Backed Applications
92 DELTA: Memory-Efficient Training via Dynamic Fine-Grained Recomputation and Swapping
93 Using Analytical Performance/Power Model and Fine-Grained DVFS to Enhance AI Accelerator Energy Efficiency