会议日程

2024-12-14 (Day 1)
8:45 - 9:00	开幕:大会主席、程序委员会主席
9:00 - 9:10	ChinaSys新星奖和优秀博士论文奖颁奖
Keynote I (Session Chair: 章明星, 清华大学)
9:10 - 9:50	高性能安全并发的工业实践讲者：付明（华为）
9:50 - 10:30	面向视觉多模态模型的专用AI处理器芯片架构设计讲者：梁晓峣（上海交通大学）
茶歇 (10:30 - 10:40)
Session 1: Serverless (Session Chair: 庞浦, 上海交通大学)
10:40 - 10:57	Derm: SLA-aware Resource Management for Highly Dynamic Microservices 讲者：陈钌 (澳门大学)
10:57 - 11:14	TrEnv: Transparently Share Serverless Execution Environments Across Different Functions and Nodes 讲者：黄嘉良 (清华大学, 阿里巴巴)
11:15 - 12:00	Lightning Talk I
午休 & Poster Session (12:00 - 13:30)
Session 2: Operating Systems (Session Chair: 杜冬冬, 上海交通大学)
13:30 - 13:47	Understanding the Linux Kernel, Visually 讲者：刘瀚之 (南京大学)
13:47 - 14:04	Skyloft: A General High-Efficient Scheduling Framework in User Space 讲者：田凯夫 (清华大学)
14:04 - 14:21	Fast Core Scheduling with Userspace Process Abstraction 讲者：林家桢 (清华大学)
14:21 - 14:38	BULKHEAD: Secure, Scalable, and Efficient Kernel Compartmentalization with PKS 讲者：郭迎港 (南京大学)
Session 3: Storage Systems (Session Chair: 张峰, 中国人民大学)
14:38 - 14:55	CHIME: A Cache-Efficient and High-Performance Hybrid Index on Disaggregated Memory 讲者：罗旭川 (复旦大学)
14:55 - 15:12	Boosting Data Center Performance via Intelligently Managed Multi-backend Disaggregated Memory 讲者：杨涵章 (上海交通大学)
15:12 - 15:29	NeoMem: Hardware/Software Co-Design for CXL-Native Memory Tiering 讲者：周哲 (华为)
15:29 - 15:46	Mirage: Generating Enormous Databases for Complex Workloads 讲者：黄煦华 (华东师范大学)
茶歇 (15:46 - 16:00)
Session 4: Machine Learning Systems (Session Chair: 张钊宁, 国防科技大学)
16:00 - 16:17	Parrot: Efficient Serving of LLM-based Applications with Semantic Variable 讲者：林超凡 (清华大学)
16:17 - 16:34	Llumnix: Dynamic Scheduling for Large Language Model Serving 讲者：赵汉宇 (阿里巴巴)
16:34 - 16:51	Heterogeneous Collaborative Speculative Decoding: An Acceleration Method for LLM on Personal Devices 讲者：张立博 (国防科技大学)
16:51 - 17:08	In-Storage Attention Offloading for Efficient Long-Context LLM Inference 讲者：李恩典 (北京大学)
Session 5: Industry (Session Chair: 马腾, 阿里巴巴)
17:08 - 17:25	OceanBase Paetica: A Hybrid Shared-Nothing/Shared-Everything Database for Supporting Single Machine and Distributed Cluster 讲者：徐泉清 (蚂蚁集团)
17:25 - 17:42	异构融合OS-异构智算时代的操作系统创新与挑战讲者：林飞龙 (华为)
17:42 - 17:59	GraphUniverse数据管理分析新范式讲者：林恒 (蚂蚁集团)
Poster Session (18:00 - 18:30)
晚宴 (18:30 东凯悦酒店二楼宴会厅)

2024-12-15 (Day 2)
Keynote II (Session Chair: 宋卓然, 上海交通大学)
9:00 - 9:40	低开销可逆缓存一致性协议讲者：钱学海（清华大学）
9:40 - 10:20	赋能基础大模型的边缘设备部署讲者：曹婷（微软亚洲研究院）
最佳展示奖颁奖
茶歇 (10:20 - 10:30)
Session 6: Best Paper (Session Chair: 宋新开, 中科院计算所)
10:30 - 10:47	Serialization/Deserialization-free State Transfer in Serverless Workflows 讲者：魏星达 (上海交通大学)
10:47 - 11:04	AmgT: Algebraic Multigrid Solver on Tensor Cores 讲者：曾礼杰 (中国石油大学（北京）)
11:04 - 11:21	PolarDB-MP: A Multi-Primary Cloud-Native Database via Disaggregated Shared Memory 讲者：章颖强 (阿里云)
11:21 - 12:00	Lightning Talk II
午休 (12:00 - 13:30)
Session 7: Computer Architecture (Session Chair: 周哲, 华为)
13:30 - 13:47	Constructing Block-Interface All-Flash Array with Zoned-Namespace SSDs 讲者：彭力 (北京大学)
13:47 - 14:04	ChameleonEC: Exploiting Tunability of Erasure Coding for Low-Interference Repair 讲者：蔡煜晖 (厦门大学)
14:04 - 14:21	A Scalable, Efficient, and Robust Dynamic Memory Management Library for HLS-based FPGAs 讲者：王庆刚 (华中科技大学)
14:21 - 14:38	Gaze into the Pattern: Characterizing Spatial Patterns with Footprint-Internal Correlations for Hardware Prefetching 讲者：陈子啸 (上海交通大学)
Session 8: High-Performance Computing (Session Chair: 甘一鸣, 中科院计算所)
14:38 - 14:55	Exploring Hierarchical Patterns for Alert Aggregation in Supercomputers 讲者：孙永谦 (南开大学)
14:55 - 15:12	Gemini: Mapping and Architecture Co-exploration for Large-scale DNN Chiplet Accelerators 讲者：蔡经纬 (清华大学)
15:12 - 15:29	Towards Highly Compatible I/O-aware Workflow Scheduling on HPC Systems 讲者：唐宇 (国防科技大学)
15:29 - 15:46	RPCAcc: A High-Performance and Reconfigurable PCIe-attached RPC Accelerator 讲者：张杰 (浙江大学)
茶歇 (15:46 - 16:00)
Session 9: GPU (Session Chair: 周迪宇, 北京大学)
16:00 - 16:17	Efficient 4-bit Matrix Unit via Primitivization 讲者：陈亦 (中科院计算所)
16:17 - 16:34	DTuner: Efficiently Tuning and Compiling for Dynamic Shape Tensor Programs 讲者：刘硕 (中国科学技术大学)
16:34 - 16:51	PHOENIXOS: A Concurrent OS-level GPU Checkpoint and Restore System 讲者：黄卓彬 (上海交通大学)
16:51 - 17:08	Hydrogen: Contention-Aware Hybrid Memory for Heterogeneous CPU-GPU Architectures 讲者：李一苇 (清华大学)
Session 10: Accelerator (Session Chair: 张余豪, 天津大学)
17:08 - 17:25	WarpDrive: GPU-Based Fully Homomorphic Encryption Acceleration Leveraging Tensor and CUDA Cores 讲者：范广 (蚂蚁技术研究院)
17:25 - 17:42	ActiveN: A Scalable and Flexibly-programmable Event-driven Neuromorphic Processor 讲者：刘晓义 (清华大学)
17:42 - 17:59	Hassert: Hardware Assertion-Based Agile Verification Framework with FPGA Acceleration 讲者：张子卿 (中科院计算所)
17:59 - 18:16	Enabling Tensor Language Model to Assist in Generating High-Performance Tensor Programs for Deep Learning 讲者：翟祎 (中国科学技术大学)
闭幕 (18:16 - 18:21)

2024-12-14 (Day 1)
Lightning Talk I
11:15 - 11:16	Trinity: A General Purpose FHE Accelerator 讲者：邓翔龙 (中科院信工所)
11:16 - 11:17	SQLStateGuard: Statement-Level SQL Injection Defense Based on Learning-Driven Middleware 讲者：王天一 (兰州大学)
11:17 - 11:18	Beaver: A High-Performance and Crash-Consistent File System Cache via PM-DRAM Collaborative Memory Tiering 讲者：潘庆霖 (中科院软件所)
11:18 - 11:19	A System-Level Dynamic Binary Translator using Automatically-Learned Translation Rules 讲者：梁超毅 (复旦大学)
11:19 - 11:20	Swift Unfolding of Communities: GPU-Accelerated Louvain Algorithm 讲者：林夕 (南京大学)
11:20 - 11:21	Mobilizing underutilized storage nodes via job path: A job-aware file striping approach 讲者：鲜港 (中国空气动力研究与发展中心)
11:21 - 11:22	EZTopo: An Automatic Custom Topology Generation Framework for Large Scale NoC Design 讲者：唐岩 (国防科技大学)
11:22 - 11:23	VertexSurge: Variable Length Graph Pattern Match on Billion-edge Graphs 讲者：谢威宇 (清华大学)
11:23 - 11:24	GPU Performance Optimization via Inter-group Cache Cooperation 讲者：王国升 (武汉理工大学)
11:24 - 11:25	CROSS: Compiler-Driven Optimization of Sparse DNNs Using Sparse/Dense Computation Kernels 讲者：黄世远 (上海交通大学)
11:25 - 11:26	DDP-Fsim: Efficient and Scalable Fault Simulation for Deterministic Patterns with Two-Dimensional Parallelism 讲者：谷丰 (中科院计算所)
11:26 - 11:27	Towards Hotness-aware Object Locality Optimization for Memory Tiering 讲者：黄瑞哲 (北京大学)
11:27 - 11:28	T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge 讲者：魏剑宇 (微软亚洲研究院)
11:28 - 11:29	Fast State Restoration in LLM Serving with HCache 讲者：高世伟 (清华大学)
11:29 - 11:30	UNICO: Unifying Coroutines in Rust 讲者：徐启航 (北京大学)
11:30 - 11:31	面向超级计算系统的节点故障异常预测方法讲者：赵一宁 (中科院计算机网络信息中心)
11:31 - 11:32	Scaling Disk Failure Prediction via Multi-Source Stream Mining 讲者：韩淑捷 (西北工业大学)
11:32 - 11:33	Sparsification-Driven Accelerator Design for Video Generation Models 讲者：刘军 (上海交通大学)
11:33 - 11:34	TB-STC: Transposable Block-wise N:M Structured Sparse Tensor Core 讲者：刘军 (上海交通大学)
11:34 - 11:35	Scalable Billion-point Approximate Nearest Neighbor Search Using SmartSSDs 讲者：田冰 (华中科技大学)
11:35 - 11:36	Holmes: Towards Distributed Training Across Clusters with Heterogeneous NIC Environment 讲者：杨非 (之江实验室)
11:36 - 11:37	FastGL: A GPU-Efficient Framework for Accelerating Sampling-Based GNN Training at Large Scale 讲者：朱泽雨 (中科院自动化研究所)
11:37 - 11:38	BulkCC: Scaling Concurrent Data Structures to GPU Parallelism 讲者：芮轲 (中科院计算所)
11:38 - 11:39	SMIless: Serving DAG-based Inference with Dynamic Invocations under Serverless Computing 讲者：卢澄志 (澳门大学)
11:39 - 11:40	HyFiSS: A Hybrid Fidelity Stall-Aware Simulator for GPGPUs 讲者：杨建超 (国防科技大学)
11:40 - 11:41	M-ANT: Efficient Low-bit Group Quantization for LLMs via Mathematically Adaptive Numerical Type 讲者：胡洧铭 (上海交通大学)
11:41 - 11:42	VQ-LLM: High-performance Code Generation for Vector Quantization Augmented LLM Inference 讲者：刘子汉 (上海交通大学)
11:42 - 11:43	COMPASS: SRAM-Based Computing-in-Memory SNN Accelerator with Adaptive Spike Speculation 讲者：汪宗武 (上海交通大学)
11:43 - 11:44	Scaling Up Memory Disaggregated Applications With SMART 讲者：任峰 (启元实验室)
11:44 - 11:45	Criticality-Aware Instruction-Centric Bandwidth Partitioning for Data Center Applications 讲者：朱立人 (北京大学)
11:45 - 11:46	Flame: A Centralized Cache Controller for Serverless Computing 讲者：杨亚南 (中国电信云计算研究院)

2024-12-15 (Day 2)
Lightning Talk II
11:22 - 11:23	Stream-Based Data Placement for Near-Data Processing with Extended Memory 讲者：李一苇 (清华大学)
11:23 - 11:24	MV4PG: Materialized Views for Property Graphs 讲者：徐柴俊 (中国科学技术大学)
11:24 - 11:25	Tribase: A Vector Data Query Engine for Reliable and Lossless Pruning Compression using Triangle Inequalities 讲者：许骞 (中国人民大学)
11:25 - 11:26	SparSynergy: Unlocking Flexible and Efficient DNN Acceleration through Multi-Level Sparsity 讲者：杨靖奎 (国防科技大学)
11:26 - 11:27	gVulkan: Scalable GPU Pooling for Pixel-Grained Rendering in Ray Tracing 讲者：顾翼成 (上海交通大学)
11:27 - 11:28	On-demand and Parallel Checkpoint/Restore for GPU Applications 讲者：杨彦凝 (上海交通大学)
11:28 - 11:29	WeiPipe: Weight Pipeline Parallelism for Communication-Effective Long-Context Large Model Training 讲者：林俊峰 (清华大学)
11:29 - 11:30	NodeSentry: Unsupervised Anomaly Detection in Production HPC Systems Using Model Sharing 讲者：孙永谦 (南开大学)
11:30 - 11:31	Mille-feuille: A Tile-Grained Mixed Precision Single-Kernel Conjugate Gradient Solver on GPUs 讲者：杨德闯 (中国石油大学（北京）)
11:31 - 11:32	Medusa: Accelerating Serverless LLM Inference with Materialization 讲者：曾少勋 (清华大学)
11:32 - 11:33	UniNDP: A Unified Compilation and Simulation Tool for Near DRAM Processing Architectures 讲者：谢童欣 (清华大学)
11:33 - 11:34	LegoZK: A Dynamically Reconfigurable Accelerator for Zero-Knowledge Proof 讲者：杨正帮 (中科院信工所)
11:34 - 11:35	Rubick: Exploiting Job Reconfigurability for Deep Learning Cluster Scheduling 讲者：张欣怡 (华东师范大学)
11:35 - 11:36	PimPam: Efficient Graph Pattern Matching on Real Processing-in-Memory Hardware 讲者：蔡双羽 (清华大学)
11:36 - 11:37	WinAcc: Window-based Acceleration of Neural Networks Using Block Floating Point 讲者：鞠鑫 (国防科技大学)
11:37 - 11:38	Topology-aware Preemption for Co-located LLM Workloads 讲者：张平 (百川智能)
11:38 - 11:39	EDGaE: Efficient Distributed GNN Training System at Edge 讲者：靳松 (武汉大学)
11:39 - 11:40	Ladder: Enabling Efficient Low-Precision Deep Learning Computing through Hardware-aware Tensor Transformation 讲者：王磊 (微软亚洲研究院)
11:40 - 11:41	Componentized OS: Flexible and Efficient Architecture for Specialized Operating Systems 讲者：蒋周奇 (华中科技大学)
11:41 - 11:42	DRack: A CXL-Disaggregated Rack Architecture to Boost Inter-Rack Communication 讲者：张旭 (中科院计算所)
11:42 - 11:43	Improving the Ability of Thermal Radiation Based Hardware Trojan Detection 讲者：苏颋 (国防科技大学)
11:43 - 11:44	Constructing a Supplementary Benchmark Suite to Represent Android Applications with User Interactions by Using Performance Counters 讲者：欧阳铖浩 (中科院深圳先进院)
11:44 - 11:45	Performance Analysis of Light-weight Memory Isolation Methods 讲者：裴辰举 (吉林大学)
11:45 - 11:46	Efficient LLM Inference on GPUs with Operator Optimization and Compilation 讲者：许珈铭 (上海交通大学)
11:46 - 11:47	Detecting Broken Object-Level Authorization Vulnerabilities in Database-Backed Applications 讲者：黄永恒 (中科院计算所)
11:47 - 11:48	Using Analytical Performance/Power Model and Fine-Grained DVFS to Enhance AI Accelerator Energy Efficiency 讲者：王梓博 (南京大学)

2024-12-14 (Day 1)
Poster Session
1	Trinity: A General Purpose FHE Accelerator
2	SQLStateGuard: Statement-Level SQL Injection Defense Based on Learning-Driven Middleware
3	Beaver: A High-Performance and Crash-Consistent File System Cache via PM-DRAM Collaborative Memory Tiering
4	A System-Level Dynamic Binary Translator using Automatically-Learned Translation Rules
5	Exploring Hierarchical Patterns for Alert Aggregation in Supercomputers
6	Serialization/Deserialization-free State Transfer in Serverless Workflows
7	Swift Unfolding of Communities: GPU-Accelerated Louvain Algorithm
8	Mobilizing underutilized storage nodes via job path: A job-aware file striping approach
9	EZTopo: An Automatic Custom Topology Generation Framework for Large Scale NoC Design
10	Heterogeneous Collaborative Speculative Decoding: An Acceleration Method for LLM on Personal Devices
11	VertexSurge: Variable Length Graph Pattern Match on Billion-edge Graphs
12	Enabling Tensor Language Model to Assist in Generating High-Performance Tensor Programs for Deep Learning
13	GPU Performance Optimization via Inter-group Cache Cooperation
14	CROSS: Compiler-Driven Optimization of Sparse DNNs Using Sparse/Dense Computation Kernels
15	OceanBase Paetica: A Hybrid Shared-Nothing/Shared-Everything Database for Supporting Single Machine and Distributed Cluster
16	CHIME: A Cache-Efficient and High-Performance Hybrid Index on Disaggregated Memory
17	Parrot: Efficient Serving of LLM-based Applications with Semantic Variable
18	DDP-Fsim: Efficient and Scalable Fault Simulation for Deterministic Patterns with Two-Dimensional Parallelism
19	Towards Hotness-aware Object Locality Optimization for Memory Tiering
20	Gemini: Mapping and Architecture Co-exploration for Large-scale DNN Chiplet Accelerators
21	Accelerating Transparent Memory Compression in the Cloud
22	Derm: SLA-aware Resource Management for Highly Dynamic Microservices
23	T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge
24	Fast State Restoration in LLM Serving with HCache
25	RPCAcc: A High-Performance and Reconfigurable PCIe-attached RPC Accelerator
26	UNICO: Unifying Coroutines in Rust
27	面向超级计算系统的节点故障异常预测方法
28	Scaling Disk Failure Prediction via Multi-Source Stream Mining
29	Sparsification-Driven Accelerator Design for Video Generation Models
30	TB-STC: Transposable Block-wise N:M Structured Sparse Tensor Core
31	Scalable Billion-point Approximate Nearest Neighbor Search Using SmartSSDs
32	AmgT: Algebraic Multigrid Solver on Tensor Cores
33	A Scalable, Efficient, and Robust Dynamic Memory Management Library for HLS-based FPGAs
34	Boosting Data Center Performance via Intelligently Managed Multi-backend Disaggregated Memory
35	Hydrogen: Contention-Aware Hybrid Memory for Heterogeneous CPU-GPU Architectures
36	Gaze into the Pattern: Characterizing Spatial Patterns with Footprint-Internal Correlations for Hardware Prefetching
37	WarpDrive: GPU-Based Fully Homomorphic Encryption Acceleration Leveraging Tensor and CUDA Cores
38	Holmes: Towards Distributed Training Across Clusters with Heterogeneous NIC Environment
39	FastGL: A GPU-Efficient Framework for Accelerating Sampling-Based GNN Training at Large Scale
40	BulkCC: Scaling Concurrent Data Structures to GPU Parallelism
41	Skyloft: A General High-Efficient Scheduling Framework in User Space
42	SMIless: Serving DAG-based Inference with Dynamic Invocations under Serverless Computing
43	NeoMem: Hardware/Software Co-Design for CXL-Native Memory Tiering
44	HyFiSS: A Hybrid Fidelity Stall-Aware Simulator for GPGPUs
45	M-ANT: Efficient Low-bit Group Quantization for LLMs via Mathematically Adaptive Numerical Type
46	PolarDB-MP: A Multi-Primary Cloud-Native Database via Disaggregated Shared Memory
47	VQ-LLM: High-performance Code Generation for Vector Quantization Augmented LLM Inference
48	COMPASS: SRAM-Based Computing-in-Memory SNN Accelerator with Adaptive Spike Speculation
49	Scaling Up Memory Disaggregated Applications With SMART
50	Mirage: Generating Enormous Databases for Complex Workloads
51	Criticality-Aware Instruction-Centric Bandwidth Partitioning for Data Center Applications
52	Stream-Based Data Placement for Near-Data Processing with Extended Memory
53	MV4PG: Materialized Views for Property Graphs
54	DTuner: Efficiently Tuning and Compiling for Dynamic Shape Tensor Programs
55	Tribase: A Vector Data Query Engine for Reliable and Lossless Pruning Compression using Triangle Inequalities
56	PHOENIXOS: A Concurrent OS-level GPU Checkpoint and Restore System
57	TrEnv: Transparently Share Serverless Execution Environments Across Different Functions and Nodes
58	SparSynergy: Unlocking Flexible and Efficient DNN Acceleration through Multi-Level Sparsity
59	gVulkan: Scalable GPU Pooling for Pixel-Grained Rendering in Ray Tracing
60	In-Storage Attention Offloading for Efficient Long-Context LLM Inference
61	On-demand and Parallel Checkpoint/Restore for GPU Applications
62	WeiPipe: Weight Pipeline Parallelism for Communication-Effective Long-Context Large Model Training
63	NodeSentry: Unsupervised Anomaly Detection in Production HPC Systems Using Model Sharing
64	Understanding the Linux Kernel, Visually
65	Constructing Block-Interface All-Flash Array with Zoned-Namespace SSDs
66	ActiveN: A Scalable and Flexibly-programmable Event-driven Neuromorphic Processor
67	Mille-feuille: A Tile-Grained Mixed Precision Single-Kernel Conjugate Gradient Solver on GPUs
68	Medusa: Accelerating Serverless LLM Inference with Materialization
69	ChameleonEC: Exploiting Tunability of Erasure Coding for Low-Interference Repair
70	UniNDP: A Unified Compilation and Simulation Tool for Near DRAM Processing Architectures
71	LegoZK: A Dynamically Reconfigurable Accelerator for Zero-Knowledge Proof
72	Rubick: Exploiting Job Reconfigurability for Deep Learning Cluster Scheduling
73	PimPam: Efficient Graph Pattern Matching on Real Processing-in-Memory Hardware
74	WinAcc: Window-based Acceleration of Neural Networks Using Block Floating Point
75	BULKHEAD: Secure, Scalable, and Efficient Kernel Compartmentalization with PKS
76	Topology-aware Preemption for Co-located LLM Workloads
77	EDGaE: Efficient Distributed GNN Training System at Edge
78	Hassert: Hardware Assertion-Based Agile Verification Framework with FPGA Acceleration
79	Ladder: Enabling Efficient Low-Precision Deep Learning Computing through Hardware-aware Tensor Transformation
80	Efficient 4-bit Matrix Unit via Primitivization
81	Componentized OS: Flexible and Efficient Architecture for Specialized Operating Systems
82	DRack: A CXL-Disaggregated Rack Architecture to Boost Inter-Rack Communication
83	Llumnix: Dynamic Scheduling for Large Language Model Serving
84	Flame: A Centralized Cache Controller for Serverless Computing
85	Improving the Ability of Thermal Radiation Based Hardware Trojan Detection
86	Towards Highly Compatible I/O-aware Workflow Scheduling on HPC Systems
87	Constructing a Supplementary Benchmark Suite to Represent Android Applications with User Interactions by Using Performance Counters
88	Fast Core Scheduling with Userspace Process Abstraction
89	Performance Analysis of Light-weight Memory Isolation Methods
90	Efficient LLM Inference on GPUs with Operator Optimization and Compilation
91	Detecting Broken Object-Level Authorization Vulnerabilities in Database-Backed Applications
92	DELTA: Memory-Efficient Training via Dynamic Fine-Grained Recomputation and Swapping
93	Using Analytical Performance/Power Model and Fine-Grained DVFS to Enhance AI Accelerator Energy Efficiency