Haoqin Isaac Tu

Haoqin (Isaac) Tu

I'm a first-year Ph.D. student at UCSC CSE, working with Prof. Cihang Xie and Prof. Yuyin Zhou. I obtained my M.Eng. at UCAS.

My research interests lie around Natural Language Processing (NLP), multi-modal learning and their applications. I'm particularly interested in efficient&controllable generation (e.g., unsupervised, Plug-and-Play), multi-modal interactions (e.g., visual dialogue, captioning), and the conbination of both. My utimate goal is to empower any off-the-shelf language model the ability of understanding real-world experiences and interacting with people.

Specifically, I'm now working on Controllable/Efficient/Multimodal Text Generation (//). I'm also interested in problems in LLM-based models.

I am open for collaborations in research. Also, I am looking for potential intern positions in the summer of 2025.

Email: tuisaac163(at)gmail.com / Google Scholar / Github / Twitter

2025

[J6] AttnGCG: Enhancing Jailbreaking Attacks on LLMs with Attention Manipulation
Zijun Wang, Haoqin Tu, Jieru Mei, Bingchen Zhao, Yisen Wang, Cihang Xie
TMLR 2025
arxiv / code

We found that LLM attacks tend to be less effective when models pay more attention to system prompts designed to ensure LLM safety alignment. Building on this discovery, we introduce an enhanced method that manipulates models’ attention scores to facilitate LLM jailbreaking.

[C7][P9] Autoregressive Pretraining with Mamba in Vision
Sucheng Ren, Xianhang Li, Haoqin Tu, Feng Wang, Fangxun Shu, Lei Zhang, Jieru Mei, Linjie Yang, Peng Wang, Heng Wang, Alan Yuille, Cihang Xie
ICLR 2025
arxiv / code

This paper shows that Mamba's visual capability can be significantly enhanced through autoregressive pretraining, a direction not previously explored.

2024

	[C6] VHELM: A Holistic Evaluation of Vision Language Models Tony Lee, Haoqin Tu, Chi Heem Wong, Wenhao Zheng, Yiyang Zhou, Yifan Mai, Josselin Somerville Roberts, Michihiro Yasunaga, Huaxiu Yao, Cihang Xie, Percy Liang NeurIPS (D&B Track) 2024* arxiv / code / website We evaluated 22 recent vision-languages models from 6 developers with the HELM framework, measuring their visual perception, knowledge, reasoning, bias, fairness, multilinguality, robustness, toxicity and safety. Like HELM, we release all the prompts and raw predictions on our website for full transparency. VHELM is intended to be a living benchmark – we hope to continue adding new datasets, models, and metrics over time.
	[P11] A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor? Yunfei Xie, Juncheng Wu, Haoqin Tu, Siwei Yang, Bingchen Zhao, Yongshuo Zong, Qiao Jin, Cihang Xie, Yuyin Zhou Arxiv arxiv / code / website We benchmarked OpenAI’s o1(-preview) on 37 medical datasets, and the model outperformed GPT-4 by 6.2% in diagnostic accuracy. We identified strengths and areas for growth in AI's clinical reasoning and came up with comprehensive discussions and analysis.
	[P10] What If We Recaption Billions of Web Images with LLaMA-3? Xianhang Li, Haoqin Tu, Mude Hui, Zeyu Wang, Bingchen Zhao, Junfei Xiao, Sucheng Ren, Jieru Mei, Qing Liu, Huangjie Zheng, Yuyin Zhou, Cihang Xie Arxiv* arxiv / data / code / website Our recaptioning pipeline is simple: first, we fine-tune a LLaMA-3-8B powered LLaVA-1.5 and then employ it to recaption ~1.3 billion images from the DataComp-1B dataset. Our empirical results confirm that this enhanced dataset, Recap-DataComp-1B, offers substantial benefits in training advanced vision-language models.
	[J4][P8] How Far Are We From AGI Tao Feng, Chuanyang Jin, Jingyu Liu, Kunlun Zhu, Haoqin Tu, Zirui Cheng, Guanyu Lin, Jiaxuan You TMLR 2024 arxiv / paper list This paper delves into the pivotal questions of our proximity to AGI and the strategies necessary for its realization through extensive surveys, discussions, and original perspectives. We start by articulating the requisite capability frameworks for AGI, integrating the internal, interface, and system dimensions.
	[C5][P7] Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence RWKV Team COLM 2024 arxiv / code / models We present Eagle (RWKV-5) and Finch (RWKV-6). Our architectural design advancements include multiheaded matrix-valued states and a dynamic recurrence mechanism that improve expressivity while maintaining the inference efficiency characteristics of RNNs.
	[C4][P5] How Many Unicorns Are In This Image? A Safety Evaluation Benchmark For Vision LLMs Haoqin Tu, Chenhang Cui, Zijun Wang, Yiyang Zhou, Bingchen Zhao, Junlin Han, Wangchunshu Zhou, Huaxiu Yao, Cihang Xie ECCV 2024* arxiv / code / VHELM evaluation We introduce a comprehensive safety evaluation suite, covering both out-of-distribution (OOD) generalization with four new datasets and adversarial robustness with one novel attack and two existing attack strategies.
	[C3][P6] Tuning LayerNorm in Attention: Towards Efficient Multi-Modal LLM Finetuning Bingchen Zhao, Haoqin Tu, Chen Wei, Jieru Mei, Cihang Xie ICLR 2024, (Spotlight, top 5%) arxiv / OpenReview / HF tutorial We propose LayerNorm tuning, a simple yet effective tuning for finetuning MLLM. Compared to LoRA tuning, LayerNorm tuning reduces the trainable parameters by a significant 41.9% while improves model performance by 20%.
	[J5][W1][P4] Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics Haoqin Tu, Bingchen Zhao, Chen Wei, Cihang Xie TMLR 2024 arxiv / code / poster / twitter Without any explicit prompting for truthful or ethical behaviors, simply tuning LLM on multi-modal instruction datasets leads to noticeable improvements in the TruthfulQA and Ethics benchmarks.

2023

	[C2][P3] ReSee: Responding through Seeing Fine-grained Visual Knowledge in Open-domain Dialogue Haoqin Tu, Yitong Li, Fei Mi, Zhongliang Yang EMNLP 2023 (Oral) arxiv / code / slides Two currently the most fine-grained multimodal dialogue datasets with entity&turn-level images on Wizard of Wikipedia and DailyDialog. And a unified multimodal dialog system with either shared or separate encoder-decoder setup.
	[C1] ZeroGen: Zero-shot Multimodal Controllable Text Generation with Multiple Oracles Haoqin Tu, Bowen Yang, Xianfeng Zhao NLPCC 2023 (acceptance rate: 29%) arxiv / code / poster / reddit post Zero-shot text generation model controlled by vision&text signals without extra training on images. ZeroGen shows SOTA performances on three vision-language tasks (two captioning tasks and controllable news generation).
	[J3] FET-LM: Flow Enhanced Variational Auto-Encoder for Topic-Guided Language Modeling Haoqin Tu, Zhongliang Yang, Jinshuai Yang, Yongfeng Huang TNNLS'23. IEEE Transactions on Neural Networks and Learning Systems (Early Access) Impact Factor: 14.26 IEEE Xplore / code / paper&Appendix A VAE model towards unsupervised topic modeling and controllable text generation (CTG). It employs two continuous latent spaces with the conditional dependency between them for topic and sequence modeling. The model builds the sequence latent space with a series of flexible Householder process to create plausible content.
2022	[J1] PCAE: A Framework of Plug-in Conditional Auto-Encoder for Controllable Text Generation Haoqin Tu, Zhongliang Yang, Jinshuai Yang, Siyu Zhang, Yongfeng Huang KBS'22. Knowledge-Based Systems Impact Factor: 8.14 paper / code A model-agnostic framework towards flexible, semi-supervised and controllable text generation. This framework is “plug-and-play” with partial parameters to be fine-tuned in the pre-trained model.
	[P1] AdaVAE: Exploring Adaptive GPT-2s in VAEs for Language Modeling Haoqin Tu, Zhongliang Yang, Jinshuai Yang, Siyu Zhang, Yongfeng Huang Arxiv (Submitting to TASLP) arxiv / code The first big VAE model with adaptive parameter-efficient PLMs that can be optimized with minimum trainable parameters. Latent Attention is proposed to better construct latent spaces in VAE from the transformer encoder. AdaVAE achieves competitive performances in language modeling and low-resource classification with only 14.66% parameter activated.
	[P2] An Overview on Controllable Text Generation via Variational Auto-Encoders Haoqin Tu, Yitong Li Arxiv arxiv / paper list / Chinese blog This survey gives an introduction into existing generation schemes and problems associated with text auto-encoders, a review of several applications about controllable generation that are instantiations of these general formulations, as well as a discussion for future research.
	[J2] Linguistic Steganalysis Towards Social Network Jinshuai Yang, Zhongliang Yang, Jiajun Zou, Haoqin Tu, Yongfeng Huang T-IFS'22. IEEE Transactions on Information Forensics and Security Impact Factor: 7.23 IEEE Xplore / code A dataset called Stego-Sandbox to simulate the real social network scenarios and an effective linguistic steganalysis framework integrating linguistic features and context features.