My research interests lie around Natural Language Processing (NLP), multi-modal learning and their applications. I'm particularly interested in efficient&controllable generation (e.g., unsupervised, Plug-and-Play), multi-modal interactions (e.g., visual dialogue, captioning), and the conbination of both. My utimate goal is to empower any off-the-shelf language model the ability of understanding real-world experiences and interacting with people.
Specifically, I'm now working on Controllable/Efficient/Multimodal Text Generation (//). I'm also interested in problems in LLM-based models.
I am open for collaborations in research. Also, I am looking for potential intern positions in the summer of 2025.
We evaluated 22 recent vision-languages models from 6 developers with the HELM framework, measuring their visual perception, knowledge, reasoning, bias, fairness, multilinguality, robustness, toxicity and safety. Like HELM, we release all the prompts and raw predictions on our website for full transparency. VHELM is intended to be a living benchmark – we hope to continue adding new datasets, models, and metrics over time.
We benchmarked OpenAI’s o1(-preview) on 37 medical datasets, and the model outperformed GPT-4 by 6.2% in diagnostic accuracy. We identified strengths and areas for growth in AI's clinical reasoning and came up with comprehensive discussions and analysis.
Our recaptioning pipeline is simple: first, we fine-tune a LLaMA-3-8B powered LLaVA-1.5 and then employ it to recaption ~1.3 billion images from the DataComp-1B dataset. Our empirical results confirm that this enhanced dataset, Recap-DataComp-1B, offers substantial benefits in training advanced vision-language models.
This paper delves into the pivotal questions of our proximity to AGI and the strategies necessary for its realization through extensive surveys, discussions, and original perspectives. We start by articulating the requisite capability frameworks for AGI, integrating the internal, interface, and system dimensions.
We present Eagle (RWKV-5) and Finch (RWKV-6). Our architectural design advancements include multiheaded matrix-valued states and a dynamic recurrence mechanism that improve expressivity while maintaining the inference efficiency characteristics of RNNs.
We introduce a comprehensive safety evaluation suite, covering both out-of-distribution (OOD) generalization with four new datasets and adversarial robustness with one novel attack and two existing attack strategies.
We propose LayerNorm tuning, a simple yet effective tuning for finetuning MLLM. Compared to LoRA tuning, LayerNorm tuning reduces the trainable parameters by a significant 41.9% while improves model performance by 20%.
Two currently the most fine-grained multimodal dialogue datasets with entity&turn-level images on Wizard of Wikipedia and DailyDialog. And a unified multimodal dialog system with either shared or separate encoder-decoder setup.
Without any explicit prompting for truthful or ethical behaviors, simply tuning LLM on multi-modal instruction datasets leads to noticeable improvements in the TruthfulQA and Ethics benchmarks.
Zero-shot text generation model controlled by vision&text signals without extra training on images. ZeroGen shows SOTA performances on three vision-language tasks (two captioning tasks and controllable news generation).
A VAE model towards unsupervised topic modeling and controllable text generation (CTG). It employs two continuous latent spaces with the conditional dependency between them for topic and sequence modeling. The model builds the sequence latent space with a series of flexible Householder process to create plausible content.
A model-agnostic framework towards flexible, semi-supervised and controllable text generation. This framework is “plug-and-play” with partial parameters to be fine-tuned in the pre-trained model.
The first big VAE model with adaptive parameter-efficient PLMs that can be optimized with minimum trainable parameters. Latent Attention is proposed to better construct latent spaces in VAE from the transformer encoder. AdaVAE achieves competitive performances in language modeling and low-resource classification with only 14.66% parameter activated.
This survey gives an introduction into existing generation schemes and problems associated with text auto-encoders, a review of several applications about controllable generation that are instantiations of these general formulations, as well as a discussion for future research.
A dataset called Stego-Sandbox to simulate the real social network scenarios and an effective linguistic steganalysis framework integrating linguistic features and context features.
Experiences
Sep. 2024 - Present,
VLAA Lab, UC Santa Cruz,
Ph.D. Student, Multimodal & AI Safety.
Aug. 2021 - Jun. 2024,
University of Chinese Academy of Sciences,