Mobile cameras have revolutionized content creation, allowing casual users to capture professional-looking photos. However, capturing the perfect moment can still be challenging, making post-capture editing desirable. In this work, we introduce ExShot, a mobile-oriented expression editing system that delivers high-quality, fast, and interactive editing experiences. Unlike existing methods that rely on learning expression priors, we leverage mobile photo sequences to extract expression information on demand. This design insight enables ExShot to address challenges related to diverse expressions, facial details, environment entanglement, and interactive editing. At the core lies ExprNet, a lightweight deep learning model that extracts and refines expression features. To train our model, we captured portrait images with diverse expressions, incorporating pre-processing and lighting augmentation techniques to ensure data quality. Our comprehensive evaluation results demonstrate that ExShot outperforms other editing approaches by up to 29.02\% in PSNR. Ablation studies validate the effectiveness of our design choices, and user studies with 28 participants confirm the strong desire for expression editing and the superior synthesis quality of ExShot, while also identifying areas for further investigation.