Deep learning-based vision systems are widely used in manufacturing for defect inspection and process monitoring, but their performance can drop sharply under rare visual disturbances such as sudden illumination changes, occlusions, and pose variations that were not seen during training. Since collecting and labeling data that covers all such real-world conditions is costly and risks production downtime, this paper proposes a framework that combines digital-twin-based domain expansion with test-time adaptation. A Segment-Guided Network (SGN) is trained on image–mask pairs generated from a high-fidelity Unity-based simulation in which lighting factors are systematically controlled, and at test time the segmentation model is stably adapted using semantic self-supervisory signals produced by SGN, without additional labels. Experiments on a robot manipulator dataset and an in-house digital twin dataset under various domain shift conditions show that the proposed method improves Dice scores by up to 27.47 percentage points over a source-only baseline and by up to 25.05 percentage points over other test-time adaptation methods, demonstrating robust visual perception and reliability in manufacturing environments under rare compound disturbances while minimizing labeling cost and production downtime.